Section 8 Developing R packages
Raphael Scherrer (thanks to Pedro Neves for guiding me through these steps)
By now you know what R packages are and you have been using many of them, some of them part of the tidyverse and others not. R packages are modules, or coherent libraries of functions, designed at specific sets of tasks. Packages, or libraries, are common to many programming languages, the philosophy behind them being: pick only the tools you need for your task, without having to download all the possible toolboxes. Currently CRAN (the Comprehensive R Archive Network) is host to more than 16,000 packages (link), and that is not counting R packages hosted by other platforms such as GitHub, Bioconductor or rOpenSci. This is what makes R such a powerful and popular language. Why there are so many packages is because anyone can write their own package and make it available to others, so the growth of the R universe if very much community-driven. Here we will show you how to write your own package. Most of the content of this tutorial follows Hadley Wickham’s exhaustive book on R packages.
8.1 Why writing packages?
You may very well have written analysis pipelines in R for various projects and never felt the need to make packages for them. So why bother? you may ask. The main reasons are:
- deployment: packages make it easier for people to use your code
- reproducibility: packages can be a convenient way to make your study fully reproducible
- consistency: there is a common set of rules on how packages should be organized, which forces you to make your code understandable to everyone
- security: the common conventions around package syntax make it possible for third-party tools to check your code for bugs or style, which also means you can trust packages hosted at some platforms when you know they run these tests, for example
8.2 Hands-on workflow
8.2.1 Primer: what is an RStudio project?
An RStudio project is a virtual context associated with a specific working directory on your computer. A project is the recommended unit of work for a given analysis. This is because it keeps track of the R workspace and history for that analysis, together with the working directory (meaning you never have to use setwd
anymore). A project has the extension .Rproj. See this page for more information. As we shall see, developing a package requires creating a project for it.
8.2.2 Create a project for your package
In RStudio, click on File, then New Project. There, you have the option to create a new package. This will create all the files that are needed, in particular a DESCRIPTION, a NAMESPACE, a .Rbuildignore, and a man/ and R/ folders. Use the .Rproj file to develop the package (launching it will open RStudio and place you in the right directory). It is possible to create an R package by assembling all those files together by yourself, but RStudio really makes it painless.
8.2.3 Link to GitHub?
At this stage you may want to host your package on an online version control platform such as GitHub. One way to do this is the following. Assuming that git is already installed on your machine and linked to your GitHub account, you need to:
- Create a project for your package locally (the step above)
- Create an empty repository on GitHub for your package
- Initialize git in the local copy by running
git init
from within - Stage and commit (
git add .
andgit commit -m "some commit message"
) - Link the local copy to the remote one with
git remote add origin https://github.com/username/reponame
- Push using
git push -u origin master
You should be all set. Useful links include this page, this one and also the instructions given by GitHub upon creation of an empty online repository.
8.2.4 Write your functions
A package is nothing much more than a convenient collection of functions that one may want to use repeatedly. Here we assume that you are comfortable with writing R functions. Prefer saving each function as its own R script (.R) and save them in the dedicated R/
folder. Here is an example function that repeats multiple elements, multiple times and returns a vector of those.
mrep <- function(x, n) {
assertthat::are_equal(length(x), length(n)) # security check
purrr::reduce(purrr::map2(x, n, ~ rep(.x, .y)), c)
}
We can use this function, for example, to repeat the number 1 once, number 2 twice and number 3 three times:
Note that when calling functions from other packages (here purrr
and asserthat
) we do not use library
or require
, as this would make all the functions of these packages available. Instead we use the namespace of the respective package, separated from the function name with a ::
. Although a package that uses library
will typically build just fine, it is considered bad practice and will not pass CRAN’s requirements, which are implemented in the R CMD CHECK command (more on this later).
8.2.5 Tests
Do you want to go test-driven? Then write your tests first, and follow those guidelines. Although tests are out of the scope of this tutorial, they are a vital part of package development, so we highly recommend this read as your next step to go further.
8.2.6 Document your functions
The documentation of a function is what shows up when you type ?function-name
for example (e.g. ?purrr::reduce
). When writing your package, you must provide a documentation for each of your functions so your user knows what the function does, what arguments it takes, what it returns and has examples of the function being used. Each function documentation goes in its own .Rd file, stored in the man/
folder.
roxygen2
is an R package that makes documentation very easy. It allows you to write the documentation as a header of a function’s R script, and save this header into its own .Rd file in man/
. All the lines that go into the documentation must start with the special comment characters #'
. If we take our previous example:
#' Repeat multiple things multiple times
#'
#' A function to repeat multiple things multiple times.
#'
#' @param x A vector of things
#' @param n A vector of numbers times each thing must be repeated
#'
#' @details The function can take a vector of any atomic type
#'
#' @return A vector of the same type as `x`
#'
#' @examples
#'
#' mrep(seq(3), seq(3))
#'
#' @export
mrep <- function(x, n) {
assertthat::are_equal(length(x), length(n)) # security check
purrr::reduce(purrr::map2(x, n, ~ rep(.x, .y)), c)
}
Here, everything starting with #'
will be interpreted by roxygen2
as part of the documentation. Different fields can be supplied:
- The first line is the title of the documentation page
- The second line is the description
@param
goes for each of the parameters, with their description@details
if you want to be more specific on what happens backstage@return
tells the user what the function returns@examples
shows some use-cases@export
indicates that this function can be called explicitly by the user (as opposed to an internal function of the package that is only meant to be used by other functions of the package)
Other fields such as @note
can be specified, but these are the main ones. A package with incomplete documentation will build fine, but again this will not pass R CMD CHECK for CRAN’s requirements, which require you, for example, to always have examples for exported functions.
To effectively produce the documentation, run roxygen2::roxygenize()
(or roxygenise
) from within the working directory of the package. roxyygen2
may not update the NAMESPACE file if it has not been created by roxygen2
in the first place, so you may have to erase NAMESPACE before running roxygenize()
(then it will automatically create a new NAMESPACE). We do not describe here what the NAMESPACE is, as it is a bit too advanced for this tutorial, just remember that you may have to erase it before documenting if you see a warning.
8.2.7 Build the package
Once some functions are added and their documentation is ready, the package should be able to build. Use the Install and Restard button under the Build tab in RStudio for that. Your package is now installed and loaded. Alternatively you can build your package from the command line by running R CMD INSTALL. If your package is on GitHub (or another remote server), you can also build it with devtools
, for example with devtools::install_github("username/reponame")
.
8.3 Write a vignette
A vignette is a more user-oriented overview of your package. In contrast to the individual documentation of each function, the vignette takes the user for a tour of the package to show use-cases of the functions in context.
A vignette is written in Rmarkdown. The Rmarkdown language is out of the scope of this tutorial, but is a great way to combine textual information (it inherits from markdown) with embedded chunks of R code and their output (this tutorial is written in Rmarkdown). See this link, or this cheatsheet, or inspire yourself from the source code of this tutorial to get more familiar with Rmarkdown.
We use the usethis
package to set up everything we need to get our vignette ready. Running usethis::use_vignette
will create a vignettes/
folder with the vignette .Rmd file in it, that you can then edit.
The vignette can be rendered in multiple output formats,, such as an HTML web page or a LaTeX-looking PDF. RStudio does this through the Knit button, which calls the knitr
package in the background. By default, upon creation of the vignette only the HTML output is supported. To change the possible outputs (e.g. allow both HTML and PDF), change the output
part of the header of the .Rmd file with:
output:
pdf_document: default
html_document:
keep_md: yes
Now the drop-down menu of the Knit button will offer the possibility to render the vignette as PDF as well as HTML.
The Knit button renders a vignette, but does not save it. You could of course save it manually, but devtools
offers the build_vignettes
function to automatize this task. Running it will create two new folders, doc/
and Meta/
. The former contains the rendered vignette, in the first format specified in the output
header (so PDF in the above example) while the latter contains some data used to render that vignette. It is best to not touch those, and stick to editing the vignette file located in the vignettes/
folder. One exception: one can render a vignette manually with the Knit button and save the rendered output into the doc/
folder.
Do you want to host the vignette on a web page dedicated to your package, also with an overview of all the functions as well as their documentation? Then pkgdown
is your friend, but this is out of the scope of this tutorial (yes, the web page for the pkgdown
package is built with pkgdown
).
8.4 Update the description
In the top folder of your package is a DESCRIPTION file. This contains some important information. Make sure that you update the Title, Author, Maintainer, Description and License fields. The Imports field requires you to supply the names of the dependencies of your package: what packages need to be installed for your functions to work? In our example, mrep
calls functions from assertthat
and purrr
, so our Imports field will look something like:
Imports:
asserthat,
purrr
These dependencies will be downloaded and installed automatically upon installation of your package. You can specify version requirements for the packages you load (see Hadley’s book). The Suggests field is for packages that are not required but recommended (e.g. knitr
to build the vignette locally).
Dependencies will be downloaded from CRAN by default. In order to add packages from other platforms, you may have to add some keywords to your DESCRIPTION file. For example, the ggtree
package is hosted by Bioconductor. You can add it with the other packages in Imports, but you need to add “biocViews:” before Imports, e.g.
biocViews:
Imports:
asserthat,
purrr,
ggtree
A special case of dependencies is operators from other packages, such as the famous pipe (%>%
) from magrittr
, because you cannot just write magrittr::%>%
in your functions. Again, usethis
is our friend here, and you can run usethis::use_pipe()
to make the pipe operator fully available to your functions without having to use library
. (This command will update the NAMESPACE.)
As a minor note, you can also use the DESCRIPTION file to give extra options to the build of your documentation. For example, to allow roxygen2
to understand the markdown syntax when rendering the help pages of your functions, use
Roxygen: list(markdown = TRUE)
8.5 Check the package
8.5.1 Good practices
As mentioned before, CRAN has specific requirements that are implemented in the R CMD CHECK command. Running this command, or clicking on Check within the Build tab, will run a series of quality controls on your code, and will indicate what does not meet the requirements. A package is CRAN-compatible if no errors and no warnings are issued (notes are fine).
Generally, CHECK will make sure all the things we talked about above are done. It will look at the functions, the documentation, run your examples (and your tests if you have some) make sure that the vignette renders, and that all dependencies are accessible. If anything is wrong, it will tell you what.
One thing to keep in mind is that CHECK will run your examples (in the documentation files), unless these are surrounded with \dontrun{
and }
. This can be used for examples that, e.g., would require some specific data that you do not make available with the package, or just because the example takes too long or is too computation-heavy.
CHECK also dislikes files and folders that are not absolutely necessary to the package. It will complain if, say, you have a scripts/
folder with extra draft scripts you used to develop and try your functions, or a data/
folder containing some example data. You can add the names of these folders to .Rbuildignore
to tell CHECK to ignore those when checking your package (.Rbuildignore
works in many respects just like a .gitignore
file).
8.5.2 Better practices
If all the above are met, CHECK should be happy and in theory your package should be CRAN-compatible. Some platforms, such as rOpenSci, have stricter standards, however, and those requirements come from a good place. We will highlight two things here.
First, rOpenSci will require 100% code coverage in your package. This means that during the execution of the CHECK command, every single line of code must be run. This is often impossible to achieve without having tests, and thus strongly encourages test-driven development. The testthat
package can be used to write tests that check for the outcomes of your functions under different circumstances, or scenarios. See the section on regex
for examples. In a package, test
will be stored in a tests/
folder, which can be set-up by our old friend usethis
, by running usethis::use_testthat()
. Having tests is always good!
Second, rOpenSci will also check your coding style. In R, it is possible to write the same code in different ways, for example:
library(tidyverse)
x <- mrep(seq(3), seq(3))
y <- rep(1, 6)
tibble(
V1 = x,
V2 = y
)
#> # A tibble: 6 x 2
#> V1 V2
#> <int> <dbl>
#> 1 1 1
#> 2 2 1
#> 3 2 1
#> 4 3 1
#> 5 3 1
#> 6 3 1
versus
x = mrep(seq(3), seq(3))
y = rep(1, 6)
tibble(V1 =x,
V2 =y
)
#> # A tibble: 6 x 2
#> V1 V2
#> <int> <dbl>
#> 1 1 1
#> 2 2 1
#> 3 2 1
#> 4 3 1
#> 5 3 1
#> 6 3 1
Both styles will run, and CHECK will not complain. However, lintr
will. lintr
is a style checker that makes sure that you follow the tidyverse recommended style. This style includes things such as: no use of =
as an assignment operator (only use <-
), always put a space after an equal sign or a comma among others. lintr
will be run on all of your R code if you submit your package to rOpenSci. The reason behing using a style checker is similar to the basic philosophy of the tidyverse: make things follow a convention, so that pieces of code speak the same language (so to speak, pun intended) and integrate nicely with each other.
8.5.3 Even better practices
Git and GitHub (or other version control platforms) are your friends when it comes to developing packages or software in general. You may want to check out how to use them. One strength of these platforms is that they allow you to give access to third-party platforms to your package, that can be used to quality-control your code. These are known as continuous integration tools, Travis CI and AppVeyor being two famous examples. By activating these tools on your repository (hosted, say, on GitHub), these platforms can access your package and remotely run all kinds of things for you: run R CMD CHECK, make sure that the code coverage is 100%, or run lintr
for you, every time you upload an edited version of your code. This gives you an extra safety net to make sure that your package (or at least the version hosted online and available to people) is always working, and it may even give you a hint if, for example, one dependency of your package breaks (due to errors independent of you). If you want to know more, you can for example check the R package babette
, which makes use of these tools and is hosted at rOpenSci.
8.6 References
- Hadley’s book on developing R packages