Section 8 Developing R packages

Raphael Scherrer (thanks to Pedro Neves for guiding me through these steps)

By now you know what R packages are and you have been using many of them, some of them part of the tidyverse and others not. R packages are modules, or coherent libraries of functions, designed at specific sets of tasks. Packages, or libraries, are common to many programming languages, the philosophy behind them being: pick only the tools you need for your task, without having to download all the possible toolboxes. Currently CRAN (the Comprehensive R Archive Network) is host to more than 16,000 packages (link), and that is not counting R packages hosted by other platforms such as GitHub, Bioconductor or rOpenSci. This is what makes R such a powerful and popular language. Why there are so many packages is because anyone can write their own package and make it available to others, so the growth of the R universe if very much community-driven. Here we will show you how to write your own package. Most of the content of this tutorial follows Hadley Wickham’s exhaustive book on R packages.

8.1 Why writing packages?

You may very well have written analysis pipelines in R for various projects and never felt the need to make packages for them. So why bother? you may ask. The main reasons are:

deployment: packages make it easier for people to use your code
reproducibility: packages can be a convenient way to make your study fully reproducible
consistency: there is a common set of rules on how packages should be organized, which forces you to make your code understandable to everyone
security: the common conventions around package syntax make it possible for third-party tools to check your code for bugs or style, which also means you can trust packages hosted at some platforms when you know they run these tests, for example

8.2 Hands-on workflow

8.2.1 Primer: what is an RStudio project?

An RStudio project is a virtual context associated with a specific working directory on your computer. A project is the recommended unit of work for a given analysis. This is because it keeps track of the R workspace and history for that analysis, together with the working directory (meaning you never have to use setwd anymore). A project has the extension .Rproj. See this page for more information. As we shall see, developing a package requires creating a project for it.

8.2.2 Create a project for your package

In RStudio, click on File, then New Project. There, you have the option to create a new package. This will create all the files that are needed, in particular a DESCRIPTION, a NAMESPACE, a .Rbuildignore, and a man/ and R/ folders. Use the .Rproj file to develop the package (launching it will open RStudio and place you in the right directory). It is possible to create an R package by assembling all those files together by yourself, but RStudio really makes it painless.

8.2.3 Link to GitHub?

At this stage you may want to host your package on an online version control platform such as GitHub. One way to do this is the following. Assuming that git is already installed on your machine and linked to your GitHub account, you need to:

Create a project for your package locally (the step above)
Create an empty repository on GitHub for your package
Initialize git in the local copy by running git init from within
Stage and commit (git add . and git commit -m "some commit message")
Link the local copy to the remote one with git remote add origin https://github.com/username/reponame
Push using git push -u origin master

You should be all set. Useful links include this page, this one and also the instructions given by GitHub upon creation of an empty online repository.

8.2.4 Write your functions

A package is nothing much more than a convenient collection of functions that one may want to use repeatedly. Here we assume that you are comfortable with writing R functions. Prefer saving each function as its own R script (.R) and save them in the dedicated R/ folder. Here is an example function that repeats multiple elements, multiple times and returns a vector of those.

mrep <- function(x, n) {
  
  assertthat::are_equal(length(x), length(n)) # security check
  purrr::reduce(purrr::map2(x, n, ~ rep(.x, .y)), c)
  
}

We can use this function, for example, to repeat the number 1 once, number 2 twice and number 3 three times:

mrep(seq(3), seq(3))
#> [1] 1 2 2 3 3 3

Note that when calling functions from other packages (here purrr and asserthat) we do not use library or require, as this would make all the functions of these packages available. Instead we use the namespace of the respective package, separated from the function name with a ::. Although a package that uses library will typically build just fine, it is considered bad practice and will not pass CRAN’s requirements, which are implemented in the R CMD CHECK command (more on this later).

8.2.5 Tests

Do you want to go test-driven? Then write your tests first, and follow those guidelines. Although tests are out of the scope of this tutorial, they are a vital part of package development, so we highly recommend this read as your next step to go further.

8.2.6 Document your functions

The documentation of a function is what shows up when you type ?function-name for example (e.g. ?purrr::reduce). When writing your package, you must provide a documentation for each of your functions so your user knows what the function does, what arguments it takes, what it returns and has examples of the function being used. Each function documentation goes in its own .Rd file, stored in the man/ folder.

roxygen2 is an R package that makes documentation very easy. It allows you to write the documentation as a header of a function’s R script, and save this header into its own .Rd file in man/. All the lines that go into the documentation must start with the special comment characters #'. If we take our previous example:

#' Repeat multiple things multiple times
#' 
#' A function to repeat multiple things multiple times.
#' 
#' @param x A vector of things
#' @param n A vector of numbers times each thing must be repeated
#' 
#' @details The function can take a vector of any atomic type
#' 
#' @return A vector of the same type as `x`
#'
#' @examples
#'
#' mrep(seq(3), seq(3))
#'
#' @export

mrep <- function(x, n) {
  
  assertthat::are_equal(length(x), length(n)) # security check
  purrr::reduce(purrr::map2(x, n, ~ rep(.x, .y)), c)
  
}

Here, everything starting with #' will be interpreted by roxygen2 as part of the documentation. Different fields can be supplied:

The first line is the title of the documentation page
The second line is the description
@param goes for each of the parameters, with their description
@details if you want to be more specific on what happens backstage
@return tells the user what the function returns
@examples shows some use-cases
@export indicates that this function can be called explicitly by the user (as opposed to an internal function of the package that is only meant to be used by other functions of the package)

Other fields such as @note can be specified, but these are the main ones. A package with incomplete documentation will build fine, but again this will not pass R CMD CHECK for CRAN’s requirements, which require you, for example, to always have examples for exported functions.

To effectively produce the documentation, run roxygen2::roxygenize() (or roxygenise) from within the working directory of the package. roxyygen2 may not update the NAMESPACE file if it has not been created by roxygen2 in the first place, so you may have to erase NAMESPACE before running roxygenize() (then it will automatically create a new NAMESPACE). We do not describe here what the NAMESPACE is, as it is a bit too advanced for this tutorial, just remember that you may have to erase it before documenting if you see a warning.

8.2.7 Build the package

Once some functions are added and their documentation is ready, the package should be able to build. Use the Install and Restard button under the Build tab in RStudio for that. Your package is now installed and loaded. Alternatively you can build your package from the command line by running R CMD INSTALL. If your package is on GitHub (or another remote server), you can also build it with devtools, for example with devtools::install_github("username/reponame").

8.3 Write a vignette

A vignette is a more user-oriented overview of your package. In contrast to the individual documentation of each function, the vignette takes the user for a tour of the package to show use-cases of the functions in context.

A vignette is written in Rmarkdown. The Rmarkdown language is out of the scope of this tutorial, but is a great way to combine textual information (it inherits from markdown) with embedded chunks of R code and their output (this tutorial is written in Rmarkdown). See this link, or this cheatsheet, or inspire yourself from the source code of this tutorial to get more familiar with Rmarkdown.

We use the usethis package to set up everything we need to get our vignette ready. Running usethis::use_vignette will create a vignettes/ folder with the vignette .Rmd file in it, that you can then edit.

The vignette can be rendered in multiple output formats,, such as an HTML web page or a LaTeX-looking PDF. RStudio does this through the Knit button, which calls the knitr package in the background. By default, upon creation of the vignette only the HTML output is supported. To change the possible outputs (e.g. allow both HTML and PDF), change the output part of the header of the .Rmd file with:

output:
  pdf_document: default
  html_document:
    keep_md: yes

Now the drop-down menu of the Knit button will offer the possibility to render the vignette as PDF as well as HTML.

The Knit button renders a vignette, but does not save it. You could of course save it manually, but devtools offers the build_vignettes function to automatize this task. Running it will create two new folders, doc/ and Meta/. The former contains the rendered vignette, in the first format specified in the output header (so PDF in the above example) while the latter contains some data used to render that vignette. It is best to not touch those, and stick to editing the vignette file located in the vignettes/ folder. One exception: one can render a vignette manually with the Knit button and save the rendered output into the doc/ folder.

Do you want to host the vignette on a web page dedicated to your package, also with an overview of all the functions as well as their documentation? Then pkgdown is your friend, but this is out of the scope of this tutorial (yes, the web page for the pkgdown package is built with pkgdown).

8.4 Update the description

In the top folder of your package is a DESCRIPTION file. This contains some important information. Make sure that you update the Title, Author, Maintainer, Description and License fields. The Imports field requires you to supply the names of the dependencies of your package: what packages need to be installed for your functions to work? In our example, mrep calls functions from assertthat and purrr, so our Imports field will look something like:

Imports:
  asserthat,
  purrr

These dependencies will be downloaded and installed automatically upon installation of your package. You can specify version requirements for the packages you load (see Hadley’s book). The Suggests field is for packages that are not required but recommended (e.g. knitr to build the vignette locally).

Dependencies will be downloaded from CRAN by default. In order to add packages from other platforms, you may have to add some keywords to your DESCRIPTION file. For example, the ggtree package is hosted by Bioconductor. You can add it with the other packages in Imports, but you need to add “biocViews:” before Imports, e.g.

biocViews:
Imports:
  asserthat,
  purrr,
  ggtree

A special case of dependencies is operators from other packages, such as the famous pipe (%>%) from magrittr, because you cannot just write magrittr::%>% in your functions. Again, usethis is our friend here, and you can run usethis::use_pipe() to make the pipe operator fully available to your functions without having to use library. (This command will update the NAMESPACE.)

As a minor note, you can also use the DESCRIPTION file to give extra options to the build of your documentation. For example, to allow roxygen2 to understand the markdown syntax when rendering the help pages of your functions, use

Roxygen: list(markdown = TRUE)

8.5 Check the package

8.5.1 Good practices

As mentioned before, CRAN has specific requirements that are implemented in the R CMD CHECK command. Running this command, or clicking on Check within the Build tab, will run a series of quality controls on your code, and will indicate what does not meet the requirements. A package is CRAN-compatible if no errors and no warnings are issued (notes are fine).

Generally, CHECK will make sure all the things we talked about above are done. It will look at the functions, the documentation, run your examples (and your tests if you have some) make sure that the vignette renders, and that all dependencies are accessible. If anything is wrong, it will tell you what.

One thing to keep in mind is that CHECK will run your examples (in the documentation files), unless these are surrounded with \dontrun{ and }. This can be used for examples that, e.g., would require some specific data that you do not make available with the package, or just because the example takes too long or is too computation-heavy.

CHECK also dislikes files and folders that are not absolutely necessary to the package. It will complain if, say, you have a scripts/ folder with extra draft scripts you used to develop and try your functions, or a data/ folder containing some example data. You can add the names of these folders to .Rbuildignore to tell CHECK to ignore those when checking your package (.Rbuildignore works in many respects just like a .gitignore file).

8.5.2 Better practices

If all the above are met, CHECK should be happy and in theory your package should be CRAN-compatible. Some platforms, such as rOpenSci, have stricter standards, however, and those requirements come from a good place. We will highlight two things here.

First, rOpenSci will require 100% code coverage in your package. This means that during the execution of the CHECK command, every single line of code must be run. This is often impossible to achieve without having tests, and thus strongly encourages test-driven development. The testthat package can be used to write tests that check for the outcomes of your functions under different circumstances, or scenarios. See the section on regex for examples. In a package, test will be stored in a tests/ folder, which can be set-up by our old friend usethis, by running usethis::use_testthat(). Having tests is always good!

Second, rOpenSci will also check your coding style. In R, it is possible to write the same code in different ways, for example:

library(tidyverse)
x <- mrep(seq(3), seq(3))
y <- rep(1, 6)
tibble(
  V1 = x,
  V2 = y
)
#> # A tibble: 6 x 2
#>      V1    V2
#>   <int> <dbl>
#> 1     1     1
#> 2     2     1
#> 3     2     1
#> 4     3     1
#> 5     3     1
#> 6     3     1

versus

x = mrep(seq(3), seq(3))
y = rep(1, 6)
tibble(V1 =x,
       V2 =y
)
#> # A tibble: 6 x 2
#>      V1    V2
#>   <int> <dbl>
#> 1     1     1
#> 2     2     1
#> 3     2     1
#> 4     3     1
#> 5     3     1
#> 6     3     1

Both styles will run, and CHECK will not complain. However, lintr will. lintr is a style checker that makes sure that you follow the tidyverse recommended style. This style includes things such as: no use of = as an assignment operator (only use <-), always put a space after an equal sign or a comma among others. lintr will be run on all of your R code if you submit your package to rOpenSci. The reason behing using a style checker is similar to the basic philosophy of the tidyverse: make things follow a convention, so that pieces of code speak the same language (so to speak, pun intended) and integrate nicely with each other.

8.5.3 Even better practices

Git and GitHub (or other version control platforms) are your friends when it comes to developing packages or software in general. You may want to check out how to use them. One strength of these platforms is that they allow you to give access to third-party platforms to your package, that can be used to quality-control your code. These are known as continuous integration tools, Travis CI and AppVeyor being two famous examples. By activating these tools on your repository (hosted, say, on GitHub), these platforms can access your package and remotely run all kinds of things for you: run R CMD CHECK, make sure that the code coverage is 100%, or run lintr for you, every time you upload an edited version of your code. This gives you an extra safety net to make sure that your package (or at least the version hosted online and available to people) is always working, and it may even give you a hint if, for example, one dependency of your package breaks (due to errors independent of you). If you want to know more, you can for example check the R package babette, which makes use of these tools and is hosted at rOpenSci.

8.6 References

Hadley’s book on developing R packages