Section 7 Programming in the tidyverse

Load the packages for the day.

A function to look at errors.

7.1 An exlanation of the problem

7.1.1 What the issue is

Get some data from Phylacine, and attempt to select or filter.

Examine small_mammals and small_mammals_too to check whether they are as expected.

The difference in the number of rows is because dplyr::filter could not understand the string "Mass.g" as a variable in the dataframe.

This is because the tidyverse, through its tidyselect package, makes a distinction between "Mass.g", and Mass.g.

A better explanation of (some of) the theory behind this can be found here: Programming with dplyr.

The same issue arises with functions such as dplyr::summarise and dplyr::group_by.

7.1.2 Why the issue is a problem

Consider an analysis pipeline as follows.

data %>% select variables %>% summarise by groups

Now consider that this analysis pipeline is repeated many times in your document. Consider also that a well intentioned person has renamed the dataframe columns.

The group-summarise code above will no longer work.

This illustrates the problem in part: when the columns to be operated upon are unknown to the programmer, much of basic tidyverse code cannot be generalised to be used with any dataframe.

7.1.3 Passing variables as strings is (also) an issue

The variables to be operated on could be given as strings, perhaps as the argument to a function, or as a global variable. This way, a single global vector could contain the grouping variables for all further summarise procedures.

This runs into the problem identified earlier.

In the case of a standard filter %>% group %>% summarise pipeline, the function’s operations are evident. It must filter a dataframe based on a/some column(s), and then summarise by groups. The filter to be applied, the variables to group by, and the variables to be summarised should be passed as function arguments — just how this is to be done is not immediately obvious.

7.2 Flexible selection is easy

Selection often precedes data operations, but is not part of the pipeline dealt with further.

This is because dplyr::select appears to work on both quoted and unquoted variables, but in general some useful select helpers such as dplyr::all_of should be used. These straightforward helper functions significantly expand select’s flexibility and ease of use, and are not covered here. See the select help for more information.

7.3 A first attempt at a flexible function

The attempt below to write such a function, which gives the mean and confidence intervals of groups is likely to fail.

7.3.1 Failure of the first attempt

This function initially failed because filter could not find mass_g in the dataframe. This is because mass_g is treated as an independent R object, while the function should instead treat it as a variable in a dataframe.

The difference between so-called data and environment variables is explained better at the rlang and tidyeval websites and tutorials linked at the end of this chapter. It is this difference that prevents filter from correctly interpreting mass_g.

7.3.2 Passing arguments as strings doesn’t help

The example below tries to get filter to work. What could be tried? One option is to attempt passing the filtering process as a string argument, i.e., "mass_g > 1000".

While this doesn’t work, it is on the right track, which is that the filters argument needs some extra work beyond changing the type.

7.3.3 None of the other arguments will be successful

filter was the first failure, after which it stopped further evaluation, but none of the steps of the custom function would have worked, for the same reason filter would not have worked: all the arguments need some work before they can be passed to their respective functions.

7.4 Flexible filtering in a function

The first thing to try is to change how filter uses the argument passed to it. Here, the argument filters is passed as a character vector, and is set by default to filter out mammals with masses below 1 kg.

The argument could be passed as a list, but the rlang::parse_exprs function works on vectors, not lists. The conversion between them is trivial for single level lists with atomic types (purrr::as_vector).

A brief detour: Expressions in R

A full explanation of R works under the hood would take a very long time. A working knowledge of how this working can be exploited is usually sufficient to use most of R’s functionality.

R expressions are one such. They represent a promise of R code, but without being evaluated. Any string can be parsed (interpreted) as an R expression.

What does rlang::parse_exprs do? It interprets a string as an R command. This expression can then be evaluated later. Consider the following, where a is assigned the numeric value 3.

Here, a + 3 was converted to an expression in the second command, and only evaluated in the third.

Unquoting with !!!

R expressions underlie R code. Their evaluation can be forced inside another function using the special operators !! and !!!, for single and multiple R expressions respectively.

7.5 Flexible grouping in a function

Just as the exact filtering approach can be controlled from a single string vector in the example above, the grouping variables can also be stored and passed as arguments using the ... (dots) argument. Dots are a convenient way of referring to all unnamed arguments of a function. Here, they are used to accept the grouping variables.

7.5.2 Passing grouping variables as strings

In the previous example, the grouping variables were passed as unquoted variables, then enquo-ted and parsed, after which they were applied. An alternative way of passing arguments to a function is as a string vector, i.e, grouping_vars = c("var_a", "var_b).

This can be done by interpreting the string vector as R symbols using rlang::syms. It could also be done by treating them as a full expression using the previously covered rlang::parse_exprs. However, both methods must use an unquoting-splice (!!!), i.e., force the evaluation of a list of R expressions.

7.6 Flexible summarising in a function

Summarising using string expressions has been around in the tidyverse for a very long time, and summarise_at is a function most users are familiar with, along with its variants summarise_if, summarise_all

7.6.2 Using the across argument for summary variables

dplyr 1.0.0 had summarise_* superseded by the across argument to summarise. This works somewhat differently. The example below shows how the mean of a trait of mammal groups can be found.

This example makes use of embracing using {{ }}, where the double curly braces indicate a promise, i.e., an expectation that such a variable will exist in the function environment.

across also accepts multiple functions just as summarise_ did. This works as follows.

7.6.3 Summarise multiple variables using ...

Here, the unquoted and unnamed variables passed to the function are captured by ... and enquos-ed, i.e, their evaluation is delayed. Then the variables are forcibly evaluated within the mean function, and this expression is captured using expr. Since there are multiple variables to summarise, these expressions are stored as a list.

expr and enquo

expr and enquo are essentially the same, defusing/quoting (delaying evaluation) of R code. expr works on expressions supplied by the primary user, while enquo works on arguments passed to a function. When in doubt, ask whether the expression to be quoted has entered the function environment as an argument. If yes, use enquo, and if not expr. The plural forms enquos and exprs exist for multiple arguments.

7.6.4 Summarise with multiple functions

The final step is to pass multiple summary functions to the summary variables. Unlike the earlier example using summarise(across(vars, funs)), the goal here is to apply one function to each variable.

This is done by passing the functions and the variables on which they should operate as strings, and using string interpolation via glue to construct a coherent R expression. This expression is then named and evaluated.

7.7 Further resources