Section 7 Programming in the tidyverse
Load the packages for the day.
A function to look at errors.
try_this <- function(ex) {
tryCatch(
expr = {
ex
},
error = function(e) {
print(glue::glue(as.character(e), "\n"))
}
)
}
7.1 An exlanation of the problem
7.1.1 What the issue is
Get some data from Phylacine, and attempt to select or filter.
# read in phylacine data
data = read_csv("data/phylacine_traits.csv")
# regular filtering
small_mammals = data %>%
filter(Mass.g < 1000)
Examine small_mammals
and small_mammals_too
to check whether they are as expected.
# count rows
map_int(list(sm_1 = small_mammals, sm2 = small_mammals_too),
nrow)
#> sm_1 sm2
#> 4381 0
The difference in the number of rows is because dplyr::filter
could not understand the string "Mass.g"
as a variable in the dataframe.
This is because the tidyverse
, through its tidyselect
package, makes a distinction between "Mass.g"
, and Mass.g
.
A better explanation of (some of) the theory behind this can be found here: Programming with dplyr.
The same issue arises with functions such as dplyr::summarise
and dplyr::group_by
.
# summarise using an unquoted variable
summarise(data,
mean_mass = mean(Mass.g))
#> # A tibble: 1 x 1
#> mean_mass
#> <dbl>
#> 1 156882.
# this will print a warning
summarise(data,
mean_mass = mean("Mass.g"))
#> Warning in mean.default("Mass.g"): argument is not numeric or logical: returning
#> NA
#> # A tibble: 1 x 1
#> mean_mass
#> <dbl>
#> 1 NA
7.1.2 Why the issue is a problem
Consider an analysis pipeline as follows.
data %>% select variables %>% summarise by groups
data %>%
select(Mass.g, Diet.Plant, Order.1.2) %>%
group_by(Order.1.2) %>%
summarise_all(.funs = mean) %>%
head()
#> # A tibble: 6 x 3
#> Order.1.2 Mass.g Diet.Plant
#> <chr> <dbl> <dbl>
#> 1 Afrosoricida 306. 0.947
#> 2 Carnivora 47905. 14.1
#> 3 Cetartiodactyla 1854811. 76.2
#> 4 Chiroptera 49.1 27.3
#> 5 Cingulata 235529. 43.0
#> 6 Dasyuromorphia 748. 1.09
Now consider that this analysis pipeline is repeated many times in your document. Consider also that a well intentioned person has renamed the dataframe columns.
data <- data %>%
`colnames<-`(str_replace_all(colnames(data), "\\.", "_") %>%
str_to_lower %>%
str_remove("_1_2"))
The group-summarise code above will no longer work.
try_this(ex =
data %>%
select(Mass.g, Diet.Plant, Order.1.2) %>%
group_by(Order.1.2) %>%
summarise_all(.funs = mean) %>%
head()
)
#> Error: Can't subset columns that don't exist.
#> ✖ Column `Mass.g` doesn't exist.
This illustrates the problem in part: when the columns to be operated upon are unknown to the programmer, much of basic tidyverse
code cannot be generalised to be used with any dataframe.
7.1.3 Passing variables as strings is (also) an issue
The variables to be operated on could be given as strings, perhaps as the argument to a function, or as a global variable. This way, a single global vector could contain the grouping variables for all further summarise
procedures.
This runs into the problem identified earlier.
# choose some variables
vars_to_select = c("Mass.g", "Diet.Plant")
vars_to_group = c("Order.1.2")
# attempt to select and summarise on group
# the tidyverse will not be pleased
try_this(ex =
data %>%
select(vars_to_select) %>% # this works with a warning
group_by(vars_to_group) %>%
summarise(mean_mass = mean(Mass.g),
mean_plant = mean(Diet.Plant))
)
#> Error: Can't subset columns that don't exist.
#> ✖ Columns `Mass.g` and `Diet.Plant` don't exist.
In the case of a standard filter %>% group %>% summarise
pipeline, the function’s operations are evident. It must filter a dataframe based on a/some column(s), and then summarise by groups. The filter to be applied, the variables to group by, and the variables to be summarised should be passed as function arguments — just how this is to be done is not immediately obvious.
7.2 Flexible selection is easy
Selection often precedes data operations, but is not part of the pipeline dealt with further.
This is because dplyr::select
appears to work on both quoted and unquoted variables, but in general some useful select
helpers such as dplyr::all_of
should be used. These straightforward helper functions significantly expand select
’s flexibility and ease of use, and are not covered here. See the select
help for more information.
7.3 A first attempt at a flexible function
The attempt below to write such a function, which gives the mean and confidence intervals of groups is likely to fail.
# define a ci function
ci <- function(x, ci = 95) {
qnorm(1 - (1 - ci / 100)/2) * sd(x, na.rm = TRUE) / sqrt(length(x))
}
custom_summary <- function(data, filters, grouping_vars, summary_vars) {
data %>%
filter(filters) %>%
group_by(grouping_vars) %>%
summarise(mean = mean(summary_vars),
ci = ci(summary_vars))
}
7.3.1 Failure of the first attempt
# this is going to fail, so look at the error message
try_this(ex = custom_summary(data,
filters = list(mass_g > 1000),
grouping_vars = list(order, family),
summary_vars = list(diet_plant))
)
#> Error: Problem with `filter()` input `..1`.
#> ✖ object 'mass_g' not found
#> ℹ Input `..1` is `filters`.
This function initially failed because filter
could not find mass_g
in the dataframe. This is because mass_g
is treated as an independent R
object, while the function should instead treat it as a variable in a dataframe.
The difference between so-called data
and environment
variables is explained better at the rlang
and tidyeval
websites and tutorials linked at the end of this chapter. It is this difference that prevents filter from correctly interpreting mass_g
.
7.3.2 Passing arguments as strings doesn’t help
The example below tries to get filter
to work. What could be tried? One option is to attempt passing the filtering process as a string argument, i.e., "mass_g > 1000"
.
# it doesn't matter whether filters is a vector or list
try_this(ex = custom_summary(data,
filters = c("mass_g > 1000"),
grouping_vars = list(order, family),
summary_vars = list(diet_plant))
)
#> Error: Problem with `filter()` input `..1`.
#> ✖ Input `..1` must be a logical vector, not a character.
#> ℹ Input `..1` is `filters`.
While this doesn’t work, it is on the right track, which is that the filters
argument needs some extra work beyond changing the type.
7.3.3 None of the other arguments will be successful
filter
was the first failure, after which it stopped further evaluation, but none of the steps of the custom function would have worked, for the same reason filter would not have worked: all the arguments need some work before they can be passed to their respective functions.
7.4 Flexible filtering in a function
The first thing to try is to change how filter
uses the argument passed to it.
Here, the argument filters
is passed as a character vector, and is set by default to filter out mammals with masses below 1 kg.
The argument could be passed as a list, but the rlang::parse_exprs
function works on vectors, not lists. The conversion between them is trivial for single level lists with atomic types (purrr::as_vector
).
A brief detour: Expressions in R
A full explanation of R
works under the hood would take a very long time. A working knowledge of how this working can be exploited is usually sufficient to use most of R
’s functionality.
R
expressions are one such. They represent a promise of R
code, but without being evaluated. Any string can be parsed (interpreted) as an R
expression.
What does rlang::parse_exprs
do? It interprets a string as an R
command.
This expression can then be evaluated later. Consider the following, where a
is assigned the numeric
value 3.
# a is assigned
a = 3
# parsed but not evaluated
rlang::parse_expr("a + 3")
#> a + 3
# evaluated
rlang::parse_expr("a + 3") %>% eval
#> [1] 6
Here, a + 3
was converted to an expression in the second command, and only evaluated in the third.
Unquoting with !!!
R
expressions underlie R
code. Their evaluation can be forced inside another function using the special operators !!
and !!!
, for single and multiple R
expressions respectively.
7.4.1 Flexible filtering using expressions
Consider the case where mammals below 1 kg body mass are to be excluded. The dplyr
code would look like this:
filter(data, mass_g > 1000)
This fixes both the variable to be filtered by, as well as the cut-off value. This can be made flexible for a custom function that allows any kind of filtering.
custom_summary = function(data,
filters = c("mass_g > 1000")) {
# THIS IS THE IMPORTANT BIT
filters = rlang::parse_exprs(filters)
data %>%
filter(!!!filters)
}
Try this function with single and multiple filters.
# mammals above a kilo
custom_summary(data,
filters = c("mass_g > 1000")) %>%
select(binomial, mass_g) %>%
head()
#> # A tibble: 6 x 2
#> binomial mass_g
#> <chr> <dbl>
#> 1 Acerodon_jubatus 1075
#> 2 Acinonyx_jubatus 46700
#> 3 Acratocnus_odontrigonus 22990
#> 4 Acratocnus_ye 21310
#> 5 Addax_nasomaculatus 70000.
#> 6 Aepyceros_melampus 52500.
# mammals between 250 and 500 g and which are mostly carnivorous
custom_summary(data,
filters = c("between(mass_g, 250, 500)",
"diet_plant < 10")) %>%
select(binomial, mass_g, diet_plant) %>%
head()
#> # A tibble: 6 x 3
#> binomial mass_g diet_plant
#> <chr> <dbl> <dbl>
#> 1 Chrysospalax_trevelyani 426. 0
#> 2 Cyclopes_didactylus 330. 0
#> 3 Desmana_moschata 383 0
#> 4 Dologale_dybowskii 350 0
#> 5 Hydromys_chrysogaster 480. 0
#> 6 Hyosciurus_heinrichi 296 0
The function filter
correctly processes the string passed to filter the data.
7.5 Flexible grouping in a function
Just as the exact filtering approach can be controlled from a single string vector in the example above, the grouping variables can also be stored and passed as arguments using the ...
(dots) argument. Dots are a convenient way of referring to all unnamed arguments of a function.
Here, they are used to accept the grouping variables.
7.5.1 Using ...
and ‘forwarding’
custom_summary = function(data,
filters = c("mass_g > 1000"),
...) {
# deal with groups
grouping_vars = rlang::enquos(...)
data %>%
filter(!!!rlang::parse_exprs(filters)) %>%
# this is the important bit
group_by(!!!grouping_vars)
}
Try the function again, and check the grouping variables.
7.5.2 Passing grouping variables as strings
In the previous example, the grouping variables were passed as unquoted variables, then enquo
-ted and parsed, after which they were applied.
An alternative way of passing arguments to a function is as a string vector, i.e, grouping_vars = c("var_a", "var_b)
.
This can be done by interpreting the string vector as R
symbols using rlang::syms
. It could also be done by treating them as a full expression using the previously covered rlang::parse_exprs
. However, both methods must use an unquoting-splice (!!!
), i.e., force the evaluation of a list of R
expressions.
7.5.3 Using rlang::syms
custom_summary = function(data,
filters = c("mass_g > 1000"),
grouping_vars) {
# deal with groups
grouping_vars = rlang::syms(grouping_vars)
data %>%
filter(!!!rlang::parse_exprs(filters)) %>%
# this is the important bit
group_by(!!!grouping_vars)
}
custom_summary(data,
filters = c("mass_g > 1000"),
grouping_vars = c("order", "family")
) %>%
summarise(mean_mass = mean(mass_g)) %>%
head()
#> # A tibble: 6 x 3
#> # Groups: order [2]
#> order family mean_mass
#> <chr> <chr> <dbl>
#> 1 Afrosoricida Tenrecidae 13220
#> 2 Carnivora Ailuridae 4900
#> 3 Carnivora Canidae 10502.
#> 4 Carnivora Eupleridae 5853.
#> 5 Carnivora Felidae 52801.
#> 6 Carnivora Herpestidae 2334.
7.5.4 Using rlang::parse_exprs
custom_summary = function(data,
filters = c("mass_g > 1000"),
grouping_vars) {
# deal with groups
grouping_vars = rlang::parse_exprs(grouping_vars)
data %>%
filter(!!!rlang::parse_exprs(filters)) %>%
# this is the important bit
group_by(!!!grouping_vars)
}
custom_summary(data,
filters = c("mass_g > 1000"),
grouping_vars = c("family", "iucn_status")
) %>%
summarise(mean_mass = mean(mass_g)) %>%
head()
#> # A tibble: 6 x 3
#> # Groups: family [5]
#> family iucn_status mean_mass
#> <chr> <chr> <dbl>
#> 1 Ailuridae EN 4900
#> 2 Anomaluridae DD 1770
#> 3 Antilocapridae EP 40503.
#> 4 Antilocapridae LC 46083.
#> 5 Aotidae LC 1060
#> 6 Aplodontiidae LC 1004
7.6 Flexible summarising in a function
Summarising using string expressions has been around in the tidyverse
for a very long time, and summarise_at
is a function most users are familiar with, along with its variants summarise_if
, summarise_all
7.6.1 Using dplyr::summarise_at
Simply pass a string vector to the .vars
argument of summarise_at
, while passing a list, named or otherwise, of functions to the .funs
argument.
custom_summary = function(data,
filters = c("mass_g > 1000"),
grouping_vars,
summary_vars,
summary_funs) {
# deal with groups
grouping_vars = rlang::parse_exprs(grouping_vars)
data %>%
filter(!!!parse_exprs(filters)) %>%
group_by(!!!grouping_vars) %>%
# important bit
summarise_at(.vars = summary_vars,
.funs = summary_funs)
}
custom_summary(data,
grouping_vars = c("order", "family"),
summary_vars = "mass_g",
summary_funs = list(this_is_a_mean = mean, sd))
#> # A tibble: 113 x 4
#> # Groups: order [24]
#> order family this_is_a_mean fn1
#> <chr> <chr> <dbl> <dbl>
#> 1 Afrosoricida Tenrecidae 13220 NA
#> 2 Carnivora Ailuridae 4900 NA
#> 3 Carnivora Canidae 10502. 11618.
#> 4 Carnivora Eupleridae 5853. 6234.
#> 5 Carnivora Felidae 52801. 88201.
#> 6 Carnivora Herpestidae 2334. 937.
#> # … with 107 more rows
7.6.2 Using the across
argument for summary variables
dplyr 1.0.0
had summarise_*
superseded by the across
argument to summarise
. This works somewhat differently.
The example below shows how the mean
of a trait of mammal groups can be found.
This example makes use of embracing using {{ }}
, where the double curly braces indicate a promise, i.e., an expectation that such a variable will exist in the function environment.
custom_summary = function(data,
filters = c("mass_g > 1000"),
grouping_vars,
summary_vars) {
# deal with groups
grouping_vars = parse_exprs(grouping_vars)
data %>%
filter(!!!parse_exprs(filters)) %>%
group_by(!!!grouping_vars) %>%
# important bit
summarise(across({{ summary_vars }},
~ mean(.)))
}
custom_summary(data,
grouping_vars = c("order", "family"),
summary_vars = c(mass_g, diet_plant)) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: order [2]
#> order family mass_g diet_plant
#> <chr> <chr> <dbl> <dbl>
#> 1 Afrosoricida Tenrecidae 13220 4
#> 2 Carnivora Ailuridae 4900 80
#> 3 Carnivora Canidae 10502. 15.0
#> 4 Carnivora Eupleridae 5853. 2.67
#> 5 Carnivora Felidae 52801. 0.348
#> 6 Carnivora Herpestidae 2334. 9.86
across
also accepts multiple functions just as summarise_
did. This works as follows.
# mean and sd
data %>%
group_by(order, family) %>%
summarise(across(c(mass_g, diet_plant),
list(~ mean(.),
~ sd(.))
)
) %>%
head()
#> # A tibble: 6 x 6
#> # Groups: order [2]
#> order family mass_g_1 mass_g_2 diet_plant_1 diet_plant_2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Afrosoricida Chrysochloridae 60.7 86.6 0 0
#> 2 Afrosoricida Tenrecidae 449. 2197. 1.5 6.83
#> 3 Carnivora Ailuridae 4900 NA 80 NA
#> 4 Carnivora Canidae 10268. 11568. 16.0 18.0
#> 5 Carnivora Eupleridae 3777. 5364. 4.6 6.72
#> 6 Carnivora Felidae 52801. 88201. 0.348 2.36
7.6.3 Summarise multiple variables using ...
Here, the unquoted and unnamed variables passed to the function are captured by ...
and enquos
-ed, i.e, their evaluation is delayed.
Then the variables are forcibly evaluated within the mean
function, and this expression is captured using expr
. Since there are multiple variables to summarise, these expressions are stored as a list.
custom_summary = function(data,
grouping_vars,
filters,
...) {
# deal with groups
grouping_vars = rlang::parse_exprs(grouping_vars)
# deal with summary variables
summary_vars = rlang::enquos(...)
# apply the summary function to the variables
summary_vars <- purrr::map(summary_vars, function(var) {
rlang::expr(mean(!!var, na.rm = TRUE))
})
data %>%
filter(!!!rlang::parse_exprs(filters)) %>%
group_by(!!!grouping_vars) %>%
# important bit
summarise(!!!summary_vars)
}
custom_summary(data,
grouping_vars = c("order", "family"),
filters = "mass_g > 10",
mass_g, diet_plant) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: order [2]
#> order family `mean(mass_g, na.rm = T… `mean(diet_plant, na.rm = …
#> <chr> <chr> <dbl> <dbl>
#> 1 Afrosorici… Chrysochlori… 60.7 0
#> 2 Afrosorici… Tenrecidae 597. 2
#> 3 Carnivora Ailuridae 4900 80
#> 4 Carnivora Canidae 10268. 16.0
#> 5 Carnivora Eupleridae 3777. 4.6
#> 6 Carnivora Felidae 52801. 0.348
expr
and enquo
expr
and enquo
are essentially the same, defusing/quoting (delaying evaluation) of R
code. expr
works on expressions supplied by the primary user, while enquo
works on arguments passed to a function. When in doubt, ask whether the expression to be quoted has entered the function environment as an argument. If yes, use enquo
, and if not expr
. The plural forms enquos
and exprs
exist for multiple arguments.
7.6.3.1 Correct the names of summary variables
The example above returns summary variables that are not assigned a name.
The enquos
function can assign the name from the variable names, so mean(mass_g)
is returned as mass_g
.
Since it is useful to add a tag to make clear what the summary variable is (mean, variance etc.) an extra glue
step is added to assign informative names to the summary variables.
custom_summary = function(data,
grouping_vars,
filters,
...) {
# deal with groups
grouping_vars = rlang::parse_exprs(grouping_vars)
# deal with summary variables
summary_vars = rlang::enquos(..., .named = TRUE)
# apply the summary function to the variables
summary_vars <- purrr::map(summary_vars, function(var) {
rlang::expr(mean(!!var, na.rm = TRUE))
})
# add a prefix to the summary variables
names(summary_vars) <- glue::glue('mean_{names(summary_vars)}')
data %>%
filter(!!!rlang::parse_exprs(filters)) %>%
group_by(!!!grouping_vars) %>%
# important bit
summarise(!!!summary_vars)
}
custom_summary(data,
grouping_vars = c("order", "family"),
filters = "mass_g > 10",
mass_g, diet_plant) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: order [2]
#> order family mean_mass_g mean_diet_plant
#> <chr> <chr> <dbl> <dbl>
#> 1 Afrosoricida Chrysochloridae 60.7 0
#> 2 Afrosoricida Tenrecidae 597. 2
#> 3 Carnivora Ailuridae 4900 80
#> 4 Carnivora Canidae 10268. 16.0
#> 5 Carnivora Eupleridae 3777. 4.6
#> 6 Carnivora Felidae 52801. 0.348
7.6.4 Summarise with multiple functions
The final step is to pass multiple summary functions to the summary variables.
Unlike the earlier example using summarise(across(vars, funs))
, the goal here is to apply one function to each variable.
This is done by passing the functions and the variables on which they should operate as strings, and using string interpolation via glue
to construct a coherent R
expression. This expression is then named and evaluated.
custom_summary = function(data,
grouping_vars,
filters,
functions,
summary_vars) {
# deal with groups
grouping_vars = parse_exprs(grouping_vars)
# deal with summary variables
# summary_vars = # enquos(..., .named = TRUE)
# apply the summary function to the variables
summary_exprs <- parse_exprs(glue::glue('{functions}({summary_vars}, na.rm = TRUE)'))
# add a prefix to the summary variables
names(summary_exprs) <- glue::glue('{functions}_{summary_vars}')
data %>%
filter(!!!parse_exprs(filters)) %>%
group_by(!!!grouping_vars) %>%
# important bit
summarise(!!!summary_exprs)
}
custom_summary(data,
grouping_vars = c("order", "family"),
filters = "mass_g > 10",
functions = c("mean", "var"),
summary_vars = c("mass_g", "diet_plant")) %>%
head()
#> # A tibble: 6 x 4
#> # Groups: order [2]
#> order family mean_mass_g var_diet_plant
#> <chr> <chr> <dbl> <dbl>
#> 1 Afrosoricida Chrysochloridae 60.7 0
#> 2 Afrosoricida Tenrecidae 597. 61.8
#> 3 Carnivora Ailuridae 4900 NA
#> 4 Carnivora Canidae 10268. 325.
#> 5 Carnivora Eupleridae 3777. 45.2
#> 6 Carnivora Felidae 52801. 5.57
7.7 Further resources
dplyr
: https://dplyr.tidyverse.org/index.html- Tidy evaluation: Superseded and archived, but still useful https://tidyeval.tidyverse.org/
rlang
: https://rlang.r-lib.org/