Section 4 Working with lists and iteration

4.1 List columns with tidyr

4.1.1 Nesting data

It may become necessary to indicate the groups of a tibble in a somewhat more explicit way than simply using dplyr::group_by. tidyr offers the option to create nested tibbles, that is, to store complex objects in the columns of a tibble. This includes other tibbles, as well as model objects and plots.

NB: Nesting data is done using tidyr::nest, which is different from the similarly named tidyr::nesting.

The example below shows how Phylacine data can be converted into a nested tibble.

The data is now a nested data frame. The class of each of its columns is respectively, a character (order name) and a list (the data of all mammals in the corresponding order).

While nest can be used without first grouping the tibble, it’s just much easier to group first.

4.1.2 Unnesting data

A nested tibble can be converted back into the original, or into a processed form, using tidyr::unnest. The original groups are retained.

The unnest_longer and unnest_wider variants of unnest are maturing functions, that is, not in their final form. They allow interesting variations on unnesting — these are shown here but advised against. Unnest the data first, and then convert it to the form needed.

4.1.3 Working with list columns

The class of a list column is list, and working with list columns (and lists, and list-like objects such as vectors) makes iteration necessary, since this is one of the only ways to operate on lists.

Two examples are shown below when getting the class and number of rows of the nested tibbles in the list column.

Functionals

The second example uses lapply, and this is a functional. Functionals are functions that take another function as one of their arguments. Base R functionals include the *apply family of functions: apply, lapply, vapply and so on.

4.2 Iteration with map

The tidyverse replaces traditional loop-based iteration with functionals from the purrr package.

Why use purrr

A good reason to use purrr functionals instead of base R functionals is their consistent and clear naming, which always indicates how they should be used. This is explained in the examples below. How map is different from for and lapply are best explained in the Advanced R Book.

4.2.1 Basic use of map

map works very similarly to lapply, where .x is object on whose elements to apply the function .f.

map works on any list-like object, which includes vectors, and always returns a list. map takes two arguments, the object on which to operate, and the function to apply to each element.

4.2.3 map variants returning data frames

map_df returns data frames, and by default binds dataframes by rows, while map_dfr does this explicitly, and map_dfc does returns a dataframe bound by column.

map accepts arguments to the function being mapped, such as in the example above, where head() accepts the argument n = 2.

map_dfr behaves the same as map_df.

map_dfc binds the resulting 3 data frames of two rows each by column, and automatically repairs the column names, adding a suffix to each duplicate.

4.2.5 Selective mapping using map variants

map_at and map_if work like other *_at and *_if functions. Here, map_if is used to run a linear model only on those tibbles which have sufficient data. The predicate is specified by .p.

In this example, the nested tibble is given a new column using dplyr::mutate, where the data to be added is a mixed list.

Some elements of the column model are tibbles, which have not been operated on because they have fewer than 100 rows (species). The remaining elements are lm objects.

4.3 More map variants

map also has variants along the axis of how many elements are operated upon. map2 operates on two vectors or list-like elements, and returns a single list as output, while pmap operates on a list of list-like elements. The output has as many elements as the input lists, which must be of the same length.

4.3.1 Mapping over two inputs with map2

map2 has the same variants as map, allowing for different return types. Here map2_int returns an integer vector.

map2 doesn’t have _at and _if variants.

One use case for map2 is to deal with both a list element and its index, as shown in the example. This may be necessary when the list index is removed in a split or nest. This can also be done with imap, where the index is referred to as .y.

4.4 Combining map variants and tidyverse functions

The example below shows a relatively complex data manipulation pipeline. Such pipelines must either be thought through carefully in advance, or checked for required output on small subsets of data, so as not to consume excessive system resources.

In the pipeline:

  1. The tibble becomes a nested dataframe by order (using tidyr::nest),
  2. If there are enough data points (> 100), a linear model of mass ~ plant diet is fit (using purrr::map_if, and stats::lm),
  3. The model coefficients are extracted if the model was fit (using purrr::map & dplyr::case_when),
  4. The model coefficients are converted to data for plotting (using purrr::map, tibble::tibble, & tidyr::pivot_wider),
  5. The raw data is plotted along with the model fit, taking the title from the nested data frame (using purrr::pmap & ggplot2::ggplot).

4.5 A return to map variants

Lists are often nested, that is, a list element may itself be a list. It is possible to map a function over elements as a specific depth.

In the example, phylacine is split by order, and then by IUCN status, creating a two-level list, with the second layer operated on.

4.5.1 Iteration without a return

map and its variants have a return type, which is either a list or a vector. However, it is often necessary to iterate a function over a list-like object for that function’s side effects, such as printing a message to screen, plotting a series of figures, or saving to file.

walk is the function for this task. It has only the variants walk2, iwalk, and pwalk, whose logic is similar to map2, imap, and pmap. In the example, the function applied to each list element is intended to print a message.

4.5.2 Modify rather than map

When the return type is expected to be the same as the input type, that is, a list returning a list, or a character vector returning the same, modify can help with keeping strictly to those expectations.

In the example, simply adding 2 to each vector element produces an error, because the output is a numeric, or double. modify helps ensure some type safety in this way.

Converting the output to an integer, which was the original input type, serves as a solution.

A note on invoke

invoke used to be a wrapper around do.call, and can still be found with its family of functions in purrr. It is however retired in favour of functionality already present in map and rlang::exec, the latter of which will be covered in another session.

4.6 Other functions for working with lists

purrr has a number of functions to work with lists, especially lists that are not nested list-columns in a tibble.

4.6.1 Filtering lists

Lists can be filtered on any predicate using keep, while the special case compact is applied when the empty elements of a list are to be filtered out. discard is the opposite of keep, and keeps only elements not satisfying a condition. Again, the predicate is specified by .p.

head_while is bit of an odd case, which returns all elements of a list-like object in sequence until the first one fails to satisfy a predicate, specified by .p.

4.6.2 Summarising lists

The purrr functions every, some, has_element, detect, detect_index, and vec_depth help determine whether a list passes a certain logical test or not. These are seldom used and are not discussed here.

4.6.3 Reduction and accumulation

reduce helps combine elements along a list using a specific function. Consider the example below where list elements are concatenated into a single vector.

This can also be applied to data frames. Consider some random samples of mtcars, each with only 5 cars removed. The objective is to find the cars present in all 10 samples.

The way reduce works in the example below is to take the first element and find its intersection with the second, and to take the result and find its intersection with the third and so on.

accumulate works very similarly, except it retains the intermediate products. The first element is retained as is. accumulate2 and reduce2 work on two lists, following the same logic as map2 etc. Both functions can be used in much more complex ways than demonstrated here.

4.6.4 Miscellaneous operation

purrr offers a few more functions to work with lists (or list like objects). prepend works very similarly to append, except it adds to the head of a list. splice adds multiple objects together in a list. splice will break the existing list structure of input lists.

flatten has a similar behaviour, and converts a list of vectors or list of lists to a single list-like object. flatten_* options allow the output type to be specified.

transpose shifts the index order in multi-level lists. This is seen in the example, where the iucn_status goes from being the index of the second level to the index of the first.

4.7 Lists of ggplots with patchwork

The patchwork library helps compose ggplots, which will be properly introduced in the next session. patchwork usually works on lists of ggplots, which can come from a standalone list, or from a list column in a nested dataframe. The example below shows the latter, with the data data frame from earlier.