Section 2 Reshaping data tables in the tidyverse, and other things

Raphael Scherrer

In this chapter we will learn what tidy means in the context of the tidyverse, and how to reshape our data into a tidy format using the tidyr package. But first, let us take a detour and introduce the tibble.

2.1 The new data frame: tibble

The tibble is the recommended class to use to store tabular data in the tidyverse. Consider it as the operational unit of any data science pipeline. For most practical purposes, a tibble is basically a data.frame.

The difference between tibble and data.frame is in its display and in the way it is subsetted, among others. Most functions working with data.frame will work with tibble and vice versa. Use the as* family of functions to switch back and forth between the two if needed, using e.g. as.data.frame or as_tibble.

In terms of display, the tibble has the advantage of showing the class of each column: chr for character, fct for factor, int for integer, dbl for numeric and lgl for logical, just to name the main atomic classes. This may be more important than you think, because many hard-to-find bugs in R are due to wrong variable types and/or cryptic type conversions. This especially happens with factor and character, which can cause quite some confusion. More about this in the extra section at the end of this chapter!

Note that you can build a tibble by rows rather than by columns with tribble:

As a rule of thumb, try to convert your tables to tibbles whenever you can, especially when the original table is not a data frame. For example, the principal component analysis function prcomp outputs a matrix of coordinates in principal component-space.

# Perform a PCA on mtcars
pca_scores <- prcomp(mtcars)$x
head(pca_scores) # looks like a data frame or a tibble...
#>                       PC1   PC2   PC3    PC4    PC5     PC6      PC7     PC8
#> Mazda RX4          -79.60  2.13 -2.15 -2.707 -0.702 -0.3149 -0.09870 -0.0779
#> Mazda RX4 Wag      -79.60  2.15 -2.22 -2.178 -0.884 -0.4534 -0.00355 -0.0957
#> Datsun 710        -133.89 -5.06 -2.14  0.346  1.106  1.1730  0.00576  0.1362
#> Hornet 4 Drive       8.52 44.99  1.23  0.827  0.424 -0.0579 -0.02431  0.2212
#> Hornet Sportabout  128.69 30.82  3.34 -0.521  0.737 -0.3329  0.10630 -0.0530
#> Valiant            -23.22 35.11 -3.26  1.401  0.803 -0.0884  0.23895  0.4239
#>                      PC9    PC10   PC11
#> Mazda RX4         -0.200 -0.2901  0.106
#> Mazda RX4 Wag     -0.353 -0.1928  0.107
#> Datsun 710        -0.198  0.0763  0.267
#> Hornet 4 Drive     0.356 -0.0906  0.209
#> Hornet Sportabout  0.153 -0.1886 -0.109
#> Valiant            0.101 -0.0377  0.276
class(pca_scores) # but is actually a matrix
#> [1] "matrix"

# Convert to tibble
as_tibble(pca_scores)
#> # A tibble: 32 x 11
#>       PC1   PC2   PC3    PC4    PC5     PC6      PC7     PC8    PC9    PC10
#>     <dbl> <dbl> <dbl>  <dbl>  <dbl>   <dbl>    <dbl>   <dbl>  <dbl>   <dbl>
#> 1  -79.6   2.13 -2.15 -2.71  -0.702 -0.315  -0.0987  -0.0779 -0.200 -0.290 
#> 2  -79.6   2.15 -2.22 -2.18  -0.884 -0.453  -0.00355 -0.0957 -0.353 -0.193 
#> 3 -134.   -5.06 -2.14  0.346  1.11   1.17    0.00576  0.136  -0.198  0.0763
#> 4    8.52 45.0   1.23  0.827  0.424 -0.0579 -0.0243   0.221   0.356 -0.0906
#> 5  129.   30.8   3.34 -0.521  0.737 -0.333   0.106   -0.0530  0.153 -0.189 
#> 6  -23.2  35.1  -3.26  1.40   0.803 -0.0884  0.239    0.424   0.101 -0.0377
#> # … with 26 more rows, and 1 more variable: PC11 <dbl>

This is important because a matrix can contain only one type of values (e.g. only numeric or character), while tibble (and data.frame) allow you to have columns of different types.

So, in the tidyverse we are going to work with tibbles, got it. But what does “tidy” mean exactly?

2.2 The concept of tidy data

When it comes to putting data into tables, there are many ways one could organize a dataset. The tidy format is one such format. According to the formal definition, a table is tidy if each column is a variable and each row is an observation. In practice, however, I found that this is not a very operational definition, especially in ecology and evolution where we often record multiple variables per individual. So, let’s dig in with an example.

Say we have a dataset of several morphometrics measured on Darwin’s finches in the Galapagos islands. Let’s first get this dataset.

Here, we pretend to have measured beak_length, beak_width and body_weight on 100 birds, 50 of them from Isabela and 50 of them from Santa Cruz. In this tibble, each row is an individual bird. This is probably the way most scientists would record their data in the field. However, a single bird is not an “observation” in the sense used in the tidyverse. Our dataset is not tidy but messy.

The tidy equivalent of this dataset would be:

where each measurement (and not each individual) is now the unit of observation (the rows). The pivot_longer function is the easiest way to get to this format. It belongs to the tidyr package, which we’ll cover in a minute.

As you can see our tibble now has three times as many rows and fewer columns. This format is rather unintuitive and not optimal for display. However, it provides a very standardized and consistent way of organizing data that will be understood (and expected) by pretty much all functions in the tidyverse. This makes the tidyverse tools work well together and reduces the time you would otherwise spend reformatting your data from one tool to the next.

That does not mean that the messy format is useless though. There may be use-cases where you need to switch back and forth between formats. For this reason I prefer referring to these formats using their other names: long (tidy) versus wide (messy). For example, matrix operations work much faster on wide data, and the wide format arguably looks nicer for display. Luckily the tidyr package gives us the tools to reshape our data as needed, as we shall see shortly.

Another common example of wide-or-long dilemma is when dealing with contingency tables. This would be our case, for example, if we asked how many observations we have for each morphometric and each island. We use table (from base R) to get the answer:

A variety of statistical tests can be used on contingency tables such as Fisher’s exact test, the chi-square test or the binomial test. Contingency tables are in the wide format by construction, but they too can be pivoted to the long format, and the tidyverse manipulation tools will expect you to do so. Actually, tibble knows that very well and does it by default if you convert your table into a tibble:

Summary: Tidy or not tidy

To sum up, the definition of what is tidy and what is not is somewhat subjective. Tables can be in long or wide format, and depending on the complexity of a dataset, there may even be some intermediate states. To be clear, the tidyverse does not only accept long tables, and wide tables may sometimes be the way to go. This is very use-case specific. Have a clear idea of what you want to do with your data (what tidyverse tools you will use), and use that to figure which format makes more sense. And remember, tidyr is here to easily do the switching for you.

2.3 Reshaping with tidyr

The tidyr package implements tools to easily switch between layouts and also perform a few other reshaping operations. Old school R users will be familiar with the reshape and reshape2 packages, of which tidyr is the tidyverse equivalent. Beware that tidyr is about playing with the general layout of the dataset, while operations and transformations of the data are within the scope of the dplyr and purrr packages. All these packages work hand-in-hand really well, and analysis pipelines usually involve all of them. But today, we focus on the first member of this holy trinity, which is often the first one you’ll need because you will want to reshape your data before doing other things. So, please hold your non-layout-related questions for the next chapters.

2.3.1 Pivoting

Pivoting a dataset between the long and wide layout is the main purpose of tidyr (check out the package’s logo). We already saw the pivot_longer function above. This function converts a table form wide to long format. Similarly, there is a pivot_wider function that does exactly the opposite and takes you back to the wide format:

The order of the columns is not exactly as it was, but this should not matter in a data analysis pipeline where you should access columns by their names. It is straightforward to change the order of the columns, but this is more within the scope of the dplyr package.

If you are familiar with earlier versions of the tidyverse, pivot_longer and pivot_wider are the respective equivalents of gather and spread, which are now deprecated.

There are a few other reshaping operations from tidyr that are worth knowing.

2.3.2 Handling missing values

Say we have some missing measurements in the column “value” of our finch dataset:

We could get rid of the rows that have missing values using drop_na:

Else, we could replace the NAs with some user-defined value:

where the replace argument takes a named list, and the names should refer to the columns to apply the replacement to.

We could also replace NAs with the most recent non-NA values:

Note that most functions in the tidyverse take a tibble as their first argument, and columns to which to apply the functions are usually passed as “objects” rather than character strings. In the above example, we passed the value column as value, not "value". These column-objects are called by the tidyverse functions in the context of the data (the tibble) they belong to.

2.3.3 Splitting and combining cells

The tidyr package offers tools to split and combine columns. This is a nice extension to the string manipulations we saw last week in the stringr tutorial.

Say we want to add the specific dates when we took measurements on our birds (we would normally do this using dplyr but for now we will stick to the old way):

We could combine the day, month and year columns into a single date column, with a dash as a separator, using unite:

Of course, we can revert back to the previous dataset by splitting the date column with separate.

But note that the day, month and year columns are now of class character and not integer anymore. This is because they result from the splitting of date, which itself was a character column.

You can also separate a single column into multiple rows using separate_rows:

2.3.4 Expanding tables using combinations

Instead of getting rid of rows with NAs, we may want to add rows with NAs, for example, for combinations of parameters that we did not measure.

We could generate a tibble with all combinations of island, morphometric and year using expand_grid:

If we already have a tibble to work from that contains the variables to combine, we can use expand on that tibble:

As you can see, we get all the combinations of the variables of interest, even those that are missing. But sometimes you might be interested in variables that are nested within each other and not crossed. For example, say we have measured birds at different locations within each island:

Of course, if sites A and B are on Isabela, they cannot be on Santa Cruz, where we have sites C and D instead. It would not make sense to expand assuming that island and site are crossed, instead, they are nested. We can therefore expand using the nesting function:

But now the missing data for Santa Cruz in 2020 are not accounted for because expand thinks the year is also nested within island. To get back the missing combination, we use crossing, the complement of nesting:

Here, we specify that site is nested within island and these two are crossed with year. Easy!

But wait a minute. These combinations are all very good, but our measurements have disappeared! We can get them back by levelling up to the complete function instead of using expand:

which nicely keeps the rest of the columns in the tibble and just adds the missing combinations.

2.3.5 Nesting

The tidyr package has yet another feature that makes the tidyverse very powerful: the nest function. However, it makes little sense without combining it with the functions in the purrr package, so we will not cover it in this chapter but rather in the purrr chapter.

2.3.6 What else can be tidied up?

2.3.6.2 Graphs with tidygraph

For some datasets, sometimes there is no trivial and intuitive way to store them into a table. This is the case, for example, for data underlying graphs (as in networks), which contain information about relations between entities. What is the unit of observation in a network? A node? An edge between two nodes? Nodes and edges in a network may each have node- or edge-specific variables mapped to them, and both may be equally valid units of observation. The tidygraph package has tools to store graph-data in a tidyverse-friendly object, consisting of two tibbles: one for node-specific information, the other for edge-specific information. This package goes hand in hand with the ggraph, that makes plotting networks compatible with the grammar of graphics.

2.3.6.3 Trees with tidytree

Phylogenetic trees are a special type of graphs suffering from the same issue, i.e. of being non-trivial to store in a table. The tidytree package and its companion treeio offer an interface to convert tree-like objects (from most format used by other packages and software) into a tidyverse-friendly format. Again, the point is that the rest of the tidyverse can be used to wrangle or plot this type of data in the same way as one would do with regular tabular data. For plotting a tidytree with the grammar of graphics, see ggtree.

2.4 Extra: factors and the forcats package

Categorical variables can be stored in R as character strings in character or factor objects. A factor looks like a character, but it actually is an integer vector, where each integer is mapped to a character label. With this respect it is sort of an enhanced version of character. For example,

is a character vector, recognizable to its double quotes, while

is a factor, of which the labels are displayed. The levels of the factor are the unique values that appear in the vector. If I added an extra occurrence of my name:

we would still have the the same levels. Note that the levels are returned as a character vector in alphabetical order by the levels function:

Why does it matter? Well, most operations on categorical variables can be performed on character of factor objects, so it does not matter so much which one you use for your own data. However, some functions in R require you to provide categorical variables in one specific format, and others may even implicitely convert your variables. In ggplot2 for example, character vectors are converted into factors by default. So, it is always good to remember the differences and what type your variables are.

But this is a tidyverse tutorial, so I would like to introduce here the package forcats, which offers tools to manipulate factors. First of all, most tools from stringr will work on factors. The forcats functions expand the string manipulation toolbox with factor-specific utilities. Similar in philosophy to stringr where functions started with str_, in forcats most functions start with fct_.

I see two main ways forcats can come handy in the kind of data most people deal with: playing with the order of the levels of a factor and playing with the levels themselves. We will show here a few examples, but the full breadth of factor manipulations can be found online or in the excellent forcats cheatsheet.

2.4.1 Change the order of the levels

One example use-case where you would want to change the order of the levels of a factor is when plotting. Your categorical variable, for example, may not be plotted in the order you want. If we plot the distribution of each variable across islands, we get

Here, the islands (horizontal axis) and the variables (the facets) are displayed in alphabetical order. When making a figure you may want to customize these orders in such a way that your message is optimally conveyed by your figure, and this may involve playing with the order of levels.

Use fct_relevel to manually change the order of the levels:

Beware that reordering a factor does not change the order of the items within the vector, only the order of the levels. So, it does not introduce any mistmatch between the island column and the other columns! It only matters when the levels are called, for example, in a ggplot. As you can see:

Alternatively, use fct_inorder to set the order of the levels to the order in which they appear:

or fct_rev to reverse the order of the levels:

Other variants exist to do more complex reordering, all present in the forcats cheatsheet, for example: * fct_infreq to re-order according to the frequency of each level (how many observation on each island?) * fct_shift to shift the order of all levels by a certain rank (in a circular way so that the last one becomes the first one or vice versa) * fct_shuffle if you want your levels in random order * fct_reorder, which reorders based on an associated variable (see fct_reorder2 for even more complex relationship between the factor and the associated variable)

2.4.2 Change the levels themselves

Changing the levels of a factor will change the labels in the actual vector. It is similar to performing a string substitution in stringr. One can change the levels of a factor using fct_recode:

or collapse factor levels together using fct_collapse:

Again, we do not provide an exhaustive list of forcats functions here but the most usual ones, to give a glimpse of many things that one can do with factors. So, if you are dealing with factors, remember that forcats may have handy tools for you. Among others: * fct_anon to “anonymize”, i.e. replace the levels by random integers * fct_lump to collapse levels together based on their frequency (e.g. the two most frequent levels together)

2.4.3 Dropping levels

If you use factors in your tibble and get rid of one level, for any reason, the factor will usually remember the old levels, which may cause some problems when applying functions to your data.

Use droplevels (from base R) to make sure you get rid of levels that are not in your data anymore:

Fortunately, most functions within the tidyverse will not complain about missing levels, and will automatically get rid of those inexistant levels for you. But because factors are such common causes of bugs, keep this in mind!

Note that this is equivalent to doing:

2.4.4 Other things

Among other things you can use in forcats: * fct_count to get the frequency of each level * fct_c to combine factors together

2.4.5 Take home message for forcats

Use this package to manipulate your factors. Do you need factors? Or are character vectors enough? That is your call, and may depend on the kind of analyses you want to do and what they require. We saw here that for plotting, having factors can allow you to do quite some tweaking of the display. If you encounter a situation where the order of encoding of your character vector starts to matter, then maybe converting into a factor would make your life easier. And if you do so, remember that lots of tools to perform all kinds of manipulation are available to you with both stringrand forcats.

2.5 External resources

Find lots of additional info by looking up the following links: