Section 3 Data manipulation with dplyr

3.1 Introduction

3.1.1 Foreword on dplyr

dplyr is tasked with performing all sorts of transformations on a dataset.

The structure of dplyr revolves around a set of functions, the so-called verbs, that share a common syntax and logic, and are meant to work with one another in chained operations. Chained operations are performed with the pipe operator (%>%), that will be introduced in section 3.2.2.

The basic syntax is verb(data, variable), where data is a data frame and variable is the name of one or more columns containing a set of values for each observation.

There are 5 main verbs, which names already hint at what they do: rename(), select(), filter(), mutate(), and summarise(). I’m going to introduce each of them (and a couple more) through the following sections.

3.1.2 Example data

Through this tutorial, we will be using mammal trait data from the Phylacine database. Let’s have a peek at what it contains.

readr automatically loads the data in a tibble, as we have seen in chapter 1 and 2. Calling the tibble gives a nice preview of what it contains. We have data for 5,831 mammal species, and the variables contain information on taxonomy, (broad) habitat, mass, IUCN status, and diet.

If you remember Section 1.2 on tidy data, you may see that this data isn’t exactly tidy. In fact, some columns are in wide (and messy) format, like the “habitat” (terrestrial, marine, etc.) and diet columns.

dplyr actually does not require your data to be strictly tidy. If you feel that your data satisfies the definition “one observation per row, one variable per column”, that’s probably good enough.

I use a tibble here, but dplyr works equally well on base data frames. In fact, dplyr is built for data.frame objects, and tibbles are data frames. Therefore, tibbles are mortal.

3.2 Working with existing variables

3.2.1 Renaming variables with rename()

The variable names in the phylacine dataset are descriptive, but quite unpractical. Typing Binomial.1.2. is cumbersome and subject to typos (in fact, I just made one). binomial would be much simpler to use.

Changing names is straightforward with rename().

The first argument is always .data, the data table you want to apply change to. Note how columns are referred to. Once the data table as been passed as an argument, there is no need to refer to it directly anymore, dplyr understands that you’re dealing with variables inside that data frame. So drop that data$var, data[, "var"], and forget the very existence of attach() / detach().

You can refer to variables names either with strings or directly as objects, whether you’re reading or creating them:

I have applied similar changes to all variables in the dataset. Here is what the new names look like:

#> # A tibble: 5,831 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abditom… Rode… Murid… Abdi… latide…           1      0          0      0
#> 2 Abeomel… Rode… Murid… Abeo… sevia             1      0          0      0
#> 3 Abraway… Rode… Crice… Abra… ruschii           1      0          0      0
#> 4 Abrocom… Rode… Abroc… Abro… bennet…           1      0          0      0
#> 5 Abrocom… Rode… Abroc… Abro… bolivi…           1      0          0      0
#> 6 Abrocom… Rode… Abroc… Abro… budini            1      0          0      0
#> # … with 5,825 more rows, and 15 more variables: life_habit_method <chr>,
#> #   life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>

3.2.2 The pipe operator %>%

If you have already come across pieces of code using the tidyverse, chances are that you have seen this odd symbol. While the pipe is not strictly-speaking a part of the tidyverse (it comes from its own package, magrittr), it is imported along with each package and widely used in conjunction with its functions. What does it do? Consider the following example with rename():

phylacine2 <- readr::read_csv("data/phylacine_traits.csv")
# regular syntax
rename(phylacine2, "binomial" = "Binomial.1.2")
#> # A tibble: 5,831 x 24
#>   binomial Order.1.2 Family.1.2 Genus.1.2 Species.1.2 Terrestrial Marine
#>   <chr>    <chr>     <chr>      <chr>     <chr>             <dbl>  <dbl>
#> 1 Abditom… Rodentia  Muridae    Abditomys latidens              1      0
#> 2 Abeomel… Rodentia  Muridae    Abeomelo… sevia                 1      0
#> 3 Abraway… Rodentia  Cricetidae Abrawaya… ruschii               1      0
#> 4 Abrocom… Rodentia  Abrocomid… Abrocoma  bennettii             1      0
#> 5 Abrocom… Rodentia  Abrocomid… Abrocoma  boliviensis           1      0
#> 6 Abrocom… Rodentia  Abrocomid… Abrocoma  budini                1      0
#> # … with 5,825 more rows, and 17 more variables: Freshwater <dbl>,
#> #   Aerial <dbl>, Life.Habit.Method <chr>, Life.Habit.Source <chr>,
#> #   Mass.g <dbl>, Mass.Method <chr>, Mass.Source <chr>, Mass.Comparison <chr>,
#> #   Mass.Comparison.Source <chr>, Island.Endemicity <chr>,
#> #   IUCN.Status.1.2 <chr>, Added.IUCN.Status.1.2 <chr>, Diet.Plant <dbl>,
#> #   Diet.Vertebrate <dbl>, Diet.Invertebrate <dbl>, Diet.Method <chr>,
#> #   Diet.Source <chr>
# alternative syntax with the pipe operator
phylacine2 %>% rename("binomial" = "Binomial.1.2")
#> # A tibble: 5,831 x 24
#>   binomial Order.1.2 Family.1.2 Genus.1.2 Species.1.2 Terrestrial Marine
#>   <chr>    <chr>     <chr>      <chr>     <chr>             <dbl>  <dbl>
#> 1 Abditom… Rodentia  Muridae    Abditomys latidens              1      0
#> 2 Abeomel… Rodentia  Muridae    Abeomelo… sevia                 1      0
#> 3 Abraway… Rodentia  Cricetidae Abrawaya… ruschii               1      0
#> 4 Abrocom… Rodentia  Abrocomid… Abrocoma  bennettii             1      0
#> 5 Abrocom… Rodentia  Abrocomid… Abrocoma  boliviensis           1      0
#> 6 Abrocom… Rodentia  Abrocomid… Abrocoma  budini                1      0
#> # … with 5,825 more rows, and 17 more variables: Freshwater <dbl>,
#> #   Aerial <dbl>, Life.Habit.Method <chr>, Life.Habit.Source <chr>,
#> #   Mass.g <dbl>, Mass.Method <chr>, Mass.Source <chr>, Mass.Comparison <chr>,
#> #   Mass.Comparison.Source <chr>, Island.Endemicity <chr>,
#> #   IUCN.Status.1.2 <chr>, Added.IUCN.Status.1.2 <chr>, Diet.Plant <dbl>,
#> #   Diet.Vertebrate <dbl>, Diet.Invertebrate <dbl>, Diet.Method <chr>,
#> #   Diet.Source <chr>

Got it? The pipe takes the object on its left-side and silently feeds it to the first argument of the function on its right-side. It could be read as “take x, then do…”. The reason for using the pipe is because it makes code syntax closer to the syntax of a sentence, and therefore, easier and faster for your brain to process (and write!) the code. In particular, the pipe enables easy chains of operations, where you apply something to an object, then apply something else to the outcome, and so on… Through the later sections, you will see some examples of chained operations with dplyr functions, but for that I first need to introduce a couple more verbs.

Using the pipe can be quite unsettling at first, because you are not used to think in this way. But if you push a bit for it, I promise it will make things a lot easier (and it’s quite addictive!). To avoid typing the tedious symbols, magrittr installs a shortcut for you in RStudio. Use Ctrl + Shift + M on Windows, and Cmd + Shift + M on MacOS.

Finally I should emphasize that the use of the pipe isn’t limited to the tidyverse, but extends to almost all R functions. Remember that by default the piped value is always matched to the first argument of the following function

If you need to pass the left-hand side to an argument other than the first, you can use the dot place-holder ..

Because of its syntax, most base R operators are not compatible with the pipe (but this is very rarely needed). If needed, magrittr introduces alternative functions for operators.

Subsetting operators can be piped, with the dot place-holder.

Because subsetting in this way is particularly hideous, dplyr delivers a function to extract values from a single variable. In only works on tables, though.

3.2.3 Select variables with select()

To extract a set of variables (i.e. columns), use the conveniently-named select(). The basic syntax is the same as rename(): pass your data as the first argument, then call the variables to select, quoted or not.

You can select by variable numbers. This is not recommended, as prone to errors, especially if you change the variable order.

select() can also be used to exclude variables:

phylacine %>% select(-binomial)
#> # A tibble: 5,831 x 23
#>   order family genus species terrestrial marine freshwater aerial
#>   <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Rode… Murid… Abdi… latide…           1      0          0      0
#> 2 Rode… Murid… Abeo… sevia             1      0          0      0
#> 3 Rode… Crice… Abra… ruschii           1      0          0      0
#> 4 Rode… Abroc… Abro… bennet…           1      0          0      0
#> 5 Rode… Abroc… Abro… bolivi…           1      0          0      0
#> 6 Rode… Abroc… Abro… budini            1      0          0      0
#> # … with 5,825 more rows, and 15 more variables: life_habit_method <chr>,
#> #   life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>
phylacine %>% select(-(binomial:species))
#> # A tibble: 5,831 x 19
#>   terrestrial marine freshwater aerial life_habit_meth… life_habit_sour… mass_g
#>         <dbl>  <dbl>      <dbl>  <dbl> <chr>            <chr>             <dbl>
#> 1           1      0          0      0 Reported         IUCN. 2016. IUC…   269 
#> 2           1      0          0      0 Reported         IUCN. 2016. IUC…    52 
#> 3           1      0          0      0 Reported         IUCN. 2016. IUC…    63 
#> 4           1      0          0      0 Reported         IUCN. 2016. IUC…   250 
#> 5           1      0          0      0 Reported         IUCN. 2016. IUC…   158 
#> 6           1      0          0      0 Reported         IUCN. 2016. IUC…   361.
#> # … with 5,825 more rows, and 12 more variables: mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>

select() and rename() are pretty similar, and in fact, select() can also rename variables along the way:

And you can mix all of that at once:

3.2.4 Select variables with helpers

The Rstudio team just released dplyr 1.0.0, and along with it, some nice helper functions to ease the selection of a set of variables. I give three examples here, and encourage you to look at the documentation (?select()) to find out more.

3.2.5 Rearranging variable order with relocate()

The order of variables seldom matters in dplyr, but due to popular demand, dplyr now has a dedicated verb to rearrange the order of variables. The syntax is identical to rename(), select().

phylacine %>% relocate(mass_g, .before = binomial)
#> # A tibble: 5,831 x 24
#>   mass_g binomial order family genus species terrestrial marine freshwater
#>    <dbl> <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>
#> 1   269  Abditom… Rode… Murid… Abdi… latide…           1      0          0
#> 2    52  Abeomel… Rode… Murid… Abeo… sevia             1      0          0
#> 3    63  Abraway… Rode… Crice… Abra… ruschii           1      0          0
#> 4   250  Abrocom… Rode… Abroc… Abro… bennet…           1      0          0
#> 5   158  Abrocom… Rode… Abroc… Abro… bolivi…           1      0          0
#> 6   361. Abrocom… Rode… Abroc… Abro… budini            1      0          0
#> # … with 5,825 more rows, and 15 more variables: aerial <dbl>,
#> #   life_habit_method <chr>, life_habit_source <chr>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>
phylacine %>% relocate(starts_with("diet"), .after = species)
#> # A tibble: 5,831 x 24
#>   binomial order family genus species diet_plant diet_vertebrate
#>   <chr>    <chr> <chr>  <chr> <chr>        <dbl>           <dbl>
#> 1 Abditom… Rode… Murid… Abdi… latide…        100               0
#> 2 Abeomel… Rode… Murid… Abeo… sevia           78               3
#> 3 Abraway… Rode… Crice… Abra… ruschii         88               1
#> 4 Abrocom… Rode… Abroc… Abro… bennet…        100               0
#> 5 Abrocom… Rode… Abroc… Abro… bolivi…        100               0
#> 6 Abrocom… Rode… Abroc… Abro… budini         100               0
#> # … with 5,825 more rows, and 17 more variables: diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>, terrestrial <dbl>, marine <dbl>,
#> #   freshwater <dbl>, aerial <dbl>, life_habit_method <chr>,
#> #   life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>

3.3 Working with observations

3.3.2 Subset rows by position - slice()

Use slice() and its variants to extract particular rows.

phylacine %>% slice(3) # third row
#> # A tibble: 1 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abraway… Rode… Crice… Abra… ruschii           1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice(5, 1, 2) # fifth, first and second row
#> # A tibble: 3 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abrocom… Rode… Abroc… Abro… bolivi…           1      0          0      0
#> 2 Abditom… Rode… Murid… Abdi… latide…           1      0          0      0
#> 3 Abeomel… Rode… Murid… Abeo… sevia             1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice(rep(3, 2)) # duplicate the third row
#> # A tibble: 2 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abraway… Rode… Crice… Abra… ruschii           1      0          0      0
#> 2 Abraway… Rode… Crice… Abra… ruschii           1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice(-c(2:5830)) # exclude all but first and last row
#> # A tibble: 2 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abditom… Rode… Murid… Abdi… latide…           1      0          0      0
#> 2 Zyzomys… Rode… Murid… Zyzo… woodwa…           1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>

phylacine %>% slice_tail(n = 3) # last three rows
#> # A tibble: 3 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Zyzomys… Rode… Murid… Zyzo… palata…           1      0          0      0
#> 2 Zyzomys… Rode… Murid… Zyzo… pedunc…           1      0          0      0
#> 3 Zyzomys… Rode… Murid… Zyzo… woodwa…           1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice_max(mass_g) # largest mammal
#> # A tibble: 1 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Balaeno… Ceta… Balae… Bala… muscul…           0      1          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>

You can also sample random rows in the data:

phylacine %>% slice_sample() # a random row
#> # A tibble: 1 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Crocidu… Euli… Soric… Croc… levicu…           1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>

# bootstrap
phylacine %>% slice_sample(n = 5831, replace = TRUE)
#> # A tibble: 5,831 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Rhinolo… Chir… Rhino… Rhin… adami             0      0          0      1
#> 2 Hylomys… Euli… Erina… Hylo… megalo…           1      0          0      0
#> 3 Sciurus… Rode… Sciur… Sciu… yucata…           1      0          0      0
#> 4 Emballo… Chir… Embal… Emba… alecto            0      0          0      1
#> 5 Pteralo… Chir… Ptero… Pter… taki              0      0          0      1
#> 6 Lasiorh… Dipr… Vomba… Lasi… latifr…           1      0          0      0
#> # … with 5,825 more rows, and 15 more variables: life_habit_method <chr>,
#> #   life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>

3.3.3 Subsetting rows by value with filter()

filter() does a similar job as slice(), but extract rows that satisfy a set of conditions. The conditions are supplied much the same way as you would do for an if statement.

Along with mutate() (next section), this is probably the function you are going to use the most.

For example, I might want to extract mammals above a given mass:

Are there any flying mammals that aren’t bats?

Are humans included in the table?

filter() can be used to deal with NAs:

Tip: dplyr introduces the useful function between() that does exactly what the name implies

Note that you can pipe operations inside function arguments as in the last line above (arguments are expressions, after all!).

3.4 Making new variables

3.4.1 Create new variables with mutate()

Very often in data analysis, you will want to create new variables, or edit existing ones. This is done easily through mutate(). For example, consider the diet data:

These three variables show the percentage of each category of food that make the diet of that species. They should sum to 100, unless the authors made a typo or other entry error. To assert this, I’m going to create a new variable, total_diet.

mutate() adds a variable to the table, and keeps all other variables. Sometimes you may want to just keep the new variable, and drop the other ones. That’s the job of mutate()’s twin sibling, transmute(). For example, I want to combine diet_invertebrate and diet_vertebrate together:

You may want to keep some variables and drop others. You could pipe mutate() and select() to do so, or you could just pass the variables to keep to transmute().

You can also refer to variables you’re creating to derive new variables from them as part of the same operation, this is not an issue.

Sometimes, you may need to perform an operation based on the row number (I don’t have a good example in mind). tibble has a built-in function to do just that:

3.4.2 Summarise observations with summarise()

mutate() applies operations to all observations in a table. By contrast, summarise() applies operations to groups of observations, and returns, er, summaries. The default grouping unit is the entire table:

Above you can see that bats account for a large portion of mammal species diversity (nb_aerial). How much exactly? Just as with mutate(), you can perform operations on the variables you just created, in the same statement:

One fifth!

If the british spelling bothers you, summarize() exists and is strictly equivalent.

Here’s a simple trick with logical (TRUE / FALSE) variables. Their sum is the count of observations that evaluate to TRUE (because TRUE is taken as 1 and FALSE as 0) and their mean is the proportion of TRUE observations. This can be exploited to count the number of observations that satisfy a condition:

There are more summaries that just means and counts (see ?summarise() for some helpful functions). In fact, summarise can use any function or expression that evaluates to a single value or a vector of values. This includes base R max(), quantiles, etc.

mutate() and transmute() can compute summaries as well, but they will return the summary once for each observation, in a new column.

3.4.3 Grouping observations by variables

In most cases you don’t want to run summary operations on the entire set of observations, but instead on observations that share a common value, i.e. groups. For example, I want to run the summary displayed above, but for each Order of mammals.

distinct() extracts all the unique values of a variable

I could work my way with what we have already seen, filtering observations (filter(order == "Rodentia")) and then pipeing the output to summarise(), and do it again for each Order. But that would be tedious.

Instead, I can use group_by() to pool observations by order.

At first glance, nothing has changed, apart from an extra line of information in the output that tells me the observations have been grouped. But now here’s what happen if I run the same summarise() statement on an ungrouped and a grouped table

I get one value for each group.

Observations can be grouped by multiple variables, which will output a summary for every unique combination of groups.

Whenever you call summarise(), the last level of grouping is dropped. Note how in the output table above, observations are still grouped by order, and no longer by IUCN status. If I summarise observations again:

I get the summary across orders, and the table is no longer grouped at all. This is useful to consider if you need to work on summaries across different levels of the data.

For example, I would like to know how the species in each order are distributed between the different levels of threat in the IUCN classification. To get these proportions, I need to first get the count of each number of species in a level of threat inside an order, and divide that by the number of species in that order.

10.2% of Carnivores are Endangered (“EN”).

3.4.4 Grouped data and other dplyr verbs

Grouping does not only affect the behaviour of summarise, but under circumstances, other verbs can (and will!) perform operations by groups.

To avoid grouped operations, you can simply drop grouping with ungroup().

3.5 Working with multiple tables

3.5.2 Combining variables of two tables with mutating joins

Mutating joins are tailored to combine tables that share a set of observations but have different variables.

As an example, let’s split the phylacine dataset in two smaller datasets, one containing information on diet and one on the dominant habitat.

The two datasets each contain 5 species, the first three are shared, and the two last differ between the two.

To use mutate-joins, both tables need to have a key, a variable that identifies each observation. Here, that would be binomial, the sepcies names. If your table doesn’t have such a key and the rows between the tables match one another, remember you can create a row number variable easily with tibble::column_to_rownames().

inner_join combined the variables, and dropped the observations that weren’t matched between the two tables. There are three other variations of mutating joins, differing in what they do with unmatching variables.

left_join(diet, life_habit, by = "binomial")
#> # A tibble: 5 x 8
#>   binomial diet_plant diet_vertebrate diet_invertebra… terrestrial marine
#>   <chr>         <dbl>           <dbl>            <dbl>       <dbl>  <dbl>
#> 1 Abditom…        100               0                0           1      0
#> 2 Abeomel…         78               3               19           1      0
#> 3 Abraway…         88               1               11           1      0
#> 4 Abrocom…        100               0                0          NA     NA
#> 5 Abrocom…        100               0                0          NA     NA
#> # … with 2 more variables: freshwater <dbl>, aerial <dbl>
right_join(diet, life_habit, by = "binomial")
#> # A tibble: 5 x 8
#>   binomial diet_plant diet_vertebrate diet_invertebra… terrestrial marine
#>   <chr>         <dbl>           <dbl>            <dbl>       <dbl>  <dbl>
#> 1 Abditom…        100               0                0           1      0
#> 2 Abeomel…         78               3               19           1      0
#> 3 Abraway…         88               1               11           1      0
#> 4 Abrocom…         NA              NA               NA           1      0
#> 5 Abrocom…         NA              NA               NA           1      0
#> # … with 2 more variables: freshwater <dbl>, aerial <dbl>
full_join(diet, life_habit, by = "binomial")
#> # A tibble: 7 x 8
#>   binomial diet_plant diet_vertebrate diet_invertebra… terrestrial marine
#>   <chr>         <dbl>           <dbl>            <dbl>       <dbl>  <dbl>
#> 1 Abditom…        100               0                0           1      0
#> 2 Abeomel…         78               3               19           1      0
#> 3 Abraway…         88               1               11           1      0
#> 4 Abrocom…        100               0                0          NA     NA
#> 5 Abrocom…        100               0                0          NA     NA
#> 6 Abrocom…         NA              NA               NA           1      0
#> # … with 1 more row, and 2 more variables: freshwater <dbl>, aerial <dbl>

semi_join(diet, life_habit, by  = "binomial")
#> # A tibble: 3 x 4
#>   binomial             diet_plant diet_vertebrate diet_invertebrate
#>   <chr>                     <dbl>           <dbl>             <dbl>
#> 1 Abditomys_latidens          100               0                 0
#> 2 Abeomelomys_sevia            78               3                19
#> 3 Abrawayaomys_ruschii         88               1                11
anti_join(diet, life_habit, by  = "binomial")
#> # A tibble: 2 x 4
#>   binomial             diet_plant diet_vertebrate diet_invertebrate
#>   <chr>                     <dbl>           <dbl>             <dbl>
#> 1 Abrocoma_bennettii          100               0                 0
#> 2 Abrocoma_boliviensis        100               0                 0