Section 3 Data manipulation with `dplyr`

# load the tidyverse
library(tidyverse)

3.1 Introduction

3.1.1 Foreword on `dplyr`

dplyr is tasked with performing all sorts of transformations on a dataset.

The structure of dplyr revolves around a set of functions, the so-called verbs, that share a common syntax and logic, and are meant to work with one another in chained operations. Chained operations are performed with the pipe operator (%>%), that will be introduced in section 3.2.2.

The basic syntax is verb(data, variable), where data is a data frame and variable is the name of one or more columns containing a set of values for each observation.

There are 5 main verbs, which names already hint at what they do: rename(), select(), filter(), mutate(), and summarise(). I’m going to introduce each of them (and a couple more) through the following sections.

3.1.2 Example data

Through this tutorial, we will be using mammal trait data from the Phylacine database. Let’s have a peek at what it contains.

phylacine <- read_csv("data/phylacine_traits.csv")
phylacine
#> # A tibble: 5,831 x 24
#>   Binomial.1.2 Order.1.2 Family.1.2 Genus.1.2 Species.1.2 Terrestrial Marine
#>   <chr>        <chr>     <chr>      <chr>     <chr>             <dbl>  <dbl>
#> 1 Abditomys_l… Rodentia  Muridae    Abditomys latidens              1      0
#> 2 Abeomelomys… Rodentia  Muridae    Abeomelo… sevia                 1      0
#> 3 Abrawayaomy… Rodentia  Cricetidae Abrawaya… ruschii               1      0
#> 4 Abrocoma_be… Rodentia  Abrocomid… Abrocoma  bennettii             1      0
#> 5 Abrocoma_bo… Rodentia  Abrocomid… Abrocoma  boliviensis           1      0
#> 6 Abrocoma_bu… Rodentia  Abrocomid… Abrocoma  budini                1      0
#> # … with 5,825 more rows, and 17 more variables: Freshwater <dbl>,
#> #   Aerial <dbl>, Life.Habit.Method <chr>, Life.Habit.Source <chr>,
#> #   Mass.g <dbl>, Mass.Method <chr>, Mass.Source <chr>, Mass.Comparison <chr>,
#> #   Mass.Comparison.Source <chr>, Island.Endemicity <chr>,
#> #   IUCN.Status.1.2 <chr>, Added.IUCN.Status.1.2 <chr>, Diet.Plant <dbl>,
#> #   Diet.Vertebrate <dbl>, Diet.Invertebrate <dbl>, Diet.Method <chr>,
#> #   Diet.Source <chr>

readr automatically loads the data in a tibble, as we have seen in chapter 1 and 2. Calling the tibble gives a nice preview of what it contains. We have data for 5,831 mammal species, and the variables contain information on taxonomy, (broad) habitat, mass, IUCN status, and diet.

If you remember Section 1.2 on tidy data, you may see that this data isn’t exactly tidy. In fact, some columns are in wide (and messy) format, like the “habitat” (terrestrial, marine, etc.) and diet columns.

dplyr actually does not require your data to be strictly tidy. If you feel that your data satisfies the definition “one observation per row, one variable per column”, that’s probably good enough.

I use a tibble here, but dplyr works equally well on base data frames. In fact, dplyr is built for data.frame objects, and tibbles are data frames. Therefore, tibbles are mortal.

3.2 Working with existing variables

3.2.1 Renaming variables with `rename()`

The variable names in the phylacine dataset are descriptive, but quite unpractical. Typing Binomial.1.2. is cumbersome and subject to typos (in fact, I just made one). binomial would be much simpler to use.

Changing names is straightforward with rename().

rename(.data = phylacine, "binomial" = Binomial.1.2)
#> # A tibble: 5,831 x 24
#>   binomial Order.1.2 Family.1.2 Genus.1.2 Species.1.2 Terrestrial Marine
#>   <chr>    <chr>     <chr>      <chr>     <chr>             <dbl>  <dbl>
#> 1 Abditom… Rodentia  Muridae    Abditomys latidens              1      0
#> 2 Abeomel… Rodentia  Muridae    Abeomelo… sevia                 1      0
#> 3 Abraway… Rodentia  Cricetidae Abrawaya… ruschii               1      0
#> 4 Abrocom… Rodentia  Abrocomid… Abrocoma  bennettii             1      0
#> 5 Abrocom… Rodentia  Abrocomid… Abrocoma  boliviensis           1      0
#> 6 Abrocom… Rodentia  Abrocomid… Abrocoma  budini                1      0
#> # … with 5,825 more rows, and 17 more variables: Freshwater <dbl>,
#> #   Aerial <dbl>, Life.Habit.Method <chr>, Life.Habit.Source <chr>,
#> #   Mass.g <dbl>, Mass.Method <chr>, Mass.Source <chr>, Mass.Comparison <chr>,
#> #   Mass.Comparison.Source <chr>, Island.Endemicity <chr>,
#> #   IUCN.Status.1.2 <chr>, Added.IUCN.Status.1.2 <chr>, Diet.Plant <dbl>,
#> #   Diet.Vertebrate <dbl>, Diet.Invertebrate <dbl>, Diet.Method <chr>,
#> #   Diet.Source <chr>

The first argument is always .data, the data table you want to apply change to. Note how columns are referred to. Once the data table as been passed as an argument, there is no need to refer to it directly anymore, dplyr understands that you’re dealing with variables inside that data frame. So drop that data$var, data[, "var"], and forget the very existence of attach() / detach().

You can refer to variables names either with strings or directly as objects, whether you’re reading or creating them:

rename(
  phylacine,
  # this works
  binomial = Binomial.1.2
)
rename(
  phylacine,
  # this works too!
  binomial = "Binomial.1.2"
)
rename(
  phylacine,
  # guess what
  "binomial" = "Binomial.1.2"
)

I have applied similar changes to all variables in the dataset. Here is what the new names look like:

#> # A tibble: 5,831 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abditom… Rode… Murid… Abdi… latide…           1      0          0      0
#> 2 Abeomel… Rode… Murid… Abeo… sevia             1      0          0      0
#> 3 Abraway… Rode… Crice… Abra… ruschii           1      0          0      0
#> 4 Abrocom… Rode… Abroc… Abro… bennet…           1      0          0      0
#> 5 Abrocom… Rode… Abroc… Abro… bolivi…           1      0          0      0
#> 6 Abrocom… Rode… Abroc… Abro… budini            1      0          0      0
#> # … with 5,825 more rows, and 15 more variables: life_habit_method <chr>,
#> #   life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>

3.2.2 The pipe operator `%>%`

If you have already come across pieces of code using the tidyverse, chances are that you have seen this odd symbol. While the pipe is not strictly-speaking a part of the tidyverse (it comes from its own package, magrittr), it is imported along with each package and widely used in conjunction with its functions. What does it do? Consider the following example with rename():

phylacine2 <- readr::read_csv("data/phylacine_traits.csv")
# regular syntax
rename(phylacine2, "binomial" = "Binomial.1.2")
#> # A tibble: 5,831 x 24
#>   binomial Order.1.2 Family.1.2 Genus.1.2 Species.1.2 Terrestrial Marine
#>   <chr>    <chr>     <chr>      <chr>     <chr>             <dbl>  <dbl>
#> 1 Abditom… Rodentia  Muridae    Abditomys latidens              1      0
#> 2 Abeomel… Rodentia  Muridae    Abeomelo… sevia                 1      0
#> 3 Abraway… Rodentia  Cricetidae Abrawaya… ruschii               1      0
#> 4 Abrocom… Rodentia  Abrocomid… Abrocoma  bennettii             1      0
#> 5 Abrocom… Rodentia  Abrocomid… Abrocoma  boliviensis           1      0
#> 6 Abrocom… Rodentia  Abrocomid… Abrocoma  budini                1      0
#> # … with 5,825 more rows, and 17 more variables: Freshwater <dbl>,
#> #   Aerial <dbl>, Life.Habit.Method <chr>, Life.Habit.Source <chr>,
#> #   Mass.g <dbl>, Mass.Method <chr>, Mass.Source <chr>, Mass.Comparison <chr>,
#> #   Mass.Comparison.Source <chr>, Island.Endemicity <chr>,
#> #   IUCN.Status.1.2 <chr>, Added.IUCN.Status.1.2 <chr>, Diet.Plant <dbl>,
#> #   Diet.Vertebrate <dbl>, Diet.Invertebrate <dbl>, Diet.Method <chr>,
#> #   Diet.Source <chr>
# alternative syntax with the pipe operator
phylacine2 %>% rename("binomial" = "Binomial.1.2")
#> # A tibble: 5,831 x 24
#>   binomial Order.1.2 Family.1.2 Genus.1.2 Species.1.2 Terrestrial Marine
#>   <chr>    <chr>     <chr>      <chr>     <chr>             <dbl>  <dbl>
#> 1 Abditom… Rodentia  Muridae    Abditomys latidens              1      0
#> 2 Abeomel… Rodentia  Muridae    Abeomelo… sevia                 1      0
#> 3 Abraway… Rodentia  Cricetidae Abrawaya… ruschii               1      0
#> 4 Abrocom… Rodentia  Abrocomid… Abrocoma  bennettii             1      0
#> 5 Abrocom… Rodentia  Abrocomid… Abrocoma  boliviensis           1      0
#> 6 Abrocom… Rodentia  Abrocomid… Abrocoma  budini                1      0
#> # … with 5,825 more rows, and 17 more variables: Freshwater <dbl>,
#> #   Aerial <dbl>, Life.Habit.Method <chr>, Life.Habit.Source <chr>,
#> #   Mass.g <dbl>, Mass.Method <chr>, Mass.Source <chr>, Mass.Comparison <chr>,
#> #   Mass.Comparison.Source <chr>, Island.Endemicity <chr>,
#> #   IUCN.Status.1.2 <chr>, Added.IUCN.Status.1.2 <chr>, Diet.Plant <dbl>,
#> #   Diet.Vertebrate <dbl>, Diet.Invertebrate <dbl>, Diet.Method <chr>,
#> #   Diet.Source <chr>

Got it? The pipe takes the object on its left-side and silently feeds it to the first argument of the function on its right-side. It could be read as “take x, then do…”. The reason for using the pipe is because it makes code syntax closer to the syntax of a sentence, and therefore, easier and faster for your brain to process (and write!) the code. In particular, the pipe enables easy chains of operations, where you apply something to an object, then apply something else to the outcome, and so on… Through the later sections, you will see some examples of chained operations with dplyr functions, but for that I first need to introduce a couple more verbs.

Using the pipe can be quite unsettling at first, because you are not used to think in this way. But if you push a bit for it, I promise it will make things a lot easier (and it’s quite addictive!). To avoid typing the tedious symbols, magrittr installs a shortcut for you in RStudio. Use Ctrl + Shift + M on Windows, and Cmd + Shift + M on MacOS.

Finally I should emphasize that the use of the pipe isn’t limited to the tidyverse, but extends to almost all R functions. Remember that by default the piped value is always matched to the first argument of the following function

5 %>% rep(3)
#> [1] 5 5 5
"meow" %>% cat()
#> meow

If you need to pass the left-hand side to an argument other than the first, you can use the dot place-holder ..

"meow" %>% cat("cats", "go")
#> meow cats go

Because of its syntax, most base R operators are not compatible with the pipe (but this is very rarely needed). If needed, magrittr introduces alternative functions for operators.

Subsetting operators can be piped, with the dot place-holder.

# 5 %>% * 3 # no, won't work
# 5 %>% .* 3 # neither
5 %>% magrittr::multiply_by(3) # yes
#> [1] 15

# subsetting
list("monkey see", "monkey_do") %>% .[[2]]
#> [1] "monkey_do"
phylacine %>% .$binomial %>% head()
#> [1] "Abditomys_latidens"   "Abeomelomys_sevia"    "Abrawayaomys_ruschii"
#> [4] "Abrocoma_bennettii"   "Abrocoma_boliviensis" "Abrocoma_budini"

Because subsetting in this way is particularly hideous, dplyr delivers a function to extract values from a single variable. In only works on tables, though.

phylacine %>% pull(binomial) %>% head()
#> [1] "Abditomys_latidens"   "Abeomelomys_sevia"    "Abrawayaomys_ruschii"
#> [4] "Abrocoma_bennettii"   "Abrocoma_boliviensis" "Abrocoma_budini"

3.2.3 Select variables with `select()`

To extract a set of variables (i.e. columns), use the conveniently-named select(). The basic syntax is the same as rename(): pass your data as the first argument, then call the variables to select, quoted or not.

# Single variable
phylacine %>% select(binomial)
#> # A tibble: 5,831 x 1
#>   binomial            
#>   <chr>               
#> 1 Abditomys_latidens  
#> 2 Abeomelomys_sevia   
#> 3 Abrawayaomys_ruschii
#> 4 Abrocoma_bennettii  
#> 5 Abrocoma_boliviensis
#> 6 Abrocoma_budini     
#> # … with 5,825 more rows
# A set of variables
phylacine %>% select(genus, "species", mass_g)
#> # A tibble: 5,831 x 3
#>   genus        species     mass_g
#>   <chr>        <chr>        <dbl>
#> 1 Abditomys    latidens      269 
#> 2 Abeomelomys  sevia          52 
#> 3 Abrawayaomys ruschii        63 
#> 4 Abrocoma     bennettii     250 
#> 5 Abrocoma     boliviensis   158 
#> 6 Abrocoma     budini        361.
#> # … with 5,825 more rows
# A range of contiguous variables
phylacine %>% select(family:terrestrial)
#> # A tibble: 5,831 x 4
#>   family      genus        species     terrestrial
#>   <chr>       <chr>        <chr>             <dbl>
#> 1 Muridae     Abditomys    latidens              1
#> 2 Muridae     Abeomelomys  sevia                 1
#> 3 Cricetidae  Abrawayaomys ruschii               1
#> 4 Abrocomidae Abrocoma     bennettii             1
#> 5 Abrocomidae Abrocoma     boliviensis           1
#> 6 Abrocomidae Abrocoma     budini                1
#> # … with 5,825 more rows

You can select by variable numbers. This is not recommended, as prone to errors, especially if you change the variable order.

phylacine %>% select(2)
#> # A tibble: 5,831 x 1
#>   order   
#>   <chr>   
#> 1 Rodentia
#> 2 Rodentia
#> 3 Rodentia
#> 4 Rodentia
#> 5 Rodentia
#> 6 Rodentia
#> # … with 5,825 more rows

select() can also be used to exclude variables:

phylacine %>% select(-binomial)
#> # A tibble: 5,831 x 23
#>   order family genus species terrestrial marine freshwater aerial
#>   <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Rode… Murid… Abdi… latide…           1      0          0      0
#> 2 Rode… Murid… Abeo… sevia             1      0          0      0
#> 3 Rode… Crice… Abra… ruschii           1      0          0      0
#> 4 Rode… Abroc… Abro… bennet…           1      0          0      0
#> 5 Rode… Abroc… Abro… bolivi…           1      0          0      0
#> 6 Rode… Abroc… Abro… budini            1      0          0      0
#> # … with 5,825 more rows, and 15 more variables: life_habit_method <chr>,
#> #   life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>
phylacine %>% select(-(binomial:species))
#> # A tibble: 5,831 x 19
#>   terrestrial marine freshwater aerial life_habit_meth… life_habit_sour… mass_g
#>         <dbl>  <dbl>      <dbl>  <dbl> <chr>            <chr>             <dbl>
#> 1           1      0          0      0 Reported         IUCN. 2016. IUC…   269 
#> 2           1      0          0      0 Reported         IUCN. 2016. IUC…    52 
#> 3           1      0          0      0 Reported         IUCN. 2016. IUC…    63 
#> 4           1      0          0      0 Reported         IUCN. 2016. IUC…   250 
#> 5           1      0          0      0 Reported         IUCN. 2016. IUC…   158 
#> 6           1      0          0      0 Reported         IUCN. 2016. IUC…   361.
#> # … with 5,825 more rows, and 12 more variables: mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>

select() and rename() are pretty similar, and in fact, select() can also rename variables along the way:

phylacine %>% select("linnaeus" = binomial)
#> # A tibble: 5,831 x 1
#>   linnaeus            
#>   <chr>               
#> 1 Abditomys_latidens  
#> 2 Abeomelomys_sevia   
#> 3 Abrawayaomys_ruschii
#> 4 Abrocoma_bennettii  
#> 5 Abrocoma_boliviensis
#> 6 Abrocoma_budini     
#> # … with 5,825 more rows

And you can mix all of that at once:

phylacine %>% select(
  "fam" = family,
  genus:freshwater,
  -terrestrial
)
#> # A tibble: 5,831 x 5
#>   fam         genus        species     marine freshwater
#>   <chr>       <chr>        <chr>        <dbl>      <dbl>
#> 1 Muridae     Abditomys    latidens         0          0
#> 2 Muridae     Abeomelomys  sevia            0          0
#> 3 Cricetidae  Abrawayaomys ruschii          0          0
#> 4 Abrocomidae Abrocoma     bennettii        0          0
#> 5 Abrocomidae Abrocoma     boliviensis      0          0
#> 6 Abrocomidae Abrocoma     budini           0          0
#> # … with 5,825 more rows

3.2.4 Select variables with helpers

The Rstudio team just released dplyr 1.0.0, and along with it, some nice helper functions to ease the selection of a set of variables. I give three examples here, and encourage you to look at the documentation (?select()) to find out more.

phylacine %>% select(where(is.numeric))
#> # A tibble: 5,831 x 8
#>   terrestrial marine freshwater aerial mass_g diet_plant diet_vertebrate
#>         <dbl>  <dbl>      <dbl>  <dbl>  <dbl>      <dbl>           <dbl>
#> 1           1      0          0      0   269         100               0
#> 2           1      0          0      0    52          78               3
#> 3           1      0          0      0    63          88               1
#> 4           1      0          0      0   250         100               0
#> 5           1      0          0      0   158         100               0
#> 6           1      0          0      0   361.        100               0
#> # … with 5,825 more rows, and 1 more variable: diet_invertebrate <dbl>
phylacine %>% select(contains("mass") | contains("diet"))
#> # A tibble: 5,831 x 10
#>   mass_g mass_method mass_source mass_comparison mass_comparison… diet_plant
#>    <dbl> <chr>       <chr>       <chr>           <chr>                 <dbl>
#> 1   269  Reported    Smith, F. … <NA>            <NA>                    100
#> 2    52  Reported    Smith, F. … <NA>            <NA>                     78
#> 3    63  Reported    Smith, F. … <NA>            <NA>                     88
#> 4   250  Reported    Smith, F. … <NA>            <NA>                    100
#> 5   158  Reported    Smith, F. … <NA>            <NA>                    100
#> 6   361. Assumed is… Journal of… Abrocoma_ciner… Journal of Mamm…        100
#> # … with 5,825 more rows, and 4 more variables: diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>

habitats <- c("terrestrial", "marine", "arboreal", "fossorial", "freshwater")
phylacine %>% select(any_of(habitats))
#> # A tibble: 5,831 x 3
#>   terrestrial marine freshwater
#>         <dbl>  <dbl>      <dbl>
#> 1           1      0          0
#> 2           1      0          0
#> 3           1      0          0
#> 4           1      0          0
#> 5           1      0          0
#> 6           1      0          0
#> # … with 5,825 more rows

3.2.5 Rearranging variable order with `relocate()`

The order of variables seldom matters in dplyr, but due to popular demand, dplyr now has a dedicated verb to rearrange the order of variables. The syntax is identical to rename(), select().

phylacine %>% relocate(mass_g, .before = binomial)
#> # A tibble: 5,831 x 24
#>   mass_g binomial order family genus species terrestrial marine freshwater
#>    <dbl> <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>
#> 1   269  Abditom… Rode… Murid… Abdi… latide…           1      0          0
#> 2    52  Abeomel… Rode… Murid… Abeo… sevia             1      0          0
#> 3    63  Abraway… Rode… Crice… Abra… ruschii           1      0          0
#> 4   250  Abrocom… Rode… Abroc… Abro… bennet…           1      0          0
#> 5   158  Abrocom… Rode… Abroc… Abro… bolivi…           1      0          0
#> 6   361. Abrocom… Rode… Abroc… Abro… budini            1      0          0
#> # … with 5,825 more rows, and 15 more variables: aerial <dbl>,
#> #   life_habit_method <chr>, life_habit_source <chr>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>
phylacine %>% relocate(starts_with("diet"), .after = species)
#> # A tibble: 5,831 x 24
#>   binomial order family genus species diet_plant diet_vertebrate
#>   <chr>    <chr> <chr>  <chr> <chr>        <dbl>           <dbl>
#> 1 Abditom… Rode… Murid… Abdi… latide…        100               0
#> 2 Abeomel… Rode… Murid… Abeo… sevia           78               3
#> 3 Abraway… Rode… Crice… Abra… ruschii         88               1
#> 4 Abrocom… Rode… Abroc… Abro… bennet…        100               0
#> 5 Abrocom… Rode… Abroc… Abro… bolivi…        100               0
#> 6 Abrocom… Rode… Abroc… Abro… budini         100               0
#> # … with 5,825 more rows, and 17 more variables: diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>, terrestrial <dbl>, marine <dbl>,
#> #   freshwater <dbl>, aerial <dbl>, life_habit_method <chr>,
#> #   life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>

3.3 Working with observations

3.3.1 Ordering rows by value - `arrange()`

arrange() sorts rows in the data by ascending value for a given variable. Use the wrapper desc() to sort by descending values instead.

# Smallest mammals
phylacine %>%
  arrange(mass_g) %>%
  select(binomial, mass_g)
#> # A tibble: 5,831 x 2
#>   binomial            mass_g
#>   <chr>                <dbl>
#> 1 Sorex_yukonicus        1.6
#> 2 Crocidura_levicula     1.8
#> 3 Suncus_remyi           1.8
#> 4 Crocidura_lusitania    2  
#> 5 Kerivoula_minuta       2.1
#> 6 Suncus_etruscus        2.1
#> # … with 5,825 more rows

# Largest mammals
phylacine %>%
  arrange(desc(mass_g)) %>%
  select(binomial, mass_g)
#> # A tibble: 5,831 x 2
#>   binomial                  mass_g
#>   <chr>                      <dbl>
#> 1 Balaenoptera_musculus  190000000
#> 2 Balaena_mysticetus     100000000
#> 3 Balaenoptera_physalus   70000000
#> 4 Caperea_marginata       32000000
#> 5 Megaptera_novaeangliae  30000000
#> 6 Eschrichtius_robustus   28500000
#> # … with 5,825 more rows

# Extra variables are used to sort ties in the first variable
phylacine %>%
  arrange(mass_g, desc(binomial)) %>%
  select(binomial, mass_g)
#> # A tibble: 5,831 x 2
#>   binomial            mass_g
#>   <chr>                <dbl>
#> 1 Sorex_yukonicus        1.6
#> 2 Suncus_remyi           1.8
#> 3 Crocidura_levicula     1.8
#> 4 Crocidura_lusitania    2  
#> 5 Suncus_etruscus        2.1
#> 6 Kerivoula_minuta       2.1
#> # … with 5,825 more rows

Important: NA values, if present, are always ordered at the end!

3.3.2 Subset rows by position - `slice()`

Use slice() and its variants to extract particular rows.

phylacine %>% slice(3) # third row
#> # A tibble: 1 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abraway… Rode… Crice… Abra… ruschii           1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice(5, 1, 2) # fifth, first and second row
#> # A tibble: 3 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abrocom… Rode… Abroc… Abro… bolivi…           1      0          0      0
#> 2 Abditom… Rode… Murid… Abdi… latide…           1      0          0      0
#> 3 Abeomel… Rode… Murid… Abeo… sevia             1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice(rep(3, 2)) # duplicate the third row
#> # A tibble: 2 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abraway… Rode… Crice… Abra… ruschii           1      0          0      0
#> 2 Abraway… Rode… Crice… Abra… ruschii           1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice(-c(2:5830)) # exclude all but first and last row
#> # A tibble: 2 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abditom… Rode… Murid… Abdi… latide…           1      0          0      0
#> 2 Zyzomys… Rode… Murid… Zyzo… woodwa…           1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>

phylacine %>% slice_tail(n = 3) # last three rows
#> # A tibble: 3 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Zyzomys… Rode… Murid… Zyzo… palata…           1      0          0      0
#> 2 Zyzomys… Rode… Murid… Zyzo… pedunc…           1      0          0      0
#> 3 Zyzomys… Rode… Murid… Zyzo… woodwa…           1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice_max(mass_g) # largest mammal
#> # A tibble: 1 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Balaeno… Ceta… Balae… Bala… muscul…           0      1          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>

You can also sample random rows in the data:

phylacine %>% slice_sample() # a random row
#> # A tibble: 1 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Crocidu… Euli… Soric… Croc… levicu…           1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>

# bootstrap
phylacine %>% slice_sample(n = 5831, replace = TRUE)
#> # A tibble: 5,831 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Rhinolo… Chir… Rhino… Rhin… adami             0      0          0      1
#> 2 Hylomys… Euli… Erina… Hylo… megalo…           1      0          0      0
#> 3 Sciurus… Rode… Sciur… Sciu… yucata…           1      0          0      0
#> 4 Emballo… Chir… Embal… Emba… alecto            0      0          0      1
#> 5 Pteralo… Chir… Ptero… Pter… taki              0      0          0      1
#> 6 Lasiorh… Dipr… Vomba… Lasi… latifr…           1      0          0      0
#> # … with 5,825 more rows, and 15 more variables: life_habit_method <chr>,
#> #   life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>

3.3.3 Subsetting rows by value with `filter()`

filter() does a similar job as slice(), but extract rows that satisfy a set of conditions. The conditions are supplied much the same way as you would do for an if statement.

Along with mutate() (next section), this is probably the function you are going to use the most.

For example, I might want to extract mammals above a given mass:

# megafauna
phylacine %>%
  filter(mass_g > 1e5) %>% # 100 kg
  select(binomial, mass_g)
#> # A tibble: 302 x 2
#>   binomial                   mass_g
#>   <chr>                       <dbl>
#> 1 Ailuropoda_melanoleuca    108400 
#> 2 Alcelaphus_buselaphus     171002.
#> 3 Alces_alces               356998 
#> 4 Archaeoindris_fontoynonti 160000 
#> 5 Arctocephalus_forsteri    101250 
#> 6 Arctocephalus_pusillus    178500 
#> # … with 296 more rows

# non-extinct megafauna
phylacine %>%
  filter(mass_g > 1e5, iucn_status != "EP") %>%
  select(binomial, mass_g, iucn_status)
#> # A tibble: 178 x 3
#>   binomial                 mass_g iucn_status
#>   <chr>                     <dbl> <chr>      
#> 1 Ailuropoda_melanoleuca  108400  VU         
#> 2 Alcelaphus_buselaphus   171002. LC         
#> 3 Alces_alces             356998  LC         
#> 4 Arctocephalus_forsteri  101250  LC         
#> 5 Arctocephalus_pusillus  178500  LC         
#> 6 Arctocephalus_townsendi 105000  LC         
#> # … with 172 more rows

Are there any flying mammals that aren’t bats?

phylacine %>%
  filter(aerial == 1, order != "Chiroptera")
#> # A tibble: 0 x 24
#> # … with 24 variables: binomial <chr>, order <chr>, family <chr>, genus <chr>,
#> #   species <chr>, terrestrial <dbl>, marine <dbl>, freshwater <dbl>,
#> #   aerial <dbl>, life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
# no :(

Are humans included in the table?

phylacine %>% filter(binomial == "Homo_sapiens")
#> # A tibble: 1 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Homo_sa… Prim… Homin… Homo  sapiens           1      0          0      0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> #   mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> #   mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> #   added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> #   diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>

filter() can be used to deal with NAs:

phylacine %>%
  filter(!is.na(mass_comparison))
#> # A tibble: 754 x 24
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abrocom… Rode… Abroc… Abro… budini            1      0          0      0
#> 2 Abrocom… Rode… Abroc… Abro… famati…           1      0          0      0
#> 3 Abrocom… Rode… Abroc… Abro… shista…           1      0          0      0
#> 4 Abrocom… Rode… Abroc… Abro… uspall…           1      0          0      0
#> 5 Abrocom… Rode… Abroc… Abro… vaccar…           1      0          0      0
#> 6 Acerodo… Chir… Ptero… Acer… humilis           0      0          0      1
#> # … with 748 more rows, and 15 more variables: life_habit_method <chr>,
#> #   life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>

Tip: dplyr introduces the useful function between() that does exactly what the name implies

between(1:5, 2, 4)
#> [1] FALSE  TRUE  TRUE  TRUE FALSE

# Mesofauna
phylacine %>%
  filter(mass_g > 1e3, mass_g < 1e5) %>%
  select(binomial, mass_g)
#> # A tibble: 1,126 x 2
#>   binomial                mass_g
#>   <chr>                    <dbl>
#> 1 Acerodon_jubatus         1075 
#> 2 Acinonyx_jubatus        46700 
#> 3 Acratocnus_odontrigonus 22990 
#> 4 Acratocnus_ye           21310 
#> 5 Addax_nasomaculatus     70000.
#> 6 Aepyceros_melampus      52500.
#> # … with 1,120 more rows

# same thing
phylacine %>%
  filter(mass_g %>% between(1e3, 1e5)) %>%
  select(binomial, mass_g)
#> # A tibble: 1,148 x 2
#>   binomial                mass_g
#>   <chr>                    <dbl>
#> 1 Acerodon_jubatus         1075 
#> 2 Acinonyx_jubatus        46700 
#> 3 Acratocnus_odontrigonus 22990 
#> 4 Acratocnus_ye           21310 
#> 5 Addax_nasomaculatus     70000.
#> 6 Aepyceros_melampus      52500.
#> # … with 1,142 more rows

Note that you can pipe operations inside function arguments as in the last line above (arguments are expressions, after all!).

3.4 Making new variables

3.4.1 Create new variables with `mutate()`

Very often in data analysis, you will want to create new variables, or edit existing ones. This is done easily through mutate(). For example, consider the diet data:

diet <- phylacine %>%
  select(
    binomial,
    contains("diet") & !contains(c("method", "source"))
  )
diet
#> # A tibble: 5,831 x 4
#>   binomial             diet_plant diet_vertebrate diet_invertebrate
#>   <chr>                     <dbl>           <dbl>             <dbl>
#> 1 Abditomys_latidens          100               0                 0
#> 2 Abeomelomys_sevia            78               3                19
#> 3 Abrawayaomys_ruschii         88               1                11
#> 4 Abrocoma_bennettii          100               0                 0
#> 5 Abrocoma_boliviensis        100               0                 0
#> 6 Abrocoma_budini             100               0                 0
#> # … with 5,825 more rows

These three variables show the percentage of each category of food that make the diet of that species. They should sum to 100, unless the authors made a typo or other entry error. To assert this, I’m going to create a new variable, total_diet.

diet <- diet %>% mutate(
  "total_diet" = diet_vertebrate + diet_invertebrate + diet_plant
)
diet
#> # A tibble: 5,831 x 5
#>   binomial             diet_plant diet_vertebrate diet_invertebrate total_diet
#>   <chr>                     <dbl>           <dbl>             <dbl>      <dbl>
#> 1 Abditomys_latidens          100               0                 0        100
#> 2 Abeomelomys_sevia            78               3                19        100
#> 3 Abrawayaomys_ruschii         88               1                11        100
#> 4 Abrocoma_bennettii          100               0                 0        100
#> 5 Abrocoma_boliviensis        100               0                 0        100
#> 6 Abrocoma_budini             100               0                 0        100
#> # … with 5,825 more rows

all(diet$total_diet == 100)
#> [1] TRUE
# cool and good

mutate() adds a variable to the table, and keeps all other variables. Sometimes you may want to just keep the new variable, and drop the other ones. That’s the job of mutate()’s twin sibling, transmute(). For example, I want to combine diet_invertebrate and diet_vertebrate together:

diet %>%
  transmute(
    "diet_animal" = diet_invertebrate + diet_vertebrate
  )
#> # A tibble: 5,831 x 1
#>   diet_animal
#>         <dbl>
#> 1           0
#> 2          22
#> 3          12
#> 4           0
#> 5           0
#> 6           0
#> # … with 5,825 more rows

You may want to keep some variables and drop others. You could pipe mutate() and select() to do so, or you could just pass the variables to keep to transmute().

diet %>%
  transmute(
    "diet_animal" = diet_invertebrate + diet_vertebrate,
    diet_plant
  )
#> # A tibble: 5,831 x 2
#>   diet_animal diet_plant
#>         <dbl>      <dbl>
#> 1           0        100
#> 2          22         78
#> 3          12         88
#> 4           0        100
#> 5           0        100
#> 6           0        100
#> # … with 5,825 more rows

You can also refer to variables you’re creating to derive new variables from them as part of the same operation, this is not an issue.

diet %>%
  transmute(
    "diet_animal" = diet_invertebrate + diet_vertebrate,
    diet_plant,
    "total_diet" = diet_animal + diet_plant
  )
#> # A tibble: 5,831 x 3
#>   diet_animal diet_plant total_diet
#>         <dbl>      <dbl>      <dbl>
#> 1           0        100        100
#> 2          22         78        100
#> 3          12         88        100
#> 4           0        100        100
#> 5           0        100        100
#> 6           0        100        100
#> # … with 5,825 more rows

Sometimes, you may need to perform an operation based on the row number (I don’t have a good example in mind). tibble has a built-in function to do just that:

phylacine %>%
  select(binomial) %>%
  tibble::rownames_to_column(var = "row_nb")
#> # A tibble: 5,831 x 2
#>   row_nb binomial            
#>   <chr>  <chr>               
#> 1 1      Abditomys_latidens  
#> 2 2      Abeomelomys_sevia   
#> 3 3      Abrawayaomys_ruschii
#> 4 4      Abrocoma_bennettii  
#> 5 5      Abrocoma_boliviensis
#> 6 6      Abrocoma_budini     
#> # … with 5,825 more rows

3.4.2 Summarise observations with `summarise()`

mutate() applies operations to all observations in a table. By contrast, summarise() applies operations to groups of observations, and returns, er, summaries. The default grouping unit is the entire table:

phylacine %>%
  summarise(
    "nb_species" = n(), # counts observations
    "nb_terrestrial" = sum(terrestrial),
    "nb_marine" = sum(marine),
    "nb_freshwater" = sum(freshwater),
    "nb_aerial" = sum(aerial),
    "mean_mass_g" = mean(mass_g)
  )
#> # A tibble: 1 x 6
#>   nb_species nb_terrestrial nb_marine nb_freshwater nb_aerial mean_mass_g
#>        <int>          <dbl>     <dbl>         <dbl>     <dbl>       <dbl>
#> 1       5831           4575       135           156      1162     156882.

Above you can see that bats account for a large portion of mammal species diversity (nb_aerial). How much exactly? Just as with mutate(), you can perform operations on the variables you just created, in the same statement:

phylacine %>%
  summarise(
    "nb_species" = n(),
    "nb_aerial" = sum(aerial), # bats
    "prop_aerial" = nb_aerial / nb_species
  )
#> # A tibble: 1 x 3
#>   nb_species nb_aerial prop_aerial
#>        <int>     <dbl>       <dbl>
#> 1       5831      1162       0.199

One fifth!

If the british spelling bothers you, summarize() exists and is strictly equivalent.

Here’s a simple trick with logical (TRUE / FALSE) variables. Their sum is the count of observations that evaluate to TRUE (because TRUE is taken as 1 and FALSE as 0) and their mean is the proportion of TRUE observations. This can be exploited to count the number of observations that satisfy a condition:

phylacine %>%
  summarise(
    "nb_species" = n(),
    "nb_megafauna" = sum(mass_g > 100000),
    "p_megafauna" = mean(mass_g > 100000)
  )
#> # A tibble: 1 x 3
#>   nb_species nb_megafauna p_megafauna
#>        <int>        <int>       <dbl>
#> 1       5831          302      0.0518

There are more summaries that just means and counts (see ?summarise() for some helpful functions). In fact, summarise can use any function or expression that evaluates to a single value or a vector of values. This includes base R max(), quantiles, etc.

mutate() and transmute() can compute summaries as well, but they will return the summary once for each observation, in a new column.

phylacine %>%
  mutate("nb_species" = n()) %>%
  select(binomial, nb_species)
#> # A tibble: 5,831 x 2
#>   binomial             nb_species
#>   <chr>                     <int>
#> 1 Abditomys_latidens         5831
#> 2 Abeomelomys_sevia          5831
#> 3 Abrawayaomys_ruschii       5831
#> 4 Abrocoma_bennettii         5831
#> 5 Abrocoma_boliviensis       5831
#> 6 Abrocoma_budini            5831
#> # … with 5,825 more rows

3.4.3 Grouping observations by variables

In most cases you don’t want to run summary operations on the entire set of observations, but instead on observations that share a common value, i.e. groups. For example, I want to run the summary displayed above, but for each Order of mammals.

distinct() extracts all the unique values of a variable

phylacine %>% distinct(order)
#> # A tibble: 29 x 1
#>   order          
#>   <chr>          
#> 1 Rodentia       
#> 2 Chiroptera     
#> 3 Carnivora      
#> 4 Pilosa         
#> 5 Diprotodontia  
#> 6 Cetartiodactyla
#> # … with 23 more rows

I could work my way with what we have already seen, filtering observations (filter(order == "Rodentia")) and then pipeing the output to summarise(), and do it again for each Order. But that would be tedious.

Instead, I can use group_by() to pool observations by order.

phylacine %>%
  group_by(order)
#> # A tibble: 5,831 x 24
#> # Groups:   order [29]
#>   binomial order family genus species terrestrial marine freshwater aerial
#>   <chr>    <chr> <chr>  <chr> <chr>         <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abditom… Rode… Murid… Abdi… latide…           1      0          0      0
#> 2 Abeomel… Rode… Murid… Abeo… sevia             1      0          0      0
#> 3 Abraway… Rode… Crice… Abra… ruschii           1      0          0      0
#> 4 Abrocom… Rode… Abroc… Abro… bennet…           1      0          0      0
#> 5 Abrocom… Rode… Abroc… Abro… bolivi…           1      0          0      0
#> 6 Abrocom… Rode… Abroc… Abro… budini            1      0          0      0
#> # … with 5,825 more rows, and 15 more variables: life_habit_method <chr>,
#> #   life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> #   mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> #   island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> #   diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> #   diet_method <chr>, diet_source <chr>

At first glance, nothing has changed, apart from an extra line of information in the output that tells me the observations have been grouped. But now here’s what happen if I run the same summarise() statement on an ungrouped and a grouped table

phylacine %>%
  summarise(
    "n_species" = n(),
    "mean_mass_g" = mean(mass_g)
  )
#> # A tibble: 1 x 2
#>   n_species mean_mass_g
#>       <int>       <dbl>
#> 1      5831     156882.

phylacine %>%
  group_by(order) %>%
  summarise(
    "n_species" = n(),
    "mean_mass_g" = mean(mass_g)
  )
#> # A tibble: 29 x 3
#>   order           n_species mean_mass_g
#>   <chr>               <int>       <dbl>
#> 1 Afrosoricida           57       306. 
#> 2 Carnivora             313     47905. 
#> 3 Cetartiodactyla       392   1854811. 
#> 4 Chiroptera           1162        49.1
#> 5 Cingulata              39    235529. 
#> 6 Dasyuromorphia         74       748. 
#> # … with 23 more rows

I get one value for each group.

Observations can be grouped by multiple variables, which will output a summary for every unique combination of groups.

phylacine %>%
  group_by(order, iucn_status) %>%
  summarise(
    "n_species" = n()
  )
#> # A tibble: 138 x 3
#> # Groups:   order [29]
#>   order        iucn_status n_species
#>   <chr>        <chr>           <int>
#> 1 Afrosoricida CR                  1
#> 2 Afrosoricida DD                  4
#> 3 Afrosoricida EN                  7
#> 4 Afrosoricida EP                  2
#> 5 Afrosoricida LC                 32
#> 6 Afrosoricida NT                  3
#> # … with 132 more rows

Whenever you call summarise(), the last level of grouping is dropped. Note how in the output table above, observations are still grouped by order, and no longer by IUCN status. If I summarise observations again:

phylacine %>%
  group_by(order, iucn_status) %>%
  summarise(
    "n_species" = n()
  ) %>%
  summarise(
    "n_species_2" = n()
  )
#> # A tibble: 29 x 2
#>   order           n_species_2
#>   <chr>                 <int>
#> 1 Afrosoricida              7
#> 2 Carnivora                 8
#> 3 Cetartiodactyla           9
#> 4 Chiroptera                8
#> 5 Cingulata                 5
#> 6 Dasyuromorphia            7
#> # … with 23 more rows

I get the summary across orders, and the table is no longer grouped at all. This is useful to consider if you need to work on summaries across different levels of the data.

For example, I would like to know how the species in each order are distributed between the different levels of threat in the IUCN classification. To get these proportions, I need to first get the count of each number of species in a level of threat inside an order, and divide that by the number of species in that order.

phylacine %>%
  group_by(order, iucn_status) %>%
  summarise("n_order_iucn" = n()) %>%
  # grouping by iucn_status silently dropped
  mutate(
    "n_order" = sum(n_order_iucn),
    "p_iucn" = n_order_iucn / n_order
  )
#> # A tibble: 138 x 5
#> # Groups:   order [29]
#>   order        iucn_status n_order_iucn n_order p_iucn
#>   <chr>        <chr>              <int>   <int>  <dbl>
#> 1 Afrosoricida CR                     1      57 0.0175
#> 2 Afrosoricida DD                     4      57 0.0702
#> 3 Afrosoricida EN                     7      57 0.123 
#> 4 Afrosoricida EP                     2      57 0.0351
#> 5 Afrosoricida LC                    32      57 0.561 
#> 6 Afrosoricida NT                     3      57 0.0526
#> # … with 132 more rows

10.2% of Carnivores are Endangered (“EN”).

3.4.4 Grouped data and other `dplyr` verbs

Grouping does not only affect the behaviour of summarise, but under circumstances, other verbs can (and will!) perform operations by groups.

# Species with a higher mass than the mammal mean
phylacine %>%
  select("binomial", "mass_g") %>%
  filter(mass_g > mean(mass_g, na.rm = TRUE))
#> # A tibble: 234 x 2
#>   binomial                      mass_g
#>   <chr>                          <dbl>
#> 1 Alcelaphus_buselaphus        171002.
#> 2 Alces_alces                  356998 
#> 3 Archaeoindris_fontoynonti    160000 
#> 4 Arctocephalus_pusillus       178500 
#> 5 Arctodus_simus               709500 
#> 6 Balaena_mysticetus        100000000 
#> # … with 228 more rows

# Species with a higher mass than the mean in their order
phylacine %>%
  group_by(order) %>%
  select("binomial", "mass_g") %>%
  filter(mass_g > mean(mass_g, na.rm = TRUE))
#> # A tibble: 890 x 3
#> # Groups:   order [27]
#>   order      binomial             mass_g
#>   <chr>      <chr>                 <dbl>
#> 1 Chiroptera Acerodon_celebensis    390 
#> 2 Chiroptera Acerodon_humilis       600.
#> 3 Chiroptera Acerodon_jubatus      1075 
#> 4 Chiroptera Acerodon_leucotis      513.
#> 5 Chiroptera Acerodon_mackloti      470.
#> 6 Rodentia   Aeretes_melanopterus   732.
#> # … with 884 more rows

# Largest mammal
phylacine %>%
  select(binomial, mass_g) %>%
  slice_max(mass_g)
#> # A tibble: 1 x 2
#>   binomial                 mass_g
#>   <chr>                     <dbl>
#> 1 Balaenoptera_musculus 190000000
# Largest species in each order
phylacine %>%
  group_by(order) %>%
  select(binomial, mass_g) %>%
  slice_max(mass_g)
#> # A tibble: 30 x 3
#> # Groups:   order [29]
#>   order           binomial                             mass_g
#>   <chr>           <chr>                                 <dbl>
#> 1 Afrosoricida    Plesiorycteropus_madagascariensis     13220
#> 2 Carnivora       Mirounga_leonina                    1600000
#> 3 Cetartiodactyla Balaenoptera_musculus             190000000
#> 4 Chiroptera      Acerodon_jubatus                       1075
#> 5 Cingulata       Glyptodon_clavipes                  2000000
#> 6 Dasyuromorphia  Thylacinus_cynocephalus               30000
#> # … with 24 more rows

To avoid grouped operations, you can simply drop grouping with ungroup().

3.5 Working with multiple tables

3.5.1 Binding tables

dplyr introduces bind_rows() and bind_cols(), which are equivalent to base R rbind() and cbind(), with a few extra feature. They are faster, and can bind many tables at once, and bind data frames with vectors or lists.

bind_rows() has an option to pass a variable specifying which dataset each observation originates from.

porpoises <- phylacine %>%
  filter(family == "Phocoenidae") %>%
  select(binomial, iucn_status)
echidnas <- phylacine %>%
  filter(family == "Tachyglossidae") %>%
  select(binomial, iucn_status)

bind_rows(
  "porpoise" = porpoises,
  "echidna" = echidnas,
  .id = "kind"
)
#> # A tibble: 13 x 3
#>   kind     binomial                    iucn_status
#>   <chr>    <chr>                       <chr>      
#> 1 porpoise Neophocaena_asiaeorientalis VU         
#> 2 porpoise Neophocaena_phocaenoides    VU         
#> 3 porpoise Phocoena_dioptrica          DD         
#> 4 porpoise Phocoena_phocoena           LC         
#> 5 porpoise Phocoena_sinus              CR         
#> 6 porpoise Phocoena_spinipinnis        DD         
#> # … with 7 more rows

3.5.2 Combining variables of two tables with mutating joins

Mutating joins are tailored to combine tables that share a set of observations but have different variables.

As an example, let’s split the phylacine dataset in two smaller datasets, one containing information on diet and one on the dominant habitat.

diet <- phylacine %>%
  select(binomial, diet_plant:diet_invertebrate) %>%
  slice(1:5)
diet
#> # A tibble: 5 x 4
#>   binomial             diet_plant diet_vertebrate diet_invertebrate
#>   <chr>                     <dbl>           <dbl>             <dbl>
#> 1 Abditomys_latidens          100               0                 0
#> 2 Abeomelomys_sevia            78               3                19
#> 3 Abrawayaomys_ruschii         88               1                11
#> 4 Abrocoma_bennettii          100               0                 0
#> 5 Abrocoma_boliviensis        100               0                 0

life_habit <- phylacine %>% select(binomial, terrestrial:aerial) %>%
  slice(1:3, 6:7)
life_habit
#> # A tibble: 5 x 5
#>   binomial             terrestrial marine freshwater aerial
#>   <chr>                      <dbl>  <dbl>      <dbl>  <dbl>
#> 1 Abditomys_latidens             1      0          0      0
#> 2 Abeomelomys_sevia              1      0          0      0
#> 3 Abrawayaomys_ruschii           1      0          0      0
#> 4 Abrocoma_budini                1      0          0      0
#> 5 Abrocoma_cinerea               1      0          0      0

The two datasets each contain 5 species, the first three are shared, and the two last differ between the two.

intersect(diet$binomial, life_habit$binomial)
#> [1] "Abditomys_latidens"   "Abeomelomys_sevia"    "Abrawayaomys_ruschii"
setdiff(diet$binomial, life_habit$binomial)
#> [1] "Abrocoma_bennettii"   "Abrocoma_boliviensis"

To use mutate-joins, both tables need to have a key, a variable that identifies each observation. Here, that would be binomial, the sepcies names. If your table doesn’t have such a key and the rows between the tables match one another, remember you can create a row number variable easily with tibble::column_to_rownames().

inner_join(diet, life_habit, by = "binomial")
#> # A tibble: 3 x 8
#>   binomial diet_plant diet_vertebrate diet_invertebra… terrestrial marine
#>   <chr>         <dbl>           <dbl>            <dbl>       <dbl>  <dbl>
#> 1 Abditom…        100               0                0           1      0
#> 2 Abeomel…         78               3               19           1      0
#> 3 Abraway…         88               1               11           1      0
#> # … with 2 more variables: freshwater <dbl>, aerial <dbl>

inner_join combined the variables, and dropped the observations that weren’t matched between the two tables. There are three other variations of mutating joins, differing in what they do with unmatching variables.

left_join(diet, life_habit, by = "binomial")
#> # A tibble: 5 x 8
#>   binomial diet_plant diet_vertebrate diet_invertebra… terrestrial marine
#>   <chr>         <dbl>           <dbl>            <dbl>       <dbl>  <dbl>
#> 1 Abditom…        100               0                0           1      0
#> 2 Abeomel…         78               3               19           1      0
#> 3 Abraway…         88               1               11           1      0
#> 4 Abrocom…        100               0                0          NA     NA
#> 5 Abrocom…        100               0                0          NA     NA
#> # … with 2 more variables: freshwater <dbl>, aerial <dbl>
right_join(diet, life_habit, by = "binomial")
#> # A tibble: 5 x 8
#>   binomial diet_plant diet_vertebrate diet_invertebra… terrestrial marine
#>   <chr>         <dbl>           <dbl>            <dbl>       <dbl>  <dbl>
#> 1 Abditom…        100               0                0           1      0
#> 2 Abeomel…         78               3               19           1      0
#> 3 Abraway…         88               1               11           1      0
#> 4 Abrocom…         NA              NA               NA           1      0
#> 5 Abrocom…         NA              NA               NA           1      0
#> # … with 2 more variables: freshwater <dbl>, aerial <dbl>
full_join(diet, life_habit, by = "binomial")
#> # A tibble: 7 x 8
#>   binomial diet_plant diet_vertebrate diet_invertebra… terrestrial marine
#>   <chr>         <dbl>           <dbl>            <dbl>       <dbl>  <dbl>
#> 1 Abditom…        100               0                0           1      0
#> 2 Abeomel…         78               3               19           1      0
#> 3 Abraway…         88               1               11           1      0
#> 4 Abrocom…        100               0                0          NA     NA
#> 5 Abrocom…        100               0                0          NA     NA
#> 6 Abrocom…         NA              NA               NA           1      0
#> # … with 1 more row, and 2 more variables: freshwater <dbl>, aerial <dbl>

semi_join(diet, life_habit, by  = "binomial")
#> # A tibble: 3 x 4
#>   binomial             diet_plant diet_vertebrate diet_invertebrate
#>   <chr>                     <dbl>           <dbl>             <dbl>
#> 1 Abditomys_latidens          100               0                 0
#> 2 Abeomelomys_sevia            78               3                19
#> 3 Abrawayaomys_ruschii         88               1                11
anti_join(diet, life_habit, by  = "binomial")
#> # A tibble: 2 x 4
#>   binomial             diet_plant diet_vertebrate diet_invertebrate
#>   <chr>                     <dbl>           <dbl>             <dbl>
#> 1 Abrocoma_bennettii          100               0                 0
#> 2 Abrocoma_boliviensis        100               0                 0

3.5.3 Filtering matching observations between two tables wiht filtering joins

So-called filtering joins return row from the first table that are matched (or not, for anti_join()) in the second.

semi_join(diet, life_habit, by  = "binomial")
#> # A tibble: 3 x 4
#>   binomial             diet_plant diet_vertebrate diet_invertebrate
#>   <chr>                     <dbl>           <dbl>             <dbl>
#> 1 Abditomys_latidens          100               0                 0
#> 2 Abeomelomys_sevia            78               3                19
#> 3 Abrawayaomys_ruschii         88               1                11
anti_join(diet, life_habit, by  = "binomial")
#> # A tibble: 2 x 4
#>   binomial             diet_plant diet_vertebrate diet_invertebrate
#>   <chr>                     <dbl>           <dbl>             <dbl>
#> 1 Abrocoma_bennettii          100               0                 0
#> 2 Abrocoma_boliviensis        100               0                 0

Section 3 Data manipulation with dplyr

3.1 Introduction

3.1.1 Foreword on dplyr

3.1.2 Example data

3.2 Working with existing variables

3.2.1 Renaming variables with rename()

3.2.2 The pipe operator %>%

3.2.3 Select variables with select()