Section 3 Data manipulation with dplyr
3.1 Introduction
3.1.1 Foreword on dplyr
dplyr
is tasked with performing all sorts of transformations on a dataset.
The structure of dplyr
revolves around a set of functions, the so-called
verbs, that share a common syntax and logic, and are meant to work with one
another in chained operations. Chained operations are performed with the pipe
operator (%>%
), that will be introduced in section 3.2.2.
The basic syntax is verb(data, variable)
, where data
is a data frame and
variable
is the name of one or more columns containing a set of values for
each observation.
There are 5 main verbs, which names already hint at what they do: rename()
,
select()
, filter()
, mutate()
, and summarise()
.
I’m going to introduce each of them (and a couple more) through the following sections.
3.1.2 Example data
Through this tutorial, we will be using mammal trait data from the Phylacine database. Let’s have a peek at what it contains.
phylacine <- read_csv("data/phylacine_traits.csv")
phylacine
#> # A tibble: 5,831 x 24
#> Binomial.1.2 Order.1.2 Family.1.2 Genus.1.2 Species.1.2 Terrestrial Marine
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 Abditomys_l… Rodentia Muridae Abditomys latidens 1 0
#> 2 Abeomelomys… Rodentia Muridae Abeomelo… sevia 1 0
#> 3 Abrawayaomy… Rodentia Cricetidae Abrawaya… ruschii 1 0
#> 4 Abrocoma_be… Rodentia Abrocomid… Abrocoma bennettii 1 0
#> 5 Abrocoma_bo… Rodentia Abrocomid… Abrocoma boliviensis 1 0
#> 6 Abrocoma_bu… Rodentia Abrocomid… Abrocoma budini 1 0
#> # … with 5,825 more rows, and 17 more variables: Freshwater <dbl>,
#> # Aerial <dbl>, Life.Habit.Method <chr>, Life.Habit.Source <chr>,
#> # Mass.g <dbl>, Mass.Method <chr>, Mass.Source <chr>, Mass.Comparison <chr>,
#> # Mass.Comparison.Source <chr>, Island.Endemicity <chr>,
#> # IUCN.Status.1.2 <chr>, Added.IUCN.Status.1.2 <chr>, Diet.Plant <dbl>,
#> # Diet.Vertebrate <dbl>, Diet.Invertebrate <dbl>, Diet.Method <chr>,
#> # Diet.Source <chr>
readr
automatically loads the data in a tibble
, as we have seen in chapter
1 and 2. Calling the tibble gives a nice preview of what it contains. We have
data for 5,831 mammal species, and the variables contain information on taxonomy,
(broad) habitat, mass, IUCN status, and diet.
If you remember Section 1.2 on tidy data, you may see that this data isn’t exactly tidy. In fact, some columns are in wide (and messy) format, like the “habitat” (terrestrial, marine, etc.) and diet columns.
dplyr
actually does not require your data to be strictly tidy. If you feel that your
data satisfies the definition “one observation per row, one variable per column”,
that’s probably good enough.
I use a tibble
here, but dplyr
works equally well on base data frames. In
fact, dplyr
is built for data.frame
objects, and tibbles are data frames.
Therefore, tibbles are mortal.
3.2 Working with existing variables
3.2.1 Renaming variables with rename()
The variable names in the phylacine dataset are descriptive, but quite unpractical. Typing
Binomial.1.2.
is cumbersome and subject to typos (in fact, I just made one).
binomial
would be much simpler to use.
Changing names is straightforward with rename()
.
rename(.data = phylacine, "binomial" = Binomial.1.2)
#> # A tibble: 5,831 x 24
#> binomial Order.1.2 Family.1.2 Genus.1.2 Species.1.2 Terrestrial Marine
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 Abditom… Rodentia Muridae Abditomys latidens 1 0
#> 2 Abeomel… Rodentia Muridae Abeomelo… sevia 1 0
#> 3 Abraway… Rodentia Cricetidae Abrawaya… ruschii 1 0
#> 4 Abrocom… Rodentia Abrocomid… Abrocoma bennettii 1 0
#> 5 Abrocom… Rodentia Abrocomid… Abrocoma boliviensis 1 0
#> 6 Abrocom… Rodentia Abrocomid… Abrocoma budini 1 0
#> # … with 5,825 more rows, and 17 more variables: Freshwater <dbl>,
#> # Aerial <dbl>, Life.Habit.Method <chr>, Life.Habit.Source <chr>,
#> # Mass.g <dbl>, Mass.Method <chr>, Mass.Source <chr>, Mass.Comparison <chr>,
#> # Mass.Comparison.Source <chr>, Island.Endemicity <chr>,
#> # IUCN.Status.1.2 <chr>, Added.IUCN.Status.1.2 <chr>, Diet.Plant <dbl>,
#> # Diet.Vertebrate <dbl>, Diet.Invertebrate <dbl>, Diet.Method <chr>,
#> # Diet.Source <chr>
The first argument is always .data
, the data table you want to apply change to.
Note how columns are referred to. Once the data table as been passed as an
argument, there is no need to refer to it directly anymore, dplyr
understands
that you’re dealing with variables inside that data frame. So drop that
data$var
, data[, "var"]
, and forget the very existence of attach()
/
detach()
.
You can refer to variables names either with strings or directly as objects, whether you’re reading or creating them:
rename(
phylacine,
# this works
binomial = Binomial.1.2
)
rename(
phylacine,
# this works too!
binomial = "Binomial.1.2"
)
rename(
phylacine,
# guess what
"binomial" = "Binomial.1.2"
)
I have applied similar changes to all variables in the dataset. Here is what the new names look like:
#> # A tibble: 5,831 x 24
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abditom… Rode… Murid… Abdi… latide… 1 0 0 0
#> 2 Abeomel… Rode… Murid… Abeo… sevia 1 0 0 0
#> 3 Abraway… Rode… Crice… Abra… ruschii 1 0 0 0
#> 4 Abrocom… Rode… Abroc… Abro… bennet… 1 0 0 0
#> 5 Abrocom… Rode… Abroc… Abro… bolivi… 1 0 0 0
#> 6 Abrocom… Rode… Abroc… Abro… budini 1 0 0 0
#> # … with 5,825 more rows, and 15 more variables: life_habit_method <chr>,
#> # life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> # mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> # island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> # diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> # diet_method <chr>, diet_source <chr>
3.2.2 The pipe operator %>%
If you have already come across pieces of code using the tidyverse, chances are
that you have seen this odd symbol. While the pipe is not strictly-speaking a
part of the tidyverse (it comes from its own package, magrittr
), it is
imported along with each package and widely used in conjunction with its
functions.
What does it do? Consider the following example with rename()
:
phylacine2 <- readr::read_csv("data/phylacine_traits.csv")
# regular syntax
rename(phylacine2, "binomial" = "Binomial.1.2")
#> # A tibble: 5,831 x 24
#> binomial Order.1.2 Family.1.2 Genus.1.2 Species.1.2 Terrestrial Marine
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 Abditom… Rodentia Muridae Abditomys latidens 1 0
#> 2 Abeomel… Rodentia Muridae Abeomelo… sevia 1 0
#> 3 Abraway… Rodentia Cricetidae Abrawaya… ruschii 1 0
#> 4 Abrocom… Rodentia Abrocomid… Abrocoma bennettii 1 0
#> 5 Abrocom… Rodentia Abrocomid… Abrocoma boliviensis 1 0
#> 6 Abrocom… Rodentia Abrocomid… Abrocoma budini 1 0
#> # … with 5,825 more rows, and 17 more variables: Freshwater <dbl>,
#> # Aerial <dbl>, Life.Habit.Method <chr>, Life.Habit.Source <chr>,
#> # Mass.g <dbl>, Mass.Method <chr>, Mass.Source <chr>, Mass.Comparison <chr>,
#> # Mass.Comparison.Source <chr>, Island.Endemicity <chr>,
#> # IUCN.Status.1.2 <chr>, Added.IUCN.Status.1.2 <chr>, Diet.Plant <dbl>,
#> # Diet.Vertebrate <dbl>, Diet.Invertebrate <dbl>, Diet.Method <chr>,
#> # Diet.Source <chr>
# alternative syntax with the pipe operator
phylacine2 %>% rename("binomial" = "Binomial.1.2")
#> # A tibble: 5,831 x 24
#> binomial Order.1.2 Family.1.2 Genus.1.2 Species.1.2 Terrestrial Marine
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 Abditom… Rodentia Muridae Abditomys latidens 1 0
#> 2 Abeomel… Rodentia Muridae Abeomelo… sevia 1 0
#> 3 Abraway… Rodentia Cricetidae Abrawaya… ruschii 1 0
#> 4 Abrocom… Rodentia Abrocomid… Abrocoma bennettii 1 0
#> 5 Abrocom… Rodentia Abrocomid… Abrocoma boliviensis 1 0
#> 6 Abrocom… Rodentia Abrocomid… Abrocoma budini 1 0
#> # … with 5,825 more rows, and 17 more variables: Freshwater <dbl>,
#> # Aerial <dbl>, Life.Habit.Method <chr>, Life.Habit.Source <chr>,
#> # Mass.g <dbl>, Mass.Method <chr>, Mass.Source <chr>, Mass.Comparison <chr>,
#> # Mass.Comparison.Source <chr>, Island.Endemicity <chr>,
#> # IUCN.Status.1.2 <chr>, Added.IUCN.Status.1.2 <chr>, Diet.Plant <dbl>,
#> # Diet.Vertebrate <dbl>, Diet.Invertebrate <dbl>, Diet.Method <chr>,
#> # Diet.Source <chr>
Got it? The pipe takes the object on its left-side and silently feeds it to the
first argument of the function on its right-side. It could be read as “take x,
then do…”.
The reason for using the pipe is because it makes code syntax closer to
the syntax of a sentence, and therefore, easier and faster for your brain to
process (and write!) the code. In particular, the pipe enables easy chains of
operations, where you apply something to an object, then apply something else to
the outcome, and so on…
Through the later sections, you will see some examples of chained operations
with dplyr
functions, but
for that I first need to introduce a couple more verbs.
Using the pipe can be quite unsettling at first, because you are not used to
think in this way. But if you push a bit for it, I promise it will make things
a lot easier (and it’s quite addictive!). To avoid typing the tedious symbols,
magrittr
installs a shortcut for you in RStudio. Use Ctrl + Shift + M
on
Windows, and Cmd + Shift + M
on MacOS.
Finally I should emphasize that the use of the pipe isn’t limited to the tidyverse, but extends to almost all R functions. Remember that by default the piped value is always matched to the first argument of the following function
If you need to pass the left-hand side to an argument other than the first,
you can use the dot place-holder .
.
Because of its syntax, most base R operators are not compatible with the pipe
(but this is very rarely needed).
If needed, magrittr
introduces alternative functions for operators.
Subsetting operators can be piped, with the dot place-holder.
# 5 %>% * 3 # no, won't work
# 5 %>% .* 3 # neither
5 %>% magrittr::multiply_by(3) # yes
#> [1] 15
# subsetting
list("monkey see", "monkey_do") %>% .[[2]]
#> [1] "monkey_do"
phylacine %>% .$binomial %>% head()
#> [1] "Abditomys_latidens" "Abeomelomys_sevia" "Abrawayaomys_ruschii"
#> [4] "Abrocoma_bennettii" "Abrocoma_boliviensis" "Abrocoma_budini"
Because subsetting in this way is particularly hideous, dplyr
delivers a function to extract values from a single variable. In only works on tables, though.
3.2.3 Select variables with select()
To extract a set of variables (i.e. columns), use the conveniently-named
select()
. The basic syntax is the same as rename()
: pass your data as the
first argument, then call the variables to select, quoted or not.
# Single variable
phylacine %>% select(binomial)
#> # A tibble: 5,831 x 1
#> binomial
#> <chr>
#> 1 Abditomys_latidens
#> 2 Abeomelomys_sevia
#> 3 Abrawayaomys_ruschii
#> 4 Abrocoma_bennettii
#> 5 Abrocoma_boliviensis
#> 6 Abrocoma_budini
#> # … with 5,825 more rows
# A set of variables
phylacine %>% select(genus, "species", mass_g)
#> # A tibble: 5,831 x 3
#> genus species mass_g
#> <chr> <chr> <dbl>
#> 1 Abditomys latidens 269
#> 2 Abeomelomys sevia 52
#> 3 Abrawayaomys ruschii 63
#> 4 Abrocoma bennettii 250
#> 5 Abrocoma boliviensis 158
#> 6 Abrocoma budini 361.
#> # … with 5,825 more rows
# A range of contiguous variables
phylacine %>% select(family:terrestrial)
#> # A tibble: 5,831 x 4
#> family genus species terrestrial
#> <chr> <chr> <chr> <dbl>
#> 1 Muridae Abditomys latidens 1
#> 2 Muridae Abeomelomys sevia 1
#> 3 Cricetidae Abrawayaomys ruschii 1
#> 4 Abrocomidae Abrocoma bennettii 1
#> 5 Abrocomidae Abrocoma boliviensis 1
#> 6 Abrocomidae Abrocoma budini 1
#> # … with 5,825 more rows
You can select by variable numbers. This is not recommended, as prone to errors, especially if you change the variable order.
phylacine %>% select(2)
#> # A tibble: 5,831 x 1
#> order
#> <chr>
#> 1 Rodentia
#> 2 Rodentia
#> 3 Rodentia
#> 4 Rodentia
#> 5 Rodentia
#> 6 Rodentia
#> # … with 5,825 more rows
select()
can also be used to exclude variables:
phylacine %>% select(-binomial)
#> # A tibble: 5,831 x 23
#> order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Rode… Murid… Abdi… latide… 1 0 0 0
#> 2 Rode… Murid… Abeo… sevia 1 0 0 0
#> 3 Rode… Crice… Abra… ruschii 1 0 0 0
#> 4 Rode… Abroc… Abro… bennet… 1 0 0 0
#> 5 Rode… Abroc… Abro… bolivi… 1 0 0 0
#> 6 Rode… Abroc… Abro… budini 1 0 0 0
#> # … with 5,825 more rows, and 15 more variables: life_habit_method <chr>,
#> # life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> # mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> # island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> # diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> # diet_method <chr>, diet_source <chr>
phylacine %>% select(-(binomial:species))
#> # A tibble: 5,831 x 19
#> terrestrial marine freshwater aerial life_habit_meth… life_habit_sour… mass_g
#> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 1 0 0 0 Reported IUCN. 2016. IUC… 269
#> 2 1 0 0 0 Reported IUCN. 2016. IUC… 52
#> 3 1 0 0 0 Reported IUCN. 2016. IUC… 63
#> 4 1 0 0 0 Reported IUCN. 2016. IUC… 250
#> 5 1 0 0 0 Reported IUCN. 2016. IUC… 158
#> 6 1 0 0 0 Reported IUCN. 2016. IUC… 361.
#> # … with 5,825 more rows, and 12 more variables: mass_method <chr>,
#> # mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> # island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> # diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> # diet_method <chr>, diet_source <chr>
select()
and rename()
are pretty similar, and in fact, select()
can also
rename variables along the way:
phylacine %>% select("linnaeus" = binomial)
#> # A tibble: 5,831 x 1
#> linnaeus
#> <chr>
#> 1 Abditomys_latidens
#> 2 Abeomelomys_sevia
#> 3 Abrawayaomys_ruschii
#> 4 Abrocoma_bennettii
#> 5 Abrocoma_boliviensis
#> 6 Abrocoma_budini
#> # … with 5,825 more rows
And you can mix all of that at once:
phylacine %>% select(
"fam" = family,
genus:freshwater,
-terrestrial
)
#> # A tibble: 5,831 x 5
#> fam genus species marine freshwater
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 Muridae Abditomys latidens 0 0
#> 2 Muridae Abeomelomys sevia 0 0
#> 3 Cricetidae Abrawayaomys ruschii 0 0
#> 4 Abrocomidae Abrocoma bennettii 0 0
#> 5 Abrocomidae Abrocoma boliviensis 0 0
#> 6 Abrocomidae Abrocoma budini 0 0
#> # … with 5,825 more rows
3.2.4 Select variables with helpers
The Rstudio team just released dplyr 1.0.0
, and along with it, some nice
helper functions to ease the selection of a set of variables. I give three
examples here, and encourage you to look at the documentation (?select()
) to
find out more.
phylacine %>% select(where(is.numeric))
#> # A tibble: 5,831 x 8
#> terrestrial marine freshwater aerial mass_g diet_plant diet_vertebrate
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 0 0 269 100 0
#> 2 1 0 0 0 52 78 3
#> 3 1 0 0 0 63 88 1
#> 4 1 0 0 0 250 100 0
#> 5 1 0 0 0 158 100 0
#> 6 1 0 0 0 361. 100 0
#> # … with 5,825 more rows, and 1 more variable: diet_invertebrate <dbl>
phylacine %>% select(contains("mass") | contains("diet"))
#> # A tibble: 5,831 x 10
#> mass_g mass_method mass_source mass_comparison mass_comparison… diet_plant
#> <dbl> <chr> <chr> <chr> <chr> <dbl>
#> 1 269 Reported Smith, F. … <NA> <NA> 100
#> 2 52 Reported Smith, F. … <NA> <NA> 78
#> 3 63 Reported Smith, F. … <NA> <NA> 88
#> 4 250 Reported Smith, F. … <NA> <NA> 100
#> 5 158 Reported Smith, F. … <NA> <NA> 100
#> 6 361. Assumed is… Journal of… Abrocoma_ciner… Journal of Mamm… 100
#> # … with 5,825 more rows, and 4 more variables: diet_vertebrate <dbl>,
#> # diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
habitats <- c("terrestrial", "marine", "arboreal", "fossorial", "freshwater")
phylacine %>% select(any_of(habitats))
#> # A tibble: 5,831 x 3
#> terrestrial marine freshwater
#> <dbl> <dbl> <dbl>
#> 1 1 0 0
#> 2 1 0 0
#> 3 1 0 0
#> 4 1 0 0
#> 5 1 0 0
#> 6 1 0 0
#> # … with 5,825 more rows
3.2.5 Rearranging variable order with relocate()
The order of variables seldom matters in dplyr
, but due to popular demand,
dplyr
now has a dedicated verb to rearrange the order of variables. The syntax
is identical to rename()
, select()
.
phylacine %>% relocate(mass_g, .before = binomial)
#> # A tibble: 5,831 x 24
#> mass_g binomial order family genus species terrestrial marine freshwater
#> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 269 Abditom… Rode… Murid… Abdi… latide… 1 0 0
#> 2 52 Abeomel… Rode… Murid… Abeo… sevia 1 0 0
#> 3 63 Abraway… Rode… Crice… Abra… ruschii 1 0 0
#> 4 250 Abrocom… Rode… Abroc… Abro… bennet… 1 0 0
#> 5 158 Abrocom… Rode… Abroc… Abro… bolivi… 1 0 0
#> 6 361. Abrocom… Rode… Abroc… Abro… budini 1 0 0
#> # … with 5,825 more rows, and 15 more variables: aerial <dbl>,
#> # life_habit_method <chr>, life_habit_source <chr>, mass_method <chr>,
#> # mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> # island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> # diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> # diet_method <chr>, diet_source <chr>
phylacine %>% relocate(starts_with("diet"), .after = species)
#> # A tibble: 5,831 x 24
#> binomial order family genus species diet_plant diet_vertebrate
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 Abditom… Rode… Murid… Abdi… latide… 100 0
#> 2 Abeomel… Rode… Murid… Abeo… sevia 78 3
#> 3 Abraway… Rode… Crice… Abra… ruschii 88 1
#> 4 Abrocom… Rode… Abroc… Abro… bennet… 100 0
#> 5 Abrocom… Rode… Abroc… Abro… bolivi… 100 0
#> 6 Abrocom… Rode… Abroc… Abro… budini 100 0
#> # … with 5,825 more rows, and 17 more variables: diet_invertebrate <dbl>,
#> # diet_method <chr>, diet_source <chr>, terrestrial <dbl>, marine <dbl>,
#> # freshwater <dbl>, aerial <dbl>, life_habit_method <chr>,
#> # life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> # mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> # island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>
3.3 Working with observations
3.3.1 Ordering rows by value - arrange()
arrange()
sorts rows in the data by ascending value for a given variable.
Use the wrapper desc()
to sort by descending values instead.
# Smallest mammals
phylacine %>%
arrange(mass_g) %>%
select(binomial, mass_g)
#> # A tibble: 5,831 x 2
#> binomial mass_g
#> <chr> <dbl>
#> 1 Sorex_yukonicus 1.6
#> 2 Crocidura_levicula 1.8
#> 3 Suncus_remyi 1.8
#> 4 Crocidura_lusitania 2
#> 5 Kerivoula_minuta 2.1
#> 6 Suncus_etruscus 2.1
#> # … with 5,825 more rows
# Largest mammals
phylacine %>%
arrange(desc(mass_g)) %>%
select(binomial, mass_g)
#> # A tibble: 5,831 x 2
#> binomial mass_g
#> <chr> <dbl>
#> 1 Balaenoptera_musculus 190000000
#> 2 Balaena_mysticetus 100000000
#> 3 Balaenoptera_physalus 70000000
#> 4 Caperea_marginata 32000000
#> 5 Megaptera_novaeangliae 30000000
#> 6 Eschrichtius_robustus 28500000
#> # … with 5,825 more rows
# Extra variables are used to sort ties in the first variable
phylacine %>%
arrange(mass_g, desc(binomial)) %>%
select(binomial, mass_g)
#> # A tibble: 5,831 x 2
#> binomial mass_g
#> <chr> <dbl>
#> 1 Sorex_yukonicus 1.6
#> 2 Suncus_remyi 1.8
#> 3 Crocidura_levicula 1.8
#> 4 Crocidura_lusitania 2
#> 5 Suncus_etruscus 2.1
#> 6 Kerivoula_minuta 2.1
#> # … with 5,825 more rows
Important: NA
values, if present, are always ordered at the end!
3.3.2 Subset rows by position - slice()
Use slice()
and its variants to extract particular rows.
phylacine %>% slice(3) # third row
#> # A tibble: 1 x 24
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abraway… Rode… Crice… Abra… ruschii 1 0 0 0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> # mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> # mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> # added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> # diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice(5, 1, 2) # fifth, first and second row
#> # A tibble: 3 x 24
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abrocom… Rode… Abroc… Abro… bolivi… 1 0 0 0
#> 2 Abditom… Rode… Murid… Abdi… latide… 1 0 0 0
#> 3 Abeomel… Rode… Murid… Abeo… sevia 1 0 0 0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> # mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> # mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> # added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> # diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice(rep(3, 2)) # duplicate the third row
#> # A tibble: 2 x 24
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abraway… Rode… Crice… Abra… ruschii 1 0 0 0
#> 2 Abraway… Rode… Crice… Abra… ruschii 1 0 0 0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> # mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> # mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> # added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> # diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice(-c(2:5830)) # exclude all but first and last row
#> # A tibble: 2 x 24
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abditom… Rode… Murid… Abdi… latide… 1 0 0 0
#> 2 Zyzomys… Rode… Murid… Zyzo… woodwa… 1 0 0 0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> # mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> # mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> # added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> # diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice_tail(n = 3) # last three rows
#> # A tibble: 3 x 24
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Zyzomys… Rode… Murid… Zyzo… palata… 1 0 0 0
#> 2 Zyzomys… Rode… Murid… Zyzo… pedunc… 1 0 0 0
#> 3 Zyzomys… Rode… Murid… Zyzo… woodwa… 1 0 0 0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> # mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> # mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> # added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> # diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
phylacine %>% slice_max(mass_g) # largest mammal
#> # A tibble: 1 x 24
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Balaeno… Ceta… Balae… Bala… muscul… 0 1 0 0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> # mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> # mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> # added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> # diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
You can also sample random rows in the data:
phylacine %>% slice_sample() # a random row
#> # A tibble: 1 x 24
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Crocidu… Euli… Soric… Croc… levicu… 1 0 0 0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> # mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> # mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> # added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> # diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
# bootstrap
phylacine %>% slice_sample(n = 5831, replace = TRUE)
#> # A tibble: 5,831 x 24
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Rhinolo… Chir… Rhino… Rhin… adami 0 0 0 1
#> 2 Hylomys… Euli… Erina… Hylo… megalo… 1 0 0 0
#> 3 Sciurus… Rode… Sciur… Sciu… yucata… 1 0 0 0
#> 4 Emballo… Chir… Embal… Emba… alecto 0 0 0 1
#> 5 Pteralo… Chir… Ptero… Pter… taki 0 0 0 1
#> 6 Lasiorh… Dipr… Vomba… Lasi… latifr… 1 0 0 0
#> # … with 5,825 more rows, and 15 more variables: life_habit_method <chr>,
#> # life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> # mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> # island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> # diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> # diet_method <chr>, diet_source <chr>
3.3.3 Subsetting rows by value with filter()
filter()
does a similar job as slice()
, but extract rows that satisfy a set
of conditions. The conditions are supplied much the same way as you would do
for an if
statement.
Along with mutate()
(next section), this is probably the function you are
going to use the most.
For example, I might want to extract mammals above a given mass:
# megafauna
phylacine %>%
filter(mass_g > 1e5) %>% # 100 kg
select(binomial, mass_g)
#> # A tibble: 302 x 2
#> binomial mass_g
#> <chr> <dbl>
#> 1 Ailuropoda_melanoleuca 108400
#> 2 Alcelaphus_buselaphus 171002.
#> 3 Alces_alces 356998
#> 4 Archaeoindris_fontoynonti 160000
#> 5 Arctocephalus_forsteri 101250
#> 6 Arctocephalus_pusillus 178500
#> # … with 296 more rows
# non-extinct megafauna
phylacine %>%
filter(mass_g > 1e5, iucn_status != "EP") %>%
select(binomial, mass_g, iucn_status)
#> # A tibble: 178 x 3
#> binomial mass_g iucn_status
#> <chr> <dbl> <chr>
#> 1 Ailuropoda_melanoleuca 108400 VU
#> 2 Alcelaphus_buselaphus 171002. LC
#> 3 Alces_alces 356998 LC
#> 4 Arctocephalus_forsteri 101250 LC
#> 5 Arctocephalus_pusillus 178500 LC
#> 6 Arctocephalus_townsendi 105000 LC
#> # … with 172 more rows
Are there any flying mammals that aren’t bats?
phylacine %>%
filter(aerial == 1, order != "Chiroptera")
#> # A tibble: 0 x 24
#> # … with 24 variables: binomial <chr>, order <chr>, family <chr>, genus <chr>,
#> # species <chr>, terrestrial <dbl>, marine <dbl>, freshwater <dbl>,
#> # aerial <dbl>, life_habit_method <chr>, life_habit_source <chr>,
#> # mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> # mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> # added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> # diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
# no :(
Are humans included in the table?
phylacine %>% filter(binomial == "Homo_sapiens")
#> # A tibble: 1 x 24
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Homo_sa… Prim… Homin… Homo sapiens 1 0 0 0
#> # … with 15 more variables: life_habit_method <chr>, life_habit_source <chr>,
#> # mass_g <dbl>, mass_method <chr>, mass_source <chr>, mass_comparison <chr>,
#> # mass_comparison_source <chr>, island_endemicity <chr>, iucn_status <chr>,
#> # added_iucn_status <chr>, diet_plant <dbl>, diet_vertebrate <dbl>,
#> # diet_invertebrate <dbl>, diet_method <chr>, diet_source <chr>
filter()
can be used to deal with NAs:
phylacine %>%
filter(!is.na(mass_comparison))
#> # A tibble: 754 x 24
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abrocom… Rode… Abroc… Abro… budini 1 0 0 0
#> 2 Abrocom… Rode… Abroc… Abro… famati… 1 0 0 0
#> 3 Abrocom… Rode… Abroc… Abro… shista… 1 0 0 0
#> 4 Abrocom… Rode… Abroc… Abro… uspall… 1 0 0 0
#> 5 Abrocom… Rode… Abroc… Abro… vaccar… 1 0 0 0
#> 6 Acerodo… Chir… Ptero… Acer… humilis 0 0 0 1
#> # … with 748 more rows, and 15 more variables: life_habit_method <chr>,
#> # life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> # mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> # island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> # diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> # diet_method <chr>, diet_source <chr>
Tip: dplyr
introduces the useful function between()
that does exactly what the name implies
between(1:5, 2, 4)
#> [1] FALSE TRUE TRUE TRUE FALSE
# Mesofauna
phylacine %>%
filter(mass_g > 1e3, mass_g < 1e5) %>%
select(binomial, mass_g)
#> # A tibble: 1,126 x 2
#> binomial mass_g
#> <chr> <dbl>
#> 1 Acerodon_jubatus 1075
#> 2 Acinonyx_jubatus 46700
#> 3 Acratocnus_odontrigonus 22990
#> 4 Acratocnus_ye 21310
#> 5 Addax_nasomaculatus 70000.
#> 6 Aepyceros_melampus 52500.
#> # … with 1,120 more rows
# same thing
phylacine %>%
filter(mass_g %>% between(1e3, 1e5)) %>%
select(binomial, mass_g)
#> # A tibble: 1,148 x 2
#> binomial mass_g
#> <chr> <dbl>
#> 1 Acerodon_jubatus 1075
#> 2 Acinonyx_jubatus 46700
#> 3 Acratocnus_odontrigonus 22990
#> 4 Acratocnus_ye 21310
#> 5 Addax_nasomaculatus 70000.
#> 6 Aepyceros_melampus 52500.
#> # … with 1,142 more rows
Note that you can pipe operations inside function arguments as in the last line above (arguments are expressions, after all!).
3.4 Making new variables
3.4.1 Create new variables with mutate()
Very often in data analysis, you will want to create new variables, or edit
existing ones. This is done easily through mutate()
. For example, consider
the diet data:
diet <- phylacine %>%
select(
binomial,
contains("diet") & !contains(c("method", "source"))
)
diet
#> # A tibble: 5,831 x 4
#> binomial diet_plant diet_vertebrate diet_invertebrate
#> <chr> <dbl> <dbl> <dbl>
#> 1 Abditomys_latidens 100 0 0
#> 2 Abeomelomys_sevia 78 3 19
#> 3 Abrawayaomys_ruschii 88 1 11
#> 4 Abrocoma_bennettii 100 0 0
#> 5 Abrocoma_boliviensis 100 0 0
#> 6 Abrocoma_budini 100 0 0
#> # … with 5,825 more rows
These three variables show the percentage of each category of food that make the
diet of that species. They should sum to 100, unless the authors made a typo or
other entry error. To assert this, I’m going to create a new variable,
total_diet
.
diet <- diet %>% mutate(
"total_diet" = diet_vertebrate + diet_invertebrate + diet_plant
)
diet
#> # A tibble: 5,831 x 5
#> binomial diet_plant diet_vertebrate diet_invertebrate total_diet
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abditomys_latidens 100 0 0 100
#> 2 Abeomelomys_sevia 78 3 19 100
#> 3 Abrawayaomys_ruschii 88 1 11 100
#> 4 Abrocoma_bennettii 100 0 0 100
#> 5 Abrocoma_boliviensis 100 0 0 100
#> 6 Abrocoma_budini 100 0 0 100
#> # … with 5,825 more rows
all(diet$total_diet == 100)
#> [1] TRUE
# cool and good
mutate()
adds a variable to the table, and keeps all other variables.
Sometimes you may want to just keep the new variable, and drop the other ones.
That’s the job of mutate()
’s twin sibling, transmute()
. For example,
I want to combine diet_invertebrate
and diet_vertebrate
together:
diet %>%
transmute(
"diet_animal" = diet_invertebrate + diet_vertebrate
)
#> # A tibble: 5,831 x 1
#> diet_animal
#> <dbl>
#> 1 0
#> 2 22
#> 3 12
#> 4 0
#> 5 0
#> 6 0
#> # … with 5,825 more rows
You may want to keep some variables and drop others. You could pipe mutate()
and select()
to do so, or you could just pass the variables to keep to
transmute()
.
diet %>%
transmute(
"diet_animal" = diet_invertebrate + diet_vertebrate,
diet_plant
)
#> # A tibble: 5,831 x 2
#> diet_animal diet_plant
#> <dbl> <dbl>
#> 1 0 100
#> 2 22 78
#> 3 12 88
#> 4 0 100
#> 5 0 100
#> 6 0 100
#> # … with 5,825 more rows
You can also refer to variables you’re creating to derive new variables from them as part of the same operation, this is not an issue.
diet %>%
transmute(
"diet_animal" = diet_invertebrate + diet_vertebrate,
diet_plant,
"total_diet" = diet_animal + diet_plant
)
#> # A tibble: 5,831 x 3
#> diet_animal diet_plant total_diet
#> <dbl> <dbl> <dbl>
#> 1 0 100 100
#> 2 22 78 100
#> 3 12 88 100
#> 4 0 100 100
#> 5 0 100 100
#> 6 0 100 100
#> # … with 5,825 more rows
Sometimes, you may need to perform an operation based on the row number
(I don’t have a good example in mind). tibble
has a built-in function to do
just that:
phylacine %>%
select(binomial) %>%
tibble::rownames_to_column(var = "row_nb")
#> # A tibble: 5,831 x 2
#> row_nb binomial
#> <chr> <chr>
#> 1 1 Abditomys_latidens
#> 2 2 Abeomelomys_sevia
#> 3 3 Abrawayaomys_ruschii
#> 4 4 Abrocoma_bennettii
#> 5 5 Abrocoma_boliviensis
#> 6 6 Abrocoma_budini
#> # … with 5,825 more rows
3.4.2 Summarise observations with summarise()
mutate()
applies operations to all observations in a table. By contrast,
summarise()
applies operations to groups of observations, and returns, er,
summaries. The default grouping unit is the entire table:
phylacine %>%
summarise(
"nb_species" = n(), # counts observations
"nb_terrestrial" = sum(terrestrial),
"nb_marine" = sum(marine),
"nb_freshwater" = sum(freshwater),
"nb_aerial" = sum(aerial),
"mean_mass_g" = mean(mass_g)
)
#> # A tibble: 1 x 6
#> nb_species nb_terrestrial nb_marine nb_freshwater nb_aerial mean_mass_g
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5831 4575 135 156 1162 156882.
Above you can see that bats account for a large portion of mammal species
diversity (nb_aerial
). How much exactly? Just as with mutate()
, you can
perform operations on the variables you just created, in the same statement:
phylacine %>%
summarise(
"nb_species" = n(),
"nb_aerial" = sum(aerial), # bats
"prop_aerial" = nb_aerial / nb_species
)
#> # A tibble: 1 x 3
#> nb_species nb_aerial prop_aerial
#> <int> <dbl> <dbl>
#> 1 5831 1162 0.199
One fifth!
If the british spelling bothers you, summarize()
exists and is strictly
equivalent.
Here’s a simple trick with logical (TRUE / FALSE) variables. Their sum is the
count of observations that evaluate to TRUE
(because TRUE
is taken as 1 and
FALSE
as 0) and their mean is the proportion of TRUE
observations. This can
be exploited to count the number of observations that satisfy a condition:
phylacine %>%
summarise(
"nb_species" = n(),
"nb_megafauna" = sum(mass_g > 100000),
"p_megafauna" = mean(mass_g > 100000)
)
#> # A tibble: 1 x 3
#> nb_species nb_megafauna p_megafauna
#> <int> <int> <dbl>
#> 1 5831 302 0.0518
There are more summaries that just means and counts
(see ?summarise()
for some helpful functions). In fact, summarise can
use any function or expression that evaluates to a single value or a vector
of values. This includes base R max()
, quantiles
, etc.
mutate()
and transmute()
can compute summaries as well, but they will
return the summary once for each observation, in a new column.
phylacine %>%
mutate("nb_species" = n()) %>%
select(binomial, nb_species)
#> # A tibble: 5,831 x 2
#> binomial nb_species
#> <chr> <int>
#> 1 Abditomys_latidens 5831
#> 2 Abeomelomys_sevia 5831
#> 3 Abrawayaomys_ruschii 5831
#> 4 Abrocoma_bennettii 5831
#> 5 Abrocoma_boliviensis 5831
#> 6 Abrocoma_budini 5831
#> # … with 5,825 more rows
3.4.3 Grouping observations by variables
In most cases you don’t want to run summary operations on the entire set of observations, but instead on observations that share a common value, i.e. groups. For example, I want to run the summary displayed above, but for each Order of mammals.
distinct()
extracts all the unique values of a variable
phylacine %>% distinct(order)
#> # A tibble: 29 x 1
#> order
#> <chr>
#> 1 Rodentia
#> 2 Chiroptera
#> 3 Carnivora
#> 4 Pilosa
#> 5 Diprotodontia
#> 6 Cetartiodactyla
#> # … with 23 more rows
I could work my way with what we have already seen, filtering observations (filter(order == "Rodentia")
) and then pipeing the output to summarise()
, and do it again for each Order. But that would be tedious.
Instead, I can use group_by()
to pool observations by order
.
phylacine %>%
group_by(order)
#> # A tibble: 5,831 x 24
#> # Groups: order [29]
#> binomial order family genus species terrestrial marine freshwater aerial
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abditom… Rode… Murid… Abdi… latide… 1 0 0 0
#> 2 Abeomel… Rode… Murid… Abeo… sevia 1 0 0 0
#> 3 Abraway… Rode… Crice… Abra… ruschii 1 0 0 0
#> 4 Abrocom… Rode… Abroc… Abro… bennet… 1 0 0 0
#> 5 Abrocom… Rode… Abroc… Abro… bolivi… 1 0 0 0
#> 6 Abrocom… Rode… Abroc… Abro… budini 1 0 0 0
#> # … with 5,825 more rows, and 15 more variables: life_habit_method <chr>,
#> # life_habit_source <chr>, mass_g <dbl>, mass_method <chr>,
#> # mass_source <chr>, mass_comparison <chr>, mass_comparison_source <chr>,
#> # island_endemicity <chr>, iucn_status <chr>, added_iucn_status <chr>,
#> # diet_plant <dbl>, diet_vertebrate <dbl>, diet_invertebrate <dbl>,
#> # diet_method <chr>, diet_source <chr>
At first glance, nothing has changed, apart from an extra line of information in the output that tells me the observations have been grouped. But now here’s what happen if I run the same summarise()
statement on an ungrouped and a grouped table
phylacine %>%
summarise(
"n_species" = n(),
"mean_mass_g" = mean(mass_g)
)
#> # A tibble: 1 x 2
#> n_species mean_mass_g
#> <int> <dbl>
#> 1 5831 156882.
phylacine %>%
group_by(order) %>%
summarise(
"n_species" = n(),
"mean_mass_g" = mean(mass_g)
)
#> # A tibble: 29 x 3
#> order n_species mean_mass_g
#> <chr> <int> <dbl>
#> 1 Afrosoricida 57 306.
#> 2 Carnivora 313 47905.
#> 3 Cetartiodactyla 392 1854811.
#> 4 Chiroptera 1162 49.1
#> 5 Cingulata 39 235529.
#> 6 Dasyuromorphia 74 748.
#> # … with 23 more rows
I get one value for each group.
Observations can be grouped by multiple variables, which will output a summary for every unique combination of groups.
phylacine %>%
group_by(order, iucn_status) %>%
summarise(
"n_species" = n()
)
#> # A tibble: 138 x 3
#> # Groups: order [29]
#> order iucn_status n_species
#> <chr> <chr> <int>
#> 1 Afrosoricida CR 1
#> 2 Afrosoricida DD 4
#> 3 Afrosoricida EN 7
#> 4 Afrosoricida EP 2
#> 5 Afrosoricida LC 32
#> 6 Afrosoricida NT 3
#> # … with 132 more rows
Whenever you call summarise()
, the last level of grouping is dropped. Note how in the output table above, observations are still grouped by order, and no longer by IUCN status. If I summarise observations again:
phylacine %>%
group_by(order, iucn_status) %>%
summarise(
"n_species" = n()
) %>%
summarise(
"n_species_2" = n()
)
#> # A tibble: 29 x 2
#> order n_species_2
#> <chr> <int>
#> 1 Afrosoricida 7
#> 2 Carnivora 8
#> 3 Cetartiodactyla 9
#> 4 Chiroptera 8
#> 5 Cingulata 5
#> 6 Dasyuromorphia 7
#> # … with 23 more rows
I get the summary across orders, and the table is no longer grouped at all. This is useful to consider if you need to work on summaries across different levels of the data.
For example, I would like to know how the species in each order are distributed between the different levels of threat in the IUCN classification. To get these proportions, I need to first get the count of each number of species in a level of threat inside an order, and divide that by the number of species in that order.
phylacine %>%
group_by(order, iucn_status) %>%
summarise("n_order_iucn" = n()) %>%
# grouping by iucn_status silently dropped
mutate(
"n_order" = sum(n_order_iucn),
"p_iucn" = n_order_iucn / n_order
)
#> # A tibble: 138 x 5
#> # Groups: order [29]
#> order iucn_status n_order_iucn n_order p_iucn
#> <chr> <chr> <int> <int> <dbl>
#> 1 Afrosoricida CR 1 57 0.0175
#> 2 Afrosoricida DD 4 57 0.0702
#> 3 Afrosoricida EN 7 57 0.123
#> 4 Afrosoricida EP 2 57 0.0351
#> 5 Afrosoricida LC 32 57 0.561
#> 6 Afrosoricida NT 3 57 0.0526
#> # … with 132 more rows
10.2% of Carnivores are Endangered (“EN”).
3.4.4 Grouped data and other dplyr
verbs
Grouping does not only affect the behaviour of summarise
, but under circumstances, other verbs can (and will!) perform operations by groups.
# Species with a higher mass than the mammal mean
phylacine %>%
select("binomial", "mass_g") %>%
filter(mass_g > mean(mass_g, na.rm = TRUE))
#> # A tibble: 234 x 2
#> binomial mass_g
#> <chr> <dbl>
#> 1 Alcelaphus_buselaphus 171002.
#> 2 Alces_alces 356998
#> 3 Archaeoindris_fontoynonti 160000
#> 4 Arctocephalus_pusillus 178500
#> 5 Arctodus_simus 709500
#> 6 Balaena_mysticetus 100000000
#> # … with 228 more rows
# Species with a higher mass than the mean in their order
phylacine %>%
group_by(order) %>%
select("binomial", "mass_g") %>%
filter(mass_g > mean(mass_g, na.rm = TRUE))
#> # A tibble: 890 x 3
#> # Groups: order [27]
#> order binomial mass_g
#> <chr> <chr> <dbl>
#> 1 Chiroptera Acerodon_celebensis 390
#> 2 Chiroptera Acerodon_humilis 600.
#> 3 Chiroptera Acerodon_jubatus 1075
#> 4 Chiroptera Acerodon_leucotis 513.
#> 5 Chiroptera Acerodon_mackloti 470.
#> 6 Rodentia Aeretes_melanopterus 732.
#> # … with 884 more rows
# Largest mammal
phylacine %>%
select(binomial, mass_g) %>%
slice_max(mass_g)
#> # A tibble: 1 x 2
#> binomial mass_g
#> <chr> <dbl>
#> 1 Balaenoptera_musculus 190000000
# Largest species in each order
phylacine %>%
group_by(order) %>%
select(binomial, mass_g) %>%
slice_max(mass_g)
#> # A tibble: 30 x 3
#> # Groups: order [29]
#> order binomial mass_g
#> <chr> <chr> <dbl>
#> 1 Afrosoricida Plesiorycteropus_madagascariensis 13220
#> 2 Carnivora Mirounga_leonina 1600000
#> 3 Cetartiodactyla Balaenoptera_musculus 190000000
#> 4 Chiroptera Acerodon_jubatus 1075
#> 5 Cingulata Glyptodon_clavipes 2000000
#> 6 Dasyuromorphia Thylacinus_cynocephalus 30000
#> # … with 24 more rows
To avoid grouped operations, you can simply drop grouping with ungroup()
.
3.5 Working with multiple tables
3.5.1 Binding tables
dplyr
introduces bind_rows()
and bind_cols()
, which are equivalent to base
R rbind()
and cbind()
, with a few extra feature. They are faster, and can
bind many tables at once, and bind data frames with vectors or lists.
bind_rows()
has an option to pass a variable specifying which dataset each
observation originates from.
porpoises <- phylacine %>%
filter(family == "Phocoenidae") %>%
select(binomial, iucn_status)
echidnas <- phylacine %>%
filter(family == "Tachyglossidae") %>%
select(binomial, iucn_status)
bind_rows(
"porpoise" = porpoises,
"echidna" = echidnas,
.id = "kind"
)
#> # A tibble: 13 x 3
#> kind binomial iucn_status
#> <chr> <chr> <chr>
#> 1 porpoise Neophocaena_asiaeorientalis VU
#> 2 porpoise Neophocaena_phocaenoides VU
#> 3 porpoise Phocoena_dioptrica DD
#> 4 porpoise Phocoena_phocoena LC
#> 5 porpoise Phocoena_sinus CR
#> 6 porpoise Phocoena_spinipinnis DD
#> # … with 7 more rows
3.5.2 Combining variables of two tables with mutating joins
Mutating joins are tailored to combine tables that share a set of observations but have different variables.
As an example, let’s split the phylacine
dataset in two smaller datasets,
one containing information on diet and one on the dominant habitat.
diet <- phylacine %>%
select(binomial, diet_plant:diet_invertebrate) %>%
slice(1:5)
diet
#> # A tibble: 5 x 4
#> binomial diet_plant diet_vertebrate diet_invertebrate
#> <chr> <dbl> <dbl> <dbl>
#> 1 Abditomys_latidens 100 0 0
#> 2 Abeomelomys_sevia 78 3 19
#> 3 Abrawayaomys_ruschii 88 1 11
#> 4 Abrocoma_bennettii 100 0 0
#> 5 Abrocoma_boliviensis 100 0 0
life_habit <- phylacine %>% select(binomial, terrestrial:aerial) %>%
slice(1:3, 6:7)
life_habit
#> # A tibble: 5 x 5
#> binomial terrestrial marine freshwater aerial
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Abditomys_latidens 1 0 0 0
#> 2 Abeomelomys_sevia 1 0 0 0
#> 3 Abrawayaomys_ruschii 1 0 0 0
#> 4 Abrocoma_budini 1 0 0 0
#> 5 Abrocoma_cinerea 1 0 0 0
The two datasets each contain 5 species, the first three are shared, and the two last differ between the two.
intersect(diet$binomial, life_habit$binomial)
#> [1] "Abditomys_latidens" "Abeomelomys_sevia" "Abrawayaomys_ruschii"
setdiff(diet$binomial, life_habit$binomial)
#> [1] "Abrocoma_bennettii" "Abrocoma_boliviensis"
To use mutate-joins, both tables need to have a key, a variable that
identifies each observation. Here, that would be binomial
, the sepcies names.
If your table doesn’t have such a key and the rows between the tables match
one another, remember you can create a row number variable easily with
tibble::column_to_rownames()
.
inner_join(diet, life_habit, by = "binomial")
#> # A tibble: 3 x 8
#> binomial diet_plant diet_vertebrate diet_invertebra… terrestrial marine
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Abditom… 100 0 0 1 0
#> 2 Abeomel… 78 3 19 1 0
#> 3 Abraway… 88 1 11 1 0
#> # … with 2 more variables: freshwater <dbl>, aerial <dbl>
inner_join
combined the variables, and dropped the observations that weren’t
matched between the two tables. There are three other variations of mutating
joins, differing in what they do with unmatching variables.
left_join(diet, life_habit, by = "binomial")
#> # A tibble: 5 x 8
#> binomial diet_plant diet_vertebrate diet_invertebra… terrestrial marine
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Abditom… 100 0 0 1 0
#> 2 Abeomel… 78 3 19 1 0
#> 3 Abraway… 88 1 11 1 0
#> 4 Abrocom… 100 0 0 NA NA
#> 5 Abrocom… 100 0 0 NA NA
#> # … with 2 more variables: freshwater <dbl>, aerial <dbl>
right_join(diet, life_habit, by = "binomial")
#> # A tibble: 5 x 8
#> binomial diet_plant diet_vertebrate diet_invertebra… terrestrial marine
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Abditom… 100 0 0 1 0
#> 2 Abeomel… 78 3 19 1 0
#> 3 Abraway… 88 1 11 1 0
#> 4 Abrocom… NA NA NA 1 0
#> 5 Abrocom… NA NA NA 1 0
#> # … with 2 more variables: freshwater <dbl>, aerial <dbl>
full_join(diet, life_habit, by = "binomial")
#> # A tibble: 7 x 8
#> binomial diet_plant diet_vertebrate diet_invertebra… terrestrial marine
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Abditom… 100 0 0 1 0
#> 2 Abeomel… 78 3 19 1 0
#> 3 Abraway… 88 1 11 1 0
#> 4 Abrocom… 100 0 0 NA NA
#> 5 Abrocom… 100 0 0 NA NA
#> 6 Abrocom… NA NA NA 1 0
#> # … with 1 more row, and 2 more variables: freshwater <dbl>, aerial <dbl>
semi_join(diet, life_habit, by = "binomial")
#> # A tibble: 3 x 4
#> binomial diet_plant diet_vertebrate diet_invertebrate
#> <chr> <dbl> <dbl> <dbl>
#> 1 Abditomys_latidens 100 0 0
#> 2 Abeomelomys_sevia 78 3 19
#> 3 Abrawayaomys_ruschii 88 1 11
anti_join(diet, life_habit, by = "binomial")
#> # A tibble: 2 x 4
#> binomial diet_plant diet_vertebrate diet_invertebrate
#> <chr> <dbl> <dbl> <dbl>
#> 1 Abrocoma_bennettii 100 0 0
#> 2 Abrocoma_boliviensis 100 0 0
3.5.3 Filtering matching observations between two tables wiht filtering joins
So-called filtering joins return row from the first table that are matched
(or not, for anti_join()
) in the second.
semi_join(diet, life_habit, by = "binomial")
#> # A tibble: 3 x 4
#> binomial diet_plant diet_vertebrate diet_invertebrate
#> <chr> <dbl> <dbl> <dbl>
#> 1 Abditomys_latidens 100 0 0
#> 2 Abeomelomys_sevia 78 3 19
#> 3 Abrawayaomys_ruschii 88 1 11
anti_join(diet, life_habit, by = "binomial")
#> # A tibble: 2 x 4
#> binomial diet_plant diet_vertebrate diet_invertebrate
#> <chr> <dbl> <dbl> <dbl>
#> 1 Abrocoma_bennettii 100 0 0
#> 2 Abrocoma_boliviensis 100 0 0