Section 1 Reading files and string manipulation

Load the packages for the day.

1.1 Data import and export with readr

Data in the wild with which ecologists and evolutionary biologists deal is most often in the form of a text file, usually with the extensions .csv or .txt. Often, such data has to be written to file from within R. readr contains a number of functions to help with reading and writing text files.

1.1.1 Reading data

Reading in a csv file with readr is done with the read_csv function, a faster alternative to the base R read.csv. Here, read_csv is applied to the mtcars example.

The read_csv2 function is useful when dealing with files where the separator between columns is a semicolon ;, and where the decimal point is represented by a comma ,.

Other variants include:

  • read_tsv for tab-separated files, and

  • read_delim, a general case which allows the separator to be specified manually.

readr import function will attempt to guess the column type from the first N lines in the data. This N can be set using the function argument guess_max. The n_max argument sets the number of rows to read, while the skip argument sets the number of rows to be skipped before reading data.

By default, the column names are taken from the first row of the data, but they can be manually specified by passing a character vector to col_names.

There are some other arguments to the data import functions, but the defaults usually just work.

1.1.2 Writing data

Writing data uses the write_* family of functions, with implementations for csv, csv2 etc. (represented by the asterisk), mirroring the import functions discussed above. write_* functions offer the append argument, which allow a data frame to be added to an existing file.

These functions are not covered here.

1.1.3 Reading and writing lines

Sometimes, there is text output generated in R which needs to be written to file, but is not in the form of a dataframe. A good example is model outputs. It is good practice to save model output as a text file, and add it to version control. Similarly, it may be necessary to import such text, either for display to screen, or to extract data.

This can be done using the readr functions read_lines and write_lines. Consider the model summary from a simple linear model.

The model summary can be written to file. When writing lines to file, BE AWARE OF THE DIFFERENCES BETWEEN UNIX AND WINODWS line separators. Usually, this causes no trouble.

This model output can be read back in for display, and each line of the model output is an element in a character vector.

These few functions demonstrate the most common uses of readr, but most other use cases for text data can be handled using different function arguments, including reading data off the web, unzipping compressed files before reading, and specifying the column types to control for type conversion errors.

Excel files

Finally, data is often shared or stored by well meaning people in the form of Microsoft Excel sheets. Indeed, Excel (especially when synced regularly to remote storage) is a good way of noting down observational data in the field. The readxl package allows importing from Excel files, including reading in specific sheets.

1.2 String manipulation with stringr

stringr is the tidyverse package for string manipulation, and exists in an interesting symbiosis with the stringi package. For the most part, stringr is a wrapper around stringi, and is almost always more than sufficient for day-to-day needs.

stringr functions begin with str_.

1.2.2 Detecting strings

Count the frequency of a pattern in a string with str_count. Returns an inteegr. Detect whether a pattern exists in a string with str_detect. Returns a logical and can be used as a predicate.

Both are vectorised, i.e, automatically applied to a vector of arguments.

Vectorising over both string and pattern works as expected.

str_locate locates the search pattern in a string, and returns the start and end as a two column matrix.

Detect whether a string starts or ends with a pattern. Also vectorised. Both have a negate argument, which returns the negative, i.e., returns FALSE if the search pattern is detected.

str_subset [WHICH IS NOT RELATED TO str_sub] helps with subsetting a character vector based on a str_detect predicate. In the example, all elements containing “banana” are subset.

str_which has the same logic except that it returns the vector position and not the elements.

1.2.3 Matching strings

str_match returns all positive matches of the patttern in the string. The return type is a list, with one element per search pattern.

A simple case is shown below where the search pattern is the phrase “banana”.

The search pattern can be extended to look for multiple subsets of the search pattern. Consider searching for dates and times.

Here, the search pattern is a regex pattern that looks for a set of four digits (\\d{4}) and a month name (\\w+) seperated by a hyphen. There’s much more to be explored in dealing with dates and times in lubridate, another tidyverse package.

The return type is a list, each element is a character matrix where the first column is the string subset matching the full search pattern, and then as many columns as there are parts to the search pattern. The parts of interest in the search pattern are indicated by wrapping them in parentheses. For example, in the case below, wrapping [-.] in parentheses will turn it into a distinct part of the search pattern.

Multiple possible matches are dealt with using str_match_all. An example case is uncertainty in date-time in raw data, where the date has been entered as 1970-somemonth-01 or 1970/anothermonth/01.

The return type is a list, with one element per input string. Each element is a character matrix, where each row is one possible match, and each column after the first (the full match) corresponds to the parts of the search pattern.

1.2.4 Simpler pattern extraction

The full functionality of str_match_* can be boiled down to the most common use case, extracting one or more full matches of the search pattern using str_extract and str_extract_all respectively.

str_extract returns a character vector with the same length as the input string vector, while str_extract_all returns a list, with a character vector whose elements are the matches.

1.2.6 Replacing string elements

str_replace is intended to replace the search pattern, and can be co-opted into the task of recovering simulation parameters or other data from regularly named files. str_replace_all works the same way but replaces all matches of the search pattern.

str_remove is a wrapper around str_replace where the replacement is set to "". This is not covered here.

Having replaced unwanted characters in the filename with spaces, str_trim offers a way to remove leading and trailing whitespaces.

1.2.7 Subsetting within strings

When strings are highly regular, useful data can be extracted from a string using str_sub. In the date-time example, the year is always represented by the first four characters.

Similarly, it’s possible to extract the last few characters using negative indices.

Finally, it’s also possible to replace characters within a string based on the position. This requires using the assignment operator <-.

1.2.8 Padding and truncating strings

Strings included in filenames or plots are often of unequal lengths, especially when they represent numbers. str_pad can pad strings with suitable characters to maintain equal length filenames, with which it is easier to work.

Strings can also be truncated if they are too long.

1.2.9 Stringr aspects not covered here

Some stringr functions are not covered here. These include:

  • str_wrap (of dubious use),

  • str_interp, str_glue* (better to use glue; see below),

  • str_sort, str_order (used in sorting a character vector),

  • str_to_case* (case conversion), and

  • str_view* (a graphical view of search pattern matches).

  • word, boundary etc. The use of word is covered below.

stringi, of which stringr is a wrapper, offers a lot more flexibility and control.

1.3 String interpolation with glue

The idea behind string interpolation is to procedurally generate new complex strings from pre-existing data.

glue is as simple as the example shown.

This creates and prints a vector of car names stating each is a car model.

The related glue_data is even more useful in printing from a dataframe. In this example, it can quickly generate command line arguments or filenames.

Finally, the convenient glue_sql and glue_data_sql are used to safely write SQL queries where variables from data are appropriately quoted. This is not covered here, but it is good to know it exists.

glue has some more functions — glue_safe, glue_collapse, and glue_col, but these are infrequently used. Their functionality can be found on the glue github page.

1.4 Strings in ggplot

ggplot has two geoms (wait for the ggplot tutorial to understand more about geoms) that work with text: geom_text and geom_label. These geoms allow text to be pasted on to the main body of a plot.

Often, these may overlap when the data are closely spaced. The package ggrepel offers another geom, geom_text_repel (and the related geom_label_repel) that help arrange text on a plot so it doesn’t overlap with other features. This is not perfect, but it works more often than not.

More examples can be found on the ggrepl website.

Here, the arguments to geom_text_repel are taken both from the mtcars data (position), as well as from the car brands extracted using the stringr::word (labels), which tries to separate strings based on a regular pattern.

The details of ggplot are covered in a later tutorial.

This is not a good looking plot, because it breaks other rules of plot design, such as whether this sort of plot should be made at all. Labels and text need to be applied sparingly, for example drawing attention or adding information to outliers.