1 Preamble

Suggested reading material: Chapter 5 on Data transformation in the R for Data Science book.

On the RStudio site there is also a nice cheat sheet

1.1 Needed libraries for this tutorial:

library(tidyverse)

1.2 Getting the example data into R

The data we are going to use is the web. It can be downloaded onto your computer by clicking on the link or what is better (because it becomes repoducable) download the data on to your computer (only have to do this once, the file is big):

download.file("http://www.hafro.is/~einarhj/crfmr/data-raw/SAU_FAO_31_v44-1.csv",
              "SAU_FAO_31_v44-1.csv")

and then read in the data each time you want to work on it:

sau <- read.csv("SAU_FAO_31_v44-1.csv",
                stringsAsFactors = FALSE)

1.3 Quick overview of the data

glimpse(sau)

## Observations: 275,736
## Variables: 13
## $ area_name        <chr> "Atlantic, Western Central", "Atlantic, Weste...
## $ area_type        <chr> "fao_area", "fao_area", "fao_area", "fao_area...
## $ year             <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195...
## $ scientific_name  <chr> "Marine fishes not identified", "Marine fishe...
## $ common_name      <chr> "Marine fishes nei", "Marine fishes nei", "Ma...
## $ functional_group <chr> "Medium demersals (30 - 89 cm)", "Medium deme...
## $ commercial_group <chr> "Other fishes & inverts", "Other fishes & inv...
## $ fishing_entity   <chr> "Antigua & Barbuda", "Antigua & Barbuda", "An...
## $ fishing_sector   <chr> "Subsistence", "Artisanal", "Artisanal", "Sub...
## $ catch_type       <chr> "Landings", "Landings", "Landings", "Landings...
## $ reporting_status <chr> "Unreported", "Reported", "Unreported", "Unre...
## $ tonnes           <dbl> 1.827121e+02, 1.889356e+01, 1.257746e+02, 1.5...
## $ landed_value     <dbl> 2.678560e+05, 2.769796e+04, 1.843856e+05, 2.2...

2 Grammar of data

The tidyverse package, through the dplyr and tidyr, are a set of tools for a common set of problems connected to aggregates or summaries of data.

Similar to ggplot2 they feature a Domain Specific Language (DSL) specially designed for data summaries.
Developed by Hadley Wickam, the creator ggplot2 and other useful tools.

Commonly, when collating summaries by group, one wants to:

Split up a big data structure into homogeneous pieces,
Apply a function to each piece
Combine all the results back together.

For example, one might want to

quickly calculate summary statistics by species
perform group-wise transformations like scaling or standardizing
fit the same model to different …

3 dplyr

Essentially dplyr offers a collection of simple but powerful verbs (functions) that facilitate this split-apply-combined process:

filter: keep rows matching criteria
select: pick or drop columns by name
arrange: order the rows according to a variable
mutate: add new variables
summarise: reduce variables to values
group_by: gives the group to apply the analysis functions to

The structure of these commands is always the same:

First argument is a data frame
Always return a data frame
Subsequent arguments say what to do with data frame
(Never modify in place)
It recognizes the columns of the data.frame as variables, that is only need to call col.name instead of dat$col.name

3.1 Select

Select only certain columns

sau.redux <- 
  select(sau, year, common_name, tonnes)
glimpse(sau.redux)

## Observations: 275,736
## Variables: 3
## $ year        <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 19...
## $ common_name <chr> "Marine fishes nei", "Marine fishes nei", "Marine ...
## $ tonnes      <dbl> 1.827121e+02, 1.889356e+01, 1.257746e+02, 1.556623...

select all columns except area_name and area_type:

sau.redux <- 
  select(sau, -area_name, -area_type)
glimpse(sau.redux)

## Observations: 275,736
## Variables: 11
## $ year             <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195...
## $ scientific_name  <chr> "Marine fishes not identified", "Marine fishe...
## $ common_name      <chr> "Marine fishes nei", "Marine fishes nei", "Ma...
## $ functional_group <chr> "Medium demersals (30 - 89 cm)", "Medium deme...
## $ commercial_group <chr> "Other fishes & inverts", "Other fishes & inv...
## $ fishing_entity   <chr> "Antigua & Barbuda", "Antigua & Barbuda", "An...
## $ fishing_sector   <chr> "Subsistence", "Artisanal", "Artisanal", "Sub...
## $ catch_type       <chr> "Landings", "Landings", "Landings", "Landings...
## $ reporting_status <chr> "Unreported", "Reported", "Unreported", "Unre...
## $ tonnes           <dbl> 1.827121e+02, 1.889356e+01, 1.257746e+02, 1.5...
## $ landed_value     <dbl> 2.678560e+05, 2.769796e+04, 1.843856e+05, 2.2...

Other useful helper function

# not run
select(sau, starts_with("fishing"))     # Select all columns that start with "fishing"
select(sau, ends_with("name"))          # Select all columns that end with "name"
select(sau, scientific_name:catch_type) # Select all columns between "scientific_name" and "catch_type"
select(sau, contains("group"))          # Select all columns that contains "group"

Select also allows you to rename columns:

sau.redux <- 
  select(sau, year, country = fishing_entity, name = common_name, tonnes)
glimpse(sau.redux)

## Observations: 275,736
## Variables: 4
## $ year    <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, ...
## $ country <chr> "Antigua & Barbuda", "Antigua & Barbuda", "Antigua & B...
## $ name    <chr> "Marine fishes nei", "Marine fishes nei", "Marine fish...
## $ tonnes  <dbl> 1.827121e+02, 1.889356e+01, 1.257746e+02, 1.556623e+01...

but this also removes columns.

3.2 Rename

Instead of changing variable name with select (as shown above) one can use rename. The difference is that all original columns are intact. Lets for sake of sanity do that here on the original data frame (i.e. we overwrite the sau dataframe):

sau <- 
  rename(sau,
         area = area_name,
         type = area_type,
         latin = scientific_name,
         species = common_name,
         bgroup = functional_group,
         cgroup = commercial_group,
         country = fishing_entity,
         sector = fishing_sector,
         cclass = catch_type,
         status = reporting_status,
         value = landed_value)
glimpse(sau)

## Observations: 275,736
## Variables: 13
## $ area    <chr> "Atlantic, Western Central", "Atlantic, Western Centra...
## $ type    <chr> "fao_area", "fao_area", "fao_area", "fao_area", "fao_a...
## $ year    <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, ...
## $ latin   <chr> "Marine fishes not identified", "Marine fishes not ide...
## $ species <chr> "Marine fishes nei", "Marine fishes nei", "Marine fish...
## $ bgroup  <chr> "Medium demersals (30 - 89 cm)", "Medium demersals (30...
## $ cgroup  <chr> "Other fishes & inverts", "Other fishes & inverts", "O...
## $ country <chr> "Antigua & Barbuda", "Antigua & Barbuda", "Antigua & B...
## $ sector  <chr> "Subsistence", "Artisanal", "Artisanal", "Subsistence"...
## $ cclass  <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status  <chr> "Unreported", "Reported", "Unreported", "Unreported", ...
## $ tonnes  <dbl> 1.827121e+02, 1.889356e+01, 1.257746e+02, 1.556623e+01...
## $ value   <dbl> 2.678560e+05, 2.769796e+04, 1.843856e+05, 2.282010e+04...

Lets also get rid of the column area, type and latin.

sau <- select(sau, -area, -type, -latin)

3.3 Filter

One can create a subset of the data using the filter command:

sau.dominica <-
  filter(sau,
         country == "Dominica")
glimpse(sau.dominica)

## Observations: 3,515
## Variables: 10
## $ year    <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, ...
## $ species <chr> "Marine fishes nei", "Marine fishes nei", "Marine fish...
## $ bgroup  <chr> "Medium demersals (30 - 89 cm)", "Medium demersals (30...
## $ cgroup  <chr> "Other fishes & inverts", "Other fishes & inverts", "O...
## $ country <chr> "Dominica", "Dominica", "Dominica", "Dominica", "Domin...
## $ sector  <chr> "Subsistence", "Subsistence", "Artisanal", "Subsistenc...
## $ cclass  <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status  <chr> "Reported", "Unreported", "Reported", "Reported", "Unr...
## $ tonnes  <dbl> 97.98476, 83.12100, 50.47700, 7.92000, 7.56000, 4.0800...
## $ value   <dbl> 143645.661, 121855.379, 73999.280, 11610.720, 11082.96...

and filter even further (here two countries and one commerical group):

sau.dominica.tunas <-
  filter(sau,
         country %in% c("Dominica", "Barbados"),
         cgroup == "Tuna & billfishes")
glimpse(sau.dominica.tunas)

## Observations: 1,667
## Variables: 10
## $ year    <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, ...
## $ species <chr> "Billfishes", "Billfishes", "Bullet and frigate tunas"...
## $ bgroup  <chr> "Large pelagics (>=90 cm)", "Large pelagics (>=90 cm)"...
## $ cgroup  <chr> "Tuna & billfishes", "Tuna & billfishes", "Tuna & bill...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Domin...
## $ sector  <chr> "Artisanal", "Recreational", "Recreational", "Recreati...
## $ cclass  <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status  <chr> "Reported", "Unreported", "Unreported", "Unreported", ...
## $ tonnes  <dbl> 2.103319e+01, 1.831712e-03, 1.373784e-03, 1.373784e-03...
## $ value   <dbl> 3.083466e+04, 2.685289e+00, 2.013967e+00, 2.013967e+00...

Filter can use any logical statement:

 a == b   ## a is equal to b
 a != b   ## a is not equal to b
 a > b    ## a is greater than b
 a >= b   ## a is greater or equal to b
 a < b    ## a is less than b
 a <= b   ## a is less or equal to b
 a & b    ## a and b
 a | b    ## a or b
 !a       ## not a
 is.na(a) ## is a equal to NA (missing)
 a %in% b ## elements of column a that are also in b

A related function is the slice function, that simply filters the data by row number. E.g. to filter the 1st, 4th, 5th, 6th and the 100th row one can do:

sau.slice <- 
  slice(sau, c(1,4:6,100))
glimpse(sau.slice)

## Observations: 5
## Variables: 10
## $ year    <int> 1950, 1950, 1950, 1950, 1950
## $ species <chr> "Marine fishes nei", "Marine fishes nei", "Marine fish...
## $ bgroup  <chr> "Medium demersals (30 - 89 cm)", "Medium demersals (30...
## $ cgroup  <chr> "Other fishes & inverts", "Other fishes & inverts", "O...
## $ country <chr> "Antigua & Barbuda", "Bahamas", "Bahamas", "Bahamas", ...
## $ sector  <chr> "Subsistence", "Subsistence", "Artisanal", "Artisanal"...
## $ cclass  <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status  <chr> "Unreported", "Unreported", "Reported", "Unreported", ...
## $ tonnes  <dbl> 182.71212, 15.56623, 10.01596, 28.00161, 18.52800
## $ value   <dbl> 267855.96, 22820.10, 14683.40, 41050.36, 27162.05

3.4 Arrange

Arrange by species name (in ascending order):

sau.arr <- arrange(sau, species)
glimpse(sau.arr)

## Observations: 275,736
## Variables: 10
## $ year    <int> 1950, 1950, 1950, 1950, 1950, 1951, 1951, 1951, 1951, ...
## $ species <chr> "Acoupa weakfish", "Acoupa weakfish", "Acoupa weakfish...
## $ bgroup  <chr> "Large demersals (>=90 cm)", "Large demersals (>=90 cm...
## $ cgroup  <chr> "Perch-likes", "Perch-likes", "Perch-likes", "Perch-li...
## $ country <chr> "Guyana", "Guyana", "Guyana", "Suriname", "Suriname", ...
## $ sector  <chr> "Subsistence", "Artisanal", "Artisanal", "Subsistence"...
## $ cclass  <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status  <chr> "Unreported", "Reported", "Unreported", "Unreported", ...
## $ tonnes  <dbl> 325.077622, 16.889866, 5.066960, 1016.261896, 176.2660...
## $ value   <dbl> 476563.794, 24760.544, 7428.163, 1489839.939, 258405.9...

and in descending order:

sau.darr <- arrange(sau, desc(year))
glimpse(sau.darr)

## Observations: 275,736
## Variables: 10
## $ year    <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
## $ species <chr> "Marine fishes nei", "Marine fishes nei", "Marine fish...
## $ bgroup  <chr> "Medium demersals (30 - 89 cm)", "Medium demersals (30...
## $ cgroup  <chr> "Other fishes & inverts", "Other fishes & inverts", "O...
## $ country <chr> "Antigua & Barbuda", "Antigua & Barbuda", "Antigua & B...
## $ sector  <chr> "Subsistence", "Subsistence", "Artisanal", "Recreation...
## $ cclass  <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status  <chr> "Reported", "Unreported", "Reported", "Reported", "Unr...
## $ tonnes  <dbl> 1.340877e+03, 7.534594e+01, 1.806750e+03, 3.349000e+02...
## $ value   <dbl> 1965725.1783, 110457.1498, 2648695.5000, 490963.4000, ...

You can also arrange by more that one column:

sau.arr2 <- arrange(sau, desc(year), species)
glimpse(sau.arr2)

## Observations: 275,736
## Variables: 10
## $ year    <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
## $ species <chr> "Acoupa weakfish", "Acoupa weakfish", "Acoupa weakfish...
## $ bgroup  <chr> "Large demersals (>=90 cm)", "Large demersals (>=90 cm...
## $ cgroup  <chr> "Perch-likes", "Perch-likes", "Perch-likes", "Perch-li...
## $ country <chr> "Guyana", "Guyana", "Guyana", "Belize", "Belize", "Bel...
## $ sector  <chr> "Subsistence", "Artisanal", "Artisanal", "Subsistence"...
## $ cclass  <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status  <chr> "Unreported", "Reported", "Unreported", "Unreported", ...
## $ tonnes  <dbl> 56.6908346, 820.5335653, 247.6644052, 11.2982886, 2.15...
## $ value   <dbl> 83108.7636, 1202902.2068, 363076.0180, 16563.2911, 315...

3.5 Mutate

NOTE: Need to check what the unit of the value is

Mutate allows you to add new columns to your data. Let’s e.g. calculate the value per kg

sau.vperkg <-
  mutate(sau,
         value_per_kg = value/(tonnes * 1000))

You can also do more than one “mutation”, e.g. here we convert tonnes to kilograms and then use that variable to calculate value per kilogram

sau.vperkg <-
  mutate(sau,
         kg = tonnes/1000,
         value_per_kg = value/kg)

3.6 Summarise

Useful to create (not so) simple summaries of the data

sau.summ <- 
  summarise(sau, 
            stonnes = sum(tonnes, na.rm = TRUE),
            svalue  = sum(value,  na.rm = TRUE))

Useful helper function

n()             ## counts the number of rows
n_distinct(x)   ## counts the number distinct values of x
first(x)        ## first value of x
last(x)         ## last value of x
nth(x,n)        ## nth value of x

For example, if we want to calculate the number of records, number species, biological and commercial groups:

sau.summ2 <- 
  summarise(sau,
            n = n(),
            nspecies = n_distinct(species),
            nbgroup =  n_distinct(bgroup),
            ncgroup =  n_distinct(cgroup))
sau.summ2

##        n nspecies nbgroup ncgroup
## 1 275736      441      26      11

3.7 Group_by

Summarize is somewhat useful on its own but become much more useful when combined with a group_by statement

group_by Group data into rows with the same value of (a) particular variable(s)
ungroup Remove grouping information from data frame

sau.group <- 
  group_by(sau, country)
sau.group

## Source: local data frame [275,736 x 10]
## Groups: country [90]
## 
##     year           species                        bgroup
##    <int>             <chr>                         <chr>
## 1   1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 2   1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 3   1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 4   1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 5   1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 6   1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 7   1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 8   1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 9   1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 10  1950 Marine fishes nei Medium demersals (30 - 89 cm)
## # ... with 275,726 more rows, and 7 more variables: cgroup <chr>,
## #   country <chr>, sector <chr>, cclass <chr>, status <chr>, tonnes <dbl>,
## #   value <dbl>

In the above code we have grouped the data by country (note the second line). If we now repeat the summary calculation we did above on the grouped dataframe (sau.group) we get the summary by each country:

sau.summ3 <- 
  summarise(sau.group,
            n = n(),
            nspecies = n_distinct(species),
            nbgroup =  n_distinct(bgroup),
            ncgroup =  n_distinct(cgroup))
sau.summ3

We can use the group_by function on more than one column, e.g. by country and catch-class and then apply some summary statistics:

sau.group2 <- 
  group_by(sau, country, cclass)
sau.summ4 <- 
  summarise(sau.group2,
            n = n(),
            nspecies = n_distinct(species),
            nbgroup =  n_distinct(bgroup),
            ncgroup =  n_distinct(cgroup))
sau.summ4

## Source: local data frame [165 x 6]
## Groups: country [?]
## 
##                country   cclass     n nspecies nbgroup ncgroup
##                  <chr>    <chr> <int>    <int>   <int>   <int>
## 1              Algeria Discards     1        1       1       1
## 2              Algeria Landings     6        6       2       2
## 3        Anguilla (UK) Landings  3673       22       9       7
## 4    Antigua & Barbuda Landings  4483       20       8       6
## 5            Argentina Discards    17        1       1       1
## 6            Argentina Landings    93       11       2       3
## 7  Aruba (Netherlands) Discards  1849       29      14       9
## 8  Aruba (Netherlands) Landings  2578       36      14       9
## 9              Bahamas Discards     8        1       1       1
## 10             Bahamas Landings  6211       42      10       6
## # ... with 155 more rows

3.8 Combining dplyr verbs

In R one can apply functions to data repeatedly:

sau.summ <- 
  arrange(summarise(group_by(select(sau, country, tonnes), country), n=n(), stonnes = sum(tonnes)), desc(stonnes))
sau.summ

The problem here is that this is very hard to read because in order to understand what has been done one has to read the code from inside out. What one can read can be written in some kind of a pseudo code:

select columns country and tonnes from the sau dataframe
group it by country
provide summary statistic of the number of records and the sum of the total catches
arrange it by the total catch in descending order (countries with highest catch first).

3.9 The `%>%` operator

dplyr allows chaining of operations using the $\%>\%$ (pipe). By use of the pipe one can get the same as above by:

sau %>% 
  select(country, tonnes) %>% 
  group_by(country) %>% 
  summarise(n = n(),
            stonnes = sum(tonnes)) %>% 
  arrange(desc(stonnes))

This needs a bit of explaining. Instead of calling a function with f(x,y) you can do x %>% f(y), that is “take x, then apply f with a setting y”. This essentially means that %>% operator takes the stuff on the left and places it as the first argument to function on the right hand side. If we take e.g. the first bit of code above:

# not run
sau %>% 
  select(country, tonnes)

It basically means that we pass the dataframe sau as the first argument in the select function.

4 tidyr

Data are normally not tidy …

The grammar of data manipulation