1 Preamble
- 1.1 Needed libraries for this tutorial:
- 1.2 The example data
2 Grammar of data
3 dplyr

1 Preamble

NOTE TO READERS: The first version (until 27.1.2017) of this document used Sea Around Us (SAU) CRFM regional catch data. That version is now found in the link given below. The reason for switching was that the SAU data was a bit too much as a starter (275736 rows). And a bit limited in terms of what could be shown as computational statistics. The new data used to introduce the grammar of data is introduced below.

Here is the older version of the document that used the SAU data. It and the comment above will be removed from this site within the next few weeks.

Suggested reading material: Chapter 5 on Data transformation in the R for Data Science book.

On the RStudio site there is also a nice cheat sheet

1.1 Needed libraries for this tutorial:

library(tidyverse)

1.2 The example data

The data we are going to use is the regional catch and effort flying fish data that was used as a part of the UNU-FTP stock assessment course that was held some years ago. It data contains observation of catch and effort by year, month, country and vessel type.

Getting the example data into R:

ff <- read.csv("http://www.hafro.is/~einarhj/crfmr/data-raw/flyingfish.csv",
               stringsAsFactors = FALSE)

Quick overview of the data:

glimpse(ff)

## Observations: 589
## Variables: 6
## $ Year        <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 19...
## $ Month       <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3...
## $ Country     <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago", ...
## $ Vessel      <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "D...
## $ Weight..kg. <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 5...
## $ Trips       <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25...

2 Grammar of data

The tidyverse package, through the dplyr and tidyr, are a set of tools for a common set of problems connected to aggregates or summaries of data.

Similar to ggplot2 they feature a Domain Specific Language (DSL) specially designed for data summaries.
Developed by Hadley Wickam, the creator ggplot2 and other useful tools.

Commonly, when collating summaries by group, one wants to:

Split up a big data structure into homogeneous pieces,
Apply a function to each piece
Combine all the results back together.

For example, one might want to

quickly calculate summary statistics by species
perform group-wise transformations like scaling or standardizing
fit the same model to different …

3 dplyr

Essentially dplyr offers a collection of simple but powerful verbs (functions) that facilitate this split-apply-combined process:

filter: keep rows matching criteria
select: pick or drop columns by name
arrange: order the rows according to a variable
mutate: add new variables
summarise: reduce variables to values
group_by: gives the group to apply the analysis functions to

The structure of these commands is always the same:

First argument is a data frame
Always return a data frame
Subsequent arguments say what to do with data frame
(Never modify in place)
It recognizes the columns of the data.frame as variables, that is only need to call col.name instead of dat$col.name

3.1 Select

Select only certain columns

ff.redux <- 
  select(ff, Year, Month, Trips)
glimpse(ff.redux)

## Observations: 589
## Variables: 3
## $ Year  <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 19...
## $ Month <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, 3, 4...
## $ Trips <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68, ...

select all columns except Country and Weight..kg. (e.g. if we were only interested in analysing the effort in some potential following steps):

ff.redux <- 
  select(ff, -Country, -Weight..kg.)
glimpse(ff.redux)

## Observations: 589
## Variables: 4
## $ Year   <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1...
## $ Month  <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, 3, ...
## $ Vessel <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Dayboa...
## $ Trips  <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68,...

Select also allows you to rename columns:

ff.redux <- 
  select(ff, catch = Weight..kg., effort = Trips)
glimpse(ff.redux)

## Observations: 589
## Variables: 2
## $ catch  <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 55, 19...
## $ effort <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68,...

but this also removes columns.

3.2 Rename

Instead of changing variable name with select (as shown above) one can use rename. The difference is that all original data are left intact. Lets for sake of sanity do that here on the original data frame (i.e. we overwrite the ff dataframe):

ff <- 
  rename(ff,
         month = Month,
         country = Country,
         vessel = Vessel,
         catch = Weight..kg.,
         effort = Trips)
glimpse(ff)

## Observations: 589
## Variables: 6
## $ Year    <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, ...
## $ month   <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, 3,...
## $ country <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago", "Tob...
## $ vessel  <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch   <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 55, 1...
## $ effort  <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68...

Noticed that I:

Overwrote the orginal ff-dataframe.
Changed all variables except Year. If you had used select instead of rename in the code above that column would have been dropped from the ff-dataframe.
In three of the cases I only changed the first capital letter from upper to lower. The reason is that if one has a capital letters in variable names that means an extra key stroke (SHIFT) in any subsequent codeing. I.e. just trying to save some extra physical activity :-)
Turned the variable name Weight..kg. and Trips into a more generic name:
- Firstly because I, the analysist know that the catch is in kg and the effort is number of trips.
- Secondly, and more importantly if I were to work later on a different data set, that would e.g. have variable names as Weight..lbs. and Hours for catch and effort by renaming them to catch and effort I could use the code below with minimal changes.

Now what is left is of couse to change the variable name Year to year (would have already done it in the step above but then you would possibly have missed the point of what rename does):

ff <- rename(ff, year = Year)

3.3 Filter

One can create a subset of the data using the filter command:

ff.barbados <-
  filter(ff,
         country == "Barbados")
glimpse(ff.barbados)

## Observations: 337
## Variables: 6
## $ year    <int> 1988, 1988, 1989, 1989, 1989, 1989, 1989, 1989, 1989, ...
## $ month   <int> 11, 12, 1, 2, 3, 4, 5, 6, 7, 11, 12, 1, 2, 3, 4, 5, 6,...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel  <chr> "Iceboats", "Iceboats", "Iceboats", "Iceboats", "Icebo...
## $ catch   <dbl> 65, 19279, 32884, 16356, 836, 7768, 14662, 17645, 6145...
## $ effort  <int> 1, 16, 25, 12, 3, 11, 23, 26, 10, 2, 3, 13, 8, 8, 11, ...

Notice here that ff.barbados has only 337 obsverations compared with the full dataset having 589 observations.

One can filter further:

ff.barbados.dayboats <-
  filter(ff,
       country %in% c("Barbados"),
       vessel == "Dayboats")
glimpse(ff.barbados.dayboats)

## Observations: 146
## Variables: 6
## $ year    <int> 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1995, 1995, ...
## $ month   <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 1...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel  <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch   <dbl> 1098.60, 749.80, 1681.60, 1142.20, 636.40, 1984.00, 30...
## $ effort  <int> 9, 11, 21, 12, 15, 14, 2, 1388, 1204, 1182, 776, 1878,...

So now we are down to 146 observations (number of observations of Barbados dayboats).

Filter can use any logical statement:

 a == b   ## a is equal to b
 a != b   ## a is not equal to b
 a > b    ## a is greater than b
 a >= b   ## a is greater or equal to b
 a < b    ## a is less than b
 a <= b   ## a is less or equal to b
 a & b    ## a and b
 a | b    ## a or b
 !a       ## not a
 is.na(a) ## is a equal to NA (missing)
 a %in% b ## elements of column a that are also in b

For example (results not shown, try it out yourself):

d1 <- filter(ff, year >= 2001)
glimpse(d1)
d2 <- filter(ff, year >= 2001, month == 12)
glimpse(d2)
d3 <- filter(ff, year >= 2001, month == 12, country == "StLucia")
glimpse(d3)
d4 <- filter(ff, year >= 2001, month == 12, country == "StLucia", effort < 25)
glimpse(d4)
d5 <- filter(ff, year >= 2001, month == 12, country == "StLucia", effort < 25, catch > 400)
glimpse(d5)

Notice that the in the last step (created object d5) we ended up with a dataframe that contains only one observation (row).

Further along this line we can e.g. exclude the Barbados obervation from the data. This can be done in three ways, all leading to the same results (not run, try it yourself):

d1 <- filter(ff, country != "Barbados")
glimpse(d1)
d2 <- filter(ff, country %in% c("StLucia", "Tobago"))
glimpse(d2)
d3 <- filter(ff, country == "StLucia" | country == "Tobago")
glimpse(d3)

A related function is the slice function, that simply filters the data by row number. E.g. to filter the 1st, 4th, 5th, 6th and the 100th row one can do:

ff.slice <- 
  slice(ff, c(1,4:6,100))
glimpse(ff.slice)

## Observations: 5
## Variables: 6
## $ year    <int> 1988, 1988, 1988, 1988, 1993
## $ month   <int> 1, 4, 5, 6, 12
## $ country <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago"
## $ vessel  <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch   <dbl> 14366, 5409, 8887, 10725, 7149
## $ effort  <int> 132, 159, 154, 174, 45

Not very useful on its own, but the power of the slice function will be shown later.

3.4 Arrange

Arrange by catch (in ascending order):

ff.arr <- arrange(ff, catch)
glimpse(ff.arr)

## Observations: 589
## Variables: 6
## $ year    <int> 1995, 1999, 2007, 2003, 1992, 1997, 2001, 2003, 1996, ...
## $ month   <int> 4, 4, 7, 3, 11, 8, 3, 8, 11, 4, 10, 9, 11, 9, 7, 10, 1...
## $ country <chr> "StLucia", "StLucia", "StLucia", "StLucia", "Tobago", ...
## $ vessel  <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch   <dbl> 0.4545455, 0.9090909, 1.8181818, 4.7727273, 5.0000000,...
## $ effort  <int> 1, 1, 1, 2, 6, 1, 2, 1, 1, 1, 1, 6, 3, 3, 2, 2, 1, 1, ...

or in descending order:

ff.darr <- arrange(ff, desc(catch))
glimpse(ff.darr)

## Observations: 589
## Variables: 6
## $ year    <int> 2003, 2008, 2008, 2004, 1998, 1998, 2003, 1998, 2008, ...
## $ month   <int> 5, 4, 5, 3, 5, 3, 6, 6, 2, 6, 2, 5, 5, 4, 5, 6, 3, 5, ...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel  <chr> "Iceboats", "Iceboats", "Iceboats", "Iceboats", "Icebo...
## $ catch   <dbl> 317393.0, 299613.0, 284108.0, 280106.0, 259247.0, 2544...
## $ effort  <int> 259, 185, 202, 679, 267, 249, 218, 213, 151, 199, 216,...

You can also arrange by more that one column:

ff.arr2 <- arrange(ff, desc(year), desc(catch))
glimpse(ff.arr2)

## Observations: 589
## Variables: 6
## $ year    <int> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, ...
## $ month   <int> 4, 5, 2, 3, 1, 6, 5, 6, 4, 3, 1, 2, 5, 4, 3, 6, 12, 1,...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel  <chr> "Iceboats", "Iceboats", "Iceboats", "Iceboats", "Icebo...
## $ catch   <dbl> 299613, 284108, 218236, 196913, 163096, 160004, 55611,...
## $ effort  <int> 185, 202, 151, 198, 116, 133, 293, 226, 188, 217, 221,...

Which basically means that we have sorted the data such that the last year of observation (2008) comes first and then within any year the highest catch comes first. In the particular case fist record is for the year 2008, month is March (4), country is Barbados and the Vessel type is Iceboats. Not very interesting in itself (one could have achieved the same in Excel), but hold on - this simple function may become useful further down the line.

3.5 Mutate

Mutate allows you to add new columns to your data. Let’s e.g. calculate the value per kg

ff.cpue <-
  mutate(ff,
         cpue = catch/effort)

You can also do more than one “mutation”, e.g. here we convert kilograms to pund, and then use that variable to calculate cpue in units of pounds per trip.

ff.cpue_lbs <-
  mutate(ff,
         catch_lbs = catch * 2.20462262,
         cpue_lbs =  catch_lbs/effort)
glimpse(ff.cpue_lbs)

## Observations: 589
## Variables: 8
## $ year      <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988...
## $ month     <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, ...
## $ country   <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago", "T...
## $ vessel    <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Day...
## $ catch     <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 55,...
## $ effort    <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, ...
## $ catch_lbs <dbl> 31671.6086, 31863.4107, 27262.3633, 11924.8038, 1959...
## $ cpue_lbs  <dbl> 239.93643, 237.78665, 121.16606, 74.99877, 127.22390...

3.6 Summarise

Useful to create (not so) simple summaries of the data

ff.summ <- 
  summarise(ff, 
            catch  = sum(catch),
            effort = sum(effort))
glimpse(ff.summ)

## Observations: 1
## Variables: 2
## $ catch  <dbl> 20541932
## $ effort <int> 78369

Useful helper function

n()             ## counts the number of rows
n_distinct(x)   ## counts the number distinct values of x
first(x)        ## first value of x
last(x)         ## last value of x
nth(x,n)        ## nth value of x

For example, if we want to calculate the number of records, number of countries, months and years:

ff.summ2 <- 
  summarise(ff,
            n = n(),
            ncountries = n_distinct(country),
            nmonths    = n_distinct(month),
            nyear      = n_distinct(year),
            year_first = min(year),
            year_last  = max(year))
ff.summ2

##     n ncountries nmonths nyear year_first year_last
## 1 589          3      12    21       1988      2008

3.7 Group_by

Summarize is somewhat useful on its own but become much more useful when combined with a group_by statements:

group_by Group data into rows with the same value of (a) particular variable(s)
ungroup Remove grouping information from data frame

ff.group <- 
  group_by(ff, country)
ff.group

## Source: local data frame [589 x 6]
## Groups: country [3]
## 
##     year month  country   vessel catch effort
##    <int> <int>    <chr>    <chr> <dbl>  <int>
## 1   1988     1   Tobago Dayboats 14366    132
## 2   1988     2   Tobago Dayboats 14453    134
## 3   1988     3   Tobago Dayboats 12366    225
## 4   1988     4   Tobago Dayboats  5409    159
## 5   1988     5   Tobago Dayboats  8887    154
## 6   1988     6   Tobago Dayboats 10725    174
## 7   1988     7   Tobago Dayboats   243     29
## 8   1988    11 Barbados Iceboats    65      1
## 9   1988    11   Tobago Dayboats    55      5
## 10  1988    12 Barbados Iceboats 19279     16
## # ... with 579 more rows

In the above code we have grouped the data by country (note the second line). If we now repeat the summary calculation we did above on the grouped dataframe (ff.group) we get the summary by each country:

ff.summ3 <- 
  summarise(ff.group,
            n = n(),
            ncountries = n_distinct(country),
            nmonths    = n_distinct(month),
            nyear      = n_distinct(year),
            year_first = min(year),
            year_last  = max(year))
ff.summ3

## # A tibble: 3 × 7
##    country     n ncountries nmonths nyear year_first year_last
##      <chr> <int>      <int>   <int> <int>      <int>     <int>
## 1 Barbados   337          1      12    21       1988      2008
## 2  StLucia   100          1      11    13       1995      2007
## 3   Tobago   152          1      10    20       1988      2007

We can use the group_by function on more than one column, e.g. by country and catch-class and then apply some summary statistics:

ff.group2 <- 
  group_by(ff, country, vessel)
ff.summ4 <- 
  summarise(ff.group2,
            n = n(),
            ncountries = n_distinct(country),
            nmonths    = n_distinct(month),
            nyear      = n_distinct(year),
            year_first = min(year),
            year_last  = max(year))
ff.summ4

## Source: local data frame [4 x 8]
## Groups: country [?]
## 
##    country   vessel     n ncountries nmonths nyear year_first year_last
##      <chr>    <chr> <int>      <int>   <int> <int>      <int>     <int>
## 1 Barbados Dayboats   146          1      12    15       1994      2008
## 2 Barbados Iceboats   191          1      12    21       1988      2008
## 3  StLucia Dayboats   100          1      11    13       1995      2007
## 4   Tobago Dayboats   152          1      10    20       1988      2007

3.8 Combining dplyr verbs

In R one can apply functions to data repeatedly:

ff.summ <- 
  arrange(summarise(group_by(select(mutate(ff, cpue = catch/effort), country, vessel, cpue), country, vessel), n=n(), cpue = mean(cpue)), desc(cpue))
ff.summ

## Source: local data frame [4 x 4]
## Groups: country [3]
## 
##    country   vessel     n      cpue
##      <chr>    <chr> <int>     <dbl>
## 1 Barbados Iceboats   191 653.98598
## 2   Tobago Dayboats   152 161.21923
## 3  StLucia Dayboats   100 105.37125
## 4 Barbados Dayboats   146  84.65103

The problem here is that this is very hard to read because in order to understand what has been done one has to read the code from inside out. If we were to write this as a pseudo-code we could do something like this:

calculate the cpue per trip
select the columns country, vessel and cpue
group the data by country and vessel
provide summary statistic of the number of records and the mean cpue
arrange it by the cpue in descending order (country and vessel with highest cpue first).

3.9 The `%>%` operator

dplyr allows chaining of operations using the $\%>\%$ (pipe). By use of the pipe one can get the same as above by:

ff %>% 
  mutate(cpue = catch/effort) %>% 
  select(country, vessel, cpue) %>% 
  group_by(country, vessel) %>% 
  summarise(n = n(),
            cpue = mean(cpue)) %>% 
  arrange(desc(cpue))

## Source: local data frame [4 x 4]
## Groups: country [3]
## 
##    country   vessel     n      cpue
##      <chr>    <chr> <int>     <dbl>
## 1 Barbados Iceboats   191 653.98598
## 2   Tobago Dayboats   152 161.21923
## 3  StLucia Dayboats   100 105.37125
## 4 Barbados Dayboats   146  84.65103

Now this is much more readable and almost equivalent to the pseudo-code we wrote above.

What happens here is that instead of calling a function with f(x,y) you can do x %>% f(y), that is “take x, then apply f with a setting y. This essentially means that %>% operator takes the stuff on the left and places it as the first argument to function on the right hand side. If we take e.g. the first bit of code above:

# not run
ff %>% 
  mutate(cpue = catch/effort)

It basically means that we pass the dataframe ff as the first argument in the mutate function.

The grammar of data manipulation