NOTE TO READERS: The first version (until 27.1.2017) of this document used Sea Around Us (SAU) CRFM regional catch data. That version is now found in the link given below. The reason for switching was that the SAU data was a bit too much as a starter (275736 rows). And a bit limited in terms of what could be shown as computational statistics. The new data used to introduce the grammar of data is introduced below.Here is the older version of the document that used the SAU data. It and the comment above will be removed from this site within the next few weeks.
Suggested reading material: Chapter 5 on Data transformation in the R for Data Science book.
On the RStudio site there is also a nice cheat sheet
library(tidyverse)The data we are going to use is the regional catch and effort flying fish data that was used as a part of the UNU-FTP stock assessment course that was held some years ago. It data contains observation of catch and effort by year, month, country and vessel type.
Getting the example data into R:
ff <- read.csv("http://www.hafro.is/~einarhj/crfmr/data-raw/flyingfish.csv",
               stringsAsFactors = FALSE)Quick overview of the data:
glimpse(ff)## Observations: 589
## Variables: 6
## $ Year        <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 19...
## $ Month       <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3...
## $ Country     <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago", ...
## $ Vessel      <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "D...
## $ Weight..kg. <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 5...
## $ Trips       <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25...The tidyverse package, through the dplyr and tidyr, are a set of tools for a common set of problems connected to aggregates or summaries of data.
Commonly, when collating summaries by group, one wants to:
For example, one might want to
Essentially dplyr offers a collection of simple but powerful verbs (functions) that facilitate this split-apply-combined process:
The structure of these commands is always the same:
col.name instead of dat$col.nameSelect only certain columns
ff.redux <- 
  select(ff, Year, Month, Trips)
glimpse(ff.redux)## Observations: 589
## Variables: 3
## $ Year  <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 19...
## $ Month <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, 3, 4...
## $ Trips <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68, ...select all columns except Country and Weight..kg. (e.g. if we were only interested in analysing the effort in some potential following steps):
ff.redux <- 
  select(ff, -Country, -Weight..kg.)
glimpse(ff.redux)## Observations: 589
## Variables: 4
## $ Year   <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1...
## $ Month  <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, 3, ...
## $ Vessel <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Dayboa...
## $ Trips  <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68,...Select also allows you to rename columns:
ff.redux <- 
  select(ff, catch = Weight..kg., effort = Trips)
glimpse(ff.redux)## Observations: 589
## Variables: 2
## $ catch  <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 55, 19...
## $ effort <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68,...but this also removes columns.
Instead of changing variable name with select (as shown above) one can use rename. The difference is that all original data are left intact. Lets for sake of sanity do that here on the original data frame (i.e. we overwrite the ff dataframe):
ff <- 
  rename(ff,
         month = Month,
         country = Country,
         vessel = Vessel,
         catch = Weight..kg.,
         effort = Trips)
glimpse(ff)## Observations: 589
## Variables: 6
## $ Year    <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, ...
## $ month   <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, 3,...
## $ country <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago", "Tob...
## $ vessel  <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch   <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 55, 1...
## $ effort  <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68...Noticed that I:
select instead of rename in the code above that column would have been dropped from the ff-dataframe.Now what is left is of couse to change the variable name Year to year (would have already done it in the step above but then you would possibly have missed the point of what rename does):
ff <- rename(ff, year = Year)One can create a subset of the data using the filter command:
ff.barbados <-
  filter(ff,
         country == "Barbados")
glimpse(ff.barbados)## Observations: 337
## Variables: 6
## $ year    <int> 1988, 1988, 1989, 1989, 1989, 1989, 1989, 1989, 1989, ...
## $ month   <int> 11, 12, 1, 2, 3, 4, 5, 6, 7, 11, 12, 1, 2, 3, 4, 5, 6,...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel  <chr> "Iceboats", "Iceboats", "Iceboats", "Iceboats", "Icebo...
## $ catch   <dbl> 65, 19279, 32884, 16356, 836, 7768, 14662, 17645, 6145...
## $ effort  <int> 1, 16, 25, 12, 3, 11, 23, 26, 10, 2, 3, 13, 8, 8, 11, ...Notice here that ff.barbados has only 337 obsverations compared with the full dataset having 589 observations.
One can filter further:
ff.barbados.dayboats <-
  filter(ff,
       country %in% c("Barbados"),
       vessel == "Dayboats")
glimpse(ff.barbados.dayboats)## Observations: 146
## Variables: 6
## $ year    <int> 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1995, 1995, ...
## $ month   <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 1...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel  <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch   <dbl> 1098.60, 749.80, 1681.60, 1142.20, 636.40, 1984.00, 30...
## $ effort  <int> 9, 11, 21, 12, 15, 14, 2, 1388, 1204, 1182, 776, 1878,...So now we are down to 146 observations (number of observations of Barbados dayboats).
Filter can use any logical statement:
 a == b   ## a is equal to b
 a != b   ## a is not equal to b
 a > b    ## a is greater than b
 a >= b   ## a is greater or equal to b
 a < b    ## a is less than b
 a <= b   ## a is less or equal to b
 a & b    ## a and b
 a | b    ## a or b
 !a       ## not a
 is.na(a) ## is a equal to NA (missing)
 a %in% b ## elements of column a that are also in b For example (results not shown, try it out yourself):
d1 <- filter(ff, year >= 2001)
glimpse(d1)
d2 <- filter(ff, year >= 2001, month == 12)
glimpse(d2)
d3 <- filter(ff, year >= 2001, month == 12, country == "StLucia")
glimpse(d3)
d4 <- filter(ff, year >= 2001, month == 12, country == "StLucia", effort < 25)
glimpse(d4)
d5 <- filter(ff, year >= 2001, month == 12, country == "StLucia", effort < 25, catch > 400)
glimpse(d5)Notice that the in the last step (created object d5) we ended up with a dataframe that contains only one observation (row).
Further along this line we can e.g. exclude the Barbados obervation from the data. This can be done in three ways, all leading to the same results (not run, try it yourself):
d1 <- filter(ff, country != "Barbados")
glimpse(d1)
d2 <- filter(ff, country %in% c("StLucia", "Tobago"))
glimpse(d2)
d3 <- filter(ff, country == "StLucia" | country == "Tobago")
glimpse(d3)A related function is the slice function, that simply filters the data by row number. E.g. to filter the 1st, 4th, 5th, 6th and the 100th row one can do:
ff.slice <- 
  slice(ff, c(1,4:6,100))
glimpse(ff.slice)## Observations: 5
## Variables: 6
## $ year    <int> 1988, 1988, 1988, 1988, 1993
## $ month   <int> 1, 4, 5, 6, 12
## $ country <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago"
## $ vessel  <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch   <dbl> 14366, 5409, 8887, 10725, 7149
## $ effort  <int> 132, 159, 154, 174, 45Not very useful on its own, but the power of the slice function will be shown later.
Arrange by catch (in ascending order):
ff.arr <- arrange(ff, catch)
glimpse(ff.arr)## Observations: 589
## Variables: 6
## $ year    <int> 1995, 1999, 2007, 2003, 1992, 1997, 2001, 2003, 1996, ...
## $ month   <int> 4, 4, 7, 3, 11, 8, 3, 8, 11, 4, 10, 9, 11, 9, 7, 10, 1...
## $ country <chr> "StLucia", "StLucia", "StLucia", "StLucia", "Tobago", ...
## $ vessel  <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch   <dbl> 0.4545455, 0.9090909, 1.8181818, 4.7727273, 5.0000000,...
## $ effort  <int> 1, 1, 1, 2, 6, 1, 2, 1, 1, 1, 1, 6, 3, 3, 2, 2, 1, 1, ...or in descending order:
ff.darr <- arrange(ff, desc(catch))
glimpse(ff.darr)## Observations: 589
## Variables: 6
## $ year    <int> 2003, 2008, 2008, 2004, 1998, 1998, 2003, 1998, 2008, ...
## $ month   <int> 5, 4, 5, 3, 5, 3, 6, 6, 2, 6, 2, 5, 5, 4, 5, 6, 3, 5, ...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel  <chr> "Iceboats", "Iceboats", "Iceboats", "Iceboats", "Icebo...
## $ catch   <dbl> 317393.0, 299613.0, 284108.0, 280106.0, 259247.0, 2544...
## $ effort  <int> 259, 185, 202, 679, 267, 249, 218, 213, 151, 199, 216,...You can also arrange by more that one column:
ff.arr2 <- arrange(ff, desc(year), desc(catch))
glimpse(ff.arr2)## Observations: 589
## Variables: 6
## $ year    <int> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, ...
## $ month   <int> 4, 5, 2, 3, 1, 6, 5, 6, 4, 3, 1, 2, 5, 4, 3, 6, 12, 1,...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel  <chr> "Iceboats", "Iceboats", "Iceboats", "Iceboats", "Icebo...
## $ catch   <dbl> 299613, 284108, 218236, 196913, 163096, 160004, 55611,...
## $ effort  <int> 185, 202, 151, 198, 116, 133, 293, 226, 188, 217, 221,...Which basically means that we have sorted the data such that the last year of observation (2008) comes first and then within any year the highest catch comes first. In the particular case fist record is for the year 2008, month is March (4), country is Barbados and the Vessel type is Iceboats. Not very interesting in itself (one could have achieved the same in Excel), but hold on - this simple function may become useful further down the line.
Mutate allows you to add new columns to your data. Let’s e.g. calculate the value per kg
ff.cpue <-
  mutate(ff,
         cpue = catch/effort)You can also do more than one “mutation”, e.g. here we convert kilograms to pund, and then use that variable to calculate cpue in units of pounds per trip.
ff.cpue_lbs <-
  mutate(ff,
         catch_lbs = catch * 2.20462262,
         cpue_lbs =  catch_lbs/effort)
glimpse(ff.cpue_lbs)## Observations: 589
## Variables: 8
## $ year      <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988...
## $ month     <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, ...
## $ country   <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago", "T...
## $ vessel    <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Day...
## $ catch     <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 55,...
## $ effort    <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, ...
## $ catch_lbs <dbl> 31671.6086, 31863.4107, 27262.3633, 11924.8038, 1959...
## $ cpue_lbs  <dbl> 239.93643, 237.78665, 121.16606, 74.99877, 127.22390...Useful to create (not so) simple summaries of the data
ff.summ <- 
  summarise(ff, 
            catch  = sum(catch),
            effort = sum(effort))
glimpse(ff.summ)## Observations: 1
## Variables: 2
## $ catch  <dbl> 20541932
## $ effort <int> 78369Useful helper function
n()             ## counts the number of rows
n_distinct(x)   ## counts the number distinct values of x
first(x)        ## first value of x
last(x)         ## last value of x
nth(x,n)        ## nth value of x  For example, if we want to calculate the number of records, number of countries, months and years:
ff.summ2 <- 
  summarise(ff,
            n = n(),
            ncountries = n_distinct(country),
            nmonths    = n_distinct(month),
            nyear      = n_distinct(year),
            year_first = min(year),
            year_last  = max(year))
ff.summ2##     n ncountries nmonths nyear year_first year_last
## 1 589          3      12    21       1988      2008Summarize is somewhat useful on its own but become much more useful when combined with a group_by statements:
ff.group <- 
  group_by(ff, country)
ff.group## Source: local data frame [589 x 6]
## Groups: country [3]
## 
##     year month  country   vessel catch effort
##    <int> <int>    <chr>    <chr> <dbl>  <int>
## 1   1988     1   Tobago Dayboats 14366    132
## 2   1988     2   Tobago Dayboats 14453    134
## 3   1988     3   Tobago Dayboats 12366    225
## 4   1988     4   Tobago Dayboats  5409    159
## 5   1988     5   Tobago Dayboats  8887    154
## 6   1988     6   Tobago Dayboats 10725    174
## 7   1988     7   Tobago Dayboats   243     29
## 8   1988    11 Barbados Iceboats    65      1
## 9   1988    11   Tobago Dayboats    55      5
## 10  1988    12 Barbados Iceboats 19279     16
## # ... with 579 more rowsIn the above code we have grouped the data by country (note the second line). If we now repeat the summary calculation we did above on the grouped dataframe (ff.group) we get the summary by each country:
ff.summ3 <- 
  summarise(ff.group,
            n = n(),
            ncountries = n_distinct(country),
            nmonths    = n_distinct(month),
            nyear      = n_distinct(year),
            year_first = min(year),
            year_last  = max(year))
ff.summ3## # A tibble: 3 × 7
##    country     n ncountries nmonths nyear year_first year_last
##      <chr> <int>      <int>   <int> <int>      <int>     <int>
## 1 Barbados   337          1      12    21       1988      2008
## 2  StLucia   100          1      11    13       1995      2007
## 3   Tobago   152          1      10    20       1988      2007We can use the group_by function on more than one column, e.g. by country and catch-class and then apply some summary statistics:
ff.group2 <- 
  group_by(ff, country, vessel)
ff.summ4 <- 
  summarise(ff.group2,
            n = n(),
            ncountries = n_distinct(country),
            nmonths    = n_distinct(month),
            nyear      = n_distinct(year),
            year_first = min(year),
            year_last  = max(year))
ff.summ4## Source: local data frame [4 x 8]
## Groups: country [?]
## 
##    country   vessel     n ncountries nmonths nyear year_first year_last
##      <chr>    <chr> <int>      <int>   <int> <int>      <int>     <int>
## 1 Barbados Dayboats   146          1      12    15       1994      2008
## 2 Barbados Iceboats   191          1      12    21       1988      2008
## 3  StLucia Dayboats   100          1      11    13       1995      2007
## 4   Tobago Dayboats   152          1      10    20       1988      2007In R one can apply functions to data repeatedly:
ff.summ <- 
  arrange(summarise(group_by(select(mutate(ff, cpue = catch/effort), country, vessel, cpue), country, vessel), n=n(), cpue = mean(cpue)), desc(cpue))
ff.summ## Source: local data frame [4 x 4]
## Groups: country [3]
## 
##    country   vessel     n      cpue
##      <chr>    <chr> <int>     <dbl>
## 1 Barbados Iceboats   191 653.98598
## 2   Tobago Dayboats   152 161.21923
## 3  StLucia Dayboats   100 105.37125
## 4 Barbados Dayboats   146  84.65103The problem here is that this is very hard to read because in order to understand what has been done one has to read the code from inside out. If we were to write this as a pseudo-code we could do something like this:
%>% operatordplyr allows chaining of operations using the \(\%>\%\) (pipe). By use of the pipe one can get the same as above by:
ff %>% 
  mutate(cpue = catch/effort) %>% 
  select(country, vessel, cpue) %>% 
  group_by(country, vessel) %>% 
  summarise(n = n(),
            cpue = mean(cpue)) %>% 
  arrange(desc(cpue))## Source: local data frame [4 x 4]
## Groups: country [3]
## 
##    country   vessel     n      cpue
##      <chr>    <chr> <int>     <dbl>
## 1 Barbados Iceboats   191 653.98598
## 2   Tobago Dayboats   152 161.21923
## 3  StLucia Dayboats   100 105.37125
## 4 Barbados Dayboats   146  84.65103Now this is much more readable and almost equivalent to the pseudo-code we wrote above.
What happens here is that instead of calling a function with f(x,y) you can do x %>% f(y), that is “take x, then apply f with a setting y. This essentially means that %>% operator takes the stuff on the left and places it as the first argument to function on the right hand side. If we take e.g. the first bit of code above:
# not run
ff %>% 
  mutate(cpue = catch/effort)It basically means that we pass the dataframe ff as the first argument in the mutate function.