NOTE TO READERS: The first version (until 27.1.2017) of this document used Sea Around Us (SAU) CRFM regional catch data. That version is now found in the link given below. The reason for switching was that the SAU data was a bit too much as a starter (275736 rows). And a bit limited in terms of what could be shown as computational statistics. The new data used to introduce the grammar of data is introduced below.
Here is the older version of the document that used the SAU data. It and the comment above will be removed from this site within the next few weeks.
Suggested reading material: Chapter 5 on Data transformation in the R for Data Science book.
On the RStudio site there is also a nice cheat sheet
library(tidyverse)
The data we are going to use is the regional catch and effort flying fish data that was used as a part of the UNU-FTP stock assessment course that was held some years ago. It data contains observation of catch and effort by year, month, country and vessel type.
Getting the example data into R:
ff <- read.csv("http://www.hafro.is/~einarhj/crfmr/data-raw/flyingfish.csv",
stringsAsFactors = FALSE)
Quick overview of the data:
glimpse(ff)
## Observations: 589
## Variables: 6
## $ Year <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 19...
## $ Month <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3...
## $ Country <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago", ...
## $ Vessel <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "D...
## $ Weight..kg. <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 5...
## $ Trips <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25...
The tidyverse package, through the dplyr and tidyr, are a set of tools for a common set of problems connected to aggregates or summaries of data.
Commonly, when collating summaries by group, one wants to:
For example, one might want to
Essentially dplyr offers a collection of simple but powerful verbs (functions) that facilitate this split-apply-combined process:
The structure of these commands is always the same:
col.name
instead of dat$col.name
Select only certain columns
ff.redux <-
select(ff, Year, Month, Trips)
glimpse(ff.redux)
## Observations: 589
## Variables: 3
## $ Year <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 19...
## $ Month <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, 3, 4...
## $ Trips <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68, ...
select all columns except Country and Weight..kg. (e.g. if we were only interested in analysing the effort in some potential following steps):
ff.redux <-
select(ff, -Country, -Weight..kg.)
glimpse(ff.redux)
## Observations: 589
## Variables: 4
## $ Year <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1...
## $ Month <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, 3, ...
## $ Vessel <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Dayboa...
## $ Trips <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68,...
Select also allows you to rename columns:
ff.redux <-
select(ff, catch = Weight..kg., effort = Trips)
glimpse(ff.redux)
## Observations: 589
## Variables: 2
## $ catch <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 55, 19...
## $ effort <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68,...
but this also removes columns.
Instead of changing variable name with select (as shown above) one can use rename
. The difference is that all original data are left intact. Lets for sake of sanity do that here on the original data frame (i.e. we overwrite the ff
dataframe):
ff <-
rename(ff,
month = Month,
country = Country,
vessel = Vessel,
catch = Weight..kg.,
effort = Trips)
glimpse(ff)
## Observations: 589
## Variables: 6
## $ Year <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, ...
## $ month <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, 3,...
## $ country <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago", "Tob...
## $ vessel <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 55, 1...
## $ effort <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, 68...
Noticed that I:
select
instead of rename
in the code above that column would have been dropped from the ff-dataframe.Now what is left is of couse to change the variable name Year to year (would have already done it in the step above but then you would possibly have missed the point of what rename
does):
ff <- rename(ff, year = Year)
One can create a subset of the data using the filter
command:
ff.barbados <-
filter(ff,
country == "Barbados")
glimpse(ff.barbados)
## Observations: 337
## Variables: 6
## $ year <int> 1988, 1988, 1989, 1989, 1989, 1989, 1989, 1989, 1989, ...
## $ month <int> 11, 12, 1, 2, 3, 4, 5, 6, 7, 11, 12, 1, 2, 3, 4, 5, 6,...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel <chr> "Iceboats", "Iceboats", "Iceboats", "Iceboats", "Icebo...
## $ catch <dbl> 65, 19279, 32884, 16356, 836, 7768, 14662, 17645, 6145...
## $ effort <int> 1, 16, 25, 12, 3, 11, 23, 26, 10, 2, 3, 13, 8, 8, 11, ...
Notice here that ff.barbados has only 337 obsverations compared with the full dataset having 589 observations.
One can filter further:
ff.barbados.dayboats <-
filter(ff,
country %in% c("Barbados"),
vessel == "Dayboats")
glimpse(ff.barbados.dayboats)
## Observations: 146
## Variables: 6
## $ year <int> 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1995, 1995, ...
## $ month <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 1...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch <dbl> 1098.60, 749.80, 1681.60, 1142.20, 636.40, 1984.00, 30...
## $ effort <int> 9, 11, 21, 12, 15, 14, 2, 1388, 1204, 1182, 776, 1878,...
So now we are down to 146 observations (number of observations of Barbados dayboats).
Filter can use any logical statement:
a == b ## a is equal to b
a != b ## a is not equal to b
a > b ## a is greater than b
a >= b ## a is greater or equal to b
a < b ## a is less than b
a <= b ## a is less or equal to b
a & b ## a and b
a | b ## a or b
!a ## not a
is.na(a) ## is a equal to NA (missing)
a %in% b ## elements of column a that are also in b
For example (results not shown, try it out yourself):
d1 <- filter(ff, year >= 2001)
glimpse(d1)
d2 <- filter(ff, year >= 2001, month == 12)
glimpse(d2)
d3 <- filter(ff, year >= 2001, month == 12, country == "StLucia")
glimpse(d3)
d4 <- filter(ff, year >= 2001, month == 12, country == "StLucia", effort < 25)
glimpse(d4)
d5 <- filter(ff, year >= 2001, month == 12, country == "StLucia", effort < 25, catch > 400)
glimpse(d5)
Notice that the in the last step (created object d5) we ended up with a dataframe that contains only one observation (row).
Further along this line we can e.g. exclude the Barbados obervation from the data. This can be done in three ways, all leading to the same results (not run, try it yourself):
d1 <- filter(ff, country != "Barbados")
glimpse(d1)
d2 <- filter(ff, country %in% c("StLucia", "Tobago"))
glimpse(d2)
d3 <- filter(ff, country == "StLucia" | country == "Tobago")
glimpse(d3)
A related function is the slice
function, that simply filters the data by row number. E.g. to filter the 1st, 4th, 5th, 6th and the 100th row one can do:
ff.slice <-
slice(ff, c(1,4:6,100))
glimpse(ff.slice)
## Observations: 5
## Variables: 6
## $ year <int> 1988, 1988, 1988, 1988, 1993
## $ month <int> 1, 4, 5, 6, 12
## $ country <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago"
## $ vessel <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch <dbl> 14366, 5409, 8887, 10725, 7149
## $ effort <int> 132, 159, 154, 174, 45
Not very useful on its own, but the power of the slice
function will be shown later.
Arrange by catch (in ascending order):
ff.arr <- arrange(ff, catch)
glimpse(ff.arr)
## Observations: 589
## Variables: 6
## $ year <int> 1995, 1999, 2007, 2003, 1992, 1997, 2001, 2003, 1996, ...
## $ month <int> 4, 4, 7, 3, 11, 8, 3, 8, 11, 4, 10, 9, 11, 9, 7, 10, 1...
## $ country <chr> "StLucia", "StLucia", "StLucia", "StLucia", "Tobago", ...
## $ vessel <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Daybo...
## $ catch <dbl> 0.4545455, 0.9090909, 1.8181818, 4.7727273, 5.0000000,...
## $ effort <int> 1, 1, 1, 2, 6, 1, 2, 1, 1, 1, 1, 6, 3, 3, 2, 2, 1, 1, ...
or in descending order:
ff.darr <- arrange(ff, desc(catch))
glimpse(ff.darr)
## Observations: 589
## Variables: 6
## $ year <int> 2003, 2008, 2008, 2004, 1998, 1998, 2003, 1998, 2008, ...
## $ month <int> 5, 4, 5, 3, 5, 3, 6, 6, 2, 6, 2, 5, 5, 4, 5, 6, 3, 5, ...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel <chr> "Iceboats", "Iceboats", "Iceboats", "Iceboats", "Icebo...
## $ catch <dbl> 317393.0, 299613.0, 284108.0, 280106.0, 259247.0, 2544...
## $ effort <int> 259, 185, 202, 679, 267, 249, 218, 213, 151, 199, 216,...
You can also arrange by more that one column:
ff.arr2 <- arrange(ff, desc(year), desc(catch))
glimpse(ff.arr2)
## Observations: 589
## Variables: 6
## $ year <int> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, ...
## $ month <int> 4, 5, 2, 3, 1, 6, 5, 6, 4, 3, 1, 2, 5, 4, 3, 6, 12, 1,...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Barba...
## $ vessel <chr> "Iceboats", "Iceboats", "Iceboats", "Iceboats", "Icebo...
## $ catch <dbl> 299613, 284108, 218236, 196913, 163096, 160004, 55611,...
## $ effort <int> 185, 202, 151, 198, 116, 133, 293, 226, 188, 217, 221,...
Which basically means that we have sorted the data such that the last year of observation (2008) comes first and then within any year the highest catch comes first. In the particular case fist record is for the year 2008, month is March (4), country is Barbados and the Vessel type is Iceboats. Not very interesting in itself (one could have achieved the same in Excel), but hold on - this simple function may become useful further down the line.
Mutate allows you to add new columns to your data. Let’s e.g. calculate the value per kg
ff.cpue <-
mutate(ff,
cpue = catch/effort)
You can also do more than one “mutation”, e.g. here we convert kilograms to pund, and then use that variable to calculate cpue in units of pounds per trip.
ff.cpue_lbs <-
mutate(ff,
catch_lbs = catch * 2.20462262,
cpue_lbs = catch_lbs/effort)
glimpse(ff.cpue_lbs)
## Observations: 589
## Variables: 8
## $ year <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988...
## $ month <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3, ...
## $ country <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago", "T...
## $ vessel <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "Day...
## $ catch <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 55,...
## $ effort <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25, ...
## $ catch_lbs <dbl> 31671.6086, 31863.4107, 27262.3633, 11924.8038, 1959...
## $ cpue_lbs <dbl> 239.93643, 237.78665, 121.16606, 74.99877, 127.22390...
Useful to create (not so) simple summaries of the data
ff.summ <-
summarise(ff,
catch = sum(catch),
effort = sum(effort))
glimpse(ff.summ)
## Observations: 1
## Variables: 2
## $ catch <dbl> 20541932
## $ effort <int> 78369
Useful helper function
n() ## counts the number of rows
n_distinct(x) ## counts the number distinct values of x
first(x) ## first value of x
last(x) ## last value of x
nth(x,n) ## nth value of x
For example, if we want to calculate the number of records, number of countries, months and years:
ff.summ2 <-
summarise(ff,
n = n(),
ncountries = n_distinct(country),
nmonths = n_distinct(month),
nyear = n_distinct(year),
year_first = min(year),
year_last = max(year))
ff.summ2
## n ncountries nmonths nyear year_first year_last
## 1 589 3 12 21 1988 2008
Summarize is somewhat useful on its own but become much more useful when combined with a group_by
statements:
ff.group <-
group_by(ff, country)
ff.group
## Source: local data frame [589 x 6]
## Groups: country [3]
##
## year month country vessel catch effort
## <int> <int> <chr> <chr> <dbl> <int>
## 1 1988 1 Tobago Dayboats 14366 132
## 2 1988 2 Tobago Dayboats 14453 134
## 3 1988 3 Tobago Dayboats 12366 225
## 4 1988 4 Tobago Dayboats 5409 159
## 5 1988 5 Tobago Dayboats 8887 154
## 6 1988 6 Tobago Dayboats 10725 174
## 7 1988 7 Tobago Dayboats 243 29
## 8 1988 11 Barbados Iceboats 65 1
## 9 1988 11 Tobago Dayboats 55 5
## 10 1988 12 Barbados Iceboats 19279 16
## # ... with 579 more rows
In the above code we have grouped the data by country (note the second line). If we now repeat the summary calculation we did above on the grouped dataframe (ff.group) we get the summary by each country:
ff.summ3 <-
summarise(ff.group,
n = n(),
ncountries = n_distinct(country),
nmonths = n_distinct(month),
nyear = n_distinct(year),
year_first = min(year),
year_last = max(year))
ff.summ3
## # A tibble: 3 × 7
## country n ncountries nmonths nyear year_first year_last
## <chr> <int> <int> <int> <int> <int> <int>
## 1 Barbados 337 1 12 21 1988 2008
## 2 StLucia 100 1 11 13 1995 2007
## 3 Tobago 152 1 10 20 1988 2007
We can use the group_by
function on more than one column, e.g. by country and catch-class and then apply some summary statistics:
ff.group2 <-
group_by(ff, country, vessel)
ff.summ4 <-
summarise(ff.group2,
n = n(),
ncountries = n_distinct(country),
nmonths = n_distinct(month),
nyear = n_distinct(year),
year_first = min(year),
year_last = max(year))
ff.summ4
## Source: local data frame [4 x 8]
## Groups: country [?]
##
## country vessel n ncountries nmonths nyear year_first year_last
## <chr> <chr> <int> <int> <int> <int> <int> <int>
## 1 Barbados Dayboats 146 1 12 15 1994 2008
## 2 Barbados Iceboats 191 1 12 21 1988 2008
## 3 StLucia Dayboats 100 1 11 13 1995 2007
## 4 Tobago Dayboats 152 1 10 20 1988 2007
In R one can apply functions to data repeatedly:
ff.summ <-
arrange(summarise(group_by(select(mutate(ff, cpue = catch/effort), country, vessel, cpue), country, vessel), n=n(), cpue = mean(cpue)), desc(cpue))
ff.summ
## Source: local data frame [4 x 4]
## Groups: country [3]
##
## country vessel n cpue
## <chr> <chr> <int> <dbl>
## 1 Barbados Iceboats 191 653.98598
## 2 Tobago Dayboats 152 161.21923
## 3 StLucia Dayboats 100 105.37125
## 4 Barbados Dayboats 146 84.65103
The problem here is that this is very hard to read because in order to understand what has been done one has to read the code from inside out. If we were to write this as a pseudo-code we could do something like this:
%>%
operatordplyr allows chaining of operations using the \(\%>\%\) (pipe). By use of the pipe one can get the same as above by:
ff %>%
mutate(cpue = catch/effort) %>%
select(country, vessel, cpue) %>%
group_by(country, vessel) %>%
summarise(n = n(),
cpue = mean(cpue)) %>%
arrange(desc(cpue))
## Source: local data frame [4 x 4]
## Groups: country [3]
##
## country vessel n cpue
## <chr> <chr> <int> <dbl>
## 1 Barbados Iceboats 191 653.98598
## 2 Tobago Dayboats 152 161.21923
## 3 StLucia Dayboats 100 105.37125
## 4 Barbados Dayboats 146 84.65103
Now this is much more readable and almost equivalent to the pseudo-code we wrote above.
What happens here is that instead of calling a function with f(x,y)
you can do x %>% f(y)
, that is “take x, then apply f with a setting y. This essentially means that %>%
operator takes the stuff on the left and places it as the first argument to function on the right hand side. If we take e.g. the first bit of code above:
# not run
ff %>%
mutate(cpue = catch/effort)
It basically means that we pass the dataframe ff as the first argument in the mutate
function.