Suggested reading material: Chapter 5 on Data transformation in the R for Data Science book.
On the RStudio site there is also a nice cheat sheet
library(tidyverse)
The data we are going to use is the web. It can be downloaded onto your computer by clicking on the link or what is better (because it becomes repoducable) download the data on to your computer (only have to do this once, the file is big):
download.file("http://www.hafro.is/~einarhj/crfmr/data-raw/SAU_FAO_31_v44-1.csv",
"SAU_FAO_31_v44-1.csv")
and then read in the data each time you want to work on it:
sau <- read.csv("SAU_FAO_31_v44-1.csv",
stringsAsFactors = FALSE)
glimpse(sau)
## Observations: 275,736
## Variables: 13
## $ area_name <chr> "Atlantic, Western Central", "Atlantic, Weste...
## $ area_type <chr> "fao_area", "fao_area", "fao_area", "fao_area...
## $ year <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195...
## $ scientific_name <chr> "Marine fishes not identified", "Marine fishe...
## $ common_name <chr> "Marine fishes nei", "Marine fishes nei", "Ma...
## $ functional_group <chr> "Medium demersals (30 - 89 cm)", "Medium deme...
## $ commercial_group <chr> "Other fishes & inverts", "Other fishes & inv...
## $ fishing_entity <chr> "Antigua & Barbuda", "Antigua & Barbuda", "An...
## $ fishing_sector <chr> "Subsistence", "Artisanal", "Artisanal", "Sub...
## $ catch_type <chr> "Landings", "Landings", "Landings", "Landings...
## $ reporting_status <chr> "Unreported", "Reported", "Unreported", "Unre...
## $ tonnes <dbl> 1.827121e+02, 1.889356e+01, 1.257746e+02, 1.5...
## $ landed_value <dbl> 2.678560e+05, 2.769796e+04, 1.843856e+05, 2.2...
The tidyverse package, through the dplyr and tidyr, are a set of tools for a common set of problems connected to aggregates or summaries of data.
Commonly, when collating summaries by group, one wants to:
For example, one might want to
Essentially dplyr offers a collection of simple but powerful verbs (functions) that facilitate this split-apply-combined process:
The structure of these commands is always the same:
col.name
instead of dat$col.name
Select only certain columns
sau.redux <-
select(sau, year, common_name, tonnes)
glimpse(sau.redux)
## Observations: 275,736
## Variables: 3
## $ year <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 19...
## $ common_name <chr> "Marine fishes nei", "Marine fishes nei", "Marine ...
## $ tonnes <dbl> 1.827121e+02, 1.889356e+01, 1.257746e+02, 1.556623...
select all columns except area_name
and area_type
:
sau.redux <-
select(sau, -area_name, -area_type)
glimpse(sau.redux)
## Observations: 275,736
## Variables: 11
## $ year <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195...
## $ scientific_name <chr> "Marine fishes not identified", "Marine fishe...
## $ common_name <chr> "Marine fishes nei", "Marine fishes nei", "Ma...
## $ functional_group <chr> "Medium demersals (30 - 89 cm)", "Medium deme...
## $ commercial_group <chr> "Other fishes & inverts", "Other fishes & inv...
## $ fishing_entity <chr> "Antigua & Barbuda", "Antigua & Barbuda", "An...
## $ fishing_sector <chr> "Subsistence", "Artisanal", "Artisanal", "Sub...
## $ catch_type <chr> "Landings", "Landings", "Landings", "Landings...
## $ reporting_status <chr> "Unreported", "Reported", "Unreported", "Unre...
## $ tonnes <dbl> 1.827121e+02, 1.889356e+01, 1.257746e+02, 1.5...
## $ landed_value <dbl> 2.678560e+05, 2.769796e+04, 1.843856e+05, 2.2...
Other useful helper function
# not run
select(sau, starts_with("fishing")) # Select all columns that start with "fishing"
select(sau, ends_with("name")) # Select all columns that end with "name"
select(sau, scientific_name:catch_type) # Select all columns between "scientific_name" and "catch_type"
select(sau, contains("group")) # Select all columns that contains "group"
Select also allows you to rename columns:
sau.redux <-
select(sau, year, country = fishing_entity, name = common_name, tonnes)
glimpse(sau.redux)
## Observations: 275,736
## Variables: 4
## $ year <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, ...
## $ country <chr> "Antigua & Barbuda", "Antigua & Barbuda", "Antigua & B...
## $ name <chr> "Marine fishes nei", "Marine fishes nei", "Marine fish...
## $ tonnes <dbl> 1.827121e+02, 1.889356e+01, 1.257746e+02, 1.556623e+01...
but this also removes columns.
Instead of changing variable name with select (as shown above) one can use rename
. The difference is that all original columns are intact. Lets for sake of sanity do that here on the original data frame (i.e. we overwrite the sau
dataframe):
sau <-
rename(sau,
area = area_name,
type = area_type,
latin = scientific_name,
species = common_name,
bgroup = functional_group,
cgroup = commercial_group,
country = fishing_entity,
sector = fishing_sector,
cclass = catch_type,
status = reporting_status,
value = landed_value)
glimpse(sau)
## Observations: 275,736
## Variables: 13
## $ area <chr> "Atlantic, Western Central", "Atlantic, Western Centra...
## $ type <chr> "fao_area", "fao_area", "fao_area", "fao_area", "fao_a...
## $ year <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, ...
## $ latin <chr> "Marine fishes not identified", "Marine fishes not ide...
## $ species <chr> "Marine fishes nei", "Marine fishes nei", "Marine fish...
## $ bgroup <chr> "Medium demersals (30 - 89 cm)", "Medium demersals (30...
## $ cgroup <chr> "Other fishes & inverts", "Other fishes & inverts", "O...
## $ country <chr> "Antigua & Barbuda", "Antigua & Barbuda", "Antigua & B...
## $ sector <chr> "Subsistence", "Artisanal", "Artisanal", "Subsistence"...
## $ cclass <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status <chr> "Unreported", "Reported", "Unreported", "Unreported", ...
## $ tonnes <dbl> 1.827121e+02, 1.889356e+01, 1.257746e+02, 1.556623e+01...
## $ value <dbl> 2.678560e+05, 2.769796e+04, 1.843856e+05, 2.282010e+04...
Lets also get rid of the column area, type and latin.
sau <- select(sau, -area, -type, -latin)
One can create a subset of the data using the filter
command:
sau.dominica <-
filter(sau,
country == "Dominica")
glimpse(sau.dominica)
## Observations: 3,515
## Variables: 10
## $ year <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, ...
## $ species <chr> "Marine fishes nei", "Marine fishes nei", "Marine fish...
## $ bgroup <chr> "Medium demersals (30 - 89 cm)", "Medium demersals (30...
## $ cgroup <chr> "Other fishes & inverts", "Other fishes & inverts", "O...
## $ country <chr> "Dominica", "Dominica", "Dominica", "Dominica", "Domin...
## $ sector <chr> "Subsistence", "Subsistence", "Artisanal", "Subsistenc...
## $ cclass <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status <chr> "Reported", "Unreported", "Reported", "Reported", "Unr...
## $ tonnes <dbl> 97.98476, 83.12100, 50.47700, 7.92000, 7.56000, 4.0800...
## $ value <dbl> 143645.661, 121855.379, 73999.280, 11610.720, 11082.96...
and filter even further (here two countries and one commerical group):
sau.dominica.tunas <-
filter(sau,
country %in% c("Dominica", "Barbados"),
cgroup == "Tuna & billfishes")
glimpse(sau.dominica.tunas)
## Observations: 1,667
## Variables: 10
## $ year <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, ...
## $ species <chr> "Billfishes", "Billfishes", "Bullet and frigate tunas"...
## $ bgroup <chr> "Large pelagics (>=90 cm)", "Large pelagics (>=90 cm)"...
## $ cgroup <chr> "Tuna & billfishes", "Tuna & billfishes", "Tuna & bill...
## $ country <chr> "Barbados", "Barbados", "Barbados", "Barbados", "Domin...
## $ sector <chr> "Artisanal", "Recreational", "Recreational", "Recreati...
## $ cclass <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status <chr> "Reported", "Unreported", "Unreported", "Unreported", ...
## $ tonnes <dbl> 2.103319e+01, 1.831712e-03, 1.373784e-03, 1.373784e-03...
## $ value <dbl> 3.083466e+04, 2.685289e+00, 2.013967e+00, 2.013967e+00...
Filter can use any logical statement:
a == b ## a is equal to b
a != b ## a is not equal to b
a > b ## a is greater than b
a >= b ## a is greater or equal to b
a < b ## a is less than b
a <= b ## a is less or equal to b
a & b ## a and b
a | b ## a or b
!a ## not a
is.na(a) ## is a equal to NA (missing)
a %in% b ## elements of column a that are also in b
A related function is the slice
function, that simply filters the data by row number. E.g. to filter the 1st, 4th, 5th, 6th and the 100th row one can do:
sau.slice <-
slice(sau, c(1,4:6,100))
glimpse(sau.slice)
## Observations: 5
## Variables: 10
## $ year <int> 1950, 1950, 1950, 1950, 1950
## $ species <chr> "Marine fishes nei", "Marine fishes nei", "Marine fish...
## $ bgroup <chr> "Medium demersals (30 - 89 cm)", "Medium demersals (30...
## $ cgroup <chr> "Other fishes & inverts", "Other fishes & inverts", "O...
## $ country <chr> "Antigua & Barbuda", "Bahamas", "Bahamas", "Bahamas", ...
## $ sector <chr> "Subsistence", "Subsistence", "Artisanal", "Artisanal"...
## $ cclass <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status <chr> "Unreported", "Unreported", "Reported", "Unreported", ...
## $ tonnes <dbl> 182.71212, 15.56623, 10.01596, 28.00161, 18.52800
## $ value <dbl> 267855.96, 22820.10, 14683.40, 41050.36, 27162.05
Arrange by species name (in ascending order):
sau.arr <- arrange(sau, species)
glimpse(sau.arr)
## Observations: 275,736
## Variables: 10
## $ year <int> 1950, 1950, 1950, 1950, 1950, 1951, 1951, 1951, 1951, ...
## $ species <chr> "Acoupa weakfish", "Acoupa weakfish", "Acoupa weakfish...
## $ bgroup <chr> "Large demersals (>=90 cm)", "Large demersals (>=90 cm...
## $ cgroup <chr> "Perch-likes", "Perch-likes", "Perch-likes", "Perch-li...
## $ country <chr> "Guyana", "Guyana", "Guyana", "Suriname", "Suriname", ...
## $ sector <chr> "Subsistence", "Artisanal", "Artisanal", "Subsistence"...
## $ cclass <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status <chr> "Unreported", "Reported", "Unreported", "Unreported", ...
## $ tonnes <dbl> 325.077622, 16.889866, 5.066960, 1016.261896, 176.2660...
## $ value <dbl> 476563.794, 24760.544, 7428.163, 1489839.939, 258405.9...
and in descending order:
sau.darr <- arrange(sau, desc(year))
glimpse(sau.darr)
## Observations: 275,736
## Variables: 10
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
## $ species <chr> "Marine fishes nei", "Marine fishes nei", "Marine fish...
## $ bgroup <chr> "Medium demersals (30 - 89 cm)", "Medium demersals (30...
## $ cgroup <chr> "Other fishes & inverts", "Other fishes & inverts", "O...
## $ country <chr> "Antigua & Barbuda", "Antigua & Barbuda", "Antigua & B...
## $ sector <chr> "Subsistence", "Subsistence", "Artisanal", "Recreation...
## $ cclass <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status <chr> "Reported", "Unreported", "Reported", "Reported", "Unr...
## $ tonnes <dbl> 1.340877e+03, 7.534594e+01, 1.806750e+03, 3.349000e+02...
## $ value <dbl> 1965725.1783, 110457.1498, 2648695.5000, 490963.4000, ...
You can also arrange by more that one column:
sau.arr2 <- arrange(sau, desc(year), species)
glimpse(sau.arr2)
## Observations: 275,736
## Variables: 10
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
## $ species <chr> "Acoupa weakfish", "Acoupa weakfish", "Acoupa weakfish...
## $ bgroup <chr> "Large demersals (>=90 cm)", "Large demersals (>=90 cm...
## $ cgroup <chr> "Perch-likes", "Perch-likes", "Perch-likes", "Perch-li...
## $ country <chr> "Guyana", "Guyana", "Guyana", "Belize", "Belize", "Bel...
## $ sector <chr> "Subsistence", "Artisanal", "Artisanal", "Subsistence"...
## $ cclass <chr> "Landings", "Landings", "Landings", "Landings", "Landi...
## $ status <chr> "Unreported", "Reported", "Unreported", "Unreported", ...
## $ tonnes <dbl> 56.6908346, 820.5335653, 247.6644052, 11.2982886, 2.15...
## $ value <dbl> 83108.7636, 1202902.2068, 363076.0180, 16563.2911, 315...
NOTE: Need to check what the unit of the value is
Mutate allows you to add new columns to your data. Let’s e.g. calculate the value per kg
sau.vperkg <-
mutate(sau,
value_per_kg = value/(tonnes * 1000))
You can also do more than one “mutation”, e.g. here we convert tonnes to kilograms and then use that variable to calculate value per kilogram
sau.vperkg <-
mutate(sau,
kg = tonnes/1000,
value_per_kg = value/kg)
Useful to create (not so) simple summaries of the data
sau.summ <-
summarise(sau,
stonnes = sum(tonnes, na.rm = TRUE),
svalue = sum(value, na.rm = TRUE))
Useful helper function
n() ## counts the number of rows
n_distinct(x) ## counts the number distinct values of x
first(x) ## first value of x
last(x) ## last value of x
nth(x,n) ## nth value of x
For example, if we want to calculate the number of records, number species, biological and commercial groups:
sau.summ2 <-
summarise(sau,
n = n(),
nspecies = n_distinct(species),
nbgroup = n_distinct(bgroup),
ncgroup = n_distinct(cgroup))
sau.summ2
## n nspecies nbgroup ncgroup
## 1 275736 441 26 11
Summarize is somewhat useful on its own but become much more useful when combined with a group_by
statement
sau.group <-
group_by(sau, country)
sau.group
## Source: local data frame [275,736 x 10]
## Groups: country [90]
##
## year species bgroup
## <int> <chr> <chr>
## 1 1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 2 1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 3 1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 4 1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 5 1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 6 1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 7 1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 8 1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 9 1950 Marine fishes nei Medium demersals (30 - 89 cm)
## 10 1950 Marine fishes nei Medium demersals (30 - 89 cm)
## # ... with 275,726 more rows, and 7 more variables: cgroup <chr>,
## # country <chr>, sector <chr>, cclass <chr>, status <chr>, tonnes <dbl>,
## # value <dbl>
In the above code we have grouped the data by country (note the second line). If we now repeat the summary calculation we did above on the grouped dataframe (sau.group) we get the summary by each country:
sau.summ3 <-
summarise(sau.group,
n = n(),
nspecies = n_distinct(species),
nbgroup = n_distinct(bgroup),
ncgroup = n_distinct(cgroup))
sau.summ3
We can use the group_by
function on more than one column, e.g. by country and catch-class and then apply some summary statistics:
sau.group2 <-
group_by(sau, country, cclass)
sau.summ4 <-
summarise(sau.group2,
n = n(),
nspecies = n_distinct(species),
nbgroup = n_distinct(bgroup),
ncgroup = n_distinct(cgroup))
sau.summ4
## Source: local data frame [165 x 6]
## Groups: country [?]
##
## country cclass n nspecies nbgroup ncgroup
## <chr> <chr> <int> <int> <int> <int>
## 1 Algeria Discards 1 1 1 1
## 2 Algeria Landings 6 6 2 2
## 3 Anguilla (UK) Landings 3673 22 9 7
## 4 Antigua & Barbuda Landings 4483 20 8 6
## 5 Argentina Discards 17 1 1 1
## 6 Argentina Landings 93 11 2 3
## 7 Aruba (Netherlands) Discards 1849 29 14 9
## 8 Aruba (Netherlands) Landings 2578 36 14 9
## 9 Bahamas Discards 8 1 1 1
## 10 Bahamas Landings 6211 42 10 6
## # ... with 155 more rows
In R one can apply functions to data repeatedly:
sau.summ <-
arrange(summarise(group_by(select(sau, country, tonnes), country), n=n(), stonnes = sum(tonnes)), desc(stonnes))
sau.summ
The problem here is that this is very hard to read because in order to understand what has been done one has to read the code from inside out. What one can read can be written in some kind of a pseudo code:
%>%
operatordplyr allows chaining of operations using the \(\%>\%\) (pipe). By use of the pipe one can get the same as above by:
sau %>%
select(country, tonnes) %>%
group_by(country) %>%
summarise(n = n(),
stonnes = sum(tonnes)) %>%
arrange(desc(stonnes))
This needs a bit of explaining. Instead of calling a function with f(x,y)
you can do x %>% f(y)
, that is “take x, then apply f with a setting y”. This essentially means that %>%
operator takes the stuff on the left and places it as the first argument to function on the right hand side. If we take e.g. the first bit of code above:
# not run
sau %>%
select(country, tonnes)
It basically means that we pass the dataframe sau as the first argument in the select function.
Data are normally not tidy …