Preamble

Commonly, when collating summaries by group, one wants to:

Split up a big data structure into homogeneous pieces,
Apply a function to each piece
Combine all the results back together.

For example, one might want to:

Calculate summary statistics for each category (read: group)
Perform group-wise transformations (read: apply) like summary or re-scaling/standardization
Fit the same model to different subset of the data

The tidyverse package, through the dplyr, comes to the rescue.

Similar to ggplot2 they feature a Domain Specific Language (DSL) specially designed for data summaries.
Developed by Hadley Wickam, the creator ggplot2 and other useful tools.

Essentially dplyr offer a collection of simple but powerful commands (or verbs) that facilitate this split-apply-combined process:

filter: keep rows matching criteria
select: pick columns by name
arrange: order the rows according to a variable
mutate: add new variables
summarise: reduce variables to values
group_by: gives the group to apply the analysis functions to

The structure of these commands is always the same:

First argument to the function is a data frame
Subsequent arguments say what to do with data frame
Always return a data frame
- this is the key difference of the tidyverse vs. base R approach
It recognizes the columns of the data.frame as variables, that is only need to write col.name instead of dat$col.name

Reading material

Manipulate Cases

** Import data used**:

library(tidyverse)
d <- 
  read_csv("ftp://ftp.hafro.is/pub/data/csv/minke.csv")

filter: Extract rows

One can extract rows that meet logical criteria by using the filter command. The first argument is the name of the data frame with subsequent argument(s) being logical expressions. E.g. one subsets the minke data containing only year 2004 by:

filter(d, year == 2004)

## # A tibble: 25 × 13
##       id date                  lon   lat area  length weight   age sex   matur…¹
##    <dbl> <dttm>              <dbl> <dbl> <chr>  <dbl>  <dbl> <dbl> <chr> <chr>  
##  1     1 2004-06-10 22:00:00 -21.4  65.7 North    780     NA  11.3 Fema… pregna…
##  2   690 2004-06-15 17:00:00 -21.4  65.7 North    793     NA  NA   Fema… pregna…
##  3   926 2004-06-22 01:00:00 -19.8  66.5 North    858     NA  15.5 Fema… pregna…
##  4  1430 2004-06-22 09:00:00 -19.8  66.5 North    840     NA  15.9 Fema… pregna…
##  5  1432 2004-06-26 05:00:00 -19.6  65.9 North    761     NA  14   Fema… pregna…
##  6  1434 2004-07-04 16:00:00 -21.0  65.5 North    819     NA  16.6 Fema… anoest…
##  7  1435 2004-06-04 19:00:00 -22.5  64.6 South    836     NA  46.8 Male  mature 
##  8  1437 2004-06-15 13:00:00 -23.0  64.2 South    764     NA   9.9 Male  mature 
##  9  1439 2004-06-08 07:00:00 -22.8  64.6 South    774   4620  10.9 Fema… pregna…
## 10  1440 2004-06-16 11:00:00 -23.4  64.2 South    778     NA  25.3 Male  mature 
## # … with 15 more rows, 3 more variables: stomach.volume <dbl>,
## #   stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity

Same as:

filter(d, year %in% 2004)

Only data from 2004 and onwards:

filter(d, year >= 2004)

## # A tibble: 154 × 13
##       id date                  lon   lat area  length weight   age sex   matur…¹
##    <dbl> <dttm>              <dbl> <dbl> <chr>  <dbl>  <dbl> <dbl> <chr> <chr>  
##  1     1 2004-06-10 22:00:00 -21.4  65.7 North    780     NA  11.3 Fema… pregna…
##  2   690 2004-06-15 17:00:00 -21.4  65.7 North    793     NA  NA   Fema… pregna…
##  3   926 2004-06-22 01:00:00 -19.8  66.5 North    858     NA  15.5 Fema… pregna…
##  4  1430 2004-06-22 09:00:00 -19.8  66.5 North    840     NA  15.9 Fema… pregna…
##  5  1432 2004-06-26 05:00:00 -19.6  65.9 North    761     NA  14   Fema… pregna…
##  6  1434 2004-07-04 16:00:00 -21.0  65.5 North    819     NA  16.6 Fema… anoest…
##  7  1435 2004-06-04 19:00:00 -22.5  64.6 South    836     NA  46.8 Male  mature 
##  8  1437 2004-06-15 13:00:00 -23.0  64.2 South    764     NA   9.9 Male  mature 
##  9  1439 2004-06-08 07:00:00 -22.8  64.6 South    774   4620  10.9 Fema… pregna…
## 10  1440 2004-06-16 11:00:00 -23.4  64.2 South    778     NA  25.3 Male  mature 
## # … with 144 more rows, 3 more variables: stomach.volume <dbl>,
## #   stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity

Using subsequent arguments one can refine the subset further:

filter(d, year >= 2004 & year < 2006)
filter(d, year >= 2004, year < 2006)     # same as above
filter(d, between(2004, 2005))           # same as above

All but the year 2004 would be:

filter(d, year != 2004)

But this would give year 2005 to 2007:

filter(d, year %in% 2005:2007)

Which would be the same as:

filter(d, !year %in% 2003:2004)

Filter takes any logical statement:

 x == a   # x is equal to a
 x %in% a # x is "within" a
 x != a   # x is not equal to a
 x > a    # x is greater than a
 x >= a   # x is greater or equal to a
 x < a    # x is less than a
 x <= a   # x is less or equal to a
 a & b    # a and b
 a | b    # a or b
 is.na(x) # is a equal to NA (missing)
 ...      # ...

The arguments can operate on different variables. E.g. to extract mature males caught in 2007 one would write:

filter(d, maturity == "mature", sex == "Male", year == 2007)

NOTE: A “comma” is recognized as “AND”. If one were to specify “OR” use the “|”:

filter(d, maturity == "mature" | sex == "Male" | year == 2007)

Exercise

Find all males caught in the northern area in 2006 and 2007
Find all males that are either immature or pregnant
Find all whales caught that did not have a maturity status determined
Find all whales that were caught in year 2007 or were mature males

select: Extract columns

The select functions allows you to extract certain columns:

select(d, id, year, maturity)

## # A tibble: 190 × 3
##       id  year maturity
##    <dbl> <dbl> <chr>   
##  1     1  2004 pregnant
##  2   690  2004 pregnant
##  3   926  2004 pregnant
##  4  1333  2003 immature
##  5  1334  2003 immature
##  6  1335  2003 immature
##  7  1336  2003 pregnant
##  8  1338  2003 pregnant
##  9  1339  2003 pubertal
## 10  1341  2003 mature  
## # … with 180 more rows

You can also omit certain columns using negative indexing: for example you can select all columns except length:

select(d, -length)

## # A tibble: 190 × 12
##       id date                  lon   lat area  weight   age sex    maturity
##    <dbl> <dttm>              <dbl> <dbl> <chr>  <dbl> <dbl> <chr>  <chr>   
##  1     1 2004-06-10 22:00:00 -21.4  65.7 North     NA  11.3 Female pregnant
##  2   690 2004-06-15 17:00:00 -21.4  65.7 North     NA  NA   Female pregnant
##  3   926 2004-06-22 01:00:00 -19.8  66.5 North     NA  15.5 Female pregnant
##  4  1333 2003-09-30 16:00:00 -21.6  65.7 North     NA   7.2 Male   immature
##  5  1334 2003-09-25 15:00:00 -15.6  66.3 North     NA  12.3 Female immature
##  6  1335 2003-09-16 16:00:00 -18.7  66.2 North     NA   9.6 Female immature
##  7  1336 2003-09-12 17:00:00 -21.5  65.7 North     NA  17.3 Female pregnant
##  8  1338 2003-09-09 12:00:00 -22.8  66.1 North     NA  13.8 Female pregnant
##  9  1339 2003-08-31 13:00:00 -17.5  66.6 North     NA  12.2 Male   pubertal
## 10  1341 2003-08-30 17:00:00 -14.7  66.2 North     NA  15.4 Male   mature  
## # … with 180 more rows, and 3 more variables: stomach.volume <dbl>,
## #   stomach.weight <dbl>, year <dbl>

A combination of variables to drop could be written as:

select(d, -c(weight, maturity))

The select function has some useful helper function:

starts_with('stomach')  # Finds all columns that start with "stomach"
ends_with('weight')     # Finds all columns that end with "weight"
id:age                  # Finds all columns between "id" and "age"
contains('mach')        # Finds all columns that contain "mach"
-date                   # Remove column "date" from the dataset

And you can of course combine these at will:

select(d, id:length, starts_with('stomach'))

## # A tibble: 190 × 8
##       id date                  lon   lat area  length stomach.volume stomach.w…¹
##    <dbl> <dttm>              <dbl> <dbl> <chr>  <dbl>          <dbl>       <dbl>
##  1     1 2004-06-10 22:00:00 -21.4  65.7 North    780             58      31.9  
##  2   690 2004-06-15 17:00:00 -21.4  65.7 North    793             90      36.3  
##  3   926 2004-06-22 01:00:00 -19.8  66.5 North    858             24       9.42 
##  4  1333 2003-09-30 16:00:00 -21.6  65.7 North    567             25       3.64 
##  5  1334 2003-09-25 15:00:00 -15.6  66.3 North    774             85       5.51 
##  6  1335 2003-09-16 16:00:00 -18.7  66.2 North    526             18       1.17 
##  7  1336 2003-09-12 17:00:00 -21.5  65.7 North    809            200      99    
##  8  1338 2003-09-09 12:00:00 -22.8  66.1 North    820            111      64.3  
##  9  1339 2003-08-31 13:00:00 -17.5  66.6 North    697              8       0.801
## 10  1341 2003-08-30 17:00:00 -14.7  66.2 North    777             25       1.42 
## # … with 180 more rows, and abbreviated variable name ¹stomach.weight

select also allows you to rename columns as you select them:

select(d, id, yr = year)

## # A tibble: 190 × 2
##       id    yr
##    <dbl> <dbl>
##  1     1  2004
##  2   690  2004
##  3   926  2004
##  4  1333  2003
##  5  1334  2003
##  6  1335  2003
##  7  1336  2003
##  8  1338  2003
##  9  1339  2003
## 10  1341  2003
## # … with 180 more rows

but this only selects the requested columns, others are dropped.

rename: Rename columns

If you just want to rename a couple of columns in the data frame leaving the other columns intact you can use the function rename:

rename(d, time = date)

## # A tibble: 190 × 13
##       id time                  lon   lat area  length weight   age sex   matur…¹
##    <dbl> <dttm>              <dbl> <dbl> <chr>  <dbl>  <dbl> <dbl> <chr> <chr>  
##  1     1 2004-06-10 22:00:00 -21.4  65.7 North    780     NA  11.3 Fema… pregna…
##  2   690 2004-06-15 17:00:00 -21.4  65.7 North    793     NA  NA   Fema… pregna…
##  3   926 2004-06-22 01:00:00 -19.8  66.5 North    858     NA  15.5 Fema… pregna…
##  4  1333 2003-09-30 16:00:00 -21.6  65.7 North    567     NA   7.2 Male  immatu…
##  5  1334 2003-09-25 15:00:00 -15.6  66.3 North    774     NA  12.3 Fema… immatu…
##  6  1335 2003-09-16 16:00:00 -18.7  66.2 North    526     NA   9.6 Fema… immatu…
##  7  1336 2003-09-12 17:00:00 -21.5  65.7 North    809     NA  17.3 Fema… pregna…
##  8  1338 2003-09-09 12:00:00 -22.8  66.1 North    820     NA  13.8 Fema… pregna…
##  9  1339 2003-08-31 13:00:00 -17.5  66.6 North    697     NA  12.2 Male  pubert…
## 10  1341 2003-08-30 17:00:00 -14.7  66.2 North    777     NA  15.4 Male  mature 
## # … with 180 more rows, 3 more variables: stomach.volume <dbl>,
## #   stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity

Exercise

Select age, length and id from the minke dataset, and rename id to “whale_id”
Select the id column and all columns that contain the text “weight”.

arrange: Order rows

To sort the data we employ the arrange-function. Sorting by length is as follows (in ascending order):

arrange(d, length)

## # A tibble: 190 × 13
##       id date                  lon   lat area  length weight   age sex   matur…¹
##    <dbl> <dttm>              <dbl> <dbl> <chr>  <dbl>  <dbl> <dbl> <chr> <chr>  
##  1 20927 2006-07-19 15:00:00 -19.2  66.3 North    461     NA   3.4 Fema… immatu…
##  2 20254 2005-08-10 16:00:00 -16.8  66.4 North    474     NA   3.4 Male  immatu…
##  3 20421 2005-07-19 17:00:00 -15.9  64.1 South    482     NA   3.6 Male  immatu…
##  4  1481 2004-07-03 09:00:00 -15.8  64   South    502     NA   6.7 Fema… immatu…
##  5  1344 2003-08-26 16:00:00 -19.3  66.2 North    508     NA  NA   Male  immatu…
##  6  1504 2003-08-18 17:00:00 -23.9  65.0 South    520     NA  13.3 Male  immatu…
##  7  1335 2003-09-16 16:00:00 -18.7  66.2 North    526     NA   9.6 Fema… immatu…
##  8  1343 2003-08-28 19:00:00 -15.8  66.4 North    542     NA  NA   Fema… immatu…
##  9 19211 2005-07-23 16:00:00 -14.2  65.8 North    564     NA   5.2 Fema… immatu…
## 10 20427 2005-08-03 15:00:00 -23.1  64.6 South    566   1663   6.1 Male  mature 
## # … with 180 more rows, 3 more variables: stomach.volume <dbl>,
## #   stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity

and in descending order:

arrange(d, desc(length))

## # A tibble: 190 × 13
##       id date                  lon   lat area  length weight   age sex   matur…¹
##    <dbl> <dttm>              <dbl> <dbl> <chr>  <dbl>  <dbl> <dbl> <chr> <chr>  
##  1 20255 2005-08-17 06:00:00 -21.5  65.7 North    871     NA  10.5 Fema… pregna…
##  2 21023 2007-06-14 19:00:00 -14.4  66.2 North    871     NA  29.7 Fema… pregna…
##  3 20424 2005-07-27 21:00:00 -23.0  64.5 South    870     NA  47.4 Male  mature 
##  4  1500 2003-08-31 07:00:00 -24.2  65.1 South    861     NA  NA   Fema… pregna…
##  5 19213 2005-08-04 14:00:00 -16.5  63.7 South    860     NA  29.6 Fema… pregna…
##  6   926 2004-06-22 01:00:00 -19.8  66.5 North    858     NA  15.5 Fema… pregna…
##  7 20971 2006-07-20 08:00:00 -18.4  67.0 North    858     NA  25.1 Fema… pregna…
##  8 20978 2006-08-12 18:00:00 -22.6  66.1 North    857     NA  NA   Male  mature 
##  9  1447 2004-07-04 11:00:00 -14.9  64.2 South    853     NA  12.1 Fema… pregna…
## 10 20261 2005-08-09 21:00:00 -22.7  64.3 South    853     NA  24.1 Fema… pregna…
## # … with 180 more rows, 3 more variables: stomach.volume <dbl>,
## #   stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity

You can also arrange by more that one column:

arrange(d, sex, desc(length))

## # A tibble: 190 × 13
##       id date                  lon   lat area  length weight   age sex   matur…¹
##    <dbl> <dttm>              <dbl> <dbl> <chr>  <dbl>  <dbl> <dbl> <chr> <chr>  
##  1 20255 2005-08-17 06:00:00 -21.5  65.7 North    871     NA  10.5 Fema… pregna…
##  2 21023 2007-06-14 19:00:00 -14.4  66.2 North    871     NA  29.7 Fema… pregna…
##  3  1500 2003-08-31 07:00:00 -24.2  65.1 South    861     NA  NA   Fema… pregna…
##  4 19213 2005-08-04 14:00:00 -16.5  63.7 South    860     NA  29.6 Fema… pregna…
##  5   926 2004-06-22 01:00:00 -19.8  66.5 North    858     NA  15.5 Fema… pregna…
##  6 20971 2006-07-20 08:00:00 -18.4  67.0 North    858     NA  25.1 Fema… pregna…
##  7  1447 2004-07-04 11:00:00 -14.9  64.2 South    853     NA  12.1 Fema… pregna…
##  8 20261 2005-08-09 21:00:00 -22.7  64.3 South    853     NA  24.1 Fema… pregna…
##  9  1449 2004-07-05 05:00:00 -16.6  63.8 South    852     NA  25   Fema… pregna…
## 10 20991 2006-08-24 13:00:00 -14.0  65.6 North    850     NA  36.6 Fema… pregna…
## # … with 180 more rows, 3 more variables: stomach.volume <dbl>,
## #   stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity

mutate: Compute new column

mutate allows you to add new columns to your data. Let’s calculate the approximate weight:

d2 <- select(d, id, length, weight)
mutate(d2, computed_weight = 0.00001 * length^3)

## # A tibble: 190 × 4
##       id length weight computed_weight
##    <dbl>  <dbl>  <dbl>           <dbl>
##  1     1    780     NA           4746.
##  2   690    793     NA           4987.
##  3   926    858     NA           6316.
##  4  1333    567     NA           1823.
##  5  1334    774     NA           4637.
##  6  1335    526     NA           1455.
##  7  1336    809     NA           5295.
##  8  1338    820     NA           5514.
##  9  1339    697     NA           3386.
## 10  1341    777     NA           4691.
## # … with 180 more rows

You can also do more than one “mutation”:

mutate(d2,
       computed_weight = 0.00001 * length^3,
       approx_weight = ifelse(is.na(weight), 0.00001 * length^3, weight))

## # A tibble: 190 × 5
##       id length weight computed_weight approx_weight
##    <dbl>  <dbl>  <dbl>           <dbl>         <dbl>
##  1     1    780     NA           4746.         4746.
##  2   690    793     NA           4987.         4987.
##  3   926    858     NA           6316.         6316.
##  4  1333    567     NA           1823.         1823.
##  5  1334    774     NA           4637.         4637.
##  6  1335    526     NA           1455.         1455.
##  7  1336    809     NA           5295.         5295.
##  8  1338    820     NA           5514.         5514.
##  9  1339    697     NA           3386.         3386.
## 10  1341    777     NA           4691.         4691.
## # … with 180 more rows

To make our lives a bit easier mutate “remembers” earlier transformations within the sequence:

mutate(d2,
       computed_weight = 0.00001 * length^3,
       approx_weight = ifelse(is.na(weight), computed_weight, weight))

Exercise

Add a column to the minke whale dataset where Fullton’s condition factor, $K=100.000(W/L^3)$, is calculated

Summarise cases

summarise:

To summarise data one uses the summarise-function. Below we calculate the number of observations (using the cl("n")-function and the mean minke length.

summarise(d,
          n.obs = n(),
          ml = mean(length, na.rm = TRUE))

## # A tibble: 1 × 2
##   n.obs    ml
##   <int> <dbl>
## 1   190  753.

group_by:

The power of the command summarise is revealed when used in conjunction with group_by-function. The latter functions splits the data into groups based on one or multiple variables. E.g. one can split the minke table by maturity:

by_maturity <- 
  group_by(d, maturity)
by_maturity

## # A tibble: 190 × 13
## # Groups:   maturity [6]
##       id date                  lon   lat area  length weight   age sex   matur…¹
##    <dbl> <dttm>              <dbl> <dbl> <chr>  <dbl>  <dbl> <dbl> <chr> <chr>  
##  1     1 2004-06-10 22:00:00 -21.4  65.7 North    780     NA  11.3 Fema… pregna…
##  2   690 2004-06-15 17:00:00 -21.4  65.7 North    793     NA  NA   Fema… pregna…
##  3   926 2004-06-22 01:00:00 -19.8  66.5 North    858     NA  15.5 Fema… pregna…
##  4  1333 2003-09-30 16:00:00 -21.6  65.7 North    567     NA   7.2 Male  immatu…
##  5  1334 2003-09-25 15:00:00 -15.6  66.3 North    774     NA  12.3 Fema… immatu…
##  6  1335 2003-09-16 16:00:00 -18.7  66.2 North    526     NA   9.6 Fema… immatu…
##  7  1336 2003-09-12 17:00:00 -21.5  65.7 North    809     NA  17.3 Fema… pregna…
##  8  1338 2003-09-09 12:00:00 -22.8  66.1 North    820     NA  13.8 Fema… pregna…
##  9  1339 2003-08-31 13:00:00 -17.5  66.6 North    697     NA  12.2 Male  pubert…
## 10  1341 2003-08-30 17:00:00 -14.7  66.2 North    777     NA  15.4 Male  mature 
## # … with 180 more rows, 3 more variables: stomach.volume <dbl>,
## #   stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity

The table appears “intact” because it still has 190 observations but it has 6 groups, representing the different values of the maturity scaling in the data (anoestrous, immature, mature, pregnant, pubertal and “NA”).

The summarise-command respects the grouping, as shown if one uses the same command as used above, but now on a dataframe that has been grouped:

summarise(by_maturity,
          n.obs = n(),
          mean = mean(length, na.rm = TRUE))

## # A tibble: 6 × 3
##   maturity   n.obs  mean
##   <chr>      <int> <dbl>
## 1 anoestrous     6  784.
## 2 immature      25  619.
## 3 mature        74  758.
## 4 pregnant      66  800.
## 5 pubertal       6  692.
## 6 <NA>          13  752.

Exercise

Calculate the minimum, median, mean, standard deviation and standard error of whale lengths in each year

Combining verbs

In R one can apply functions to data repeatedly:

summarise(group_by(d, sex, year), obs = n())

## # A tibble: 10 × 3
## # Groups:   sex [2]
##    sex     year   obs
##    <chr>  <dbl> <int>
##  1 Female  2003    13
##  2 Female  2004    15
##  3 Female  2005    14
##  4 Female  2006    27
##  5 Female  2007    25
##  6 Male    2003    23
##  7 Male    2004    10
##  8 Male    2005    20
##  9 Male    2006    31
## 10 Male    2007    12

All this quickly becomes cumbersome and hard to read, for example what does this do?

arrange(summarise(group_by(d, sex, year), obs = n()), desc(obs))

The `%>%` operator

dplyr allows chaining of operators using the %>% operator - often referred to as the pipe operator. The above code then becomes:

d %>% 
  group_by(sex, year) %>% 
  summarise(obs = n()) %>% 
  arrange(desc(obs))

## # A tibble: 10 × 3
## # Groups:   sex [2]
##    sex     year   obs
##    <chr>  <dbl> <int>
##  1 Male    2006    31
##  2 Female  2006    27
##  3 Female  2007    25
##  4 Male    2003    23
##  5 Male    2005    20
##  6 Female  2004    15
##  7 Female  2005    14
##  8 Female  2003    13
##  9 Male    2007    12
## 10 Male    2004    10

This needs a bit of explaining. Noticed that the first argument in each function is missing. Instead of calling a function with f(x,y) you can do x %>% f(y), that is “take x, then apply f with a setting y”. This essentially means that %>% operator takes the stuff on the left and places it as the first argument to function on the right hand side. This get infinitely more useful when you start creating longer chains:

d %>%
  # note - here overwrite the weight column:
  mutate(weight  = ifelse(!is.na(weight), weight, 0.00001 * length^3),
         K = 100 * weight * 1000 / length^3) %>% 
  group_by(sex, year) %>% 
  summarise(n=n(),
            ml = mean(length),
            sl = sd(length),
            mK = mean(K),
            sK = sd(K)) %>% 
  arrange(year)

## # A tibble: 10 × 7
## # Groups:   sex [2]
##    sex     year     n    ml    sl    mK       sK
##    <chr>  <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
##  1 Female  2003    13  761. 107.  1     1.60e-16
##  2 Male    2003    23  722.  90.2 1     1.59e-16
##  3 Female  2004    15  763.  98.0 1.00  9.38e- 4
##  4 Male    2004    10  764.  47.2 1     1.28e-16
##  5 Female  2005    14  754.  92.7 0.997 6.92e- 2
##  6 Male    2005    20  724. 103.  0.974 4.80e- 2
##  7 Female  2006    27  770.  86.7 0.993 3.47e- 2
##  8 Male    2006    31  743.  56.7 0.998 1.75e- 2
##  9 Female  2007    25  772   60.8 0.988 9.89e- 2
## 10 Male    2007    12  772.  39.8 1.01  3.34e- 2

Recomended that you “read” the %>% as “then”. Hence the above can be read as:

take data d
then calculate the weight if missing and K
then split into groups by sex and year
then calculate the mean and standard deviation of length and K for each group
then arrange by year

Note that Rstudio has a built in shortcut for the %>% operator, [ctrl] + [shift] + M.

Exercise

How do you produce this table, the statistics being based on length?:

## # A tibble: 4 × 3
## # Groups:   sex [2]
##   sex    area  standard.error
##   <chr>  <chr>          <dbl>
## 1 Female North          10.3 
## 2 Female South          16.3 
## 3 Male   North          18.9 
## 4 Male   South           8.02

Tidy dataframes

The defintion of a tidy dataset: Each column is a variable, and each row is an observation. The SMB dataset (ftp://ftp.hafro.is/pub/data/csv/is_smb.csv) is an example of an untidy dataset. Lets take a look:

smb <- 
  read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb.csv")
glimpse(smb)

## Rows: 19,846
## Columns: 40
## $ id             <dbl> 44929, 44928, 44927, 44926, 44925, 44922, 44921, 44920,…
## $ date           <date> 1990-03-16, 1990-03-16, 1990-03-16, 1990-03-16, 1990-0…
## $ year           <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1…
## $ vid            <dbl> 1307, 1307, 1307, 1307, 1307, 1307, 1307, 1307, 1307, 1…
## $ tow_id         <chr> "670-15", "670-13", "670-12", "670-1", "670-2", "720-1"…
## $ t1             <dttm> 1990-03-16 05:05:00, 1990-03-16 07:48:00, 1990-03-16 0…
## $ t2             <dttm> 1990-03-16 06:05:00, 1990-03-16 08:51:00, 1990-03-16 1…
## $ lon1           <dbl> -20.70367, -20.49767, -20.44650, -20.21500, -19.99100, …
## $ lat1           <dbl> 66.52250, 66.50917, 66.58583, 66.57417, 66.65767, 67.24…
## $ lon2           <dbl> -20.73167, -20.53350, -20.29083, -20.07750, -19.98883, …
## $ lat2           <dbl> 66.59167, 66.57583, 66.61550, 66.61600, 66.72383, 67.29…
## $ ir             <chr> "62C9", "62C9", "62C9", "62C9", "62D0", "63C9", "63C9",…
## $ ir_lon         <dbl> -20.5, -20.5, -20.5, -20.5, -19.5, -20.5, -20.5, -20.5,…
## $ ir_lat         <dbl> 66.75, 66.75, 66.75, 66.75, 66.75, 67.25, 67.25, 67.25,…
## $ z1             <dbl> 297, 302, 205, 88, 129, 249, 332, 244, 291, 230, 140, 3…
## $ z2             <dbl> 330, 314, 291, 117, 172, 307, 360, 256, 282, 270, 158, …
## $ temp_s         <dbl> 1.1, 1.1, 1.0, 1.1, 0.9, 1.3, 1.2, 1.2, 1.1, 1.0, 1.0, …
## $ temp_b         <dbl> NA, 1.0, NA, 1.0, 1.1, 1.0, -0.5, 0.9, 0.9, 0.9, NA, -0…
## $ speed          <dbl> 4.0, 3.8, 3.2, 4.0, 4.1, 4.1, 3.9, 3.8, 3.9, 3.9, 4.0, …
## $ duration       <dbl> 60, 63, 74, 60, 59, 59, 61, 64, 62, 62, 60, 59, 65, 66,…
## $ towlength      <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4…
## $ horizontal     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ verical        <dbl> 2.0, 3.0, 2.4, 2.6, NA, NA, NA, 2.2, 2.0, 2.0, 2.0, 2.8…
## $ wind           <dbl> 15, 15, 15, 15, 12, 9, 9, 7, 7, 5, 5, 3, 7, 7, 1, 7, 7,…
## $ wind_direction <dbl> 5, 5, 5, 9, 9, 9, 9, 9, 9, 14, 14, 5, 5, 5, 32, 36, 5, …
## $ bormicon       <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2…
## $ oldstrata      <dbl> 34, 34, 34, 34, 34, 33, 36, 33, 33, 34, 34, 34, 34, 41,…
## $ newstrata      <dbl> 31, 31, 31, 32, 32, 18, 37, 18, 18, 31, 32, 31, 31, 36,…
## $ cod_kg         <dbl> 56.30017, 29.28811, 34.62728, 8.32594, 43.83642, 1.2309…
## $ cod_n          <dbl> 155, 75, 125, 13, 81, 6, 0, 3, 0, 52, 330, 77, 30, 48, …
## $ haddock_kg     <dbl> 0.50302238, 0.00000000, 0.46842356, 2.39978621, 11.5888…
## $ haddock_n      <dbl> 16, 0, 17, 7, 42, 0, 0, 0, 0, 12, 10, 0, 0, 0, 0, 0, 1,…
## $ saithe_kg      <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ saithe_n       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5893…
## $ wolffish_kg    <dbl> 4.7770270, 3.1999367, 1.7183545, 8.4568957, 11.6068288,…
## $ wolffish_n     <dbl> 12, 1, 2, 27, 26, 0, 0, 0, 0, 2, 69, 0, 0, 27, 0, 0, 29…
## $ plaice_kg      <dbl> 0.0000000, 0.0000000, 0.3660489, 4.2098115, 2.4292287, …
## $ plaice_n       <dbl> 0, 0, 1, 6, 3, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ monkfish_kg    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ monkfish_n     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Can you identify columns that should actually be a observations?

pivot_longer

To make a table longer, we can employ the pivot_longer-function. Lets just take the biological variables:

wide <- 
  smb %>% 
  select(id, cod_kg:monkfish_n)
long <- 
  wide %>% 
  # select just the abundance variables
  select(id, ends_with("_n")) %>% 
  pivot_longer(cols = -id, names_to = "species", values_to = "n") %>% 
  mutate(species = str_remove(species, "_n"))
glimpse(long)

## Rows: 119,076
## Columns: 3
## $ id      <dbl> 44929, 44929, 44929, 44929, 44929, 44929, 44928, 44928, 44928,…
## $ species <chr> "cod", "haddock", "saithe", "wolffish", "plaice", "monkfish", …
## $ n       <dbl> 155, 16, 0, 12, 0, 0, 75, 0, 0, 1, 0, 0, 125, 17, 0, 2, 1, 0, …

So we have moved from a dataframe that was 19846 rows with 7 variables to a dataframe of 119076 (19846 stations x 6 species) rows with only 3 variables:

id: Station id
species: Species name
n: Abundance (standardized to 4 nautical miles)

This latter (longer) format is the proper format for efficient computer programming. The following exercise should illustrate that.

Exercise

Use the wide-dataframe above and write a code to calculate the median abundance (***_n column) for each species
Use the long-dataframe above and write a code to calculate the median abundance for each species

## # A tibble: 1 × 6
##     cod haddock saithe wolfish plaice monkfish
##   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>    <dbl>
## 1    50      53      0       8      0        0

## # A tibble: 6 × 2
##   species  median
##   <chr>     <dbl>
## 1 cod          50
## 2 haddock      53
## 3 monkfish      0
## 4 plaice        0
## 5 saithe        0
## 6 wolffish      8

pivot_wider

When reconstructing untidy tables we sometimes may need to make a long table wider again. In the above example we only made a long table for abundance (n). We could modify the above code for the biomass (kg). Doing all in one step requires the use of pivot_wider, the steps being:

Make a table containing id, current variable (cod_n, cod_kg, …) and the corresponding value (abundance or biomass)
Separate the value measured (kg or n) from the species name
Generate separate columns for abundance (n) and biomass (kg)

long <-
  wide %>% 
  # step 1
  pivot_longer(-id) %>%
  # step 2
  separate(name, sep = "_", into = c("species", "variable")) %>% 
  # step 3
  pivot_wider(names_from = variable)

Combine variables (join)

Normally the data we are interested in working with do not reside in one table. E.g. for a typical groundfish survey data one stores the station table separate from the biological table. Lets read in the SMB tables in such a format:

station <- 
  read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb_stations.csv")
biological <- 
  read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb_biological.csv")
station %>% select(id:lat1) %>%  arrange(id) %>% glimpse()

## Rows: 19,846
## Columns: 9
## $ id     <dbl> 29654, 29655, 29656, 29657, 29658, 29659, 29660, 29661, 29662, …
## $ date   <date> 1985-03-21, 1985-03-20, 1985-03-20, 1985-03-20, 1985-03-20, 19…
## $ year   <dbl> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 198…
## $ vid    <dbl> 1273, 1273, 1273, 1273, 1273, 1273, 1273, 1273, 1273, 1273, 127…
## $ tow_id <chr> "374-3", "374-11", "374-12", "373-13", "373-11", "373-12", "373…
## $ t1     <dttm> 1985-03-21 00:45:00, 1985-03-20 22:25:00, 1985-03-20 20:35:00,…
## $ t2     <dttm> 1985-03-21 01:50:00, 1985-03-20 23:25:00, 1985-03-20 21:45:00,…
## $ lon1   <dbl> -24.17967, -24.19783, -24.40467, -23.99833, -23.67983, -23.7210…
## $ lat1   <dbl> 63.96350, 63.75367, 63.71817, 63.58000, 63.74767, 63.83767, 63.…

biological %>% arrange(id) %>% glimpse()

## Rows: 68,834
## Columns: 4
## $ id      <dbl> 29654, 29654, 29654, 29654, 29655, 29655, 29655, 29655, 29656,…
## $ species <chr> "cod", "haddock", "saithe", "wolffish", "cod", "haddock", "sai…
## $ kg      <dbl> 53.4895900, 29.6435494, 22.8682684, 0.6901652, 29.6210500, 203…
## $ n       <dbl> 13, 15, 5, 1, 6, 155, 23, 4, 6, 156, 39, 4, 43, 421, 8, 11, 25…

Here the information on the station, such as date, geographical location are kept separate from the detail measurments that are performed at each station (weight and numbers by species. The records in the station table are unique (i.e. each station information only appear once) while we can have more than one species measured at each station. Take note here that the biological table does not contain records of species that were not observed at that station. The link between the two tables, in this case, is the variable id.

left_join: Matching values from y to x

Lets say we were interested in combining the geographical position and the species records from one station:

station %>% 
  select(id, date, lon1, lat1) %>% 
  filter(id == 29654) %>% 
  left_join(biological)

## # A tibble: 4 × 7
##      id date        lon1  lat1 species      kg     n
##   <dbl> <date>     <dbl> <dbl> <chr>     <dbl> <dbl>
## 1 29654 1985-03-21 -24.2  64.0 cod      53.5      13
## 2 29654 1985-03-21 -24.2  64.0 haddock  29.6      15
## 3 29654 1985-03-21 -24.2  64.0 saithe   22.9       5
## 4 29654 1985-03-21 -24.2  64.0 wolffish  0.690     1

Take note here: * The joining is by common variable name in the two tables (here id) * That we only have records of fours species in this table, i.e. a value of zero for a species is not stored in the data.

right_join: Matching values from x to y

This is really the inverse of left_join.

inner_join: Retain only rows with matches

Example of only joining station information with monkfish, were monkfish was recorded:

station %>% 
  inner_join(biological %>% 
               filter(species == "monkfish"))

## # A tibble: 2,038 × 31
##       id date        year   vid tow_id t1                  t2                 
##    <dbl> <date>     <dbl> <dbl> <chr>  <dttm>              <dttm>             
##  1 44154 1990-03-11  1990  1273 374-1  1990-03-11 10:25:00 1990-03-11 11:30:00
##  2 44271 1990-03-12  1990  1273 323-1  1990-03-12 09:15:00 1990-03-12 10:25:00
##  3 44382 1990-03-13  1990  1273 320-4  1990-03-13 14:20:00 1990-03-13 15:20:00
##  4 44156 1990-03-11  1990  1273 374-3  1990-03-11 07:10:00 1990-03-11 08:15:00
##  5 44379 1990-03-14  1990  1273 320-3  1990-03-14 01:40:00 1990-03-14 02:40:00
##  6 44377 1990-03-14  1990  1273 320-12 1990-03-14 05:45:00 1990-03-14 06:50:00
##  7 44375 1990-03-14  1990  1273 321-11 1990-03-14 09:30:00 1990-03-14 10:40:00
##  8 44373 1990-03-14  1990  1273 371-12 1990-03-14 13:30:00 1990-03-14 14:20:00
##  9 44356 1990-03-16  1990  1273 366-11 1990-03-16 00:50:00 1990-03-16 01:30:00
## 10 50775 1992-03-15  1992  1273 371-12 1992-03-15 09:00:00 1992-03-15 10:10:00
## # … with 2,028 more rows, and 24 more variables: lon1 <dbl>, lat1 <dbl>,
## #   lon2 <dbl>, lat2 <dbl>, ir <chr>, ir_lon <dbl>, ir_lat <dbl>, z1 <dbl>,
## #   z2 <dbl>, temp_s <dbl>, temp_b <dbl>, speed <dbl>, duration <dbl>,
## #   towlength <dbl>, horizontal <dbl>, verical <dbl>, wind <dbl>,
## #   wind_direction <dbl>, bormicon <dbl>, oldstrata <dbl>, newstrata <dbl>,
## #   species <chr>, kg <dbl>, n <dbl>

full_join: Retain all rows

station %>% 
  full_join(biological %>% 
               filter(species == "monkfish")) %>% 
  select(id, species:n)

## # A tibble: 19,846 × 4
##       id species    kg     n
##    <dbl> <chr>   <dbl> <dbl>
##  1 44929 <NA>       NA    NA
##  2 44928 <NA>       NA    NA
##  3 44927 <NA>       NA    NA
##  4 44926 <NA>       NA    NA
##  5 44925 <NA>       NA    NA
##  6 44922 <NA>       NA    NA
##  7 44921 <NA>       NA    NA
##  8 44920 <NA>       NA    NA
##  9 44919 <NA>       NA    NA
## 10 44918 <NA>       NA    NA
## # … with 19,836 more rows

Here all the stations records are retained irrepsective if monkfish was caught or not.

Exercise 1

Run through this set of code that supposedly mimics the pictograms above:

x <- data_frame(A = c("a", "b", "c"),
                B = c("t", "u", "v"),
                C = c(1, 2, 3))
y <- data_frame(A = c("a", "b", "d"),
                B = c("t", "u", "w"),
                D = c(3, 2, 1))

left_join(x, y)
right_join(x, y)
left_join(y, x)
inner_join(x, y)
full_join(x, y)

Combine cases (bind)

bind_rows: One on top of the other as a single table.

union: Rows in x or y

intersect: Rows in x and y.

setdiff: Rows in x but not y

Exercise 2

Run through this set of code that supposedly mimics the pictograms above:

y <- data_frame(A = c("c", "d"),
                B = c("v", "w"),
                C = c(3, 4))
bind_rows(x, y)
union(x, y)
intersect(x, y)
setdiff(x, y)
setdiff(y, x)

Data transformation

Preamble

Reading material

Manipulate Cases

filter: Extract rows

select: Extract columns

rename: Rename columns

arrange: Order rows

mutate: Compute new column

Summarise cases

summarise:

group_by:

Combining verbs

The %>% operator

Tidy dataframes

pivot_longer

pivot_wider

Combine variables (join)

left_join: Matching values from y to x

right_join: Matching values from x to y

inner_join: Retain only rows with matches

full_join: Retain all rows

Combine cases (bind)

bind_rows: One on top of the other as a single table.

union: Rows in x or y

intersect: Rows in x and y.

setdiff: Rows in x but not y

The `%>%` operator