Transformation 2

Preamble

Commonly, when collating summaries by group, one wants to:

Split up a big data structure into homogeneous pieces,
Apply a function to each piece
Combine all the results back together.

Reading material

Libraries and data

library(tidyverse)
w <- read_csv("https://heima.hafro.is/~einarhj/data/minke.csv")

Summarise cases

summarise:

To summarise data one uses the summarise-function. Below we calculate the number of observations (using the n-function and the mean minke length.

w |> 
  summarise(n.obs = n(),
            ml = mean(length, na.rm = TRUE))

# A tibble: 1 × 2
  n.obs    ml
  <int> <dbl>
1   190  753.

group_by:

The power of the command summarise is revealed when used in conjunction with group_by-function. The latter functions splits the data into groups based on one or multiple variables. E.g. one can split the minke table by maturity:

w |> 
  group_by(maturity)

# A tibble: 190 × 13
# Groups:   maturity [6]
      id date                  lon   lat area  length weight   age sex   
   <dbl> <dttm>              <dbl> <dbl> <chr>  <dbl>  <dbl> <dbl> <chr> 
 1     1 2004-06-10 22:00:00 -21.4  65.7 North    780     NA  11.3 Female
 2   690 2004-06-15 17:00:00 -21.4  65.7 North    793     NA  NA   Female
 3   926 2004-06-22 01:00:00 -19.8  66.5 North    858     NA  15.5 Female
 4  1333 2003-09-30 16:00:00 -21.6  65.7 North    567     NA   7.2 Male  
 5  1334 2003-09-25 15:00:00 -15.6  66.3 North    774     NA  12.3 Female
 6  1335 2003-09-16 16:00:00 -18.7  66.2 North    526     NA   9.6 Female
 7  1336 2003-09-12 17:00:00 -21.5  65.7 North    809     NA  17.3 Female
 8  1338 2003-09-09 12:00:00 -22.8  66.1 North    820     NA  13.8 Female
 9  1339 2003-08-31 13:00:00 -17.5  66.6 North    697     NA  12.2 Male  
10  1341 2003-08-30 17:00:00 -14.7  66.2 North    777     NA  15.4 Male  
# ℹ 180 more rows
# ℹ 4 more variables: maturity <chr>, stomach.volume <dbl>,
#   stomach.weight <dbl>, year <dbl>

The table appears “intact” because it still has 190 observations but it has 6 groups, representing the different values of the maturity scaling in the data (anoestrous, immature, mature, pregnant, pubertal and “NA”).

The summarise-command respects the grouping, as shown if one uses the same command as used above, but now on a dataframe that has been grouped:

w |> 
  group_by(maturity) |> 
  summarise(n.obs = n(),
            mean = mean(length))

# A tibble: 6 × 3
  maturity   n.obs  mean
  <chr>      <int> <dbl>
1 anoestrous     6  784.
2 immature      25  619.
3 mature        74  758.
4 pregnant      66  800.
5 pubertal       6  692.
6 <NA>          13  752.

Excercise

Calculate the number of observations and minimum, median, mean, standard deviation and standard error of whale lengths in each year
The function names in R are:
n
min
max
median
mean
sd

The function for standard error is not availble in R. A remedy is that you use college statistical knowledge to derive them from the appropriate variables you derived above.

This may help to find different basic statistics function.

Solution

w |> 
  group_by(year) |> 
  summarise(n = n(),
            min = min(length),
            max = max(length),
            median = median(length),
            mean = mean(length),
            sd = sd(length),
            se = sd / sqrt(n))

Note: If you would not be using the pipe-code-flow your code would be something like this:

summarise(group_by(w, year),
          n = n(),
          min = min(length))  # etc.

Shown because you are bound to come accross code like this.

Multiple group splitting

w |> 
  group_by(maturity, year) |> 
  summarise(n = n(),
            min = min(length),
            max = max(length),
            median = median(length),
            mean = mean(length),
            sd = sd(length),
            se = sd / sqrt(n),
            .groups = "drop") |> 
  knitr::kable(digits = 2)

maturity	year	n	min	max	median	mean	sd	se
anoestrous	2004	1	819	819	819.0	819.00	NA	NA
anoestrous	2005	1	762	762	762.0	762.00	NA	NA
anoestrous	2006	4	761	804	780.5	781.50	17.60	8.80
immature	2003	6	508	774	534.0	572.83	100.64	41.09
immature	2004	3	502	712	684.0	632.67	114.02	65.83
immature	2005	8	474	737	640.5	624.12	109.84	38.84
immature	2006	4	461	706	604.5	594.00	100.76	50.38
immature	2007	4	622	760	695.5	693.25	56.82	28.41
mature	2003	17	693	845	771.0	765.88	35.40	8.59
mature	2004	10	685	836	771.0	764.10	47.20	14.93
mature	2005	13	566	870	740.0	743.85	72.19	20.02
mature	2006	24	606	857	750.5	750.50	59.67	12.18
mature	2007	10	701	819	792.5	777.10	39.35	12.44
pregnant	2003	9	730	861	809.0	801.00	39.88	13.29
pregnant	2004	10	745	858	814.0	809.10	42.88	13.56
pregnant	2005	7	752	871	825.0	822.29	44.59	16.85
pregnant	2006	19	732	858	802.0	799.26	38.42	8.81
pregnant	2007	21	682	871	790.0	787.00	49.71	10.85
pubertal	2003	3	603	697	692.0	664.00	52.89	30.53
pubertal	2005	1	764	764	764.0	764.00	NA	NA
pubertal	2006	2	692	706	699.0	699.00	9.90	7.00
NA	2003	1	840	840	840.0	840.00	NA	NA
NA	2004	1	634	634	634.0	634.00	NA	NA
NA	2005	4	740	790	775.0	770.00	24.49	12.25
NA	2006	5	662	803	763.0	745.60	53.44	23.90
NA	2007	2	716	780	748.0	748.00	45.25	32.00

Joins

Normally the data we are interested in working with do not reside in one table. E.g. for a typical survey data one stores a “station table” separate from a “detail” table. The surveys could be anything, e.g. catch sampling at a landing site or scientific trawl or UV-surveys.

Lets read in the Icelandic groundfish survey tables in such a format:

station <- 
  read_csv("https://heima.hafro.is/~einarhj/data/is_smb_stations.csv")
biological <- 
  read_csv("https://heima.hafro.is/~einarhj/data/is_smb_biological.csv")
station %>% select(id:lat1) %>%  arrange(id) %>% glimpse()

Rows: 19,846
Columns: 9
$ id     <dbl> 29654, 29655, 29656, 29657, 29658, 29659, 29660, 29661, 29662, …
$ date   <date> 1985-03-21, 1985-03-20, 1985-03-20, 1985-03-20, 1985-03-20, 19…
$ year   <dbl> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 198…
$ vid    <dbl> 1273, 1273, 1273, 1273, 1273, 1273, 1273, 1273, 1273, 1273, 127…
$ tow_id <chr> "374-3", "374-11", "374-12", "373-13", "373-11", "373-12", "373…
$ t1     <dttm> 1985-03-21 00:45:00, 1985-03-20 22:25:00, 1985-03-20 20:35:00,…
$ t2     <dttm> 1985-03-21 01:50:00, 1985-03-20 23:25:00, 1985-03-20 21:45:00,…
$ lon1   <dbl> -24.17967, -24.19783, -24.40467, -23.99833, -23.67983, -23.7210…
$ lat1   <dbl> 63.96350, 63.75367, 63.71817, 63.58000, 63.74767, 63.83767, 63.…

biological %>% arrange(id) %>% glimpse()

Rows: 68,834
Columns: 4
$ id      <dbl> 29654, 29654, 29654, 29654, 29655, 29655, 29655, 29655, 29656,…
$ species <chr> "cod", "haddock", "saithe", "wolffish", "cod", "haddock", "sai…
$ kg      <dbl> 53.4895900, 29.6435494, 22.8682684, 0.6901652, 29.6210500, 203…
$ n       <dbl> 13, 15, 5, 1, 6, 155, 23, 4, 6, 156, 39, 4, 43, 421, 8, 11, 25…

Here the information on the station, such as date, geographical location are kept separate from the detail measurments that are performed at each station (weight and numbers by species). The records in the station table are unique (i.e. each station information only appear once) while we can have more than one species measured at each station. Take note here that the biological table does not contain records of species that were not observed at that station. The link between the two tables, in this case, is the variable id.

Take also note that there are 6 species in the data

biological |> count(species)

# A tibble: 6 × 2
  species      n
  <chr>    <int>
1 cod      19425
2 haddock  16458
3 monkfish  2038
4 plaice    6388
5 saithe    8440
6 wolffish 16085

but that the number of records by species is different between species. E.g. there are 19425 records of cod (out of 19846 stations) but only 2038 records of monkfish. Not storing zero records is quite common in databases. It is thus the responsibility of the user to understand the structure of the data and how to deal with zero’s vs. missing data (NA) in the downstream analysis.

left_join: Matching values from y to x

Lets say we were interested in combining the geographical position and the species records from one station:

station %>% 
  select(id, date, lon1, lat1) %>% 
  filter(id == 29654) %>% 
  left_join(biological)

# A tibble: 4 × 7
     id date        lon1  lat1 species      kg     n
  <dbl> <date>     <dbl> <dbl> <chr>     <dbl> <dbl>
1 29654 1985-03-21 -24.2  64.0 cod      53.5      13
2 29654 1985-03-21 -24.2  64.0 haddock  29.6      15
3 29654 1985-03-21 -24.2  64.0 saithe   22.9       5
4 29654 1985-03-21 -24.2  64.0 wolffish  0.690     1

Take note here:

The joining is by common variable name in the two tables (here id)
That we only have records of fours species in this table, again a value of zero for a species is not stored in the data.

If we were to want to obtain the records for monkfish from this survey we could do:

d <- 
  station %>% 
  select(id, date, lon1, lat1) |> 
  left_join(biological |> 
              filter(species == "monkfish"),
            by = join_by(id))
d

# A tibble: 19,846 × 7
      id date        lon1  lat1 species    kg     n
   <dbl> <date>     <dbl> <dbl> <chr>   <dbl> <dbl>
 1 44929 1990-03-16 -20.7  66.5 <NA>       NA    NA
 2 44928 1990-03-16 -20.5  66.5 <NA>       NA    NA
 3 44927 1990-03-16 -20.4  66.6 <NA>       NA    NA
 4 44926 1990-03-16 -20.2  66.6 <NA>       NA    NA
 5 44925 1990-03-16 -20.0  66.7 <NA>       NA    NA
 6 44922 1990-03-16 -20.7  67.2 <NA>       NA    NA
 7 44921 1990-03-16 -20.7  67.3 <NA>       NA    NA
 8 44920 1990-03-17 -20.5  67.2 <NA>       NA    NA
 9 44919 1990-03-17 -20.2  67.1 <NA>       NA    NA
10 44918 1990-03-17 -20.2  66.7 <NA>       NA    NA
# ℹ 19,836 more rows

Now we have 19846 records, but most of them have missing mass (kg) and abundance (n). To account for the zeros we would need to do the following additional steps:

d <- 
  d |> 
  mutate(species = replace_na(species, "monkfish"),
         kg = replace_na(kg, 0),
         n = replace_na(n, 0))
d

# A tibble: 19,846 × 7
      id date        lon1  lat1 species     kg     n
   <dbl> <date>     <dbl> <dbl> <chr>    <dbl> <dbl>
 1 44929 1990-03-16 -20.7  66.5 monkfish     0     0
 2 44928 1990-03-16 -20.5  66.5 monkfish     0     0
 3 44927 1990-03-16 -20.4  66.6 monkfish     0     0
 4 44926 1990-03-16 -20.2  66.6 monkfish     0     0
 5 44925 1990-03-16 -20.0  66.7 monkfish     0     0
 6 44922 1990-03-16 -20.7  67.2 monkfish     0     0
 7 44921 1990-03-16 -20.7  67.3 monkfish     0     0
 8 44920 1990-03-17 -20.5  67.2 monkfish     0     0
 9 44919 1990-03-17 -20.2  67.1 monkfish     0     0
10 44918 1990-03-17 -20.2  66.7 monkfish     0     0
# ℹ 19,836 more rows

So only now would we able to do some summary statistics like the mean catch and standard error. Lets just use the inbuilt ggplot function to do that and get a plot at the same time:

d |> 
  mutate(year = year(date)) |> 
  ggplot(aes(year, kg)) +
  stat_summary(fun.data = "mean_cl_boot")

If we had done this without considering zero stations we have:

d |> 
  mutate(year = year(date)) |> 
  filter(kg != 0) |> 
  ggplot(aes(year, kg)) +
  stat_summary(fun.data = "mean_cl_boot")

Ergo: When working with data in general it is always important to think about if missingness constitutes a zero or a true NA.

But we may want some tabular data, so we could do this rather than directly a plot:

d_sum <-
  d |> 
  mutate(year = year(date)) |> 
  group_by(year) |> 
  summarise(n = n(),
            min_kg = min(kg),
            max_kg = max(kg),
            mean_kg = mean(kg),
            sd_kg = sd(kg),
            se_kg = sd_kg / sqrt(n),
            error = se_kg * 1.96,
            sum_kg = sum(kg))
d_sum

# A tibble: 35 × 9
    year     n min_kg max_kg mean_kg sd_kg  se_kg  error sum_kg
   <dbl> <int>  <dbl>  <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
 1  1985   593      0   31.5  0.251  2.04  0.0839 0.164   149. 
 2  1986   585      0   34.6  0.138  1.64  0.0678 0.133    80.5
 3  1987   566      0   20.2  0.166  1.57  0.0661 0.130    94.1
 4  1988   544      0   23.1  0.178  1.62  0.0696 0.136    96.6
 5  1989   568      0   70.2  0.260  3.10  0.130  0.255   148. 
 6  1990   567      0   14.4  0.119  0.986 0.0414 0.0811   67.7
 7  1991   570      0   18.3  0.0793 0.903 0.0378 0.0741   45.2
 8  1992   574      0   53.5  0.340  2.76  0.115  0.226   195. 
 9  1993   597      0   48.7  0.237  2.67  0.109  0.214   141. 
10  1994   596      0   24.4  0.178  1.72  0.0706 0.138   106. 
# ℹ 25 more rows

And then the plot would be:

d_sum |> 
  mutate(se_lower = mean_kg - error, 
         se_upper = mean_kg + error) |> 
  ggplot() +
  geom_pointrange(aes(year, mean_kg, ymin = se_lower, ymax = se_upper),
                  size = 0.5)

right_join: Matching values from x to y

This is really the inverse of left_join.

inner_join: Retain only rows with matches

Example of only joining station information with monkfish, were monkfish was recorded:

station %>% 
  inner_join(biological %>% 
               filter(species == "monkfish"))

# A tibble: 2,038 × 31
      id date        year   vid tow_id t1                  t2                 
   <dbl> <date>     <dbl> <dbl> <chr>  <dttm>              <dttm>             
 1 44154 1990-03-11  1990  1273 374-1  1990-03-11 10:25:00 1990-03-11 11:30:00
 2 44271 1990-03-12  1990  1273 323-1  1990-03-12 09:15:00 1990-03-12 10:25:00
 3 44382 1990-03-13  1990  1273 320-4  1990-03-13 14:20:00 1990-03-13 15:20:00
 4 44156 1990-03-11  1990  1273 374-3  1990-03-11 07:10:00 1990-03-11 08:15:00
 5 44379 1990-03-14  1990  1273 320-3  1990-03-14 01:40:00 1990-03-14 02:40:00
 6 44377 1990-03-14  1990  1273 320-12 1990-03-14 05:45:00 1990-03-14 06:50:00
 7 44375 1990-03-14  1990  1273 321-11 1990-03-14 09:30:00 1990-03-14 10:40:00
 8 44373 1990-03-14  1990  1273 371-12 1990-03-14 13:30:00 1990-03-14 14:20:00
 9 44356 1990-03-16  1990  1273 366-11 1990-03-16 00:50:00 1990-03-16 01:30:00
10 50775 1992-03-15  1992  1273 371-12 1992-03-15 09:00:00 1992-03-15 10:10:00
# ℹ 2,028 more rows
# ℹ 24 more variables: lon1 <dbl>, lat1 <dbl>, lon2 <dbl>, lat2 <dbl>,
#   ir <chr>, ir_lon <dbl>, ir_lat <dbl>, z1 <dbl>, z2 <dbl>, temp_s <dbl>,
#   temp_b <dbl>, speed <dbl>, duration <dbl>, towlength <dbl>,
#   horizontal <dbl>, verical <dbl>, wind <dbl>, wind_direction <dbl>,
#   bormicon <dbl>, oldstrata <dbl>, newstrata <dbl>, species <chr>, kg <dbl>,
#   n <dbl>

full_join: Retain all rows

station %>% 
  full_join(biological %>% 
              filter(species == "monkfish")) %>% 
  select(id, species:n)

# A tibble: 19,846 × 4
      id species    kg     n
   <dbl> <chr>   <dbl> <dbl>
 1 44929 <NA>       NA    NA
 2 44928 <NA>       NA    NA
 3 44927 <NA>       NA    NA
 4 44926 <NA>       NA    NA
 5 44925 <NA>       NA    NA
 6 44922 <NA>       NA    NA
 7 44921 <NA>       NA    NA
 8 44920 <NA>       NA    NA
 9 44919 <NA>       NA    NA
10 44918 <NA>       NA    NA
# ℹ 19,836 more rows

Here all the stations records are retained irrepsective if monkfish was caught or not.

Excercise

Run through this set of code that supposedly mimics the pictograms above:

x <- tibble(A = c("a", "b", "c"),
            B = c("t", "u", "v"),
            C = c(1, 2, 3))
y <- tibble(A = c("a", "b", "d"),
            B = c("t", "u", "w"),
            D = c(3, 2, 1))
x
y
left_join(x, y)
right_join(x, y)
left_join(y, x)
inner_join(x, y)
full_join(x, y)

Combine cases (bind)

bind_rows: One on top of the other as a single table.

union: Rows in x or y

intersect: Rows in x and y.

setdiff: Rows in x but not y

Excercise

Run through this set of code that supposedly mimics the pictograms above:

x <- tibble(A = c("a", "b", "c"),
            B = c("t", "u", "v"),
            C = c(1, 2, 3))

y <- tibble(A = c("c", "d"),
            B = c("v", "w"),
            C = c(3, 4))
x
y
bind_rows(x, y)
union(x, y)
intersect(x, y)
setdiff(x, y)
setdiff(y, x)