Commonly, when collating summaries by group, one wants to:
For example, one might want to:
The tidyverse package, through the dplyr, comes to the rescue.
Essentially dplyr offer a collection of simple but powerful commands (or verbs) that facilitate this split-apply-combined process:
The structure of these commands is always the same:
col.name
instead of
dat$col.name
** Import data used**:
library(tidyverse)
<-
d read_csv("ftp://ftp.hafro.is/pub/data/csv/minke.csv")
One can extract rows that meet logical criteria by using the filter command. The first argument is the name of the data frame with subsequent argument(s) being logical expressions. E.g. one subsets the minke data containing only year 2004 by:
filter(d, year == 2004)
## # A tibble: 25 × 13
## id date lon lat area length weight age sex matur…¹
## <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 2004-06-10 22:00:00 -21.4 65.7 North 780 NA 11.3 Fema… pregna…
## 2 690 2004-06-15 17:00:00 -21.4 65.7 North 793 NA NA Fema… pregna…
## 3 926 2004-06-22 01:00:00 -19.8 66.5 North 858 NA 15.5 Fema… pregna…
## 4 1430 2004-06-22 09:00:00 -19.8 66.5 North 840 NA 15.9 Fema… pregna…
## 5 1432 2004-06-26 05:00:00 -19.6 65.9 North 761 NA 14 Fema… pregna…
## 6 1434 2004-07-04 16:00:00 -21.0 65.5 North 819 NA 16.6 Fema… anoest…
## 7 1435 2004-06-04 19:00:00 -22.5 64.6 South 836 NA 46.8 Male mature
## 8 1437 2004-06-15 13:00:00 -23.0 64.2 South 764 NA 9.9 Male mature
## 9 1439 2004-06-08 07:00:00 -22.8 64.6 South 774 4620 10.9 Fema… pregna…
## 10 1440 2004-06-16 11:00:00 -23.4 64.2 South 778 NA 25.3 Male mature
## # … with 15 more rows, 3 more variables: stomach.volume <dbl>,
## # stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity
Same as:
filter(d, year %in% 2004)
Only data from 2004 and onwards:
filter(d, year >= 2004)
## # A tibble: 154 × 13
## id date lon lat area length weight age sex matur…¹
## <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 2004-06-10 22:00:00 -21.4 65.7 North 780 NA 11.3 Fema… pregna…
## 2 690 2004-06-15 17:00:00 -21.4 65.7 North 793 NA NA Fema… pregna…
## 3 926 2004-06-22 01:00:00 -19.8 66.5 North 858 NA 15.5 Fema… pregna…
## 4 1430 2004-06-22 09:00:00 -19.8 66.5 North 840 NA 15.9 Fema… pregna…
## 5 1432 2004-06-26 05:00:00 -19.6 65.9 North 761 NA 14 Fema… pregna…
## 6 1434 2004-07-04 16:00:00 -21.0 65.5 North 819 NA 16.6 Fema… anoest…
## 7 1435 2004-06-04 19:00:00 -22.5 64.6 South 836 NA 46.8 Male mature
## 8 1437 2004-06-15 13:00:00 -23.0 64.2 South 764 NA 9.9 Male mature
## 9 1439 2004-06-08 07:00:00 -22.8 64.6 South 774 4620 10.9 Fema… pregna…
## 10 1440 2004-06-16 11:00:00 -23.4 64.2 South 778 NA 25.3 Male mature
## # … with 144 more rows, 3 more variables: stomach.volume <dbl>,
## # stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity
Using subsequent arguments one can refine the subset further:
filter(d, year >= 2004 & year < 2006)
filter(d, year >= 2004, year < 2006) # same as above
filter(d, between(2004, 2005)) # same as above
All but the year 2004 would be:
filter(d, year != 2004)
But this would give year 2005 to 2007:
filter(d, year %in% 2005:2007)
Which would be the same as:
filter(d, !year %in% 2003:2004)
Filter takes any logical statement:
== a # x is equal to a
x %in% a # x is "within" a
x != a # x is not equal to a
x > a # x is greater than a
x >= a # x is greater or equal to a
x < a # x is less than a
x <= a # x is less or equal to a
x & b # a and b
a | b # a or b
a is.na(x) # is a equal to NA (missing)
# ... ...
The arguments can operate on different variables. E.g. to extract mature males caught in 2007 one would write:
filter(d, maturity == "mature", sex == "Male", year == 2007)
NOTE: A “comma” is recognized as “AND”. If one were to specify “OR” use the “|”:
filter(d, maturity == "mature" | sex == "Male" | year == 2007)
The select functions allows you to extract certain columns:
select(d, id, year, maturity)
## # A tibble: 190 × 3
## id year maturity
## <dbl> <dbl> <chr>
## 1 1 2004 pregnant
## 2 690 2004 pregnant
## 3 926 2004 pregnant
## 4 1333 2003 immature
## 5 1334 2003 immature
## 6 1335 2003 immature
## 7 1336 2003 pregnant
## 8 1338 2003 pregnant
## 9 1339 2003 pubertal
## 10 1341 2003 mature
## # … with 180 more rows
You can also omit certain columns using negative indexing: for
example you can select all columns except length
:
select(d, -length)
## # A tibble: 190 × 12
## id date lon lat area weight age sex maturity
## <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 1 2004-06-10 22:00:00 -21.4 65.7 North NA 11.3 Female pregnant
## 2 690 2004-06-15 17:00:00 -21.4 65.7 North NA NA Female pregnant
## 3 926 2004-06-22 01:00:00 -19.8 66.5 North NA 15.5 Female pregnant
## 4 1333 2003-09-30 16:00:00 -21.6 65.7 North NA 7.2 Male immature
## 5 1334 2003-09-25 15:00:00 -15.6 66.3 North NA 12.3 Female immature
## 6 1335 2003-09-16 16:00:00 -18.7 66.2 North NA 9.6 Female immature
## 7 1336 2003-09-12 17:00:00 -21.5 65.7 North NA 17.3 Female pregnant
## 8 1338 2003-09-09 12:00:00 -22.8 66.1 North NA 13.8 Female pregnant
## 9 1339 2003-08-31 13:00:00 -17.5 66.6 North NA 12.2 Male pubertal
## 10 1341 2003-08-30 17:00:00 -14.7 66.2 North NA 15.4 Male mature
## # … with 180 more rows, and 3 more variables: stomach.volume <dbl>,
## # stomach.weight <dbl>, year <dbl>
A combination of variables to drop could be written as:
select(d, -c(weight, maturity))
The select function has some useful helper function:
starts_with('stomach') # Finds all columns that start with "stomach"
ends_with('weight') # Finds all columns that end with "weight"
:age # Finds all columns between "id" and "age"
idcontains('mach') # Finds all columns that contain "mach"
-date # Remove column "date" from the dataset
And you can of course combine these at will:
select(d, id:length, starts_with('stomach'))
## # A tibble: 190 × 8
## id date lon lat area length stomach.volume stomach.w…¹
## <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 2004-06-10 22:00:00 -21.4 65.7 North 780 58 31.9
## 2 690 2004-06-15 17:00:00 -21.4 65.7 North 793 90 36.3
## 3 926 2004-06-22 01:00:00 -19.8 66.5 North 858 24 9.42
## 4 1333 2003-09-30 16:00:00 -21.6 65.7 North 567 25 3.64
## 5 1334 2003-09-25 15:00:00 -15.6 66.3 North 774 85 5.51
## 6 1335 2003-09-16 16:00:00 -18.7 66.2 North 526 18 1.17
## 7 1336 2003-09-12 17:00:00 -21.5 65.7 North 809 200 99
## 8 1338 2003-09-09 12:00:00 -22.8 66.1 North 820 111 64.3
## 9 1339 2003-08-31 13:00:00 -17.5 66.6 North 697 8 0.801
## 10 1341 2003-08-30 17:00:00 -14.7 66.2 North 777 25 1.42
## # … with 180 more rows, and abbreviated variable name ¹stomach.weight
select also allows you to rename columns as you select them:
select(d, id, yr = year)
## # A tibble: 190 × 2
## id yr
## <dbl> <dbl>
## 1 1 2004
## 2 690 2004
## 3 926 2004
## 4 1333 2003
## 5 1334 2003
## 6 1335 2003
## 7 1336 2003
## 8 1338 2003
## 9 1339 2003
## 10 1341 2003
## # … with 180 more rows
but this only selects the requested columns, others are dropped.
If you just want to rename a couple of columns in the data frame leaving the other columns intact you can use the function rename:
rename(d, time = date)
## # A tibble: 190 × 13
## id time lon lat area length weight age sex matur…¹
## <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 2004-06-10 22:00:00 -21.4 65.7 North 780 NA 11.3 Fema… pregna…
## 2 690 2004-06-15 17:00:00 -21.4 65.7 North 793 NA NA Fema… pregna…
## 3 926 2004-06-22 01:00:00 -19.8 66.5 North 858 NA 15.5 Fema… pregna…
## 4 1333 2003-09-30 16:00:00 -21.6 65.7 North 567 NA 7.2 Male immatu…
## 5 1334 2003-09-25 15:00:00 -15.6 66.3 North 774 NA 12.3 Fema… immatu…
## 6 1335 2003-09-16 16:00:00 -18.7 66.2 North 526 NA 9.6 Fema… immatu…
## 7 1336 2003-09-12 17:00:00 -21.5 65.7 North 809 NA 17.3 Fema… pregna…
## 8 1338 2003-09-09 12:00:00 -22.8 66.1 North 820 NA 13.8 Fema… pregna…
## 9 1339 2003-08-31 13:00:00 -17.5 66.6 North 697 NA 12.2 Male pubert…
## 10 1341 2003-08-30 17:00:00 -14.7 66.2 North 777 NA 15.4 Male mature
## # … with 180 more rows, 3 more variables: stomach.volume <dbl>,
## # stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity
To sort the data we employ the arrange-function. Sorting by length is as follows (in ascending order):
arrange(d, length)
## # A tibble: 190 × 13
## id date lon lat area length weight age sex matur…¹
## <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 20927 2006-07-19 15:00:00 -19.2 66.3 North 461 NA 3.4 Fema… immatu…
## 2 20254 2005-08-10 16:00:00 -16.8 66.4 North 474 NA 3.4 Male immatu…
## 3 20421 2005-07-19 17:00:00 -15.9 64.1 South 482 NA 3.6 Male immatu…
## 4 1481 2004-07-03 09:00:00 -15.8 64 South 502 NA 6.7 Fema… immatu…
## 5 1344 2003-08-26 16:00:00 -19.3 66.2 North 508 NA NA Male immatu…
## 6 1504 2003-08-18 17:00:00 -23.9 65.0 South 520 NA 13.3 Male immatu…
## 7 1335 2003-09-16 16:00:00 -18.7 66.2 North 526 NA 9.6 Fema… immatu…
## 8 1343 2003-08-28 19:00:00 -15.8 66.4 North 542 NA NA Fema… immatu…
## 9 19211 2005-07-23 16:00:00 -14.2 65.8 North 564 NA 5.2 Fema… immatu…
## 10 20427 2005-08-03 15:00:00 -23.1 64.6 South 566 1663 6.1 Male mature
## # … with 180 more rows, 3 more variables: stomach.volume <dbl>,
## # stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity
and in descending order:
arrange(d, desc(length))
## # A tibble: 190 × 13
## id date lon lat area length weight age sex matur…¹
## <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 20255 2005-08-17 06:00:00 -21.5 65.7 North 871 NA 10.5 Fema… pregna…
## 2 21023 2007-06-14 19:00:00 -14.4 66.2 North 871 NA 29.7 Fema… pregna…
## 3 20424 2005-07-27 21:00:00 -23.0 64.5 South 870 NA 47.4 Male mature
## 4 1500 2003-08-31 07:00:00 -24.2 65.1 South 861 NA NA Fema… pregna…
## 5 19213 2005-08-04 14:00:00 -16.5 63.7 South 860 NA 29.6 Fema… pregna…
## 6 926 2004-06-22 01:00:00 -19.8 66.5 North 858 NA 15.5 Fema… pregna…
## 7 20971 2006-07-20 08:00:00 -18.4 67.0 North 858 NA 25.1 Fema… pregna…
## 8 20978 2006-08-12 18:00:00 -22.6 66.1 North 857 NA NA Male mature
## 9 1447 2004-07-04 11:00:00 -14.9 64.2 South 853 NA 12.1 Fema… pregna…
## 10 20261 2005-08-09 21:00:00 -22.7 64.3 South 853 NA 24.1 Fema… pregna…
## # … with 180 more rows, 3 more variables: stomach.volume <dbl>,
## # stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity
You can also arrange by more that one column:
arrange(d, sex, desc(length))
## # A tibble: 190 × 13
## id date lon lat area length weight age sex matur…¹
## <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 20255 2005-08-17 06:00:00 -21.5 65.7 North 871 NA 10.5 Fema… pregna…
## 2 21023 2007-06-14 19:00:00 -14.4 66.2 North 871 NA 29.7 Fema… pregna…
## 3 1500 2003-08-31 07:00:00 -24.2 65.1 South 861 NA NA Fema… pregna…
## 4 19213 2005-08-04 14:00:00 -16.5 63.7 South 860 NA 29.6 Fema… pregna…
## 5 926 2004-06-22 01:00:00 -19.8 66.5 North 858 NA 15.5 Fema… pregna…
## 6 20971 2006-07-20 08:00:00 -18.4 67.0 North 858 NA 25.1 Fema… pregna…
## 7 1447 2004-07-04 11:00:00 -14.9 64.2 South 853 NA 12.1 Fema… pregna…
## 8 20261 2005-08-09 21:00:00 -22.7 64.3 South 853 NA 24.1 Fema… pregna…
## 9 1449 2004-07-05 05:00:00 -16.6 63.8 South 852 NA 25 Fema… pregna…
## 10 20991 2006-08-24 13:00:00 -14.0 65.6 North 850 NA 36.6 Fema… pregna…
## # … with 180 more rows, 3 more variables: stomach.volume <dbl>,
## # stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity
mutate allows you to add new columns to your data. Let’s calculate the approximate weight:
<- select(d, id, length, weight)
d2 mutate(d2, computed_weight = 0.00001 * length^3)
## # A tibble: 190 × 4
## id length weight computed_weight
## <dbl> <dbl> <dbl> <dbl>
## 1 1 780 NA 4746.
## 2 690 793 NA 4987.
## 3 926 858 NA 6316.
## 4 1333 567 NA 1823.
## 5 1334 774 NA 4637.
## 6 1335 526 NA 1455.
## 7 1336 809 NA 5295.
## 8 1338 820 NA 5514.
## 9 1339 697 NA 3386.
## 10 1341 777 NA 4691.
## # … with 180 more rows
You can also do more than one “mutation”:
mutate(d2,
computed_weight = 0.00001 * length^3,
approx_weight = ifelse(is.na(weight), 0.00001 * length^3, weight))
## # A tibble: 190 × 5
## id length weight computed_weight approx_weight
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 780 NA 4746. 4746.
## 2 690 793 NA 4987. 4987.
## 3 926 858 NA 6316. 6316.
## 4 1333 567 NA 1823. 1823.
## 5 1334 774 NA 4637. 4637.
## 6 1335 526 NA 1455. 1455.
## 7 1336 809 NA 5295. 5295.
## 8 1338 820 NA 5514. 5514.
## 9 1339 697 NA 3386. 3386.
## 10 1341 777 NA 4691. 4691.
## # … with 180 more rows
To make our lives a bit easier mutate “remembers” earlier transformations within the sequence:
mutate(d2,
computed_weight = 0.00001 * length^3,
approx_weight = ifelse(is.na(weight), computed_weight, weight))
To summarise data one uses the
summarise-function. Below we calculate the
number of observations (using the cl("n")
-function and the
mean minke length.
summarise(d,
n.obs = n(),
ml = mean(length, na.rm = TRUE))
## # A tibble: 1 × 2
## n.obs ml
## <int> <dbl>
## 1 190 753.
The power of the command summarise is revealed when used in conjunction with group_by-function. The latter functions splits the data into groups based on one or multiple variables. E.g. one can split the minke table by maturity:
<-
by_maturity group_by(d, maturity)
by_maturity
## # A tibble: 190 × 13
## # Groups: maturity [6]
## id date lon lat area length weight age sex matur…¹
## <dbl> <dttm> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 1 2004-06-10 22:00:00 -21.4 65.7 North 780 NA 11.3 Fema… pregna…
## 2 690 2004-06-15 17:00:00 -21.4 65.7 North 793 NA NA Fema… pregna…
## 3 926 2004-06-22 01:00:00 -19.8 66.5 North 858 NA 15.5 Fema… pregna…
## 4 1333 2003-09-30 16:00:00 -21.6 65.7 North 567 NA 7.2 Male immatu…
## 5 1334 2003-09-25 15:00:00 -15.6 66.3 North 774 NA 12.3 Fema… immatu…
## 6 1335 2003-09-16 16:00:00 -18.7 66.2 North 526 NA 9.6 Fema… immatu…
## 7 1336 2003-09-12 17:00:00 -21.5 65.7 North 809 NA 17.3 Fema… pregna…
## 8 1338 2003-09-09 12:00:00 -22.8 66.1 North 820 NA 13.8 Fema… pregna…
## 9 1339 2003-08-31 13:00:00 -17.5 66.6 North 697 NA 12.2 Male pubert…
## 10 1341 2003-08-30 17:00:00 -14.7 66.2 North 777 NA 15.4 Male mature
## # … with 180 more rows, 3 more variables: stomach.volume <dbl>,
## # stomach.weight <dbl>, year <dbl>, and abbreviated variable name ¹maturity
The table appears “intact” because it still has 190 observations but it has 6 groups, representing the different values of the maturity scaling in the data (anoestrous, immature, mature, pregnant, pubertal and “NA”).
The summarise-command respects the grouping, as shown if one uses the same command as used above, but now on a dataframe that has been grouped:
summarise(by_maturity,
n.obs = n(),
mean = mean(length, na.rm = TRUE))
## # A tibble: 6 × 3
## maturity n.obs mean
## <chr> <int> <dbl>
## 1 anoestrous 6 784.
## 2 immature 25 619.
## 3 mature 74 758.
## 4 pregnant 66 800.
## 5 pubertal 6 692.
## 6 <NA> 13 752.
In R one can apply functions to data repeatedly:
summarise(group_by(d, sex, year), obs = n())
## # A tibble: 10 × 3
## # Groups: sex [2]
## sex year obs
## <chr> <dbl> <int>
## 1 Female 2003 13
## 2 Female 2004 15
## 3 Female 2005 14
## 4 Female 2006 27
## 5 Female 2007 25
## 6 Male 2003 23
## 7 Male 2004 10
## 8 Male 2005 20
## 9 Male 2006 31
## 10 Male 2007 12
All this quickly becomes cumbersome and hard to read, for example what does this do?
arrange(summarise(group_by(d, sex, year), obs = n()), desc(obs))
%>%
operatordplyr allows chaining of operators using the %>%
operator - often referred to as the pipe operator. The above code then
becomes:
%>%
d group_by(sex, year) %>%
summarise(obs = n()) %>%
arrange(desc(obs))
## # A tibble: 10 × 3
## # Groups: sex [2]
## sex year obs
## <chr> <dbl> <int>
## 1 Male 2006 31
## 2 Female 2006 27
## 3 Female 2007 25
## 4 Male 2003 23
## 5 Male 2005 20
## 6 Female 2004 15
## 7 Female 2005 14
## 8 Female 2003 13
## 9 Male 2007 12
## 10 Male 2004 10
This needs a bit of explaining. Noticed that the first argument in
each function is missing. Instead of calling a function with
f(x,y)
you can do x %>% f(y)
, that is “take
x, then apply f with a setting y”. This essentially means that
%>%
operator takes the stuff on the left and places it
as the first argument to function on the right hand
side. This get infinitely more useful when you start creating longer
chains:
%>%
d # note - here overwrite the weight column:
mutate(weight = ifelse(!is.na(weight), weight, 0.00001 * length^3),
K = 100 * weight * 1000 / length^3) %>%
group_by(sex, year) %>%
summarise(n=n(),
ml = mean(length),
sl = sd(length),
mK = mean(K),
sK = sd(K)) %>%
arrange(year)
## # A tibble: 10 × 7
## # Groups: sex [2]
## sex year n ml sl mK sK
## <chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Female 2003 13 761. 107. 1 1.60e-16
## 2 Male 2003 23 722. 90.2 1 1.59e-16
## 3 Female 2004 15 763. 98.0 1.00 9.38e- 4
## 4 Male 2004 10 764. 47.2 1 1.28e-16
## 5 Female 2005 14 754. 92.7 0.997 6.92e- 2
## 6 Male 2005 20 724. 103. 0.974 4.80e- 2
## 7 Female 2006 27 770. 86.7 0.993 3.47e- 2
## 8 Male 2006 31 743. 56.7 0.998 1.75e- 2
## 9 Female 2007 25 772 60.8 0.988 9.89e- 2
## 10 Male 2007 12 772. 39.8 1.01 3.34e- 2
Recomended that you “read” the %>%
as
“then”. Hence the above can be read as:
Note that Rstudio has a built in shortcut for the %>%
operator, [ctrl] + [shift] + M.
## # A tibble: 4 × 3
## # Groups: sex [2]
## sex area standard.error
## <chr> <chr> <dbl>
## 1 Female North 10.3
## 2 Female South 16.3
## 3 Male North 18.9
## 4 Male South 8.02
The defintion of a tidy dataset: Each column is a variable, and each row is an observation. The SMB dataset (ftp://ftp.hafro.is/pub/data/csv/is_smb.csv) is an example of an untidy dataset. Lets take a look:
<-
smb read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb.csv")
glimpse(smb)
## Rows: 19,846
## Columns: 40
## $ id <dbl> 44929, 44928, 44927, 44926, 44925, 44922, 44921, 44920,…
## $ date <date> 1990-03-16, 1990-03-16, 1990-03-16, 1990-03-16, 1990-0…
## $ year <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1…
## $ vid <dbl> 1307, 1307, 1307, 1307, 1307, 1307, 1307, 1307, 1307, 1…
## $ tow_id <chr> "670-15", "670-13", "670-12", "670-1", "670-2", "720-1"…
## $ t1 <dttm> 1990-03-16 05:05:00, 1990-03-16 07:48:00, 1990-03-16 0…
## $ t2 <dttm> 1990-03-16 06:05:00, 1990-03-16 08:51:00, 1990-03-16 1…
## $ lon1 <dbl> -20.70367, -20.49767, -20.44650, -20.21500, -19.99100, …
## $ lat1 <dbl> 66.52250, 66.50917, 66.58583, 66.57417, 66.65767, 67.24…
## $ lon2 <dbl> -20.73167, -20.53350, -20.29083, -20.07750, -19.98883, …
## $ lat2 <dbl> 66.59167, 66.57583, 66.61550, 66.61600, 66.72383, 67.29…
## $ ir <chr> "62C9", "62C9", "62C9", "62C9", "62D0", "63C9", "63C9",…
## $ ir_lon <dbl> -20.5, -20.5, -20.5, -20.5, -19.5, -20.5, -20.5, -20.5,…
## $ ir_lat <dbl> 66.75, 66.75, 66.75, 66.75, 66.75, 67.25, 67.25, 67.25,…
## $ z1 <dbl> 297, 302, 205, 88, 129, 249, 332, 244, 291, 230, 140, 3…
## $ z2 <dbl> 330, 314, 291, 117, 172, 307, 360, 256, 282, 270, 158, …
## $ temp_s <dbl> 1.1, 1.1, 1.0, 1.1, 0.9, 1.3, 1.2, 1.2, 1.1, 1.0, 1.0, …
## $ temp_b <dbl> NA, 1.0, NA, 1.0, 1.1, 1.0, -0.5, 0.9, 0.9, 0.9, NA, -0…
## $ speed <dbl> 4.0, 3.8, 3.2, 4.0, 4.1, 4.1, 3.9, 3.8, 3.9, 3.9, 4.0, …
## $ duration <dbl> 60, 63, 74, 60, 59, 59, 61, 64, 62, 62, 60, 59, 65, 66,…
## $ towlength <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4…
## $ horizontal <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ verical <dbl> 2.0, 3.0, 2.4, 2.6, NA, NA, NA, 2.2, 2.0, 2.0, 2.0, 2.8…
## $ wind <dbl> 15, 15, 15, 15, 12, 9, 9, 7, 7, 5, 5, 3, 7, 7, 1, 7, 7,…
## $ wind_direction <dbl> 5, 5, 5, 9, 9, 9, 9, 9, 9, 14, 14, 5, 5, 5, 32, 36, 5, …
## $ bormicon <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2…
## $ oldstrata <dbl> 34, 34, 34, 34, 34, 33, 36, 33, 33, 34, 34, 34, 34, 41,…
## $ newstrata <dbl> 31, 31, 31, 32, 32, 18, 37, 18, 18, 31, 32, 31, 31, 36,…
## $ cod_kg <dbl> 56.30017, 29.28811, 34.62728, 8.32594, 43.83642, 1.2309…
## $ cod_n <dbl> 155, 75, 125, 13, 81, 6, 0, 3, 0, 52, 330, 77, 30, 48, …
## $ haddock_kg <dbl> 0.50302238, 0.00000000, 0.46842356, 2.39978621, 11.5888…
## $ haddock_n <dbl> 16, 0, 17, 7, 42, 0, 0, 0, 0, 12, 10, 0, 0, 0, 0, 0, 1,…
## $ saithe_kg <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ saithe_n <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5893…
## $ wolffish_kg <dbl> 4.7770270, 3.1999367, 1.7183545, 8.4568957, 11.6068288,…
## $ wolffish_n <dbl> 12, 1, 2, 27, 26, 0, 0, 0, 0, 2, 69, 0, 0, 27, 0, 0, 29…
## $ plaice_kg <dbl> 0.0000000, 0.0000000, 0.3660489, 4.2098115, 2.4292287, …
## $ plaice_n <dbl> 0, 0, 1, 6, 3, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ monkfish_kg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ monkfish_n <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Can you identify columns that should actually be a observations?
To make a table longer, we can employ the pivot_longer-function. Lets just take the biological variables:
<-
wide %>%
smb select(id, cod_kg:monkfish_n)
<-
long %>%
wide # select just the abundance variables
select(id, ends_with("_n")) %>%
pivot_longer(cols = -id, names_to = "species", values_to = "n") %>%
mutate(species = str_remove(species, "_n"))
glimpse(long)
## Rows: 119,076
## Columns: 3
## $ id <dbl> 44929, 44929, 44929, 44929, 44929, 44929, 44928, 44928, 44928,…
## $ species <chr> "cod", "haddock", "saithe", "wolffish", "plaice", "monkfish", …
## $ n <dbl> 155, 16, 0, 12, 0, 0, 75, 0, 0, 1, 0, 0, 125, 17, 0, 2, 1, 0, …
So we have moved from a dataframe that was 19846 rows with 7 variables to a dataframe of 119076 (19846 stations x 6 species) rows with only 3 variables:
This latter (longer) format is the proper format for efficient computer programming. The following exercise should illustrate that.
## # A tibble: 1 × 6
## cod haddock saithe wolfish plaice monkfish
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 50 53 0 8 0 0
## # A tibble: 6 × 2
## species median
## <chr> <dbl>
## 1 cod 50
## 2 haddock 53
## 3 monkfish 0
## 4 plaice 0
## 5 saithe 0
## 6 wolffish 8
When reconstructing untidy tables we sometimes may need to make a long table wider again. In the above example we only made a long table for abundance (n). We could modify the above code for the biomass (kg). Doing all in one step requires the use of pivot_wider, the steps being:
<-
long %>%
wide # step 1
pivot_longer(-id) %>%
# step 2
separate(name, sep = "_", into = c("species", "variable")) %>%
# step 3
pivot_wider(names_from = variable)
Normally the data we are interested in working with do not reside in one table. E.g. for a typical groundfish survey data one stores the station table separate from the biological table. Lets read in the SMB tables in such a format:
<-
station read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb_stations.csv")
<-
biological read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb_biological.csv")
%>% select(id:lat1) %>% arrange(id) %>% glimpse() station
## Rows: 19,846
## Columns: 9
## $ id <dbl> 29654, 29655, 29656, 29657, 29658, 29659, 29660, 29661, 29662, …
## $ date <date> 1985-03-21, 1985-03-20, 1985-03-20, 1985-03-20, 1985-03-20, 19…
## $ year <dbl> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 198…
## $ vid <dbl> 1273, 1273, 1273, 1273, 1273, 1273, 1273, 1273, 1273, 1273, 127…
## $ tow_id <chr> "374-3", "374-11", "374-12", "373-13", "373-11", "373-12", "373…
## $ t1 <dttm> 1985-03-21 00:45:00, 1985-03-20 22:25:00, 1985-03-20 20:35:00,…
## $ t2 <dttm> 1985-03-21 01:50:00, 1985-03-20 23:25:00, 1985-03-20 21:45:00,…
## $ lon1 <dbl> -24.17967, -24.19783, -24.40467, -23.99833, -23.67983, -23.7210…
## $ lat1 <dbl> 63.96350, 63.75367, 63.71817, 63.58000, 63.74767, 63.83767, 63.…
%>% arrange(id) %>% glimpse() biological
## Rows: 68,834
## Columns: 4
## $ id <dbl> 29654, 29654, 29654, 29654, 29655, 29655, 29655, 29655, 29656,…
## $ species <chr> "cod", "haddock", "saithe", "wolffish", "cod", "haddock", "sai…
## $ kg <dbl> 53.4895900, 29.6435494, 22.8682684, 0.6901652, 29.6210500, 203…
## $ n <dbl> 13, 15, 5, 1, 6, 155, 23, 4, 6, 156, 39, 4, 43, 421, 8, 11, 25…
Here the information on the station, such as date, geographical location are kept separate from the detail measurments that are performed at each station (weight and numbers by species. The records in the station table are unique (i.e. each station information only appear once) while we can have more than one species measured at each station. Take note here that the biological table does not contain records of species that were not observed at that station. The link between the two tables, in this case, is the variable id.
Lets say we were interested in combining the geographical position and the species records from one station:
%>%
station select(id, date, lon1, lat1) %>%
filter(id == 29654) %>%
left_join(biological)
## # A tibble: 4 × 7
## id date lon1 lat1 species kg n
## <dbl> <date> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 29654 1985-03-21 -24.2 64.0 cod 53.5 13
## 2 29654 1985-03-21 -24.2 64.0 haddock 29.6 15
## 3 29654 1985-03-21 -24.2 64.0 saithe 22.9 5
## 4 29654 1985-03-21 -24.2 64.0 wolffish 0.690 1
Take note here: * The joining is by common variable name in the two tables (here id) * That we only have records of fours species in this table, i.e. a value of zero for a species is not stored in the data.
This is really the inverse of left_join.
Example of only joining station information with monkfish, were monkfish was recorded:
%>%
station inner_join(biological %>%
filter(species == "monkfish"))
## # A tibble: 2,038 × 31
## id date year vid tow_id t1 t2
## <dbl> <date> <dbl> <dbl> <chr> <dttm> <dttm>
## 1 44154 1990-03-11 1990 1273 374-1 1990-03-11 10:25:00 1990-03-11 11:30:00
## 2 44271 1990-03-12 1990 1273 323-1 1990-03-12 09:15:00 1990-03-12 10:25:00
## 3 44382 1990-03-13 1990 1273 320-4 1990-03-13 14:20:00 1990-03-13 15:20:00
## 4 44156 1990-03-11 1990 1273 374-3 1990-03-11 07:10:00 1990-03-11 08:15:00
## 5 44379 1990-03-14 1990 1273 320-3 1990-03-14 01:40:00 1990-03-14 02:40:00
## 6 44377 1990-03-14 1990 1273 320-12 1990-03-14 05:45:00 1990-03-14 06:50:00
## 7 44375 1990-03-14 1990 1273 321-11 1990-03-14 09:30:00 1990-03-14 10:40:00
## 8 44373 1990-03-14 1990 1273 371-12 1990-03-14 13:30:00 1990-03-14 14:20:00
## 9 44356 1990-03-16 1990 1273 366-11 1990-03-16 00:50:00 1990-03-16 01:30:00
## 10 50775 1992-03-15 1992 1273 371-12 1992-03-15 09:00:00 1992-03-15 10:10:00
## # … with 2,028 more rows, and 24 more variables: lon1 <dbl>, lat1 <dbl>,
## # lon2 <dbl>, lat2 <dbl>, ir <chr>, ir_lon <dbl>, ir_lat <dbl>, z1 <dbl>,
## # z2 <dbl>, temp_s <dbl>, temp_b <dbl>, speed <dbl>, duration <dbl>,
## # towlength <dbl>, horizontal <dbl>, verical <dbl>, wind <dbl>,
## # wind_direction <dbl>, bormicon <dbl>, oldstrata <dbl>, newstrata <dbl>,
## # species <chr>, kg <dbl>, n <dbl>
%>%
station full_join(biological %>%
filter(species == "monkfish")) %>%
select(id, species:n)
## # A tibble: 19,846 × 4
## id species kg n
## <dbl> <chr> <dbl> <dbl>
## 1 44929 <NA> NA NA
## 2 44928 <NA> NA NA
## 3 44927 <NA> NA NA
## 4 44926 <NA> NA NA
## 5 44925 <NA> NA NA
## 6 44922 <NA> NA NA
## 7 44921 <NA> NA NA
## 8 44920 <NA> NA NA
## 9 44919 <NA> NA NA
## 10 44918 <NA> NA NA
## # … with 19,836 more rows
Here all the stations records are retained irrepsective if monkfish was caught or not.
Run through this set of code that supposedly mimics the pictograms above:
<- data_frame(A = c("a", "b", "c"),
x B = c("t", "u", "v"),
C = c(1, 2, 3))
<- data_frame(A = c("a", "b", "d"),
y B = c("t", "u", "w"),
D = c(3, 2, 1))
left_join(x, y)
right_join(x, y)
left_join(y, x)
inner_join(x, y)
full_join(x, y)
Run through this set of code that supposedly mimics the pictograms above:
<- data_frame(A = c("c", "d"),
y B = c("v", "w"),
C = c(3, 4))
bind_rows(x, y)
union(x, y)
intersect(x, y)
setdiff(x, y)
setdiff(y, x)