Tidy data

Preamble

Raw data are often not in a tidy format. The tidyr-package has a number of functions that are specifically designed to deal with restructuring untidy dataframes. Here only three commonly used functions will be introduced, gather, spread and separate by using very simple examples. Additional examples of use are provided in some of the case examples. And then a more thorough account is given in chapter 12 in R for Data Science.

Note that when we here refer to tidying data it is not the same thing as “data cleaning”. The latter deals specifically with correcting typos.

Needed libraries:

library(tidyverse)
library(readxl)

The data

We are going to use some very simple data here, just to cover the basic of the tidyr-functions. They are stored in an excel workbook, called tidyr.xlsx. Please have a look at the different sheets. To get the data we use the download.file-function (I assume that you have a directory named data-raw within you current R working directory):

download.file(url = "http://www.hafro.is/~einarhj/crfmr/data-raw/tidyr.xlsx",
              destfile = "data-raw/tidyr.xlsx",
              mode = "wb")

Gather

Datasets are often arranged in a wide format, mainly because it is more readable for humans. Take e.g. this dataset representing catches by gear by year:

wide <- read_excel("data-raw/tidyr.xlsx", "wide")
wide

## # A tibble: 3 × 3
##    gear `2001` `2002`
##   <chr>  <dbl>  <dbl>
## 1  hook    500    750
## 2  line    400    600
## 3  nets     20     20

Here the column names represent a variable name (year) and it contains the actual value (2001 and 2002). And in each row we have observation of catches in two rows not one.

To tidy dataset such as these we use the gather-function:

wide %>% 
  gather(key = year, value = catch, `2001`:`2002`)

## # A tibble: 6 × 3
##    gear  year catch
##   <chr> <chr> <dbl>
## 1  hook  2001   500
## 2  line  2001   400
## 3  nets  2001    20
## 4  hook  2002   750
## 5  line  2002   600
## 6  nets  2002    20

Here we specified that the key was year because the column names we are trying to gather store the value for the variable year and then the actual values are given the variable name “catch”.

Note that we could also have used a negative value for the selection of column:

# not run
wide %>% 
  gather(key = year, value = catch, -gear)

In the above we basically select all columns (to gather) except the gear column.

The acute eye may have spotted that the year is actually a character. There is an optional argument in gather called convert. The default is set to FALSE. If we overwrite the default and set it to TRUE, the function “tries to” to convert the variable to appropriate class:

wide %>% 
  gather(year, catch, -gear, convert = TRUE)

## # A tibble: 6 × 3
##    gear  year catch
##   <chr> <int> <dbl>
## 1  hook  2001   500
## 2  line  2001   400
## 3  nets  2001    20
## 4  hook  2002   750
## 5  line  2002   600
## 6  nets  2002    20

Spread

spread is the opposite of gather, i.e. takes the data from a long format to a wide format. Lets import a dataframe of a long format:

long <- read_excel("data-raw/tidyr.xlsx", "long")
long

## # A tibble: 12 × 4
##     gear  year   type value
##    <chr> <dbl>  <chr> <dbl>
## 1   hook  2001  catch   500
## 2   hook  2001 effort   100
## 3   hook  2002  catch   750
## 4   hook  2002 effort   150
## 5   line  2001  catch   400
## 6   line  2001 effort   200
## 7   line  2002  catch   600
## 8   line  2002 effort   300
## 9   nets  2001  catch    20
## 10  nets  2001 effort    50
## 11  nets  2002  catch    20
## 12  nets  2002 effort    25

In the example above an observation is catch and effort by a gear and year, but each observation is spread across two rows, where the numerical value have different meanings. We can spread these values across the two types as follows:

long %>% 
  spread(key = type, value = value)

## # A tibble: 6 × 4
##    gear  year catch effort
## * <chr> <dbl> <dbl>  <dbl>
## 1  hook  2001   500    100
## 2  hook  2002   750    150
## 3  line  2001   400    200
## 4  line  2002   600    300
## 5  nets  2001    20     50
## 6  nets  2002    20     25

Here we specified that the variable key (catch and effort) which we want to be separate column was stored in the column “type” and that the actual measurements were stored in the column “value”.

Separate

We may often need to separate one column into multiple columns. Take e.g. these VMS data:

vms <- read_excel("data-raw/tidyr.xlsx", "vms")
vms

## # A tibble: 100 × 2
##    longitude  latitude
##        <chr>     <chr>
## 1   4.0635 S  54.502 W
## 2   9.3088 N 56.1453 W
## 3   0.5545 S 53.7693 W
## 4   1.6254 S 53.4741 W
## 5  11.1198 N 54.2884 W
## 6   8.8444 N 56.2456 W
## 7  11.3536 N 54.3263 W
## 8   3.4886 N 53.5014 W
## 9   2.4971 S 56.6335 W
## 10   4.047 S 55.2361 W
## # ... with 90 more rows

Here we have a mix of numbers and text, the text characters “S”, “N” and “W” referring to South, North and West. An as it stand, given that the columns are of class character, we can not do any numerical analysis or plotting of the data. Here separate comes to the rescue. The argument it takes is the name of the column we want to separate and the name of the columns we want to split the data into, and the separator:

vms %>% 
  separate(longitude, c("lon", "NS"), sep = " ") %>% 
  separate(latitude, c("lat", "EW"), sep = " ")

## # A tibble: 100 × 4
##        lon    NS     lat    EW
## *    <chr> <chr>   <chr> <chr>
## 1   4.0635     S  54.502     W
## 2   9.3088     N 56.1453     W
## 3   0.5545     S 53.7693     W
## 4   1.6254     S 53.4741     W
## 5  11.1198     N 54.2884     W
## 6   8.8444     N 56.2456     W
## 7  11.3536     N 54.3263     W
## 8   3.4886     N 53.5014     W
## 9   2.4971     S 56.6335     W
## 10   4.047     S 55.2361     W
## # ... with 90 more rows

Notice that we still have the lons and lats as character vector. Here we could set the additional argument convert to TRUE:

vms %>% 
  separate(longitude, c("lon", "NS"), sep = " ", convert = TRUE) %>% 
  separate(latitude, c("lat", "EW"), sep = " ", convert = TRUE)

## # A tibble: 100 × 4
##        lon    NS     lat    EW
## *    <dbl> <chr>   <dbl> <chr>
## 1   4.0635     S 54.5020     W
## 2   9.3088     N 56.1453     W
## 3   0.5545     S 53.7693     W
## 4   1.6254     S 53.4741     W
## 5  11.1198     N 54.2884     W
## 6   8.8444     N 56.2456     W
## 7  11.3536     N 54.3263     W
## 8   3.4886     N 53.5014     W
## 9   2.4971     S 56.6335     W
## 10  4.0470     S 55.2361     W
## # ... with 90 more rows

So now we have the data in a tidy format. We could simplify this further, because when we start analyzing the data we would use a negative coordinate for western longitude and southern latitudes:

vms %>% 
  separate(longitude, c("lon", "NS"), sep = " ", convert = TRUE) %>% 
  separate(latitude, c("lat", "EW"), sep = " ", convert = TRUE) %>% 
  mutate(lon = ifelse(NS == "S", -lon, lon),
         lat = ifelse(EW == "W", -lat, lat)) %>% 
  select(lon, lat)

## # A tibble: 100 × 2
##        lon      lat
##      <dbl>    <dbl>
## 1  -4.0635 -54.5020
## 2   9.3088 -56.1453
## 3  -0.5545 -53.7693
## 4  -1.6254 -53.4741
## 5  11.1198 -54.2884
## 6   8.8444 -56.2456
## 7  11.3536 -54.3263
## 8   3.4886 -53.5014
## 9  -2.4971 -56.6335
## 10 -4.0470 -55.2361
## # ... with 90 more rows