Raw data are often not in a tidy format. The tidyr-package has a number of functions that are specifically designed to deal with restructuring untidy dataframes. Here only three commonly used functions will be introduced, gather
, spread
and separate
by using very simple examples. Additional examples of use are provided in some of the case examples. And then a more thorough account is given in chapter 12 in R for Data Science.
Note that when we here refer to tidying data it is not the same thing as “data cleaning”. The latter deals specifically with correcting typos.
Needed libraries:
library(tidyverse)
library(readxl)
We are going to use some very simple data here, just to cover the basic of the tidyr-functions. They are stored in an excel workbook, called tidyr.xlsx. Please have a look at the different sheets. To get the data we use the download.file
-function (I assume that you have a directory named data-raw within you current R working directory):
download.file(url = "http://www.hafro.is/~einarhj/crfmr/data-raw/tidyr.xlsx",
destfile = "data-raw/tidyr.xlsx",
mode = "wb")
Datasets are often arranged in a wide format, mainly because it is more readable for humans. Take e.g. this dataset representing catches by gear by year:
wide <- read_excel("data-raw/tidyr.xlsx", "wide")
wide
## # A tibble: 3 × 3
## gear `2001` `2002`
## <chr> <dbl> <dbl>
## 1 hook 500 750
## 2 line 400 600
## 3 nets 20 20
Here the column names represent a variable name (year) and it contains the actual value (2001 and 2002). And in each row we have observation of catches in two rows not one.
To tidy dataset such as these we use the gather
-function:
wide %>%
gather(key = year, value = catch, `2001`:`2002`)
## # A tibble: 6 × 3
## gear year catch
## <chr> <chr> <dbl>
## 1 hook 2001 500
## 2 line 2001 400
## 3 nets 2001 20
## 4 hook 2002 750
## 5 line 2002 600
## 6 nets 2002 20
Here we specified that the key was year because the column names we are trying to gather store the value for the variable year and then the actual values are given the variable name “catch”.
Note that we could also have used a negative value for the selection of column:
# not run
wide %>%
gather(key = year, value = catch, -gear)
In the above we basically select all columns (to gather) except the gear column.
The acute eye may have spotted that the year is actually a character. There is an optional argument in gather
called convert. The default is set to FALSE. If we overwrite the default and set it to TRUE, the function “tries to” to convert the variable to appropriate class:
wide %>%
gather(year, catch, -gear, convert = TRUE)
## # A tibble: 6 × 3
## gear year catch
## <chr> <int> <dbl>
## 1 hook 2001 500
## 2 line 2001 400
## 3 nets 2001 20
## 4 hook 2002 750
## 5 line 2002 600
## 6 nets 2002 20
spread
is the opposite of gather
, i.e. takes the data from a long format to a wide format. Lets import a dataframe of a long format:
long <- read_excel("data-raw/tidyr.xlsx", "long")
long
## # A tibble: 12 × 4
## gear year type value
## <chr> <dbl> <chr> <dbl>
## 1 hook 2001 catch 500
## 2 hook 2001 effort 100
## 3 hook 2002 catch 750
## 4 hook 2002 effort 150
## 5 line 2001 catch 400
## 6 line 2001 effort 200
## 7 line 2002 catch 600
## 8 line 2002 effort 300
## 9 nets 2001 catch 20
## 10 nets 2001 effort 50
## 11 nets 2002 catch 20
## 12 nets 2002 effort 25
In the example above an observation is catch and effort by a gear and year, but each observation is spread across two rows, where the numerical value have different meanings. We can spread these values across the two types as follows:
long %>%
spread(key = type, value = value)
## # A tibble: 6 × 4
## gear year catch effort
## * <chr> <dbl> <dbl> <dbl>
## 1 hook 2001 500 100
## 2 hook 2002 750 150
## 3 line 2001 400 200
## 4 line 2002 600 300
## 5 nets 2001 20 50
## 6 nets 2002 20 25
Here we specified that the variable key (catch and effort) which we want to be separate column was stored in the column “type” and that the actual measurements were stored in the column “value”.
We may often need to separate one column into multiple columns. Take e.g. these VMS data:
vms <- read_excel("data-raw/tidyr.xlsx", "vms")
vms
## # A tibble: 100 × 2
## longitude latitude
## <chr> <chr>
## 1 4.0635 S 54.502 W
## 2 9.3088 N 56.1453 W
## 3 0.5545 S 53.7693 W
## 4 1.6254 S 53.4741 W
## 5 11.1198 N 54.2884 W
## 6 8.8444 N 56.2456 W
## 7 11.3536 N 54.3263 W
## 8 3.4886 N 53.5014 W
## 9 2.4971 S 56.6335 W
## 10 4.047 S 55.2361 W
## # ... with 90 more rows
Here we have a mix of numbers and text, the text characters “S”, “N” and “W” referring to South, North and West. An as it stand, given that the columns are of class character, we can not do any numerical analysis or plotting of the data. Here separate
comes to the rescue. The argument it takes is the name of the column we want to separate and the name of the columns we want to split the data into, and the separator:
vms %>%
separate(longitude, c("lon", "NS"), sep = " ") %>%
separate(latitude, c("lat", "EW"), sep = " ")
## # A tibble: 100 × 4
## lon NS lat EW
## * <chr> <chr> <chr> <chr>
## 1 4.0635 S 54.502 W
## 2 9.3088 N 56.1453 W
## 3 0.5545 S 53.7693 W
## 4 1.6254 S 53.4741 W
## 5 11.1198 N 54.2884 W
## 6 8.8444 N 56.2456 W
## 7 11.3536 N 54.3263 W
## 8 3.4886 N 53.5014 W
## 9 2.4971 S 56.6335 W
## 10 4.047 S 55.2361 W
## # ... with 90 more rows
Notice that we still have the lons and lats as character vector. Here we could set the additional argument convert to TRUE:
vms %>%
separate(longitude, c("lon", "NS"), sep = " ", convert = TRUE) %>%
separate(latitude, c("lat", "EW"), sep = " ", convert = TRUE)
## # A tibble: 100 × 4
## lon NS lat EW
## * <dbl> <chr> <dbl> <chr>
## 1 4.0635 S 54.5020 W
## 2 9.3088 N 56.1453 W
## 3 0.5545 S 53.7693 W
## 4 1.6254 S 53.4741 W
## 5 11.1198 N 54.2884 W
## 6 8.8444 N 56.2456 W
## 7 11.3536 N 54.3263 W
## 8 3.4886 N 53.5014 W
## 9 2.4971 S 56.6335 W
## 10 4.0470 S 55.2361 W
## # ... with 90 more rows
So now we have the data in a tidy format. We could simplify this further, because when we start analyzing the data we would use a negative coordinate for western longitude and southern latitudes:
vms %>%
separate(longitude, c("lon", "NS"), sep = " ", convert = TRUE) %>%
separate(latitude, c("lat", "EW"), sep = " ", convert = TRUE) %>%
mutate(lon = ifelse(NS == "S", -lon, lon),
lat = ifelse(EW == "W", -lat, lat)) %>%
select(lon, lat)
## # A tibble: 100 × 2
## lon lat
## <dbl> <dbl>
## 1 -4.0635 -54.5020
## 2 9.3088 -56.1453
## 3 -0.5545 -53.7693
## 4 -1.6254 -53.4741
## 5 11.1198 -54.2884
## 6 8.8444 -56.2456
## 7 11.3536 -54.3263
## 8 3.4886 -53.5014
## 9 -2.4971 -56.6335
## 10 -4.0470 -55.2361
## # ... with 90 more rows