Case examples: Logbook data

Preamble

Here we are going to work on tidying the Logbook data introduce in Importing data from Excel. Although the codes provided below are case specific there are generally only limited number of functions one needs to master to import consistently formatted excel sheets:

readWorksheet: Read in a whole sheet or a selected part of a sheet
select: Pick or drop columns by name
gather: Gather columns into rows
separate: Separate one column into many
spread: Spread rows into columns

Needed libraries:

library(tidyverse)
library(XLConnect)

If one thinks about what are the “true” records (rows) in the Logbook data it basically boils down to two landing value (different categories) by species by harbour, boat and trip (we will ignore the harbor for now, see further below). So the variables (columns) we would like to end with are:

id: This is just a numerical row identifier that most likely is also a numerical species identifier.
species: English name for a species
latin: Latin name for a species
boat: name of boat
trip: Numerical value for the trip
ag: Amount landed in category “ag”
lg: Amount landed in category “lg”

Lets load the wb into R and get the sheet names:

wb <- loadWorkbook("data-raw/logbook.xlsx")
sheets <- getSheets(wb)

Side point, try running this code

sheets[1]
sheets[2]
sheets[3]
i <- 7
sheets[i]

Note that each line returns the name of the 1st, 2nd, 3rd and the ith sheet (here 7). We are going to use this feature further down the line.

Reading in metadata

Before we go into reading the actual catch data lets read in the metadata that reside in region “A4:A6”. If only because the code is easier.

meta <- readWorksheet(wb, sheet = 1,
                      region = "A4:A6",
                      header = FALSE)
meta

##                    Col1
## 1 Boat name: Georgieann
## 2      Boat number: 999
## 3     Landing plcace: 1

Since no column header was read in we are given a default column header “Col1”. We can use the separate function to split the data into two columns:

meta <-
  meta %>% 
  separate(Col1, c("variable", "value"), sep = ": ")
meta

##         variable      value
## 1      Boat name Georgieann
## 2    Boat number        999
## 3 Landing plcace          1

What is in the first column are variables, that ideally should be column names. Here we can use the spread-function:

meta %>% 
  spread(variable, value)

##    Boat name Boat number Landing plcace
## 1 Georgieann         999              1

Because we want to read in the data from many sheets (vessels) we generate a function:

read_metadata <- function(wb, sheet) {
  
  res <-
    readWorksheet(wb, sheet = sheet,
                      region = "A4:A6",
                      header = FALSE) %>% 
    separate(Col1, c("variable", "value"), sep = ": ") %>% 
    spread(variable, value)
  return(res)
}

To read in metadata for the second sheet we now just do:

read_metadata(wb, 2)

##   Boat name Boat number Landing plcace
## 1 Donatello         998              1

We can then create a loop to read in all the sheets:

res <- list()
for(i in 1:length(sheets)) {
  res[[i]] <- read_metadata(wb, i)
}
meta <- bind_rows(res)
meta

##     Boat name Boat number Landing plcace
## 1  Georgieann         999              1
## 2   Donatello         998              1
## 3    Merriman         997              1
## 4   Grandison         996              1
## 5   Deyaneira         995              1
## 6     Breklyn         993              2
## 7      Jalina         992              2
## 8     Dayanne         991              1
## 9      Arsene         990              1
## 10   Ladaijah         989              1
## 11      Yhair         988              1
## 12    Kiernan         987              1
## 13    Dalayah         986              3
## 14      Harue         985              3
## 15     Ladema         984              3
## 16   Kyrollos         983              3
## 17   Aleander         982              3
## 18     Klowie         982              3
## 19      Doxie         981              4
## 20      Neven         980              4
## 21     Rowdie         979              5

We could actually include the loop within the function, so lets modify it:

read_metadata <- function(wb, sheet) {
  
  res <- list()
  
  for(i in 1:length(sheet)) {
  res[[i]] <-
    readWorksheet(wb, sheet = sheet[i],
                      region = "A4:A6",
                      header = FALSE) %>% 
    separate(Col1, c("variable", "value"), sep = ": ") %>% 
    spread(variable, value)
  }
  meta <- bind_rows(res)
  return(meta)
}

And then we could read in one sheet:

read_metadata(wb, sheets[10])

##   Boat name Boat number Landing plcace
## 1  Ladaijah         989              1

Or all of them

read_metadata(wb, sheets)

##     Boat name Boat number Landing plcace
## 1  Georgieann         999              1
## 2   Donatello         998              1
## 3    Merriman         997              1
## 4   Grandison         996              1
## 5   Deyaneira         995              1
## 6     Breklyn         993              2
## 7      Jalina         992              2
## 8     Dayanne         991              1
## 9      Arsene         990              1
## 10   Ladaijah         989              1
## 11      Yhair         988              1
## 12    Kiernan         987              1
## 13    Dalayah         986              3
## 14      Harue         985              3
## 15     Ladema         984              3
## 16   Kyrollos         983              3
## 17   Aleander         982              3
## 18     Klowie         982              3
## 19      Doxie         981              4
## 20      Neven         980              4
## 21     Rowdie         979              5

Reading in and tidying the catch data

As said the catch data start in row 9 in all the sheets. And because different sheets have different number columns to read in we would not like to specify each time different sheet areas. So to read in the first sheet we do:

i <- 1
d <- readWorksheet(wb, sheets[i],
                   startRow = 9) %>% 
  tbl_df()

Lets work the AG landing category first. What we want is to end up with a dataframe that has three columns:

id: Row number, correponds to species
trip: The trip number
ag: The weight of the catch

So we need to take the following steps:

Select the appropriate columns
Transform the data from wide to long
Get the trip number in order

We select the column by:

ag <-
  d %>% 
  select(id, starts_with("AG"))
glimpse(ag)

## Observations: 61
## Variables: 35
## $ id    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ AG    <dbl> NA, NA, NA, NA, 1199, NA, NA, 1118, NA, NA, 384, 449, 41...
## $ AG.1  <dbl> 490, NA, 663, 542, 1265, NA, 148, 19, NA, NA, NA, NA, 12...
## $ AG.2  <dbl> 474, NA, 214, 740, 1097, NA, 781, 1921, 162, NA, 112, NA...
## $ AG.3  <dbl> 260, NA, 770, 434, 1168, NA, 347, 2093, 87, NA, NA, NA, ...
## $ AG.4  <dbl> 321, NA, 415, 292, 1087, NA, 82, 82, NA, NA, NA, 133, NA...
## $ AG.5  <dbl> 255, NA, 140, 278, 417, NA, 71, 1359, NA, NA, NA, NA, NA...
## $ AG.6  <dbl> 561, NA, 316, 350, 633, 36, 316, 3878, 33, NA, 224, NA, ...
## $ AG.7  <dbl> 224, NA, 227, 2211, 551, NA, 633, 3391, 71, NA, 108, NA,...
## $ AG.8  <dbl> 561, NA, 214, 1097, 418, NA, 852, 9896, 70, NA, NA, NA, ...
## $ AG.9  <dbl> 469, NA, 358, 306, 770, NA, 567, 2640, NA, NA, 867, NA, ...
## $ AG.10 <dbl> 347, NA, 276, 1236, 816, NA, 289, 6395, 28, NA, 1428, NA...
## $ AG.11 <dbl> 357, NA, 439, 995, 643, NA, 469, 5206, 18, NA, NA, NA, 1...
## $ AG.12 <dbl> 337, NA, NA, 359, 689, NA, 439, 3646, 10, NA, NA, NA, 30...
## $ AG.13 <dbl> 495, NA, 393, 556, 719, NA, 1245, 4617, 8, NA, NA, NA, 2...
## $ AG.14 <dbl> 418, NA, 339, 291, 852, NA, 1148, 1618, NA, NA, 31, NA, ...
## $ AG.15 <dbl> 341, NA, 265, 332, 648, NA, 286, 722, NA, NA, NA, NA, 19...
## $ AG.16 <dbl> 689, NA, 296, 332, 781, NA, 481, 961, NA, NA, 1276, NA, ...
## $ AG.17 <dbl> 490, NA, 327, 398, 1020, NA, 578, 1363, NA, NA, 163, NA,...
## $ AG.18 <dbl> 689, NA, 270, 366, 837, NA, 512, 3679, 15, NA, NA, NA, 2...
## $ AG.19 <dbl> 337, NA, 270, 1321, 973, NA, 669, 7731, 3, NA, NA, NA, 3...
## $ AG.20 <dbl> 577, NA, 272, 2121, 878, NA, 459, 5906, NA, NA, NA, NA, ...
## $ AG.21 <dbl> 495, NA, 255, 720, 842, NA, 671, 1784, NA, NA, NA, NA, 4...
## $ AG.22 <dbl> 270, NA, 281, 2277, 781, NA, 933, 8671, NA, NA, NA, 56, ...
## $ AG.23 <dbl> 316, NA, 332, 597, 821, NA, 760, 9270, NA, NA, NA, NA, 6...
## $ AG.24 <dbl> 388, NA, 370, 2190, 709, NA, 464, 5259, NA, NA, NA, NA, ...
## $ AG.25 <dbl> 286, NA, 270, 1171, 816, NA, 573, 2104, NA, NA, NA, NA, ...
## $ AG.26 <dbl> 388, NA, 311, 296, 816, NA, 677, 1042, NA, NA, NA, NA, 5...
## $ AG.27 <dbl> 296, NA, 270, 1356, 730, NA, NA, 4278, 15, NA, NA, NA, N...
## $ AG.28 <dbl> 465, NA, 230, 279, 699, NA, 449, 306, NA, NA, NA, NA, 33...
## $ AG.29 <dbl> 463, NA, 2633, 332, 878, NA, 673, 1283, NA, NA, NA, NA, ...
## $ AG.30 <dbl> 473, NA, 771, 482, 1010, NA, 487, 517, 8, NA, NA, NA, 44...
## $ AG.31 <dbl> NA, NA, 321, 520, 668, NA, 1107, 9337, 55, NA, NA, 35, N...
## $ AG.32 <dbl> 454, NA, NA, 440, 668, NA, 1029, 6289, 22, NA, NA, NA, 5...
## $ AG.33 <dbl> 398, NA, 704, 502, 658, NA, 491, 2935, NA, NA, NA, NA, 1...

Besides the id column each column represents a trip. I.e. this is a variable that we would like to put in a single column. To do that we use the gather-function, where we give the variable that are now as column names the name trip and the catch weight we name AG. Since id is already in order we do not reformat that column:

ag <-
  ag %>% 
  gather(key = trip, value = ag, -id)
glimpse(ag)

## Observations: 2,074
## Variables: 3
## $ id   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ trip <chr> "AG", "AG", "AG", "AG", "AG", "AG", "AG", "AG", "AG", "AG...
## $ ag   <dbl> NA, NA, NA, NA, 1199, NA, NA, 1118, NA, NA, 384, 449, 41,...

Take note that now we have 2074 rows, whereas the orginal data had 61 rows, one for each species. We can use the table-function to quickly get an overview of the number of redords for each id:

table(ag$id, useNA = "ifany")

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##   34   34   34   34   34   34   34   34   34   34   34   34   34   34   34 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
##   34   34   34   34   34   34   34   34   34   34   34   34   34   34   34 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##   34   34   34   34   34   34   34   34   34   34   34   34   34   34   34 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   34   34   34   34   34   34   34   34   34   34   34   34   34   34   34 
## <NA> 
##   34

so we have 34 record (trips) for each id (species). The reason that we have the same number for each species is that the data includes NA’s. Notice also that there are records that does not have an id. This is because in the original worksheet there is a column sum at the bottom of each trip. We will get rid of that in the final version of the script.

Now the trip column contains the formerly column names “AG”, “AG.1”, “AG.2”, etc. Names that represent the trip number. We need to split that column into two separate columns.

ag <- 
  ag %>% 
  separate(trip, c("category", "trip"), convert = TRUE)

## Warning: Too few values at 61 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
## 12, 13, 14, 15, 16, 17, 18, 19, 20, ...

The warning message you get is because the first trip is really labelled just “AG” and does hence not have any “.” that acts as the separator.

glimpse(ag)

## Observations: 2,074
## Variables: 4
## $ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ category <chr> "AG", "AG", "AG", "AG", "AG", "AG", "AG", "AG", "AG",...
## $ trip     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ ag       <dbl> NA, NA, NA, NA, 1199, NA, NA, 1118, NA, NA, 384, 449,...

Now the category column contains only “AG” and the trip column the numerical value for the trip number. The reason we gotten the latter as numerical value is because the argument convert was set to TRUE. We notice also that for the first trip has no value, the second trip has the value one, etc. So we need fix that using the following code:

ag <- 
  ag %>% 
  select(-category) %>%      # We do not need that column anymore
  mutate(trip = ifelse(is.na(trip), 0, trip),   # Set the first trip to value 0
         trip = trip + 1)                       # Add one to each trip number

The 6 commands we have used can by tied into a single pipe:

ag <-
  d %>% 
  filter(!is.na(id)) %>% 
  select(id, starts_with("AG")) %>% 
  gather(key = trip, value = ag, -id) %>% 
  separate(trip, c("category", "trip"), convert = TRUE) %>% 
  select(-category) %>% 
  mutate(trip = ifelse(is.na(trip), 0, trip),
         trip = trip + 1)

To work on the “LG” landing categoy we only need to change the code in two places

lg <-
  d %>% 
  filter(!is.na(id)) %>% 
  select(id, starts_with("LG")) %>% 
  gather(key = trip, value = lg, -id) %>% 
  separate(trip, c("category", "trip"), convert = TRUE) %>% 
  select(-category) %>% 
  mutate(trip = ifelse(is.na(trip), 0, trip),
         trip = trip + 1)

We can now tie the two data frame together:

catch <- 
  left_join(ag, lg, by = c("trip", "id")) %>% 
  arrange(trip, id)

We finally want to add the name of the vessel:

catch <-
  catch %>% 
  mutate(vessel = sheets[i])

We then simply generate a species table from the data that was originally read in an stored in object d:

species <- 
  d %>% 
  select(id:latin)

Generating a function

Now since the data in the different sheets is suspected to be of the same structural format (with the exception that the number of trips is variable) we should be able to repeat the procedure. And since we are going to attempt this we may as well generate our own function that includes the essence of the script we generated above within a single object R object that is a function. We give the function a very generic name (read_catch) and add some (hopefully) useful comments/reminders to ourselves:

wb <- loadWorkbook("data-raw/logbook.xlsx")

read_catch <- function(wb, sheet.no = 1) {
  
  sheets <- getSheets(wb)
  d <- readWorksheet(wb, sheet.no,
                   startRow = 9) %>% 
  tbl_df()
  
  ag <-
    d %>% 
    filter(!is.na(id)) %>% 
    select(id, starts_with("AG")) %>% 
    gather(key = trip, value = ag, -id) %>% 
    separate(trip, c("category", "trip"), convert = TRUE) %>% 
    select(-category) %>% 
    mutate(trip = ifelse(is.na(trip), 0, trip),
           trip = trip + 1,
           ag = as.numeric(ag))
  lg <-
    d %>% 
    filter(!is.na(id)) %>% 
    select(id, starts_with("LG")) %>% 
    gather(key = trip, value = lg, -id) %>% 
    separate(trip, c("category", "trip"), convert = TRUE) %>% 
    select(-category) %>% 
    mutate(trip = ifelse(is.na(trip), 0, trip),
           trip = trip + 1,
           lg = as.numeric(lg))
  catch <-
    ag %>% 
    left_join(lg, by = c("trip", "id")) %>% 
    filter(!is.na(ag)) %>%  
    mutate(vessel = sheets[i])
  
  return(catch)
}

Now to use this function and read in the second sheet:

res <- read_catch(wb, sheet.no = 2)
glimpse(res)

## Observations: 596
## Variables: 5
## $ id     <dbl> 1, 3, 4, 5, 7, 8, 13, 14, 19, 23, 26, 27, 29, 31, 37, 4...
## $ trip   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ ag     <dbl> 398, 704, 502, 658, 491, 2935, 10, 190, 622, 1329, 1020...
## $ lg     <dbl> 4189, 741, 502, 658, 491, 2935, 10, 190, 1073, 1398, 10...
## $ vessel <chr> "Georgieann", "Georgieann", "Georgieann", "Georgieann",...

Reading in multiple sheets

The next step is to read in all the sheet. Here we demonstrate it by using a for-loop (TODO: Need to add documentation on the basics in the introduction).

res <- list()
for (i in 1:length(sheets)) {
  res[[i]] <- read_catch(wb, sheets[i]) # read in sheet number i
}
# aggregate the data
catch <- bind_rows(res)
# take a peek
glimpse(catch)

## Observations: 8,006
## Variables: 5
## $ id     <dbl> 5, 8, 11, 12, 13, 14, 23, 24, 26, 27, 29, 34, 37, 45, 4...
## $ trip   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2...
## $ ag     <dbl> 1199, 1118, 384, 449, 41, 1173, 2088, 827, 2709, 407, 5...
## $ lg     <dbl> 1199, 1118, 404, 449, 43, 1173, 2198, 827, 2709, 428, 5...
## $ vessel <chr> "Georgieann", "Georgieann", "Georgieann", "Georgieann",...

Now the data are in a tidy format and hence ready for some analysis. E.g. calculate the number of trips and the total catch by vessel:

catch %>% 
  mutate(catch = ag + lg) %>% 
  group_by(vessel) %>% 
  summarise(n = n_distinct(trip),
            catch = sum(catch))

## # A tibble: 20 × 3
##        vessel     n     catch
##         <chr> <int>     <dbl>
## 1    Aleander    30  516938.0
## 2      Arsene    27  821960.0
## 3     Breklyn    25 1228870.0
## 4     Dalayah    29  432840.0
## 5     Dayanne    27  586388.0
## 6   Deyaneira    31 1397440.0
## 7   Donatello    26 1657479.0
## 8       Doxie    11  293574.0
## 9  Georgieann    34 1431449.0
## 10  Grandison    37 1550092.0
## 11      Harue    31  450944.0
## 12     Jalina    13  269508.0
## 13    Kiernan    17  421460.0
## 14   Kyrollos    26  390980.0
## 15   Ladaijah    20  334840.0
## 16     Ladema    28  446140.0
## 17   Merriman    31 1519968.0
## 18      Neven    10  348628.0
## 19     Rowdie     6   64676.4
## 20      Yhair    23  811730.0

Or, e.g. the catch of each species and display the top 10 species with the highest catch:

catch %>% 
  mutate(catch = ag + lg) %>%
  group_by(id) %>% 
  summarise(catch = sum(catch)) %>% 
  arrange(desc(catch)) %>% 
  slice(1:10) %>% 
  left_join(species, by = "id")

## # A tibble: 10 × 4
##       id     catch                        species
##    <dbl>     <dbl>                          <chr>
## 1     23 2146263.2                    KANDRATIKI 
## 2     45 1868574.0    WIT WITTIE /TONKIN WEAKFISH
## 3      8 1396952.0 BLACK SNAPPER (COROCORO GRUNT)
## 4     47 1018283.0                     SILVERBELT
## 5     18  990339.2       GRUNT/ MILKFISH/NEERTJIE
## 6     26  806995.2                        KOEPILA
## 7     14  799531.0            DAGOETIFI/BANGAMARY
## 8     27  790911.2        KROKUS/ZEEKOEBI/CROAKER
## 9     29  709479.6                  LANE SNAPPER 
## 10    37  535196.8                           POES
## # ... with 1 more variables: latin <chr>

knitr::opts_chunk$set(eval = FALSE)

Recapitulation of the case

The above seems like a lot of steps to take. But once we have set up the code correctly we can apply it to more Excel sheets that have the same structure, e.g. for other years. If we save the functions in an R-script tidyr_case1.R under e.g. the R-directory we effectively really end with only needing the following code:

rm(list=ls())           # Nothing up my sleve, deleted everthing the working directory
library(tidyverse)
library(XLConnect)

source("R/tidy_case1.R")  # TODO: Need to clarify this in the basic documents
wb <- loadWorkbook("data-raw/logbook.xlsx")
sheets <- getSheets(wb)
catch <- read_catch(wb, sheets)
vessel <- read_metadata(wb, sheets)