Here we are going to work on tidying the Logbook data introduce in Importing data from Excel. Although the codes provided below are case specific there are generally only limited number of functions one needs to master to import consistently formatted excel sheets:
readWorksheet
: Read in a whole sheet or a selected part of a sheetselect
: Pick or drop columns by namegather
: Gather columns into rowsseparate
: Separate one column into manyspread
: Spread rows into columnsNeeded libraries:
library(tidyverse)
library(XLConnect)
If one thinks about what are the “true” records (rows) in the Logbook data it basically boils down to two landing value (different categories) by species by harbour, boat and trip (we will ignore the harbor for now, see further below). So the variables (columns) we would like to end with are:
Lets load the wb into R and get the sheet names:
wb <- loadWorkbook("data-raw/logbook.xlsx")
sheets <- getSheets(wb)
Side point, try running this code
sheets[1]
sheets[2]
sheets[3]
i <- 7
sheets[i]
Note that each line returns the name of the 1st, 2nd, 3rd and the ith sheet (here 7). We are going to use this feature further down the line.
Before we go into reading the actual catch data lets read in the metadata that reside in region “A4:A6”. If only because the code is easier.
meta <- readWorksheet(wb, sheet = 1,
region = "A4:A6",
header = FALSE)
meta
## Col1
## 1 Boat name: Georgieann
## 2 Boat number: 999
## 3 Landing plcace: 1
Since no column header was read in we are given a default column header “Col1”. We can use the separate
function to split the data into two columns:
meta <-
meta %>%
separate(Col1, c("variable", "value"), sep = ": ")
meta
## variable value
## 1 Boat name Georgieann
## 2 Boat number 999
## 3 Landing plcace 1
What is in the first column are variables, that ideally should be column names. Here we can use the spread
-function:
meta %>%
spread(variable, value)
## Boat name Boat number Landing plcace
## 1 Georgieann 999 1
Because we want to read in the data from many sheets (vessels) we generate a function:
read_metadata <- function(wb, sheet) {
res <-
readWorksheet(wb, sheet = sheet,
region = "A4:A6",
header = FALSE) %>%
separate(Col1, c("variable", "value"), sep = ": ") %>%
spread(variable, value)
return(res)
}
To read in metadata for the second sheet we now just do:
read_metadata(wb, 2)
## Boat name Boat number Landing plcace
## 1 Donatello 998 1
We can then create a loop to read in all the sheets:
res <- list()
for(i in 1:length(sheets)) {
res[[i]] <- read_metadata(wb, i)
}
meta <- bind_rows(res)
meta
## Boat name Boat number Landing plcace
## 1 Georgieann 999 1
## 2 Donatello 998 1
## 3 Merriman 997 1
## 4 Grandison 996 1
## 5 Deyaneira 995 1
## 6 Breklyn 993 2
## 7 Jalina 992 2
## 8 Dayanne 991 1
## 9 Arsene 990 1
## 10 Ladaijah 989 1
## 11 Yhair 988 1
## 12 Kiernan 987 1
## 13 Dalayah 986 3
## 14 Harue 985 3
## 15 Ladema 984 3
## 16 Kyrollos 983 3
## 17 Aleander 982 3
## 18 Klowie 982 3
## 19 Doxie 981 4
## 20 Neven 980 4
## 21 Rowdie 979 5
We could actually include the loop within the function, so lets modify it:
read_metadata <- function(wb, sheet) {
res <- list()
for(i in 1:length(sheet)) {
res[[i]] <-
readWorksheet(wb, sheet = sheet[i],
region = "A4:A6",
header = FALSE) %>%
separate(Col1, c("variable", "value"), sep = ": ") %>%
spread(variable, value)
}
meta <- bind_rows(res)
return(meta)
}
And then we could read in one sheet:
read_metadata(wb, sheets[10])
## Boat name Boat number Landing plcace
## 1 Ladaijah 989 1
Or all of them
read_metadata(wb, sheets)
## Boat name Boat number Landing plcace
## 1 Georgieann 999 1
## 2 Donatello 998 1
## 3 Merriman 997 1
## 4 Grandison 996 1
## 5 Deyaneira 995 1
## 6 Breklyn 993 2
## 7 Jalina 992 2
## 8 Dayanne 991 1
## 9 Arsene 990 1
## 10 Ladaijah 989 1
## 11 Yhair 988 1
## 12 Kiernan 987 1
## 13 Dalayah 986 3
## 14 Harue 985 3
## 15 Ladema 984 3
## 16 Kyrollos 983 3
## 17 Aleander 982 3
## 18 Klowie 982 3
## 19 Doxie 981 4
## 20 Neven 980 4
## 21 Rowdie 979 5
As said the catch data start in row 9 in all the sheets. And because different sheets have different number columns to read in we would not like to specify each time different sheet areas. So to read in the first sheet we do:
i <- 1
d <- readWorksheet(wb, sheets[i],
startRow = 9) %>%
tbl_df()
Lets work the AG landing category first. What we want is to end up with a dataframe that has three columns:
So we need to take the following steps:
We select the column by:
ag <-
d %>%
select(id, starts_with("AG"))
glimpse(ag)
## Observations: 61
## Variables: 35
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
## $ AG <dbl> NA, NA, NA, NA, 1199, NA, NA, 1118, NA, NA, 384, 449, 41...
## $ AG.1 <dbl> 490, NA, 663, 542, 1265, NA, 148, 19, NA, NA, NA, NA, 12...
## $ AG.2 <dbl> 474, NA, 214, 740, 1097, NA, 781, 1921, 162, NA, 112, NA...
## $ AG.3 <dbl> 260, NA, 770, 434, 1168, NA, 347, 2093, 87, NA, NA, NA, ...
## $ AG.4 <dbl> 321, NA, 415, 292, 1087, NA, 82, 82, NA, NA, NA, 133, NA...
## $ AG.5 <dbl> 255, NA, 140, 278, 417, NA, 71, 1359, NA, NA, NA, NA, NA...
## $ AG.6 <dbl> 561, NA, 316, 350, 633, 36, 316, 3878, 33, NA, 224, NA, ...
## $ AG.7 <dbl> 224, NA, 227, 2211, 551, NA, 633, 3391, 71, NA, 108, NA,...
## $ AG.8 <dbl> 561, NA, 214, 1097, 418, NA, 852, 9896, 70, NA, NA, NA, ...
## $ AG.9 <dbl> 469, NA, 358, 306, 770, NA, 567, 2640, NA, NA, 867, NA, ...
## $ AG.10 <dbl> 347, NA, 276, 1236, 816, NA, 289, 6395, 28, NA, 1428, NA...
## $ AG.11 <dbl> 357, NA, 439, 995, 643, NA, 469, 5206, 18, NA, NA, NA, 1...
## $ AG.12 <dbl> 337, NA, NA, 359, 689, NA, 439, 3646, 10, NA, NA, NA, 30...
## $ AG.13 <dbl> 495, NA, 393, 556, 719, NA, 1245, 4617, 8, NA, NA, NA, 2...
## $ AG.14 <dbl> 418, NA, 339, 291, 852, NA, 1148, 1618, NA, NA, 31, NA, ...
## $ AG.15 <dbl> 341, NA, 265, 332, 648, NA, 286, 722, NA, NA, NA, NA, 19...
## $ AG.16 <dbl> 689, NA, 296, 332, 781, NA, 481, 961, NA, NA, 1276, NA, ...
## $ AG.17 <dbl> 490, NA, 327, 398, 1020, NA, 578, 1363, NA, NA, 163, NA,...
## $ AG.18 <dbl> 689, NA, 270, 366, 837, NA, 512, 3679, 15, NA, NA, NA, 2...
## $ AG.19 <dbl> 337, NA, 270, 1321, 973, NA, 669, 7731, 3, NA, NA, NA, 3...
## $ AG.20 <dbl> 577, NA, 272, 2121, 878, NA, 459, 5906, NA, NA, NA, NA, ...
## $ AG.21 <dbl> 495, NA, 255, 720, 842, NA, 671, 1784, NA, NA, NA, NA, 4...
## $ AG.22 <dbl> 270, NA, 281, 2277, 781, NA, 933, 8671, NA, NA, NA, 56, ...
## $ AG.23 <dbl> 316, NA, 332, 597, 821, NA, 760, 9270, NA, NA, NA, NA, 6...
## $ AG.24 <dbl> 388, NA, 370, 2190, 709, NA, 464, 5259, NA, NA, NA, NA, ...
## $ AG.25 <dbl> 286, NA, 270, 1171, 816, NA, 573, 2104, NA, NA, NA, NA, ...
## $ AG.26 <dbl> 388, NA, 311, 296, 816, NA, 677, 1042, NA, NA, NA, NA, 5...
## $ AG.27 <dbl> 296, NA, 270, 1356, 730, NA, NA, 4278, 15, NA, NA, NA, N...
## $ AG.28 <dbl> 465, NA, 230, 279, 699, NA, 449, 306, NA, NA, NA, NA, 33...
## $ AG.29 <dbl> 463, NA, 2633, 332, 878, NA, 673, 1283, NA, NA, NA, NA, ...
## $ AG.30 <dbl> 473, NA, 771, 482, 1010, NA, 487, 517, 8, NA, NA, NA, 44...
## $ AG.31 <dbl> NA, NA, 321, 520, 668, NA, 1107, 9337, 55, NA, NA, 35, N...
## $ AG.32 <dbl> 454, NA, NA, 440, 668, NA, 1029, 6289, 22, NA, NA, NA, 5...
## $ AG.33 <dbl> 398, NA, 704, 502, 658, NA, 491, 2935, NA, NA, NA, NA, 1...
Besides the id column each column represents a trip. I.e. this is a variable that we would like to put in a single column. To do that we use the gather
-function, where we give the variable that are now as column names the name trip and the catch weight we name AG. Since id is already in order we do not reformat that column:
ag <-
ag %>%
gather(key = trip, value = ag, -id)
glimpse(ag)
## Observations: 2,074
## Variables: 3
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ trip <chr> "AG", "AG", "AG", "AG", "AG", "AG", "AG", "AG", "AG", "AG...
## $ ag <dbl> NA, NA, NA, NA, 1199, NA, NA, 1118, NA, NA, 384, 449, 41,...
Take note that now we have 2074 rows, whereas the orginal data had 61 rows, one for each species. We can use the table
-function to quickly get an overview of the number of redords for each id:
table(ag$id, useNA = "ifany")
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 34 34 34 34 34 34 34 34 34 34 34 34 34 34 34
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 34 34 34 34 34 34 34 34 34 34 34 34 34 34 34
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 34 34 34 34 34 34 34 34 34 34 34 34 34 34 34
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 34 34 34 34 34 34 34 34 34 34 34 34 34 34 34
## <NA>
## 34
so we have 34 record (trips) for each id (species). The reason that we have the same number for each species is that the data includes NA’s. Notice also that there are records that does not have an id. This is because in the original worksheet there is a column sum at the bottom of each trip. We will get rid of that in the final version of the script.
Now the trip column contains the formerly column names “AG”, “AG.1”, “AG.2”, etc. Names that represent the trip number. We need to split that column into two separate columns.
ag <-
ag %>%
separate(trip, c("category", "trip"), convert = TRUE)
## Warning: Too few values at 61 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
## 12, 13, 14, 15, 16, 17, 18, 19, 20, ...
The warning message you get is because the first trip is really labelled just “AG” and does hence not have any “.” that acts as the separator.
glimpse(ag)
## Observations: 2,074
## Variables: 4
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ category <chr> "AG", "AG", "AG", "AG", "AG", "AG", "AG", "AG", "AG",...
## $ trip <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ ag <dbl> NA, NA, NA, NA, 1199, NA, NA, 1118, NA, NA, 384, 449,...
Now the category column contains only “AG” and the trip column the numerical value for the trip number. The reason we gotten the latter as numerical value is because the argument convert was set to TRUE. We notice also that for the first trip has no value, the second trip has the value one, etc. So we need fix that using the following code:
ag <-
ag %>%
select(-category) %>% # We do not need that column anymore
mutate(trip = ifelse(is.na(trip), 0, trip), # Set the first trip to value 0
trip = trip + 1) # Add one to each trip number
The 6 commands we have used can by tied into a single pipe:
ag <-
d %>%
filter(!is.na(id)) %>%
select(id, starts_with("AG")) %>%
gather(key = trip, value = ag, -id) %>%
separate(trip, c("category", "trip"), convert = TRUE) %>%
select(-category) %>%
mutate(trip = ifelse(is.na(trip), 0, trip),
trip = trip + 1)
To work on the “LG” landing categoy we only need to change the code in two places
lg <-
d %>%
filter(!is.na(id)) %>%
select(id, starts_with("LG")) %>%
gather(key = trip, value = lg, -id) %>%
separate(trip, c("category", "trip"), convert = TRUE) %>%
select(-category) %>%
mutate(trip = ifelse(is.na(trip), 0, trip),
trip = trip + 1)
We can now tie the two data frame together:
catch <-
left_join(ag, lg, by = c("trip", "id")) %>%
arrange(trip, id)
We finally want to add the name of the vessel:
catch <-
catch %>%
mutate(vessel = sheets[i])
We then simply generate a species table from the data that was originally read in an stored in object d:
species <-
d %>%
select(id:latin)
Now since the data in the different sheets is suspected to be of the same structural format (with the exception that the number of trips is variable) we should be able to repeat the procedure. And since we are going to attempt this we may as well generate our own function that includes the essence of the script we generated above within a single object R object that is a function. We give the function a very generic name (read_catch
) and add some (hopefully) useful comments/reminders to ourselves:
wb <- loadWorkbook("data-raw/logbook.xlsx")
read_catch <- function(wb, sheet.no = 1) {
sheets <- getSheets(wb)
d <- readWorksheet(wb, sheet.no,
startRow = 9) %>%
tbl_df()
ag <-
d %>%
filter(!is.na(id)) %>%
select(id, starts_with("AG")) %>%
gather(key = trip, value = ag, -id) %>%
separate(trip, c("category", "trip"), convert = TRUE) %>%
select(-category) %>%
mutate(trip = ifelse(is.na(trip), 0, trip),
trip = trip + 1,
ag = as.numeric(ag))
lg <-
d %>%
filter(!is.na(id)) %>%
select(id, starts_with("LG")) %>%
gather(key = trip, value = lg, -id) %>%
separate(trip, c("category", "trip"), convert = TRUE) %>%
select(-category) %>%
mutate(trip = ifelse(is.na(trip), 0, trip),
trip = trip + 1,
lg = as.numeric(lg))
catch <-
ag %>%
left_join(lg, by = c("trip", "id")) %>%
filter(!is.na(ag)) %>%
mutate(vessel = sheets[i])
return(catch)
}
Now to use this function and read in the second sheet:
res <- read_catch(wb, sheet.no = 2)
glimpse(res)
## Observations: 596
## Variables: 5
## $ id <dbl> 1, 3, 4, 5, 7, 8, 13, 14, 19, 23, 26, 27, 29, 31, 37, 4...
## $ trip <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ ag <dbl> 398, 704, 502, 658, 491, 2935, 10, 190, 622, 1329, 1020...
## $ lg <dbl> 4189, 741, 502, 658, 491, 2935, 10, 190, 1073, 1398, 10...
## $ vessel <chr> "Georgieann", "Georgieann", "Georgieann", "Georgieann",...
The next step is to read in all the sheet. Here we demonstrate it by using a for-loop (TODO: Need to add documentation on the basics in the introduction).
res <- list()
for (i in 1:length(sheets)) {
res[[i]] <- read_catch(wb, sheets[i]) # read in sheet number i
}
# aggregate the data
catch <- bind_rows(res)
# take a peek
glimpse(catch)
## Observations: 8,006
## Variables: 5
## $ id <dbl> 5, 8, 11, 12, 13, 14, 23, 24, 26, 27, 29, 34, 37, 45, 4...
## $ trip <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2...
## $ ag <dbl> 1199, 1118, 384, 449, 41, 1173, 2088, 827, 2709, 407, 5...
## $ lg <dbl> 1199, 1118, 404, 449, 43, 1173, 2198, 827, 2709, 428, 5...
## $ vessel <chr> "Georgieann", "Georgieann", "Georgieann", "Georgieann",...
Now the data are in a tidy format and hence ready for some analysis. E.g. calculate the number of trips and the total catch by vessel:
catch %>%
mutate(catch = ag + lg) %>%
group_by(vessel) %>%
summarise(n = n_distinct(trip),
catch = sum(catch))
## # A tibble: 20 × 3
## vessel n catch
## <chr> <int> <dbl>
## 1 Aleander 30 516938.0
## 2 Arsene 27 821960.0
## 3 Breklyn 25 1228870.0
## 4 Dalayah 29 432840.0
## 5 Dayanne 27 586388.0
## 6 Deyaneira 31 1397440.0
## 7 Donatello 26 1657479.0
## 8 Doxie 11 293574.0
## 9 Georgieann 34 1431449.0
## 10 Grandison 37 1550092.0
## 11 Harue 31 450944.0
## 12 Jalina 13 269508.0
## 13 Kiernan 17 421460.0
## 14 Kyrollos 26 390980.0
## 15 Ladaijah 20 334840.0
## 16 Ladema 28 446140.0
## 17 Merriman 31 1519968.0
## 18 Neven 10 348628.0
## 19 Rowdie 6 64676.4
## 20 Yhair 23 811730.0
Or, e.g. the catch of each species and display the top 10 species with the highest catch:
catch %>%
mutate(catch = ag + lg) %>%
group_by(id) %>%
summarise(catch = sum(catch)) %>%
arrange(desc(catch)) %>%
slice(1:10) %>%
left_join(species, by = "id")
## # A tibble: 10 × 4
## id catch species
## <dbl> <dbl> <chr>
## 1 23 2146263.2 KANDRATIKI
## 2 45 1868574.0 WIT WITTIE /TONKIN WEAKFISH
## 3 8 1396952.0 BLACK SNAPPER (COROCORO GRUNT)
## 4 47 1018283.0 SILVERBELT
## 5 18 990339.2 GRUNT/ MILKFISH/NEERTJIE
## 6 26 806995.2 KOEPILA
## 7 14 799531.0 DAGOETIFI/BANGAMARY
## 8 27 790911.2 KROKUS/ZEEKOEBI/CROAKER
## 9 29 709479.6 LANE SNAPPER
## 10 37 535196.8 POES
## # ... with 1 more variables: latin <chr>
knitr::opts_chunk$set(eval = FALSE)
The above seems like a lot of steps to take. But once we have set up the code correctly we can apply it to more Excel sheets that have the same structure, e.g. for other years. If we save the functions in an R-script tidyr_case1.R under e.g. the R-directory we effectively really end with only needing the following code:
rm(list=ls()) # Nothing up my sleve, deleted everthing the working directory
library(tidyverse)
library(XLConnect)
source("R/tidy_case1.R") # TODO: Need to clarify this in the basic documents
wb <- loadWorkbook("data-raw/logbook.xlsx")
sheets <- getSheets(wb)
catch <- read_catch(wb, sheets)
vessel <- read_metadata(wb, sheets)