Preamble


I suspect that those of you that are not storing your data in a database may use and have myriads of Excel sheets that contain the raw data or some aggregate thereof. Even those of you that do store the bulk of the data in a database may store additional data, that “do not fit the database structure” in Excel sheets. The current document provides some ideas (read: R code) on how one can read in consistently formatted data from Excel.

In the document Importing data into R we introduced the read_excel and excel_sheets functions that reside in the readxl-package. Although the read-function is plain and simple, it has its limitation. The first rows in the Excel sheet may often contain metadata, like boat name, boat registration id, landing harbor name. Or whatever else. And this may then be followed in rows further down below with the actual individual records. Or we may have multiple tables within one sheet.

There are two possible approaches:

The former is in many cases undesirable. If only because reconstructing meticulously laid out data in Excel into a computer likable form:

So, attempting to generate an R-code, which basically leaves the current Excel structure intact may be a desirable root to take (consider that as a strength). The weakness is that if the original structure in the Excel is complicated the R-code to read in the data and tidying it may also be complicated.

Getting the example data onto your computer

For this session we also need to access some example excel workbook. So lets first download it:

download.file(url = "http://www.hafro.is/~einarhj/crfmr/data-raw/iccat_codes.xlsx",
              destfile = "data-raw/iccat_codes.xlsx",
              mode = "wb")

Here I specified in the function argument destfile “data-raw/iccat_codes.xlsx”. The slash (“/”) means that what comes before is the name of a directory (read: folder) in my current R-working directory. Here the name of that directory is “data-raw”. This directory has to exist, otherwise one get an error. E.g. if I had tried something like:

download.file(url = "http://www.hafro.is/~einarhj/crfmr/data-raw/iccat_codes.xlsx",
              destfile = "garbage/iccat_codes.xlsx",
              mode = "wb")

I would have gotten something like this in the console:

downloaded 0 bytes
Error in download.file(url = "http://www.hafro.is/~einarhj/crfmr/data-raw/iccat_codes.xlsx",  : 
  cannot download all files
In addition: Warning messages:
1: In download.file(url = "http://www.hafro.is/~einarhj/crfmr/data-raw/iccat_codes.xlsx",  :
  URL http://www.hafro.is/~einarhj/crfmr/data-raw/iccat_codes.xlsx: cannot open destfile 'garbage/iccat_codes.xlsx', reason 'No such file or directory'
2: In download.file(url = "http://www.hafro.is/~einarhj/crfmr/data-raw/iccat_codes.xlsx",  :
  downloaded length 0 != reported length 37391
> download.file(url = "http://www.hafro.is/~einarhj/crfmr/data-raw/iccat_codes.xlsx",
+               destfile = "garbage/iccat_codes.xlsx")
downloaded 0 bytes
Error in download.file(url = "http://www.hafro.is/~einarhj/crfmr/data-raw/iccat_codes.xlsx",  : 
  cannot download all files
In addition: Warning messages:
1: In download.file(url = "http://www.hafro.is/~einarhj/crfmr/data-raw/iccat_codes.xlsx",  :
  URL http://www.hafro.is/~einarhj/crfmr/data-raw/iccat_codes.xlsx: cannot open destfile 'garbage/iccat_codes.xlsx', reason 'No such file or directory'
2: In download.file(url = "http://www.hafro.is/~einarhj/crfmr/data-raw/iccat_codes.xlsx",  :
  downloaded length 0 != reported length 37391

The above is a classical R warning message, obtuse. In this case the key text is:

cannot open destfile 'garbage/iccat_codes.xlsx', reason 'No such file or directory'

So the reason for the error message is that I do not have the directory named “garbage” in my current R-working directory.

If you do not already have the data-raw directory, you can create it from within R by doing:

dir.create("data-raw")

What has happened is that a new directory has been generated on your computer hard drive (check out for yourself). And then you have to re-run the download.file command above. Again, once you have downloaded the file onto your computer you do not need to repeat the above command.

One could of course also have downloaded the iccat_codes.xlsx with the usual point-and-mouse-click method. But by using an R-script one makes the whole process fully reproducible (what is done is written in the code itself).

Reading a tidy Excel sheet into R

The XLConnect-package has functions that allow refinement with respect to what is read in from Excel. If you have not install that package using the following command:

install.packages("XLConnect")

Remember that once a package is installed on your computer you do not need to reinstall it in every session. Only if you want to get the latest update would you repeat the above command.

library(tidyverse)
library(XLConnect)

We start by creating a “connection” to the Excel workbook using the loadWorkbook-function from the XLConnect-package:

wb <- loadWorkbook(filename = "data-raw/iccat_codes.xlsx")

We can get the names of the sheets that are in the workbook by:

getSheets(wb)
## [1] "reporting_flag" "codes"

So we have 2 sheets, one named “reporting_flag”, the other “codes”.

To get a glimpse of the first sheet we could try to do:

readWorksheet(object = wb,
              sheet = "reporting_flag") %>% 
  glimpse()
## Observations: 123
## Variables: 6
## $ FlagName                   <chr> "Albania", "Algerie", "Angola", "Ba...
## $ FlagCode                   <chr> "ALB", "DZA", "AGO", "BRB", "BLZ", ...
## $ Status                     <chr> "CP", "CP", "CP", "CP", "CP", "CP",...
## $ Base.Ports.Zones..current. <chr> NA, NA, NA, NA, "ETRO", NA, NA, "Ar...
## $ FlagA3ISO                  <chr> "ALB", "DZA", "AGO", "BRB", "BLZ", ...
## $ FlagA2ISO                  <chr> "AL", "DZ", "AO", "BB", "BZ", "BR",...

Here things just “look” OK. Take note that we could also have used a numeric value for the sheet-argument (in this case 1, the sheet being the first position in the workbook).

But lets try the second sheet (named “codes”) using the same approach:

readWorksheet(object = wb,
              sheet = "codes") %>% 
  glimpse()
## Observations: 229
## Variables: 21
## $ Table..Reporting.Flags..Flags.of.Vessel...Countries..A2...A3.ISO.3166. <chr> ...
## $ Col2                                                                   <chr> ...
## $ Col3                                                                   <chr> ...
## $ Col4                                                                   <chr> ...
## $ Col5                                                                   <chr> ...
## $ Col6                                                                   <chr> ...
## $ Col7                                                                   <lgl> ...
## $ Table..Species..Tunas...sharks..standard.codes                         <chr> ...
## $ Col9                                                                   <chr> ...
## $ Col10                                                                  <chr> ...
## $ Col11                                                                  <chr> ...
## $ Col12                                                                  <chr> ...
## $ Col13                                                                  <chr> ...
## $ Col14                                                                  <chr> ...
## $ Col15                                                                  <lgl> ...
## $ Table..Gears                                                           <chr> ...
## $ Col17                                                                  <chr> ...
## $ Col18                                                                  <chr> ...
## $ Col19                                                                  <lgl> ...
## $ Table..Effort.types                                                    <chr> ...
## $ Col21                                                                  <chr> ...

If you open the Excel workbook iccat_codes.xlsx in its native environment and look at the individual sheets you may gather for yourself the reason why. In short, the “reporting_flag”-sheet is tidy while the “code”-sheet has multiple tables.

Reading in specific regions from an Excel sheet

Here we are going to need specify additional arguments within the readWorksheet-function. To see what arguments a function takes one can always do:

args(readWorksheet)
## function (object, sheet, startRow = 0, startCol = 0, endRow = 0, 
##     endCol = 0, autofitRow = TRUE, autofitCol = TRUE, region = NULL, 
##     header = TRUE, rownames = NULL, colTypes = character(0), 
##     forceConversion = FALSE, dateTimeFormat = getOption("XLConnect.dateTimeFormat"), 
##     check.names = TRUE, useCachedValues = FALSE, keep = NULL, 
##     drop = NULL, simplify = FALSE, readStrategy = "default") 
## NULL

We see that we have a lot of options that we can specify.

Notice that in the “code”-worksheet we have a table starting in row 2 going from column A to column F. This is identical to the table in the “reporting_flag”-sheet. Because there are other tables in the “code”-worksheet we would need to specify the starting row and the end column in the readWorksheet-function:

country <- 
  readWorksheet(wb, "codes",
                startRow = 2,
                endCol = 6)
glimpse(country)
## Observations: 123
## Variables: 6
## $ FlagName                   <chr> "Albania", "Algerie", "Angola", "Ba...
## $ FlagCode                   <chr> "ALB", "DZA", "AGO", "BRB", "BLZ", ...
## $ Status                     <chr> "CP", "CP", "CP", "CP", "CP", "CP",...
## $ Base.Ports.Zones..current. <chr> NA, NA, NA, NA, "ETRO", NA, NA, "Ar...
## $ FlagA3ISO                  <chr> "ALB", "DZA", "AGO", "BRB", "BLZ", ...
## $ FlagA2ISO                  <chr> "AL", "DZ", "AO", "BB", "BZ", "BR",...

For the species code table, we would need to specify both the starting column (H) and the end column (N):

species <- 
  readWorksheet(wb, "codes",
                startRow = 2,
                startCol = 8,
                endCol = 14)
glimpse(species)
## Observations: 228
## Variables: 7
## $ SpeciesCode <chr> "BFT", "YFT", "ALB", "BET", "SKJ", "SAI", "BUM", "...
## $ ScieName    <chr> "Thunnus thynnus", "Thunnus albacares", "Thunnus a...
## $ CoNameEN    <chr> "Atlantic bluefin tuna", "Yellowfin tuna", "Albaco...
## $ CoNameFR    <chr> "Thon rouge de l'Atlantique", "Albacore", "Germon"...
## $ CoNameES    <chr> "Atún rojo del Atlántico", "Rabil", "Atún blanco",...
## $ IccSpcGrp   <chr> "1-Tuna (major sp.)", "1-Tuna (major sp.)", "1-Tun...
## $ TaxonType   <chr> "1-Species", "1-Species", "1-Species", "1-Species"...

The next table is the gear table. Because there is a table below (Quadrants) one would also need to specify the end row:

gear <-
  readWorksheet(wb, "codes",
                startRow = 2,
                endRow = 53,
                startCol = 18,
                endCol = 23)
glimpse(gear)
## Observations: 51
## Variables: 4
## $ GearGroup      <chr> "LL", "LL", "LL", "LL", "LL", "LL", "LL", "LL",...
## $ Col2           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ EffortTypeCode <chr> "NO DATA", "D.FISH", "D.FISH.G", "D.AT SEA", "N...
## $ EffortType     <chr> "Data not available", "Number of days fishing",...

Bottom line is that one can specify which part of a worksheet is read in by using the “startRow”, “endRow”, “startCol” and “endCol” arguments. There are also plenty of other arguments that one can specify in the readWorksheet-function. For further information of what they do check out the help page by typing “?readWorksheet” in the console.

Reading in not so nicely structured worksheet

In this example we have a workbook that contains a number of sheets. The data is a full year of landing statistics by boats, trips and species. Each boat is stored in a separate sheet. The Excel file is called logbook.xlsx. It can be downloaded by either clicking on the link above or what is more reproducible by using the download.file-function:

download.file("http://www.hafro.is/~einarhj/crfmr/data-raw/logbook.xlsx",
              "data-raw/logbook.xlsx",
              mode = "wb")

If one opens the file in Excel one observes that each sheet is of the same format:

A screenshot of one sheet is as follows:

Lets load the wb into R and get some information on the sheet (boat) names:

wb <- loadWorkbook("data-raw/logbook.xlsx")
sheets <- getSheets(wb)
sheets
##  [1] "Georgieann" "Donatello"  "Merriman"   "Grandison"  "Deyaneira" 
##  [6] "Breklyn"    "Jalina"     "Dayanne"    "Arsene"     "Ladaijah"  
## [11] "Yhair"      "Kiernan"    "Dalayah"    "Harue"      "Ladema"    
## [16] "Kyrollos"   "Aleander"   "Klowie"     "Doxie"      "Neven"     
## [21] "Rowdie"

So here we have 21 number of boats, the first being named Georgieann, the second Donatello and so on.

In the above we see that the name of the first sheet (boat) is “Georgieann”. We can read in the data using the read_excel function. We skip the first 8 rows since the data column names actually start in row 9. Hence:

d <- readWorksheet(wb, "Georgieann",
                   startRow = 9) %>% 
  tbl_df()

or alternatively:

d <- readWorksheet(wb, 1,
                   startRow = 9) %>% 
  tbl_df()

or alternatively:

d <- readWorksheet(wb, sheets[1],
                   startRow = 9) %>% 
  tbl_df()

Lest see what we have got:

glimpse(d)
## Observations: 61
## Variables: 72
## $ id      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ species <chr> "AMERICAN HARVEST/SILVER POMFRET", "B-LINER (VERMILLIO...
## $ latin   <chr> "PEPRILUS PARU", "RHOMBOPLITES AURORUBENS", "CYNOSCION...
## $ AG      <dbl> NA, NA, NA, NA, 1199, NA, NA, 1118, NA, NA, 384, 449, ...
## $ LG      <dbl> NA, NA, NA, NA, 1199, NA, NA, 1118, NA, NA, 404, 449, ...
## $ AG.1    <dbl> 490, NA, 663, 542, 1265, NA, 148, 19, NA, NA, NA, NA, ...
## $ LG.1    <dbl> 516, NA, 698, 542, 1265, NA, 148, 19, NA, NA, NA, NA, ...
## $ AG.2    <dbl> 474, NA, 214, 740, 1097, NA, 781, 1921, 162, NA, 112, ...
## $ LG.2    <dbl> 499, NA, 226, 740, 1097, NA, 781, 1921, 162, NA, 118, ...
## $ AG.3    <dbl> 260, NA, 770, 434, 1168, NA, 347, 2093, 87, NA, NA, NA...
## $ LG.3    <dbl> 274, NA, 811, 434, 1168, NA, 347, 2093, 87, NA, NA, NA...
## $ AG.4    <dbl> 321, NA, 415, 292, 1087, NA, 82, 82, NA, NA, NA, 133, ...
## $ LG.4    <dbl> 338, NA, 437, 292, 1087, NA, 82, 82, NA, NA, NA, 133, ...
## $ AG.5    <dbl> 255, NA, 140, 278, 417, NA, 71, 1359, NA, NA, NA, NA, ...
## $ LG.5    <dbl> 269, NA, 147, 278, 417, NA, 71, 1359, NA, NA, NA, NA, ...
## $ AG.6    <dbl> 561, NA, 316, 350, 633, 36, 316, 3878, 33, NA, 224, NA...
## $ LG.6    <dbl> 591, NA, 333, 350, 633, 36, 316, 3878, 33, NA, 236, NA...
## $ AG.7    <dbl> 224, NA, 227, 2211, 551, NA, 633, 3391, 71, NA, 108, N...
## $ LG.7    <dbl> 236, NA, 238, 2211, 551, NA, 633, 3391, 71, NA, 114, N...
## $ AG.8    <dbl> 561, NA, 214, 1097, 418, NA, 852, 9896, 70, NA, NA, NA...
## $ LG.8    <dbl> 591, NA, 226, 1097, 418, NA, 852, 9896, 70, NA, NA, NA...
## $ AG.9    <dbl> 469, NA, 358, 306, 770, NA, 567, 2640, NA, NA, 867, NA...
## $ LG.9    <dbl> 494, NA, 377, 306, 770, NA, 567, 2640, NA, NA, 913, NA...
## $ AG.10   <dbl> 347, NA, 276, 1236, 816, NA, 289, 6395, 28, NA, 1428, ...
## $ LG.10   <dbl> 365, NA, 290, 1236, 816, NA, 289, 6395, 28, NA, 1482, ...
## $ AG.11   <dbl> 357, NA, 439, 995, 643, NA, 469, 5206, 18, NA, NA, NA,...
## $ LG.11   <dbl> 376, NA, 439, 995, 643, NA, 469, 5206, 18, NA, NA, NA,...
## $ AG.12   <dbl> 337, NA, NA, 359, 689, NA, 439, 3646, 10, NA, NA, NA, ...
## $ LG.12   <dbl> 354, NA, NA, 359, 689, NA, 439, 3646, 10, NA, NA, NA, ...
## $ AG.13   <dbl> 495, NA, 393, 556, 719, NA, 1245, 4617, 8, NA, NA, NA,...
## $ LG.13   <dbl> 521, NA, 414, 556, 719, NA, 1245, 4617, 8, NA, NA, NA,...
## $ AG.14   <dbl> 418, NA, 339, 291, 852, NA, 1148, 1618, NA, NA, 31, NA...
## $ LG.14   <dbl> 440, NA, 357, 291, 852, NA, 1148, 1618, NA, NA, 32, NA...
## $ AG.15   <dbl> 341, NA, 265, 332, 648, NA, 286, 722, NA, NA, NA, NA, ...
## $ LG.15   <dbl> 359, NA, 279, 332, 648, NA, 286, 722, NA, NA, NA, NA, ...
## $ AG.16   <dbl> 689, NA, 296, 332, 781, NA, 481, 961, NA, NA, 1276, NA...
## $ LG.16   <dbl> 725, NA, 311, 332, 781, NA, 481, 961, NA, NA, 1343, NA...
## $ AG.17   <dbl> 490, NA, 327, 398, 1020, NA, 578, 1363, NA, NA, 163, N...
## $ LG.17   <dbl> 516, NA, 344, 398, 1020, NA, 578, 1363, NA, NA, 172, N...
## $ AG.18   <dbl> 689, NA, 270, 366, 837, NA, 512, 3679, 15, NA, NA, NA,...
## $ LG.18   <dbl> 725, NA, 285, 366, 837, NA, 512, 3679, 15, NA, NA, NA,...
## $ AG.19   <dbl> 337, NA, 270, 1321, 973, NA, 669, 7731, 3, NA, NA, NA,...
## $ LG.19   <dbl> 354, NA, 285, 1321, 973, NA, 669, 7731, 3, NA, NA, NA,...
## $ AG.20   <dbl> 577, NA, 272, 2121, 878, NA, 459, 5906, NA, NA, NA, NA...
## $ LG.20   <dbl> 607, NA, 287, 2121, 878, NA, 459, 5906, NA, NA, NA, NA...
## $ AG.21   <dbl> 495, NA, 255, 720, 842, NA, 671, 1784, NA, NA, NA, NA,...
## $ LG.21   <dbl> 521, NA, 269, 720, 842, NA, 671, 1784, NA, NA, NA, NA,...
## $ AG.22   <dbl> 270, NA, 281, 2277, 781, NA, 933, 8671, NA, NA, NA, 56...
## $ LG.22   <dbl> 285, NA, 295, 2277, 781, NA, 933, 8671, NA, NA, NA, 59...
## $ AG.23   <dbl> 316, NA, 332, 597, 821, NA, 760, 9270, NA, NA, NA, NA,...
## $ LG.23   <dbl> 333, NA, 349, 597, 821, NA, 760, 9270, NA, NA, NA, NA,...
## $ AG.24   <dbl> 388, NA, 370, 2190, 709, NA, 464, 5259, NA, NA, NA, NA...
## $ LG.24   <dbl> 408, NA, 390, 2190, 709, NA, 464, 5259, NA, NA, NA, NA...
## $ AG.25   <dbl> 286, NA, 270, 1171, 816, NA, 573, 2104, NA, NA, NA, NA...
## $ LG.25   <dbl> 301, NA, 285, 1171, 816, NA, 573, 2104, NA, NA, NA, NA...
## $ AG.26   <dbl> 388, NA, 311, 296, 816, NA, 677, 1042, NA, NA, NA, NA,...
## $ LG.26   <dbl> 408, NA, 328, 296, 816, NA, 677, 1042, NA, NA, NA, NA,...
## $ AG.27   <dbl> 296, NA, 270, 1356, 730, NA, NA, 4278, 15, NA, NA, NA,...
## $ LG.27   <dbl> 311, NA, 285, 1356, 730, NA, NA, 4278, 15, NA, NA, NA,...
## $ AG.28   <dbl> 465, NA, 230, 279, 699, NA, 449, 306, NA, NA, NA, NA, ...
## $ LG.28   <dbl> 490, NA, 242, 497, 699, NA, 449, 306, NA, NA, NA, NA, ...
## $ AG.29   <dbl> 463, NA, 2633, 332, 878, NA, 673, 1283, NA, NA, NA, NA...
## $ LG.29   <dbl> 488, NA, 2771, 332, 878, NA, 673, 1283, NA, NA, NA, NA...
## $ AG.30   <dbl> 473, NA, 771, 482, 1010, NA, 487, 517, 8, NA, NA, NA, ...
## $ LG.30   <dbl> 498, NA, 812, 482, 1010, NA, 487, 517, 8, NA, NA, NA, ...
## $ AG.31   <dbl> NA, NA, 321, 520, 668, NA, 1107, 9337, 55, NA, NA, 35,...
## $ LG.31   <dbl> NA, NA, 338, 520, 668, NA, 1107, 9337, 55, NA, NA, 37,...
## $ AG.32   <dbl> 454, NA, NA, 440, 668, NA, 1029, 6289, 22, NA, NA, NA,...
## $ LG.32   <dbl> 478, NA, NA, 440, 668, NA, 1029, 6289, 22, NA, NA, NA,...
## $ AG.33   <dbl> 398, NA, 704, 502, 658, NA, 491, 2935, NA, NA, NA, NA,...
## $ LG.33   <dbl> 4189, NA, 741, 502, 658, NA, 491, 2935, NA, NA, NA, NA...
## $ Total   <dbl> 31244, NA, 27101, 51656, 55114, 72, 37372, 242632, 121...

So this sheet has 61 rows of data (each row represents a species) and 72 columns. We note that the first three columns are as expected some kind of an id, species name and Latin name. Following that are 34 pairs of column names “AG” and “LG” and then the last column is labelled Total (presumably row total). We can get an overview of the names by:

names(d)
##  [1] "id"      "species" "latin"   "AG"      "LG"      "AG.1"    "LG.1"   
##  [8] "AG.2"    "LG.2"    "AG.3"    "LG.3"    "AG.4"    "LG.4"    "AG.5"   
## [15] "LG.5"    "AG.6"    "LG.6"    "AG.7"    "LG.7"    "AG.8"    "LG.8"   
## [22] "AG.9"    "LG.9"    "AG.10"   "LG.10"   "AG.11"   "LG.11"   "AG.12"  
## [29] "LG.12"   "AG.13"   "LG.13"   "AG.14"   "LG.14"   "AG.15"   "LG.15"  
## [36] "AG.16"   "LG.16"   "AG.17"   "LG.17"   "AG.18"   "LG.18"   "AG.19"  
## [43] "LG.19"   "AG.20"   "LG.20"   "AG.21"   "LG.21"   "AG.22"   "LG.22"  
## [50] "AG.23"   "LG.23"   "AG.24"   "LG.24"   "AG.25"   "LG.25"   "AG.26"  
## [57] "LG.26"   "AG.27"   "LG.27"   "AG.28"   "LG.28"   "AG.29"   "LG.29"  
## [64] "AG.30"   "LG.30"   "AG.31"   "LG.31"   "AG.32"   "LG.32"   "AG.33"  
## [71] "LG.33"   "Total"

Notice that for the first trip the landing types are labeled “AG” and “LG”. For the second trip we have “AG.1” and “LG.1” and so on, ending with the labels “AG.33” and “LG.33” for trip number 34. The reason is that in R one has to have unique column names. Thus if there are multiple column with the same name in a sheet, the readWorksheet-function automatically put a sequential number to the the names.

It is nice to observe that all these columns were interpreted as numeric (coded as “\<dbl\>” above). This means than no nasty characters other than numbers or even just some “spaces” were in this sheet.

Now although we gotten the data into R, the format of the table is such that any further analysis is very difficult. We deal with that in Case examples: Logbook data.