Cleaning data

Preamble

The difference between tidying vs cleaning:

  • Tidying refers to restructuring the dataset such that each column is a variable and each row an observation.
  • Cleaning refers to changing values in a variable, most often because of data-entry mistakes.

In this session we are going to read in some catch sampling data from a country and a specific period. The data structure is tidy but there are various issues with the data entry. Each record (row) consist of the following observations:

  • vessel id
  • landing site id
  • date of sampling
  • gear id
  • species id
  • catch weight (these have been scrambled)

The data collection at the landings site were initially recorded on a paper and then entered in a spreadsheet in the office. This is a very common procedure. Also what is most common is that there were no restrictions made with respect to what could be entered in the spreadsheet.

Import

library(tidyverse)
library(janitor)
d <- 
  read_csv("https://heima.hafro.is/~einarhj/data/raw-data.csv",
           # this argument is to ensure that the function read_csv is not guessing the variable
           #  type from the first 100 thousand records
           guess_max = Inf)
New names:
Rows: 14966 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(7): Vessel #, LANDING SITE, DATE MM/DD/YYYY, GEAR, SPECIES, SOAK TIM... dbl
(2): WEIGHT...6, TRAPS PULL lgl (7): WEIGHT...10, Fishing 1, Fishing2,
Fishing3, PAB, Fdays, Raised
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `WEIGHT` -> `WEIGHT...6`
• `WEIGHT` -> `WEIGHT...10`

Initially when we read in any data it is important to read the messages that we obtain. When using read_csv we get some informations about what type each variable is allocated to (like “chr”, “dbl” and “lgl”). Also here we get the information that the column name “WEIGHT” occurs twice. And because in R we can not have the same variable name twice in a dataframe, read_csv adds “…X” to the original, the X being a number that denotes in what column number in the original data each of the duplicate names occurs (here column number 6 and 10).

Lets get a peak at the data:

d |> glimpse()
Rows: 14,966
Columns: 16
$ `Vessel #`           <chr> "586BE", "586BE", "428SP", "428SP", "586BE", "586…
$ `LANDING SITE`       <chr> "BAEP", "BAEP", "SAPB", "SAPB", "BAEP", "BAEP", "…
$ `DATE    MM/DD/YYYY` <chr> "01/01/2023", "01/01/2023", "01/01/2023", "01/01/…
$ GEAR                 <chr> "SGUN", "SGUN", "HLIN", "HLIN", "SGUN", "SGUN", "…
$ SPECIES              <chr> "CLASSD", "PALIAR", "SERRGU", "SERRFU", "CLASSD",…
$ WEIGHT...6           <dbl> 5, 240, 155, 20, 4, 85, 10, 10, 14, 22, 12, 12, 5…
$ `TRAPS PULL`         <dbl> NA, NA, NA, NA, NA, NA, 200, 200, 200, NA, NA, NA…
$ `SOAK TIME  (HRS)`   <chr> NA, NA, NA, NA, NA, NA, "14", "14", "14", NA, NA,…
$ `AREA FISHED`        <chr> "E7", "E7", "B3", "B3", "E7", "E7", "D3,E3,D4,E4"…
$ WEIGHT...10          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Fishing 1`          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Fishing2             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Fishing3             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ PAB                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Fdays                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Raised               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Variable names

I usually look at the variable names first, and here I just do not like them. In general:

  • Long names makes coding downstream problematic
  • Spaces in names are even worse, because then we need to put variable names in quotation marks (like SOAK TIME (HRS)).
  • Remembering if a variable name has a captial creates extra strains and also requries extra keyboard typing.

All of these qualms are solved using the function “clean_names” that resides in the {janitor}-package. Thus we can do:

d |> 
  clean_names() |> 
  glimpse()
Rows: 14,966
Columns: 16
$ vessel_number   <chr> "586BE", "586BE", "428SP", "428SP", "586BE", "586BE", …
$ landing_site    <chr> "BAEP", "BAEP", "SAPB", "SAPB", "BAEP", "BAEP", "BAIM1…
$ date_mm_dd_yyyy <chr> "01/01/2023", "01/01/2023", "01/01/2023", "01/01/2023"…
$ gear            <chr> "SGUN", "SGUN", "HLIN", "HLIN", "SGUN", "SGUN", "TRAP"…
$ species         <chr> "CLASSD", "PALIAR", "SERRGU", "SERRFU", "CLASSD", "PAL…
$ weight_6        <dbl> 5, 240, 155, 20, 4, 85, 10, 10, 14, 22, 12, 12, 5, 60,…
$ traps_pull      <dbl> NA, NA, NA, NA, NA, NA, 200, 200, 200, NA, NA, NA, NA,…
$ soak_time_hrs   <chr> NA, NA, NA, NA, NA, NA, "14", "14", "14", NA, NA, NA, …
$ area_fished     <chr> "E7", "E7", "B3", "B3", "E7", "E7", "D3,E3,D4,E4", "D3…
$ weight_10       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ fishing_1       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ fishing2        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ fishing3        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ pab             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ fdays           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ raised          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

This is better. Now we see that the variables from weight_10 to raised are all of type logical (“lgl”) and we see at least in the top rows just NA’s as values. There is a way to remove variables where all records are empty using the “remove_empty” function, again from the {janitor}:

d |> 
  clean_names() |> 
  remove_empty() |> 
  glimpse()
Rows: 14,966
Columns: 9
$ vessel_number   <chr> "586BE", "586BE", "428SP", "428SP", "586BE", "586BE", …
$ landing_site    <chr> "BAEP", "BAEP", "SAPB", "SAPB", "BAEP", "BAEP", "BAIM1…
$ date_mm_dd_yyyy <chr> "01/01/2023", "01/01/2023", "01/01/2023", "01/01/2023"…
$ gear            <chr> "SGUN", "SGUN", "HLIN", "HLIN", "SGUN", "SGUN", "TRAP"…
$ species         <chr> "CLASSD", "PALIAR", "SERRGU", "SERRFU", "CLASSD", "PAL…
$ weight_6        <dbl> 5, 240, 155, 20, 4, 85, 10, 10, 14, 22, 12, 12, 5, 60,…
$ traps_pull      <dbl> NA, NA, NA, NA, NA, NA, 200, 200, 200, NA, NA, NA, NA,…
$ soak_time_hrs   <chr> NA, NA, NA, NA, NA, NA, "14", "14", "14", NA, NA, NA, …
$ area_fished     <chr> "E7", "E7", "B3", "B3", "E7", "E7", "D3,E3,D4,E4", "D3…

This actually also removes empty records, i.e. where all the values in a measurment are NA’s.

Taken together the initial code flow could be something like this, here I also include renaming even the cleaned names, making them generally shorter to save typing downstream:

d <- 
  read_csv("https://heima.hafro.is/~einarhj/data/raw-data.csv", guess_max = Inf) |> 
  clean_names() |> 
  remove_empty(which = "cols") |> 
  rename(vid = vessel_number,
         lid = landing_site,
         date = date_mm_dd_yyyy,
         gid = gear,
         sid = species,
         wt = weight_6,
         n_traps = traps_pull,
         effort = soak_time_hrs,
         area = area_fished) |>
  # add a variable that constitue the row number
  mutate(.rid = 1:n())

Cleaning values

We currently have:

d |> glimpse()
Rows: 14,966
Columns: 10
$ vid     <chr> "586BE", "586BE", "428SP", "428SP", "586BE", "586BE", "582BW",…
$ lid     <chr> "BAEP", "BAEP", "SAPB", "SAPB", "BAEP", "BAEP", "BAIM1", "BAIM…
$ date    <chr> "01/01/2023", "01/01/2023", "01/01/2023", "01/01/2023", "01/02…
$ gid     <chr> "SGUN", "SGUN", "HLIN", "HLIN", "SGUN", "SGUN", "TRAP", "TRAP"…
$ sid     <chr> "CLASSD", "PALIAR", "SERRGU", "SERRFU", "CLASSD", "PALIAR", "S…
$ wt      <dbl> 5, 240, 155, 20, 4, 85, 10, 10, 14, 22, 12, 12, 5, 60, 2, 3, 8…
$ n_traps <dbl> NA, NA, NA, NA, NA, NA, 200, 200, 200, NA, NA, NA, NA, 200, 15…
$ effort  <chr> NA, NA, NA, NA, NA, NA, "14", "14", "14", NA, NA, NA, NA, "14"…
$ area    <chr> "E7", "E7", "B3", "B3", "E7", "E7", "D3,E3,D4,E4", "D3,E3,D4,E…
$ .rid    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…

Now it is time to check the variable type. We see that vid, lid, gid, sid are all characters, which in this specific case we expect. Variable wt is numeric (“dbl”) but we take note that variable “effort” is character but we kind of expected a numerical. Then we have the date variable that we expected to be in R date-format.

Cleaning the effort values (soak time)

Let’s first check the effort variable. One way to check why a variable is not as expected is to try to make it numeric and then check cases were the original value was not NA but the converted values is NA. We do this interactively first, creating a new variable that we force to be numeric:

d |> 
  mutate(tmp = as.numeric(effort)) |> 
  filter(!is.na(effort), is.na(tmp))
# A tibble: 12 × 11
   vid   lid   date    gid   sid           wt n_traps effort area   .rid   tmp
   <chr> <chr> <chr>   <chr> <chr>      <dbl>   <dbl> <chr>  <chr> <int> <dbl>
 1 590BE BAEP2 6/9/23  SGUN  CLASSD       160      NA A4     <NA>   6675    NA
 2 590BE BAEP2 6/9/23  SGUN  PALIAR         5      NA A4     <NA>   6676    NA
 3 555BE BAEP2 6/9/23  SGUN  CLASSD        43      NA A5     <NA>   6677    NA
 4 555BE BAEP2 6/9/23  SGUN  SERRGU       140      NA A5     <NA>   6678    NA
 5 555BE BAEP2 6/9/23  SGUN  PALIAR         5      NA A5     <NA>   6679    NA
 6 485BE BAEP2 6/12/23 SCUBA STRO          30      NA A4     <NA>   6680    NA
 7 244BE BAEP2 6/12/23 SCUBA STRO           2      NA D7     <NA>   6681    NA
 8 629BE BAEP  4/25/23 SGUN  LUTJCA       240      NA F7     <NA>   6682    NA
 9 561BE BAEP  4/25/23 HLIN  CLASSD        10      NA D5     <NA>   6683    NA
10 582BW BAIM  9/4/23  TRAP  LUTJCH        30      NA 72 hrs E6    11717    NA
11 582BW BAIM  9/4/23  TRAP  SERRGU        59      NA 72 hrs E6    11718    NA
12 582BW BAIM  9/4/23  TRAP  SHELL FISH     8      NA 72hrs  E6    11720    NA

So we have 12 observations were we could not make effort numeric, the reason being that for those records we have characters (letters) in the data. In the first 9 records we observe that we have a letter and a number. If you were familiar with the data to begin with you would recognize these as being the fishing area code. For those records we also observe that values for area is missing. In the last three records the data-entry person added ” hrs”, ” hrs” and “hrs” in addition to the value (here 72 in all cases). Let’s try to remove the text “hrs” and then convert the effort to numeric:

d |> 
  mutate(effort = str_remove(effort, "hrs"),
         # get rid of spaces at the beginning or the end of a character vector
         effort = str_squish(effort)) |> 
  mutate(tmp = as.numeric(effort)) |> 
  filter(!is.na(effort), is.na(tmp))
# A tibble: 9 × 11
  vid   lid   date    gid   sid       wt n_traps effort area   .rid   tmp
  <chr> <chr> <chr>   <chr> <chr>  <dbl>   <dbl> <chr>  <chr> <int> <dbl>
1 590BE BAEP2 6/9/23  SGUN  CLASSD   160      NA A4     <NA>   6675    NA
2 590BE BAEP2 6/9/23  SGUN  PALIAR     5      NA A4     <NA>   6676    NA
3 555BE BAEP2 6/9/23  SGUN  CLASSD    43      NA A5     <NA>   6677    NA
4 555BE BAEP2 6/9/23  SGUN  SERRGU   140      NA A5     <NA>   6678    NA
5 555BE BAEP2 6/9/23  SGUN  PALIAR     5      NA A5     <NA>   6679    NA
6 485BE BAEP2 6/12/23 SCUBA STRO      30      NA A4     <NA>   6680    NA
7 244BE BAEP2 6/12/23 SCUBA STRO       2      NA D7     <NA>   6681    NA
8 629BE BAEP  4/25/23 SGUN  LUTJCA   240      NA F7     <NA>   6682    NA
9 561BE BAEP  4/25/23 HLIN  CLASSD    10      NA D5     <NA>   6683    NA

Now the above code likely worked as expected because we only now have records where the effort is actually the area code. We could create and algorithm were we move these values to the right variable. One way out of many could be:

d <- 
  d |> 
  mutate(effort = str_remove(effort, "hrs"),
         # get rid of spaces at the beginning or the end of a character vector
         effort = str_squish(effort)) |> 
  mutate(tmp = as.numeric(effort),
         area = case_when(!is.na(effort) & is.na(tmp) & is.na(area) ~ effort,
                          .default = area)) |> 
  # lets remove the original effort and then rename the tmp as effort
  select(-effort) |> 
  rename(effort = tmp)
d |> glimpse()                          
Rows: 14,966
Columns: 10
$ vid     <chr> "586BE", "586BE", "428SP", "428SP", "586BE", "586BE", "582BW",…
$ lid     <chr> "BAEP", "BAEP", "SAPB", "SAPB", "BAEP", "BAEP", "BAIM1", "BAIM…
$ date    <chr> "01/01/2023", "01/01/2023", "01/01/2023", "01/01/2023", "01/02…
$ gid     <chr> "SGUN", "SGUN", "HLIN", "HLIN", "SGUN", "SGUN", "TRAP", "TRAP"…
$ sid     <chr> "CLASSD", "PALIAR", "SERRGU", "SERRFU", "CLASSD", "PALIAR", "S…
$ wt      <dbl> 5, 240, 155, 20, 4, 85, 10, 10, 14, 22, 12, 12, 5, 60, 2, 3, 8…
$ n_traps <dbl> NA, NA, NA, NA, NA, NA, 200, 200, 200, NA, NA, NA, NA, 200, 15…
$ area    <chr> "E7", "E7", "B3", "B3", "E7", "E7", "D3,E3,D4,E4", "D3,E3,D4,E…
$ .rid    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ effort  <dbl> NA, NA, NA, NA, NA, NA, 14, 14, 14, NA, NA, NA, NA, 14, 13, 13…

So now we have (hopefully) gotten the effort records corrected. We may actually want to check what gear actually contain the effort (these should be only gears that require soak time):

d |> 
  filter(!is.na(effort)) |> 
  count(gid)
# A tibble: 10 × 2
   gid       n
   <chr> <int>
 1 BAIM      1
 2 HLIN     16
 3 LLIN      5
 4 SCUBA     3
 5 SGUN      1
 6 TRAP   2468
 7 TRAPS    64
 8 TROL      4
 9 Traps     3
10 traps     4

So for most of the records we do see gear coding that resembles something that requires soak time, but there are also some gears were we would not expect a value. Now the error here could be either because the gear was wrongly reported or that a soak time was accidentially entered. The only way to check that is by checking the original data.

Cleaning the gear code (incomplete)

Now there is another thing we spotted when checking what gear have reported soak time and that was different spelling of gear codes. One way to try to correct for such spelling mistakes is to create an overview or lookup table to see the different forms of entries and then create a code that take case of corrections. For gear code we have the following different spellings:

d |> 
  count(gid) |> 
  knitr::kable()
gid n
BAIM 1
BSNE 218
DIVE 1
FREE STYLE 9
FREESTYLE 3
Free Dive 1
HLIN 1323
HLNE 1
HLine 1
Hline 6
LLIN 35
LOOP 40
Line 2
MACHINE 2
NET 7
Net 3
POT 2
ROD 1
S/GUN 6
SCUBA 1411
SEINE 2
SGUN 8070
SSGUN 1
Spear Gun 1
TRAP 3667
TRAPS 72
TROL 65
TROLL 8
Traps 3
traps 4

I tend to use the case_when function in a iterative procedure, here the final code for some initial correction the gear code entry:

d <- 
  d |> 
  #      All values to upper capital
  mutate(gid = str_to_upper(gid),
         # get rid of extra spaces
         gid = str_trim(gid),
         gid = case_when(gid %in% c("FREE DIVE", "FREE STYLE", "FREESTYLE") ~ "DIVE",
                         gid %in% c("HLINE", "HLNE") ~ "HLIN",
                         gid %in% c("POT", "TRAPS") ~ "TRAP",
                         gid %in% c("SPEAR GUN", "SSGUN", "S/GUN") ~ "SGUN",
                         gid %in% c("TROLL") ~ "TROL",
                         .default = gid))
d |> count(gid) |> knitr::kable()
gid n
BAIM 1
BSNE 218
DIVE 14
HLIN 1331
LINE 2
LLIN 35
LOOP 40
MACHINE 2
NET 10
ROD 1
SCUBA 1411
SEINE 2
SGUN 8078
TRAP 3748
TROL 73

For gear I now have at least corrected some of the obvious spelling mistakes. Some gear codes are most likely still not correct but the only way to ammend those is to have e.g. knowledge of what are the “accepted” gear codes for the specific data in question. And one can expect that in some cases one needs to check the original data records (the paper trail).

Cleaning the landing site code (incomplete)

Similar as for gear we could try to make some corrections for the landings sites. The orginal values for those are:

d |> 
  count(lid) |> 
  knitr::kable()
lid n
BAEP 726
BAEP1 1373
BAEP10 32
BAEP2 1657
BAEP3 5176
BAEP4 3
BAEP5 3
BAEP6 92
BAFC 68
BAFC10 71
BAFC2 57
BAFC3 132
BAFC4 122
BAFC7 91
BAIM 661
BAIM1 31
BAIM10 175
BAIM2 3
BAIM3 60
BAIM4 4
BAIM5 4
BAIM6 5
BEAP 48
CARAMBOLA 3
CHTN 582
CHTN2 13
CNGD 1
DIBA 19
DIBA1 13
DIBA10 135
DIBA2 6
DIBA4 64
DIBA7 90
FBAY 1
FBAY7 1
JEUP 649
JEUP11 3
JEUP12 6
JNBY 92
LGHL 149
NEGU 9
NEGU2 162
NEGU4 176
NEW GUINEA 2
NEWCASTLE 14
NWCL 288
NWCL10 83
NWCL5 10
OLR01 1
OLRO 107
OLRO1 37
OLRO10 53
OLRO3 12
OLRO4 9
ORLO1 15
ORLO10 4
ORLO4 21
PALP 378
PALP2 10
PALP4 147
PALP6 109
PALP7 84
SAPB 660
SAPB10 3
SAPB3 74
SAPB4 77

Now before proceeding it would be helpful to have some local knowledge. One thing I notice though is that landings code starts in most cases with four letters that in many cases are followed by a number. There are however a couple of cases where the landings site is longer than a 4 letter code. I will specifically assume that “NEW GUINEA” should be “NEGU” and “NEWCASTLE” to be “NWCL”. There are also 3 “CARAMBOLA” entries that I can not make an educated guess what the code would be, so I will just convert that to “CARA”. I will start on correcting those and then proceed with the splitting

d <- 
  d |> 
  mutate(lid = case_when(lid == "NEW GUINEA" ~ "NEGU",
                         lid == "NEWCASTLE" ~ "NWCL",
                         lid == "CARAMBOLA" ~ "CARA",
                         .default = lid)) |> 
  # argument sep = 4, splits the data after the 4th character
  # argument remove = FALSE retains the oginal column
  separate(col = lid, into = c("lid", "lid2"), sep = 4, convert = TRUE) |> 
  mutate(lid = str_replace(lid, "0", "O"),
         # spotted this also late in the game, made an enducated quess which
         #  version was right
         lid = if_else(lid == "ORLO", "OLRO", lid),
         lid = if_else(lid == "BEAP", "BAEP", lid))
d |> count(lid) |> knitr::kable()
lid n
BAEP 9110
BAFC 541
BAIM 943
CARA 3
CHTN 595
CNGD 1
DIBA 327
FBAY 2
JEUP 658
JNBY 92
LGHL 149
NEGU 349
NWCL 395
OLRO 259
PALP 728
SAPB 814
d |> count(lid2)
# A tibble: 11 × 2
    lid2     n
   <int> <int>
 1     1  1470
 2     2  1908
 3     3  5454
 4     4   623
 5     5    17
 6     6   206
 7     7   266
 8    10   556
 9    11     3
10    12     6
11    NA  4457

Converting to dates

… text and code pending

d <- 
  d |> 
  rename(date_org = date) |> 
  mutate(tmp1 = mdy(date_org),
         # some dates are in this format
         tmp2 = dmy(date_org),
         date = case_when(!is.na(tmp1) ~ tmp1,
                          !is.na(tmp2) ~ tmp2,
                          .default = NA)) |> 
  select(-c(tmp1, tmp2)) 
# records to check:
d |> 
  filter(is.na(date)) |> 
  glimpse()
Rows: 5
Columns: 12
$ vid      <chr> "582BW", "496DB", "496DB", "601 BW", "586BE"
$ lid      <chr> "BAIM", "DIBA", "DIBA", "BAIM", "BAEP"
$ lid2     <int> 6, 10, 10, NA, 2
$ date_org <chr> "O4/15/2023", "4/31/2023", "6/31/2023", "IO/31/23", "11/08320…
$ gid      <chr> "TROL", "SGUN", "SGUN", "DIVE", "SCUBA"
$ sid      <chr> "Mahi Mahi", "CLASSD", "CLASSD", "STROGI", "LOBSTER"
$ wt       <dbl> 3, 5, 39, 15, 110
$ n_traps  <dbl> NA, NA, NA, NA, NA
$ area     <chr> NA, "C5", "C5", "E6", "G7"
$ .rid     <int> 9415, 10605, 10607, 12345, 12720
$ effort   <dbl> NA, NA, NA, NA, NA
$ date     <date> NA, NA, NA, NA, NA

Some tidying

If one thinks about it the process when sampling landings data is in this specific case something like the following:

  • One or more observers go to a landing site on a particular date.
  • The observer picks at minimum a vessel that arrives to land their catch
  • The observer records the details of the catch by each species from that vessel

Given this it would not be illogical to have 3 separte entry forms:

  1. A landing site report entry sheet. The following variables being recorded (additional potential data that one could record as shown in brackets):
  • Date
  • Name of landing site
  • Name of the observer(s)
  • (Time of observers arrival)
  • (Time of observers departure)
  • (…)
  1. An trip information sheet for each vessel. There one would report things like:
  • Vessel id
  • Effort data
    • If gear is POTS
      • soak time in hours, else
      • 1 day?
    • (Time of departure)
    • (Time of arrival)
    • (…)
  • Fishing location
  • (Crew number)
  • (Fuel used)
  • (…)
  1. A detail table
  • Species id
  • Weight of catch
  • (Value of catch)
  • (…)

In practice we would most likely record item 2) as header info on a sheet followed further down on the same sheet by the catch details (item 3). The bottom line here is that the actual data imported are really 3 separate tidy tables.

One way to think about this is if your boss was interested in the statistics of the sampling site visited by date over a year. The way to achieve this given the current datatable structure is to first get the distinct records of date and landings site:

table1 <- 
  d |> 
  # there is still some error in the date entry
  filter(year(date) == 2023) |> 
  select(date, lid) |>
  distinct() 
table1 |> 
  ggplot(aes(date, lid)) +
  geom_tile(fill = "red")