ICES VMS datacall: present -> dplyr (-> duckdb)

Note: This document is actively evolving. Feedback, corrections, and contributions are welcome!

1 Preamble

This document illustrates as well as reviews the ICES VMS datacall workflow, highlighting both the legacy and alternative data processing approaches.

The goal of this exercise was to clarify the process, if only personally for one that resides mostly in the tidyverse environment. Moving to the tidyverse, more specifically using the {dplyr}-verbs is not a goal in itself, but rather a stepping stone towards being able to use backend translation of that type of R code to SQL, via the {dbplyr}-package. Processing the data via in-memory duckdb reading from sweeps of files residing on the computer via R script, without importing the whole datasets into R is the ultimate aim.

1.1 TODO

  • Implement “split-among-pings” the new way
  • Expand spatial procedures (points-in-polygons for ICES areas, harbours, habitats)
  • DuckDB section: save processed files as Parquet and show how to start from there

1.2 The Story So Far

The ICES VMS datacall codebase is a legacy, shaped by contributions from multiple developers, each with their preferred R dialect (data.table, tidyverse, etc.). While {vmstools} remains foundational, many new functions have accumulated over time—often as standalone scripts with limited documentation.

  • Challenge: The codebase is heterogeneous and difficult to refactor wholesale. Incremental, undocumented changes can also introduce long-term maintenance risks.
  • Aim: This document provides an external review and stepwise modernization, focusing on reproducibility, transparency, and scalability. Note that the author is not a long-time datacall maintainer, but rather a user and reviewer of the latest code version.

1.3 Approach

  • Use the dplyr grammar throughout, to facilitate translation across:

  • in-memory data frames,

  • traditional transactional databases,

  • on-the-fly in-memory databases (e.g., DuckDB).

  • Rationale:

  • Out-of-memory workflows help avoid system crashes as data volumes grow (e.g., higher VMS ping frequency, new data sources like AIS/SSF).

  • Anticipate future requirements and enable smooth scaling without major rewrites.

  • Coding principles:

  • Use and expose generic, well-tested dplyr functions.

  • Use pipeline for readability

  • Log all data quality checks as part of the pipeline, not in separate object.

  • Minimize redundant date/time variables—generate datetime columns once upfront, and reconstruct character formats only if needed later.

  • Split EFLALO data into “trips” and “events” and use those when assigning events to pings.

Downstream, compare the “official” 2025 script (“datacall”) and a modernized approach (“myway”) for validation.

1.4 Packages and ad-hoc functions

We rely on several packages and external scripts/functions. The list below includes both CRAN and GitHub dependencies.

Code
library(data.table)  # Required for some datacall functions
library(vmstools)    # Provides foundational VMS/logbook utilities
library(tidyverse)   # Modern data manipulation and piping
library(duckdb)      # For scalable, out-of-memory data processing

### datacall functions - can not source global.R, here a datacall trip_assign as a temporary gist
source("https://gist.githubusercontent.com/einarhjorleifsson/deee46b493f1f95e95e905c15f1c2432/raw/e2d2a0a8241ac8f9c86c3971de677ff260d7319d/rb_trip_assign.R")

### ramb functions
source("https://raw.githubusercontent.com/einarhjorleifsson/ramb/refs/heads/main/R/rb_sum_across.R")
# source(here::here("R/dc_spread_cash_and_catch_v2.R"))

rb_getCodeList <- function(code_type, from_web = TRUE) {
  if(from_web) {
    icesVocab::getCodeList(code_type)
  } else {
    readr::read_rds(paste0(here::here("data/icesVocab"), "/", code_type, ".rds"))
  }
}

1.5 Data

For illustration, we use the {vmstools} built-in datasets and ICES vocabularies.

Code
data(eflalo)
eflalo_org <- eflalo  # Make a copy for donwstream comparisons
data(tacsat)

# Advice to change from_web = FALSE to from_web = TRUE, have temporary problems
iv_gear <-   rb_getCodeList("GearType", from_web = FALSE)$Key
iv_target <- rb_getCodeList("TargetAssemblage", from_web = FALSE)$Key
iv_met6 <-   rb_getCodeList("Metier6_FishingActivity", from_web = FALSE)$Key

1.6 Date-time consolidation, sum catch and weight & unique rownumber

Convert and consolidate all relevant date and datetime fields up front.

  • This reduces clutter and makes downstream processing simpler

  • Unique row IDs are generated for easy traceability and comparison checks

Code
remove <- TRUE  # Decide if you want retain orginal variables if new are created
eflalo <- 
  eflalo |> 
  as_tibble() |>
  rename(LE_MET6 = LE_MET_level6) |> 
  unite(col = "D_DATIM", FT_DDAT, FT_DTIME, sep = " ", remove = remove) |> 
  unite(col = "L_DATIM", FT_LDAT, FT_LTIME, sep = " ", remove = remove) |> 
  # should really be FT_DDATIM and FT_LDATIM - that is what splitAmongPings checks for
  mutate(D_DATIM = dmy_hms(D_DATIM, tz = "UTC"),       # Note: {vmstools} used "GMT"
         L_DATIM = dmy_hms(L_DATIM, tz = "UTC"),
         LE_CDAT = dmy(LE_CDAT),
         .after = FT_REF) |> 
  arrange(VE_COU, VE_COU, LE_CDAT) |> 
  mutate(.eid = 1L:n()) |> 
  select(.eid, VE_COU, VE_REF, VE_LEN, VE_KW, VE_TON, everything()) |> 
  # for the demo, reduces the clutter
  rb_sum_across("LE_KG", remove = remove) |> 
  rb_sum_across("LE_EURO", remove = remove)
tacsat <- 
  tacsat |> 
  as_tibble() |> 
  # have to keep SI_DATE, because used in trip_assign
  unite("SI_DATIM", SI_DATE, SI_TIME, sep = " ", remove = FALSE) |>
  select(-SI_TIME) |> 
  # UTC does not recognize "29/02/1801", not a leap year
  mutate(SI_DATIM = dmy_hms(SI_DATIM, tz = "UTC"),
         SI_DATE = dmy(SI_DATE)) |> 
  # hence filter upfront
  filter(!is.na(SI_DATIM)) |> 
  arrange(VE_COU, VE_REF, SI_DATIM) |> 
  # possibly part of the checks, but it is natural to do it upfront
  distinct(VE_COU, VE_REF, SI_DATIM, .keep_all = TRUE) |> 
  arrange(VE_COU, VE_COU, SI_DATIM) |> 
  mutate(.pid = 1L:n(),
         .before = VE_COU)

2 Pre-processing

2.1 eflalo: Checks

All records are assessed for potential errors or issues, which are logged in a checks column. This makes it easy to review and filter problematic records. Here bookkeeping of potential filtering of records is part of the pipe-flow process. Additional checks can easily be implemented and then in the end the user can decide what records should be filtered, based on values in “checks”.

Code
eflalo <- 
  eflalo |> 
  # only needed if one wants to do additional testing
  separate(LE_MET6, into = c(".m6gear", ".m6target", ".m6mesh", ".m6rest"), sep = "_", remove = FALSE, extra = "merge") |> 
  separate(.m6mesh, into = c(".m1", ".m2"), sep = "-", remove = FALSE, fill = "right", convert = TRUE) |> 
  separate(LE_ID, into = c(".etid", ".egear", ".eir"), sep = "-", remove = FALSE) |> 
  # TODO: Deal with .m1 or .m2 being characters
  #mutate(.m2 = case_when(is.na(.m2) & as.character(.m1) == "0" ~ "0",
  #                       .default = .m2)) |> 
  mutate(
    checks = case_when(                                                   # datacall
      # Catch record outlier - code pending                               # 1.3.2
      base::duplicated(paste(VE_REF, LE_ID, LE_CDAT)) ~ "01 duplicated events",   # 1.3.3
      is.na(D_DATIM) | is.na(L_DATIM) ~ "02 impossible time",                       # 1.3.4
      year(D_DATIM) == (year(L_DATIM) - 1) ~ "03 new years trip",                   # 1.3.5
      D_DATIM > L_DATIM ~ "04 departure before arrival",                           # 1.3.6
      # "05 overlapping trips  - code pending                             # 1.3.7
      !LE_GEAR %in% iv_gear ~ "06 gear (metier 4) invalid",                  # 1.4.1
      !LE_MET6 %in% iv_met6 ~ "07 metier 6 invalid",                         # 3.5.5
      # Addendum, needs discussions
      !between(LE_CDAT, as_date(D_DATIM), as_date(L_DATIM)) ~ "ok - 08 catch date not within trip",
      # metier level 6 component checks
      LE_GEAR != .m6gear ~ "ok - 01 met6 gear different from gear",
      !.m6target %in% iv_target ~ "ok - 10 met6 target invalid",
      # event id component checks
      FT_REF != .etid ~ "ok - 11 event id tripid different from tripid",
      LE_GEAR != .egear ~ "ok - 12 event id gear different from gear",
      LE_RECT != .eir ~ "ok - 13 event id rectangle different from rectangle",
      # Pending solution prior to case_when applied
      #!between(mesh, as.integer(.m1), .m2) ~ "11 mesh size not in the met6 range",
      .default = "ok")
  )
eflalo |> 
  count(checks) |> 
  mutate('%' = round(n / sum(n) * 100, 2)) |> 
  knitr::kable(caption = "Logbooks: All records not starting with 'ok' will be filtered downstream")
Logbooks: All records not starting with ‘ok’ will be filtered downstream
checks n %
01 duplicated events 3 0.07
07 metier 6 invalid 741 16.33
ok 3775 83.17
ok - 01 met6 gear different from gear 20 0.44
Code
eflalo <-
  eflalo |> 
  select(-c(.etid, .egear, .eir,
            .m6gear, .m6target, .m6mesh, .m1, .m2, .m6rest)) |> 
  # You may want to use another filter
  filter(str_starts(checks, "ok")) |> 
  select(-checks)

2.2 eflalo: Split eflalo into trip and event table

Splitting the data clarifies intent as well as making “myway” downstream code flow simpler.

Code
trips <-
  eflalo |> 
  group_by(VE_COU, VE_REF, FT_REF) |> 
  mutate(weight = sum(LE_KG),
         price = sum(LE_EURO)) |> 
  ungroup() |> 
  # here we could add variables corresponding to the highest trip
  select(VE_COU, VE_REF, VE_LEN, VE_KW, VE_TON, FT_REF, D_DATIM, L_DATIM, weight, price) |> 
  distinct(VE_COU, VE_REF, FT_REF, .keep_all = TRUE) |> 
  mutate(.tid = 1L:n(),
         .before = VE_COU)
events <- 
  eflalo |> 
  select(.eid, VE_COU, VE_REF, FT_REF, starts_with("LE_")) |> 
  # reduce the clutter for this demo test only - do not use generic code
  janitor::remove_empty(which = "cols")
Code
# TODO: printout with smaller fonts
trips
# A tibble: 2,519 × 11
    .tid VE_COU   VE_REF VE_LEN VE_KW VE_TON FT_REF D_DATIM            
   <int> <chr>    <chr>   <dbl> <dbl>  <dbl> <chr>  <dttm>             
 1     1 Atlantis 449         5    14     NA 263378 1800-01-02 07:00:00
 2     2 Atlantis 452         4    11     NA 263383 1800-01-02 07:00:00
 3     3 Atlantis 36         25   221     NA 264004 1800-01-05 22:00:00
 4     4 Atlantis 449         5    14     NA 263379 1800-01-06 08:00:00
 5     5 Atlantis 449         5    14     NA 263380 1800-01-07 09:00:00
 6     6 Atlantis 36         25   221     NA 264005 1800-01-07 14:00:00
 7     7 Atlantis 449         5    14     NA 263382 1800-01-09 06:00:00
 8     8 Atlantis 36         25   221     NA 264513 1800-01-12 22:00:00
 9     9 Atlantis 36         25   221     NA 264514 1800-01-15 11:00:00
10    10 Atlantis 449         5    14     NA 263795 1800-01-16 07:00:00
# ℹ 2,509 more rows
# ℹ 3 more variables: L_DATIM <dttm>, weight <dbl>, price <dbl>

As for the event table we have:

Code
# TODO: printout with smaller fonts
events
# A tibble: 3,795 × 15
    .eid VE_COU   VE_REF FT_REF LE_ID   LE_CDAT    LE_GEAR LE_MSZ LE_RECT LE_DIV
   <int> <chr>    <chr>  <chr>  <chr>   <date>     <chr>    <dbl> <chr>   <chr> 
 1     1 Atlantis 449    263378 263378… 1800-01-02 GNS        140 34F4    IVc   
 2     2 Atlantis 452    263383 263383… 1800-01-02 GNS        140 34F4    IVc   
 3     3 Atlantis 36     264004 264004… 1800-01-06 OTB        120 31F3    IVc   
 4     4 Atlantis 449    263379 263379… 1800-01-06 GNS        160 34F4    IVc   
 5     5 Atlantis 449    263380 263380… 1800-01-07 GNS        160 34F4    IVc   
 6     6 Atlantis 36     264005 264005… 1800-01-08 OTB        120 31F3    IVc   
 7     7 Atlantis 449    263382 263382… 1800-01-09 GNS        160 34F4    IVc   
 8     8 Atlantis 36     264513 264513… 1800-01-14 OTB        120 31F3    IVc   
 9    10 Atlantis 36     264514 264514… 1800-01-16 OTB        120 31F3    IVc   
10    11 Atlantis 449    263795 263795… 1800-01-16 GNS        160 34F4    IVc   
# ℹ 3,785 more rows
# ℹ 5 more variables: LE_MET6 <chr>, LE_UNIT <fct>, LE_EFF <dbl>, LE_KG <dbl>,
#   LE_EURO <dbl>

2.3 eflalo: Create higest_event and highest trip tables

I think the intent is of the datacall trip_assing function is:

  1. For each fishing day, add the values of gear, mesh and met6 from the highest event catch record within the day (no summation done a priori)
  2. For non-fishing days, add the values of gear, mesh and met6 from the highest trip catches

One way to think about this in the dplyr-flow is to create two tables, highest event table and highest trip table. We will use these two tables in “myway” flow downstream, that code flow being being fully independent of the assign_trip-function.

Code
highest_event <-
  events |> 
  arrange(desc(LE_KG)) |> 
  group_by(VE_COU, VE_REF, FT_REF, LE_CDAT) |> 
  slice(1) |> 
  ungroup() |> 
  select(VE_COU, VE_REF, FT_REF, LE_CDAT, LE_GEAR, LE_MSZ, LE_MET6, LE_RECT)
highest_event
# A tibble: 2,519 × 8
   VE_COU   VE_REF FT_REF LE_CDAT    LE_GEAR LE_MSZ LE_MET6           LE_RECT
   <chr>    <chr>  <chr>  <date>     <chr>    <dbl> <chr>             <chr>  
 1 Atlantis 10     272058 1800-05-07 TBB         80 TBB_DEF_70-99_0_0 32F2   
 2 Atlantis 10     273424 1800-05-14 TBB         80 TBB_DEF_70-99_0_0 32F3   
 3 Atlantis 10     273483 1800-05-21 TBB         80 TBB_DEF_70-99_0_0 32F3   
 4 Atlantis 10     274268 1800-05-27 TBB         80 TBB_DEF_70-99_0_0 32F3   
 5 Atlantis 10     316938 1801-09-08 TBB         80 TBB_DEF_70-99_0_0 32F3   
 6 Atlantis 10     317799 1801-09-14 TBB         80 TBB_DEF_70-99_0_0 32F3   
 7 Atlantis 10     318270 1801-09-22 TBB         80 TBB_DEF_70-99_0_0 32F3   
 8 Atlantis 10     318999 1801-09-29 TBB         80 TBB_DEF_70-99_0_0 31F2   
 9 Atlantis 1000   271704 1800-05-07 TBB         80 TBB_DEF_70-99_0_0 34F4   
10 Atlantis 1000   272438 1800-05-14 TBB         80 TBB_DEF_70-99_0_0 34F4   
# ℹ 2,509 more rows
Code
highest_trip <-
  events |> 
  group_by(VE_COU, VE_REF, FT_REF, LE_GEAR, LE_MSZ, LE_MET6, LE_RECT) |> 
  summarise(LE_KG = sum(LE_KG),
            .groups = "drop") |> 
  arrange(desc(LE_KG)) |> 
  group_by(VE_COU, VE_REF, FT_REF) |> 
  slice(1) |> 
  ungroup() |> 
  select(VE_COU, VE_REF, FT_REF, .LE_GEAR = LE_GEAR, .LE_MSZ = LE_MSZ, 
         .LE_MET6 = LE_MET6, .LE_RECT = LE_RECT)
highest_trip
# A tibble: 2,519 × 7
   VE_COU   VE_REF FT_REF .LE_GEAR .LE_MSZ .LE_MET6          .LE_RECT
   <chr>    <chr>  <chr>  <chr>      <dbl> <chr>             <chr>   
 1 Atlantis 10     272058 TBB           80 TBB_DEF_70-99_0_0 32F2    
 2 Atlantis 10     273424 TBB           80 TBB_DEF_70-99_0_0 32F3    
 3 Atlantis 10     273483 TBB           80 TBB_DEF_70-99_0_0 32F3    
 4 Atlantis 10     274268 TBB           80 TBB_DEF_70-99_0_0 32F3    
 5 Atlantis 10     316938 TBB           80 TBB_DEF_70-99_0_0 32F3    
 6 Atlantis 10     317799 TBB           80 TBB_DEF_70-99_0_0 32F3    
 7 Atlantis 10     318270 TBB           80 TBB_DEF_70-99_0_0 32F3    
 8 Atlantis 10     318999 TBB           80 TBB_DEF_70-99_0_0 31F2    
 9 Atlantis 1000   271704 TBB           80 TBB_DEF_70-99_0_0 34F4    
10 Atlantis 1000   272438 TBB           80 TBB_DEF_70-99_0_0 34F4    
# ℹ 2,509 more rows

Add test here to illustrate the intent

2.4 tacsat: Checks

Code
it_min <- 5 * 60 # Interval threshold minimum [units: seconds]
tacsat <- 
  tacsat |> 
  group_by(VE_COU, VE_REF) |> 
  mutate(dt = c(it_min, diff(unclass(SI_DATIM)))) |>
  ungroup() |> 
  mutate(
    checks = 
      case_when(
        is.na(VE_COU) ~ "00 1 no country id",
        is.na(VE_REF) ~ "00 2 no vessel id",
        is.na(SI_DATIM) ~ "00 3 no time",
        is.na(SI_LONG) ~  "00 4 no lon",
        is.na(SI_LATI) ~  "00 5 no lat",
        is.na(SI_SP) ~ "00 6 no speed",
        base::duplicated(paste(VE_COU, VE_REF, SI_LONG, SI_LATI, SI_DATIM)) ~ "02 duplicates",
        !between(SI_LONG, -180, 180) | !between(SI_LATI, -90, 90) ~ "03 coordinates out of bound",
        dt < it_min ~ "04 time interval exceeds threshold",
        .default = "ok")
  )

tacsat |>
  count(checks) |>
  mutate('%' = round(n / sum(n) * 100, 2)) |>
  knitr::kable(caption = "VMS: All records not 'ok' will be filtered downstream")
VMS: All records not ‘ok’ will be filtered downstream
checks n %
00 6 no speed 11900 13.84
04 time interval exceeds threshold 1128 1.31
ok 72954 84.85
Code
tacsat <- 
  tacsat |> 
  filter(str_starts(checks, "ok")) |>
  select(-c(checks))
trails <- tacsat  # for myway

3 Analysis

3.1 Add trip information to pings

  • First we add trip using the legacy vmstools::mergeEflalo2Tacsat
  • Then use a loop to add fixed trip related variables, (We do not expect vessel length, power or tonnage to change within a trip)
Code
tacsat <-   # normally called tacsatp in the code flow 
  mergeEflalo2Tacsat(eflalo |> as.data.frame(),
                     tacsat |> as.data.frame())
# NOTE: event information added downstream - to clarify thinking
cols <- c("VE_LEN", "VE_KW")
for (col in cols) {
  tacsat[[col]] <- eflalo[[col]][match(tacsat$FT_REF, eflalo$FT_REF)]
}
Code
trails <- 
  trails |> 
  left_join(trips,
            by = join_by(VE_COU, VE_REF, between(SI_DATIM, D_DATIM, L_DATIM)),
            relationship = "many-to-one") |> 
  # No longer needed
  select(-c(D_DATIM, L_DATIM))
Code
comparison <- 
  tacsat |> 
  as_tibble() |> 
  select(.pid, VE_COU, VE_REF, FT_REF, length_dc = VE_LEN, kw_dc = VE_KW) |> 
  left_join(trails |> select(.pid, VE_COU, VE_REF, FT_REF, length_mw = VE_LEN, kw_mw = VE_KW),
            by = join_by(.pid, VE_COU, VE_REF, FT_REF)) |> 
  filter(length_dc != length_mw | kw_dc != kw_mw) |> 
  select(-c(.pid, FT_REF)) |> 
  distinct()
tacsat |> 
  as_tibble() |> 
  select(.pid, VE_COU, VE_REF, FT_REF, length_dc = VE_LEN, kw_dc = VE_KW) |> 
  left_join(trails |> select(.pid, VE_COU, VE_REF, FT_REF, length_mw = VE_LEN, kw_mw = VE_KW)) |> 
  filter(length_dc != length_mw | kw_dc != kw_mw) |> 
  select(-c(.pid, FT_REF)) |> 
  distinct() |> 
  left_join(eflalo_org |> 
              filter(VE_REF %in% unique(comparison$VE_REF)) |> 
              select(VE_REF, VE_LEN, VE_KW) |> 
              distinct()) |> 
  knitr::kable(caption = "Assigning vessel info to pings - discrepancies\n'_dc': datacall, '_mw: myway, original from eflalo\nSeems like myway same as the original")
Assigning vessel info to pings - discrepancies ’_dc’: datacall, ’_mw: myway, original from eflalo Seems like myway same as the original
VE_COU VE_REF length_dc kw_dc length_mw kw_mw VE_LEN VE_KW
Atlantis 729 24.73 221 25.00000 221 25.00000 221
Atlantis 864 25.27 184 17.52212 111 17.52212 111
Atlantis 915 24.50 349 24.00000 221 24.00000 221

Seems “myway” is correct given the original, so something to check in the datacall flow

3.2 Add event information to pings

This step is is not a precursor to the next step - consider rearranging …

Code
## Add gear, mesh and met6 to pings --------------------------------------------
### datacall ----------------------------------------------------------------
# Now add the event info - this step needed so that next step works - could be simplified
cols <- c("LE_GEAR", "LE_MSZ", "LE_RECT", "LE_MET6")
for (col in cols) {
  tacsat[[col]] <- eflalo[[col]][match(tacsat$FT_REF, eflalo$FT_REF)]
}

tacsat_LE_GEAR <- trip_assign(tacsat, eflalo, col = "LE_GEAR",  haul_logbook = F)
# Here and subsequently, replaced %!in% with %in%, moving the ! upfront
tacsat <- rbindlist(list(tacsat[!tacsat$FT_REF %in% tacsat_LE_GEAR$FT_REF,], tacsat_LE_GEAR), fill = T, ignore.attr=TRUE)

tacsat_LE_MSZ <- trip_assign(tacsat, eflalo, col = "LE_MSZ",  haul_logbook = F)
tacsat <- rbindlist(list(tacsat[!tacsat$FT_REF %in% tacsat_LE_MSZ$FT_REF,], tacsat_LE_MSZ), fill = T)

tacsat_LE_RECT <- trip_assign(tacsat, eflalo, col = "LE_RECT",  haul_logbook = F)
tacsat <- rbindlist(list(tacsat[!tacsat$FT_REF %in% tacsat_LE_RECT$FT_REF,], tacsat_LE_RECT), fill = T)

tacsat_LE_MET <- trip_assign(tacsat, eflalo, col = "LE_MET6",  haul_logbook = F)
tacsat <- rbindlist(list(tacsat[!tacsat$FT_REF %in% tacsat_LE_MET$FT_REF,], tacsat_LE_MET), fill = T)

Here we use a left join to the trails and the final values consolidated. Take note that the joining is as follows:

1. highest_event (    fishing day): join_by(VE_COU, VE_REF, FT_REF, SI_DATE == LE_CDAT)
2. highest_trips (non-fishing day): join_by(VE_COU, VE_REF, FT_REF)
Code
trails <- 
  trails |> 
  left_join(highest_event,
            by = join_by(VE_COU, VE_REF, FT_REF, SI_DATE == LE_CDAT),
            relationship = "many-to-one") |> 
  left_join(highest_trip,
            by = join_by(VE_COU, VE_REF, FT_REF),
            relationship = "many-to-one") |> 
  mutate(LE_GEAR = case_when(is.na(LE_GEAR) & !is.na(.LE_GEAR) ~ .LE_GEAR,
                             .default = LE_GEAR),
         LE_MSZ = case_when(is.na(LE_MSZ) & !is.na(.LE_MSZ) ~ .LE_MSZ,
                            .default = LE_MSZ),
         LE_MET6 = case_when(is.na(LE_MET6) & !is.na(.LE_MET6) ~ .LE_MET6,
                             .default = LE_MET6),
         LE_RECT = case_when(is.na(LE_RECT) & !is.na(.LE_RECT) ~ .LE_RECT,
                             .default = LE_RECT)) |> 
  select(-c(.LE_GEAR, .LE_MSZ, .LE_MET6, .LE_RECT))
Code
comparison <- 
  tacsat |> 
  as_tibble() |> 
  select(.pid, VE_COU, VE_REF, FT_REF, LE_GEAR, LE_MSZ, LE_MET6, LE_RECT) |> 
  left_join(trails |> select(.pid, VE_COU, VE_REF, FT_REF, 
                             .LE_GEAR = LE_GEAR, .LE_MSZ = LE_MSZ, 
                             .LE_MET6 = LE_MET6, .LE_RECT = LE_RECT),
            by = join_by(.pid, VE_COU, VE_REF, FT_REF)) |> 
  # reorder
  select(.pid, VE_COU, VE_REF, FT_REF, LE_GEAR, .LE_GEAR, LE_MSZ, .LE_MSZ, LE_MET6, .LE_MET6, LE_RECT, .LE_RECT)
comparison |> 
  #filter(LE_GEAR != .LE_GEAR  | LE_MSZ != .LE_MSZ | LE_MET6 != .LE_MET6) |> 
  mutate(gear = if_else(LE_GEAR == .LE_GEAR, "yes", "no", "yes_na"),
         mesh = if_else(LE_MSZ  == .LE_MSZ, "yes", "no", "yes_na"),
         met6 = if_else(LE_MET6 == .LE_MET6, "yes", "no", "yes_na"),
         rect = if_else(LE_RECT   == .LE_RECT, "yes", "no", "yes_na")) |> 
  count(gear, mesh, met6, rect) |> 
  mutate('%' = round(n / sum(n) * 100, 2)) |> 
  knitr::kable(caption = "Assigning events info to pings - discrepancies between datacall vs. myway\nNeeds a revisit, at least the rectangles")
Assigning events info to pings - discrepancies between datacall vs. myway Needs a revisit, at least the rectangles
gear mesh met6 rect n %
no yes yes no 33 0.05
yes no yes no 738 1.01
yes yes yes no 24089 33.02
yes yes yes yes 32609 44.70
yes_na yes_na yes_na yes_na 15485 21.23
Code
# For now let's look at the event records were gear and rectangles are not the same (33 records)
problem_trips <- 
  comparison |> 
  #filter(LE_GEAR != .LE_GEAR  | LE_MSZ != .LE_MSZ | LE_MET6 != .LE_MET6) |> 
  filter(LE_GEAR != .LE_GEAR  & LE_RECT != .LE_RECT) |> 
  select(VE_COU, VE_REF, FT_REF) |> 
  distinct()
problem_events <- 
  problem_trips |> 
  left_join(events |> select(VE_REF, FT_REF, LE_CDAT, LE_GEAR, LE_RECT, LE_KG)) |> 
  arrange(LE_CDAT, desc(LE_KG))
problem_events |> knitr::kable(caption = "Events table, original data")
Events table, original data
VE_COU VE_REF FT_REF LE_CDAT LE_GEAR LE_RECT LE_KG
Atlantis 1527 270892 1800-04-29 TBB 36F5 688.5230
Atlantis 1527 270892 1800-04-29 OTB 34F4 373.7451
Atlantis 1527 270892 1800-04-29 TBB 34F4 319.9824
Code
problem_events |> 
  inner_join(trails) |> 
  arrange(SI_DATIM) |> 
  select(.pid, SI_DATIM, VE_REF, FT_REF, LE_CDAT, SI_DATE, LE_GEAR, LE_RECT, LE_KG) |> 
  #distinct(VE_REF, FT_REF, LE_CDAT, SI_DATE, LE_GEAR, LE_RECT, LE_KG, .keep_all = TRUE) |> 
  knitr::kable(caption = "Discrepancies - myway")
Discrepancies - myway
.pid SI_DATIM VE_REF FT_REF LE_CDAT SI_DATE LE_GEAR LE_RECT LE_KG
3696 1800-04-28 09:46:00 1527 270892 1800-04-29 1800-04-28 TBB 36F5 688.523
3699 1800-04-28 11:42:00 1527 270892 1800-04-29 1800-04-28 TBB 36F5 688.523
3702 1800-04-28 13:36:00 1527 270892 1800-04-29 1800-04-28 TBB 36F5 688.523
3705 1800-04-28 15:32:00 1527 270892 1800-04-29 1800-04-28 TBB 36F5 688.523
3708 1800-04-28 17:28:00 1527 270892 1800-04-29 1800-04-28 TBB 36F5 688.523
3711 1800-04-28 19:22:00 1527 270892 1800-04-29 1800-04-28 TBB 36F5 688.523
3714 1800-04-28 21:18:00 1527 270892 1800-04-29 1800-04-28 TBB 36F5 688.523
3716 1800-04-28 23:12:00 1527 270892 1800-04-29 1800-04-28 TBB 36F5 688.523
3718 1800-04-29 01:08:00 1527 270892 1800-04-29 1800-04-29 TBB 36F5 688.523
3720 1800-04-29 03:04:00 1527 270892 1800-04-29 1800-04-29 TBB 36F5 688.523
3722 1800-04-29 04:58:00 1527 270892 1800-04-29 1800-04-29 TBB 36F5 688.523
3724 1800-04-29 06:54:00 1527 270892 1800-04-29 1800-04-29 TBB 36F5 688.523
3726 1800-04-29 08:48:00 1527 270892 1800-04-29 1800-04-29 TBB 36F5 688.523
3728 1800-04-29 10:44:00 1527 270892 1800-04-29 1800-04-29 TBB 36F5 688.523
3731 1800-04-29 14:34:00 1527 270892 1800-04-29 1800-04-29 TBB 36F5 688.523
3734 1800-04-29 18:24:00 1527 270892 1800-04-29 1800-04-29 TBB 36F5 688.523
3736 1800-04-29 20:20:00 1527 270892 1800-04-29 1800-04-29 TBB 36F5 688.523
3739 1800-04-30 00:10:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3741 1800-04-30 02:06:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3744 1800-04-30 05:56:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3746 1800-04-30 07:50:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3748 1800-04-30 09:46:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3750 1800-04-30 11:42:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3752 1800-04-30 13:36:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3754 1800-04-30 15:32:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3756 1800-04-30 17:28:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3760 1800-04-30 19:22:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3764 1800-04-30 21:18:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3768 1800-04-30 23:12:00 1527 270892 1800-04-29 1800-04-30 TBB 36F5 688.523
3772 1800-05-01 01:08:00 1527 270892 1800-04-29 1800-05-01 TBB 36F5 688.523
3776 1800-05-01 03:02:00 1527 270892 1800-04-29 1800-05-01 TBB 36F5 688.523
3780 1800-05-01 04:58:00 1527 270892 1800-04-29 1800-05-01 TBB 36F5 688.523
3784 1800-05-01 06:54:00 1527 270892 1800-04-29 1800-05-01 TBB 36F5 688.523
Code
problem_events |> 
  inner_join(tacsat) |> 
  arrange(SI_DATIM) |> 
  select(.pid, SI_DATIM, VE_REF, FT_REF, LE_CDAT, SI_DATE, LE_GEAR, LE_RECT, LE_KG) |> 
  #distinct(VE_REF, FT_REF, LE_CDAT, SI_DATE, LE_GEAR, LE_RECT, LE_KG, .keep_all = TRUE) |> 
  knitr::kable(caption = "Discrepancies - datacall")
Discrepancies - datacall
.pid SI_DATIM VE_REF FT_REF LE_CDAT SI_DATE LE_GEAR LE_RECT LE_KG
3696 1800-04-28 09:46:00 1527 270892 1800-04-29 1800-04-28 OTB 34F4 373.7451
3699 1800-04-28 11:42:00 1527 270892 1800-04-29 1800-04-28 OTB 34F4 373.7451
3702 1800-04-28 13:36:00 1527 270892 1800-04-29 1800-04-28 OTB 34F4 373.7451
3705 1800-04-28 15:32:00 1527 270892 1800-04-29 1800-04-28 OTB 34F4 373.7451
3708 1800-04-28 17:28:00 1527 270892 1800-04-29 1800-04-28 OTB 34F4 373.7451
3711 1800-04-28 19:22:00 1527 270892 1800-04-29 1800-04-28 OTB 34F4 373.7451
3714 1800-04-28 21:18:00 1527 270892 1800-04-29 1800-04-28 OTB 34F4 373.7451
3716 1800-04-28 23:12:00 1527 270892 1800-04-29 1800-04-28 OTB 34F4 373.7451
3718 1800-04-29 01:08:00 1527 270892 1800-04-29 1800-04-29 OTB 34F4 373.7451
3720 1800-04-29 03:04:00 1527 270892 1800-04-29 1800-04-29 OTB 34F4 373.7451
3722 1800-04-29 04:58:00 1527 270892 1800-04-29 1800-04-29 OTB 34F4 373.7451
3724 1800-04-29 06:54:00 1527 270892 1800-04-29 1800-04-29 OTB 34F4 373.7451
3726 1800-04-29 08:48:00 1527 270892 1800-04-29 1800-04-29 OTB 34F4 373.7451
3728 1800-04-29 10:44:00 1527 270892 1800-04-29 1800-04-29 OTB 34F4 373.7451
3731 1800-04-29 14:34:00 1527 270892 1800-04-29 1800-04-29 OTB 34F4 373.7451
3734 1800-04-29 18:24:00 1527 270892 1800-04-29 1800-04-29 OTB 34F4 373.7451
3736 1800-04-29 20:20:00 1527 270892 1800-04-29 1800-04-29 OTB 34F4 373.7451
3739 1800-04-30 00:10:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3741 1800-04-30 02:06:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3744 1800-04-30 05:56:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3746 1800-04-30 07:50:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3748 1800-04-30 09:46:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3750 1800-04-30 11:42:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3752 1800-04-30 13:36:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3754 1800-04-30 15:32:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3756 1800-04-30 17:28:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3760 1800-04-30 19:22:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3764 1800-04-30 21:18:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3768 1800-04-30 23:12:00 1527 270892 1800-04-29 1800-04-30 OTB 34F4 373.7451
3772 1800-05-01 01:08:00 1527 270892 1800-04-29 1800-05-01 OTB 34F4 373.7451
3776 1800-05-01 03:02:00 1527 270892 1800-04-29 1800-05-01 OTB 34F4 373.7451
3780 1800-05-01 04:58:00 1527 270892 1800-04-29 1800-05-01 OTB 34F4 373.7451
3784 1800-05-01 06:54:00 1527 270892 1800-04-29 1800-05-01 OTB 34F4 373.7451
Code
# TODO: Check why something is wrotten in the State of Denmark
print("There is a pending issue, the two 'ways' do not give the same results")
[1] "There is a pending issue, the two 'ways' do not give the same results"

3.3 Interval and state

Here I will keep things simple, applying a simple speed filter for both the datacall and myway, also throwing in interval along the way:

Code
speed_criteria <- 
  tribble(~LE_GEAR, ~s1,  ~s2,
          "GN",       1,    6,
          "GNS",      1,    6,
          "OTB",      1,    6,
          "OTM",      1,    6,
          "PTB",      1,    6,
          "TBB",      1,    6)

tacsat <- 
  tacsat |>
  arrange(VE_COU, VE_REF, SI_DATIM) |> 
  group_by(VE_COU, VE_REF, FT_REF) |> 
  mutate(INTV = c(NA_real_, diff(unclass(SI_DATIM))) / 60) |> # in minutes
  fill(INTV, .direction = "up") |> 
  ungroup() |> 
  left_join(speed_criteria) |> 
  mutate(SI_STATE = case_when(is.na(SI_SP) ~ NA,
                              between(SI_SP, s1, s2) ~ 1,
                              !between(SI_SP, s1, s2) ~ 0,
                              .default = -9999)) |> 
  select(-c(s1, s2))

trails <- 
  trails |>
  arrange(VE_COU, VE_REF, SI_DATIM) |> 
  group_by(VE_COU, VE_REF, FT_REF) |> 
  mutate(INTV = c(NA_real_, diff(unclass(SI_DATIM))) / 60) |> # in minutes
  fill(INTV, .direction = "up") |> 
  ungroup() |> 
  left_join(speed_criteria) |> 
  mutate(SI_STATE = case_when(is.na(SI_SP) ~ NA,
                              between(SI_SP, s1, s2) ~ 1,
                              !between(SI_SP, s1, s2) ~ 0,
                              .default = -9999)) |> 
  select(-c(s1, s2))

interval checks (do a proper one, should also check for negative intverals)

Code
table(tacsat$INTV == 0, useNA = "ifany")

FALSE  <NA> 
72944    10 
Code
table(trails$INTV == 0, useNA = "ifany")

FALSE  <NA> 
72944    10 

Speed checks:

Code
tacsat |> 
  mutate(in_eflalo = ifelse(FT_REF %in% unique(eflalo$FT_REF), "yes", "no")) |>  
  count(in_eflalo, SI_STATE)
# A tibble: 3 × 3
  in_eflalo SI_STATE     n
  <chr>        <dbl> <int>
1 no           -9999 15485
2 yes              0 29826
3 yes              1 27643
Code
trails |> 
  mutate(in_eflalo = ifelse(FT_REF %in% unique(eflalo$FT_REF), "yes", "no")) |>  
  count(in_eflalo, SI_STATE)
# A tibble: 3 × 3
  in_eflalo SI_STATE     n
  <chr>        <dbl> <int>
1 no           -9999 15485
2 yes              0 29826
3 yes              1 27643
Code
trails |> 
  mutate(in_eflalo = ifelse(FT_REF %in% unique(eflalo$FT_REF), "yes", "no")) |> 
  filter(SI_STATE >= 0) |> 
  count(in_eflalo, SI_STATE, LE_GEAR)
# A tibble: 9 × 4
  in_eflalo SI_STATE LE_GEAR     n
  <chr>        <dbl> <chr>   <int>
1 yes              0 GNS         9
2 yes              0 OTB      1226
3 yes              0 OTM      4763
4 yes              0 PTB        30
5 yes              0 TBB     23798
6 yes              1 OTB      3168
7 yes              1 OTM      4069
8 yes              1 PTB        43
9 yes              1 TBB     20363

For now remove the records where there is not LE_GEAR assigned to trips and SI_STATE is -9999

Code
tacsat <- tacsat |> filter(!is.na(LE_GEAR), INTV > 0)
trails <- trails |> filter(!is.na(LE_GEAR), INTV > 0)

3.4 Split among pings

Code
eflalo2 <- 
  eflalo |> 
  #filter(FT_REF == TRIP) |> 
  # vmstools::splitAmongPings checks if trips cross newyear and stops if any
  #  hence need these (Note: not in d/m/y format, may cause trouble)
  mutate(FT_DDAT = as.character(as_date(D_DATIM)),
         FT_DTIME = str_sub(as.character(D_DATIM), 12),
         FT_LDAT = as.character(as_date(L_DATIM)),
         FT_LTIME = str_sub(as.character(L_DATIM), 12))

tacsat2 <-
  tacsat |> 
  #filter(FT_REF == TRIP) |> 
  select(-LE_RECT)

tacsatEflalo <-  splitAmongPings(
  # create an issue in [vmstools}
  tacsat = tacsat2 |> as.data.frame(),
  eflalo = eflalo2 |> as.data.frame(),
  variable = "all",
  level = c("trip", "ICESrectangle", "day"),
  conserve = FALSE 
  #, by = "INTV"
) |> 
  as_tibble()
[1] "level: day"
[1] "kg in eflalo 56868810"
[1] "kg in merged tacsat 56868810"
[1] "kg removed from eflalo 56868810"
[1] "level: rectangle"
[1] "kg in eflalo 177646931"
[1] "kg in merged tacsat 177646931"
[1] "kg removed from eflalo 177646931"
[1] "level: trip"
[1] "kg in eflalo 28941866"
[1] "kg in merged tacsat 28941866"
[1] "kg removed from eflalo 28941866"
Code
# Fix variable names upstream, idea is to make a distinction between
#  Standard TACSAT and EFLALO variable names and types and 
#  derived or external variables
trails2 <- 
  trails |> 
  select(-LE_RECT) |> 
  rename(time = SI_DATIM,
         state = SI_STATE) |> 
  mutate(ir = ramb::rb_d2ir(SI_LONG, SI_LATI),
         state = as.integer(state))
events2 <- 
  events |> 
  rename(date = LE_CDAT) |> 
  group_by(VE_COU, VE_REF, FT_REF, LE_RECT, date) |> 
  summarise(LE_KG = sum(LE_KG, na.rm = TRUE),
            LE_EURO = sum(LE_EURO, na.rm = TRUE),
            .groups = "drop") |> 
  mutate(.eid = as.integer(1:n()),
         .before = VE_COU)
trails_spread2 <- 
  ramb::dc_spread_cash_and_catch(
    trails = trails2,
    events = events2,
    remove_diagnostic = FALSE
  )
Code
comp <-
  tacsatEflalo |> 
  select(FT_REF, .pid, dc = LE_KG) |> 
  full_join(trails_spread2 |> 
              select(.pid, rb = LE_KG, .how, ir))
comp |> 
  group_by(FT_REF) |> 
  summarise(dc = sum(dc, na.rm = TRUE),
            rb = sum(rb, na.rm = TRUE)) |> 
  filter(!near(dc, rb)) |> 
  knitr::kable(caption = "List of FT_REFs that are not the same")
List of FT_REFs that are not the same
FT_REF dc rb
273053 3975.201 2875.4039
317846 1604.871 519.3498
Code
comp |> 
  ggplot(aes(dc / 1e3, rb / 1e3)) +
  theme_bw() +
  geom_abline(linewidth = 2, colour = "pink") +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_y_log10() +
  labs(title = "Comparison of catch spread",
       x = "datacall", y = "myway",
       caption = "Guess there is more than one way to skin a cat")

3.5 Appendix

Copilot contronts the datacall scripts

This workflow provides a reproducible, quality-controlled process to prepare, link, and summarize Vessel Monitoring System (VMS) and logbook (EFLALO) data for submission to the ICES VMS and Logbook Data Call. The process is organized into three main sections, each with clear substeps and quality control.


3.6 1. Pre-processing and Cleaning

  • Static Data Preparation:
    • Load static reference datasets such as harbours and ICES statistical areas.
    • Prepare harbour polygons (with buffer) and standardize coordinate encodings.
  • Yearly Loop:
    For each year:
    • Data Loading:
      • Load and format raw TACSAT (VMS) and EFLALO (logbook) data.
    • TACSAT Cleaning:
      • Remove records outside ICES areas, with impossible coordinates, duplicates, pseudo-duplicates, and records within harbours.
      • Track and log record exclusions at each stage.
      • Save cleaned TACSAT data.
    • EFLALO Cleaning:
      • Remove duplicate trips, invalid or missing dates, trips outside the year, and arrival-before-departure errors.
      • Detect and remove overlapping trips.
      • Check gear and metier codes against ICES vocabularies.
      • Save cleaned EFLALO data and exclusion statistics.

3.7 2. Linking, Spatial Enrichment, and Effort Calculation

  • Spatial Data Setup:
    • Load and prepare habitat and bathymetry layers for spatial joins.
  • Yearly Loop:
    For each year:
    • Trip Assignment:
      • Link VMS records to logbook trips by matching date/time windows.
      • Save unmatched records for quality control.
      • Merge trip and vessel attributes (gear, mesh size, vessel length, etc.) to VMS records.
    • Advanced Assignment:
      • Refine assignment for trips with multiple gears or ICES rectangles using event-level data.
    • Effort and Activity Status:
      • Calculate and cap time intervals between VMS points.
      • Remove incomplete records.
      • Classify vessel positions as fishing or steaming (using speed-based methods).
      • Retain only valid metier codes.
      • Distribute logbook landings among VMS pings based on fishing effort (interval).
    • Spatial and Effort Enrichment:
      • Join habitat and bathymetry data to VMS points.
      • Assign C-square codes, year, and month.
      • Compute kilowatt-hour and swept area metrics.
      • Assign gear width (using either models or lookup rules).
      • Save enriched datasets.

3.8 3. Construction and Output of Summary Tables

  • Yearly Table Construction:
    For each year:
    • Table 1:
      • VMS-level records with spatial, temporal, effort, vessel, and gear information.
    • Table 2:
      • Logbook-level records with trip-level effort, catch, and VMS-link status.
  • Postprocessing:
    • Merge tables across years.
    • Anonymize vessel IDs and assign vessel length classes using ICES vocabulary.
    • Aggregate and summarize by key groupings (e.g., C-square, metier, vessel length, habitat).
    • Compute summary statistics like effort, catch, swept area, and vessel counts.
    • Generate field type and NA summaries for QC.
  • Vocabulary and QC Checks:
    • Check all key fields against ICES DATSU vocabularies (metiers, gear, vessel length, ICES rectangles, country codes, etc.).
    • Filter out or correct invalid/missing values and log exclusions.
  • Final Output:
    • Save tables in ICES-ready CSV format for submission.
    • Provide in-script reporting and intermediate files for transparency and troubleshooting.

3.9 Key Points

  • Transparency and Traceability: Every major step and QC point is traceable, with annotation and data snapshots for review and troubleshooting.
  • Reproducibility and Standards Compliance: All processing follows ICES data standards and is reproducible.
  • Comprehensive Data Preparation: The workflow ensures harmonized, linked, and spatially enriched data, ready for robust aggregation and ICES submission.

3.10 Appendix - the datacall code, the short version

#'------------------------------------------------------------------------------
#
# Script: Extract and process VMS and logbook data for ICES VMS data call
# Sections:
#   1. Pre-process and clean TACSAT and EFLALO data
#   2. Link TACSAT and EFLALO, add spatial and effort info
#   3. Construct and output summary tables
#
#'------------------------------------------------------------------------------

# Section 1: Pre-process and clean TACSAT and EFLALO data ----------------------

library(icesVocab)
library(vmstools)
library(data.table)
library(tidyverse)
library(sf)
source("0_global_functions.R")
source("0_global_settings.R")

# 1.1 Load and prepare static data ---------------------------------------------
data(harbours)
data(ICESareas)

harbours_alt <- 
  harbours |>
  dplyr::mutate(harbour = iconv(harbour, from = "latin1", to = "UTF-8")) |>
  as_tibble() |>
  sf::st_as_sf(coords = c("lon", "lat"), crs = 4326) |>
  sf::st_transform(crs = 3857) |>
  sf::st_buffer(dist = 3000) |>
  sf::st_transform(crs = 4326) |>
  dplyr::select(harbour)

# 1.2 Loop through years to process data ---------------------------------------
for (year in yearsToSubmit) {
  print(paste0("Start loop for year ", year))
  
  # 1.2.1 Load TACSAT and EFLALO -----------------------------------------------
  tacsat_name <- load(file.path(dataPath, paste0("tacsat_", year, ".RData")))
  eflalo_name <- load(file.path(dataPath, paste0("eflalo_", year, ".RData")))
  tacsat <- data.frame(get(tacsat_name))
  eflalo <- data.frame(get(eflalo_name))
  tacsat <- formatTacsat(tacsat)
  eflalo <- formatEflalo(eflalo)
  
  # 1.2.2 Clean TACSAT ---------------------------------------------------------
  remrecsTacsat <- matrix(
    NA, nrow = 6, ncol = 2,
    dimnames = list(
      c("total", "outsideICESarea", "duplicates", "notPossible", "pseudoDuplicates", "harbour"),
      c("rows", "percentage"))
  )
  remrecsTacsat["total", ] <- c(nrow(tacsat), "100%")
  ia <- transform_to_sf(ICESareas, coords = c("SI_LONG", "SI_LATI"))
  tacsat <- transform_to_sf(tacsat, coords = c("SI_LONG", "SI_LATI"))
  ia <- ia %>% sf::st_make_valid() %>% sf::st_transform(4326) |> sf::st_zm()
  overs <- sf::st_intersects(tacsat, ia)
  tacsat <- tacsat[lengths(overs) > 0, ]
  percentage_remaining <- round(nrow(tacsat) / as.numeric(remrecsTacsat["total", 1]) * 100, 2)
  remrecsTacsat["outsideICESarea", ] <- c(nrow(tacsat), percentage_remaining)
  tacsat$SI_DATIM <- as.POSIXct(paste(tacsat$SI_DATE, tacsat$SI_TIME), tz = "GMT", format = "%d/%m/%Y  %H:%M")
  tacsat$unique_id <- paste(tacsat$VE_REF, tacsat$SI_LATI, tacsat$SI_LONG, tacsat$SI_DATIM)
  tacsat <- tacsat[!duplicated(tacsat$unique_id), ]
  percentage_remaining <- round(nrow(tacsat) / as.numeric(remrecsTacsat["total", 1]) * 100, 2)
  remrecsTacsat["duplicates", ] <- c(nrow(tacsat), percentage_remaining)
  coords <- st_coordinates(tacsat)
  invalid_positions <- which(coords[,2] > 90 | coords[,2] < -90 | coords[,1] > 180 | coords[,1] < -180)
  if (length(invalid_positions) > 0) tacsat <- tacsat[-invalid_positions, ]
  percentage_remaining <- round(nrow(tacsat) / as.numeric(remrecsTacsat["total", 1]) * 100, 2)
  remrecsTacsat["notPossible", ] <- c(nrow(tacsat), percentage_remaining)
  tacsat <- sfsortTacsat(tacsat)
  tacsat$INTV <- intervalTacsat(as.data.frame(tacsat), level = "vessel", fill.na = TRUE)$INTV
  tacsat <- tacsat[tacsat$INTV >= intThres, ]
  percentage_remaining <- round(nrow(tacsat) / as.numeric(remrecsTacsat["total", 1]) * 100, 2)
  remrecsTacsat["pseudoDuplicates", ] <- c(nrow(tacsat), percentage_remaining)
  tacsat$INTV <- NULL
  overs <- sf::st_intersects(tacsat, harbours_alt)
  tacsat <- tacsat[!(lengths(overs) > 0), ]
  percentage_remaining <- round(nrow(tacsat) / as.numeric(remrecsTacsat["total", 1]) * 100, 2)
  remrecsTacsat["harbour", ] <- c(nrow(tacsat), percentage_remaining)
  save(remrecsTacsat, file = file.path(outPath, paste0("remrecsTacsat", year, ".RData")))
  tacsat <- as.data.frame(tacsat) %>% dplyr::select(-geometry, -unique_id)
  save(tacsat, file = file.path(outPath, paste("cleanTacsat", year, ".RData", sep = "")))
  message("Cleaning tacsat completed for year ", year)
  print(remrecsTacsat)
  
  # 1.2.3 Clean EFLALO ---------------------------------------------------------
  remrecsEflalo <- matrix(
    NA, nrow = 8, ncol = 2,
    dimnames = list(
      c("total", "duplicated", "impossible time", "before 1st Jan", "departArrival", "overlappingTrips", "MetierL4_LE_GEAR", "MetierL6_LE_MET"),
      c("rows", "percentage"))
  )
  remrecsEflalo["total", ] <- c(nrow(eflalo), "100%")
  trip_id <- create_trip_id(eflalo)
  eflalo <- eflalo[!duplicated(trip_id), ]
  num_records <- nrow(eflalo)
  percent_removed <- round((num_records - as.numeric(remrecsEflalo["total", 1])) / as.numeric(remrecsEflalo["total", 1]) * 100, 2)
  remrecsEflalo["duplicated", ] <- c(num_records, 100 + percent_removed)
  eflalo$FT_DDATIM <- convert_to_datetime(eflalo$FT_DDAT, eflalo$FT_DTIME)
  eflalo$FT_LDATIM <- convert_to_datetime(eflalo$FT_LDAT, eflalo$FT_LTIME)
  eflalo <- eflalo[!is.na(eflalo$FT_DDATIM) & !is.na(eflalo$FT_LDATIM), ]
  num_records <- nrow(eflalo)
  percent_removed <- round((num_records - as.numeric(remrecsEflalo["total", 1])) / as.numeric(remrecsEflalo["total", 1]) * 100, 2)
  remrecsEflalo["impossible time", ] <- c(num_records, 100 + percent_removed)
  eflalo <- remove_before_jan(eflalo, year)
  num_records <- nrow(eflalo)
  percent_removed <- round((num_records - as.numeric(remrecsEflalo["total", 1])) / as.numeric(remrecsEflalo["total", 1]) * 100, 2)
  remrecsEflalo["before 1st Jan", ] <- c(num_records, 100 + percent_removed)
  idx <- which(eflalo$FT_LDATIM >= eflalo$FT_DDATIM)
  eflalo <- eflalo[idx, ]
  remrecsEflalo["departArrival", ] <- c(
    nrow(eflalo),
    100 + round((nrow(eflalo) - as.numeric(remrecsEflalo["total", 1])) / as.numeric(remrecsEflalo["total", 1]) * 100, 2)
  )
  eflalo <- orderBy(~ VE_COU + VE_REF + FT_DDATIM + FT_LDATIM, data = eflalo)
  dt1 <- data.table(eflalo)[, .(VE_REF, FT_REF, FT_DDATIM, FT_LDATIM)]
  dt1 <- unique(dt1, by = c("VE_REF", "FT_REF"))
  setkey(dt1, VE_REF, FT_DDATIM, FT_LDATIM)
  dt2 <- dt1[, ref := .N > 1, by = key(dt1)][ref == TRUE]
  dt3 <- dt2[, .(FT_REF_NEW = FT_REF[1]), by = .(VE_REF, FT_DDATIM, FT_LDATIM)]
  dt4 <- merge(dt2, dt3)
  eflalo2 <- merge(data.table(eflalo), dt4, all.x = TRUE)
  eflalo2[!is.na(FT_REF_NEW), FT_REF := FT_REF_NEW]
  eflalo2[, FT_REF_NEW := NULL]
  eflalo <- data.frame(eflalo2)
  eflalo <- eflalo %>% dplyr::select(-ref)
  dt1 <- data.table(ID = eflalo$VE_REF, FT = eflalo$FT_REF,
                    startdate = eflalo$FT_DDATIM,
                    enddate = eflalo$FT_LDATIM)
  dt1 <- dt1[!duplicated(paste(dt1$ID, dt1$FT)), ]
  setkey(dt1, ID, startdate, enddate)
  result <- foverlaps(dt1, dt1, by.x = c("ID", "startdate", "enddate"),
                      by.y = c("ID", "startdate", "enddate"))
  overlapping.trips <- subset(result, startdate < i.enddate & enddate > i.startdate & FT != i.FT)
  if (nrow(overlapping.trips) > 0) {
    eflalo <- eflalo[!eflalo$FT_REF %in% overlapping.trips$FT, ]
    save(overlapping.trips, file = file.path(outPath, paste0("overlappingTrips", year, ".RData")))
  }
  num_records <- nrow(eflalo)
  percent_removed <- round((num_records - as.numeric(remrecsEflalo["total", 1])) / as.numeric(remrecsEflalo["total", 1]) * 100, 2)
  remrecsEflalo["overlappingTrips", ] <- c(num_records, 100 + percent_removed)
  m4_ices <- getCodeList("GearType")
  eflalo <- eflalo %>% filter(LE_GEAR %in% m4_ices$Key)
  remrecsEflalo["MetierL4_LE_GEAR", ] <- c(nrow(eflalo), round(nrow(eflalo) / as.numeric(remrecsEflalo["total", 1]) * 100, 2))
  m6_ices <- getCodeList("Metier6_FishingActivity")
  eflalo <- eflalo %>% filter(LE_MET %in% m6_ices$Key)
  remrecsEflalo["MetierL6_LE_MET", ] <- c(nrow(eflalo), round(nrow(eflalo) / as.numeric(remrecsEflalo["total", 1]) * 100, 2))
  save(remrecsEflalo, file = file.path(outPath, paste0("remrecsEflalo", year, ".RData")))
  save(eflalo, file = file.path(outPath, paste0("cleanEflalo", year, ".RData")))
  message("Cleaning eflalo completed for year ", year)
  print(remrecsEflalo)
}

rm(harbours, harbours_alt, ICESareas, tacsat_name, eflalo_name, tacsat, eflalo, remrecsTacsat, remrecsEflalo,
   ia, overs, coords, invalid_positions, trip_id, percent_removed, num_records, idx, dt1, result, overlapping.trips)
rm(list = ls(pattern = "_20"))

# Section 2: Link TACSAT and EFLALO, add spatial and effort info ---------------

library(sf)
sf::sf_use_s2(FALSE)
eusm <- readRDS(paste0(dataPath, "eusm.rds")) %>% st_transform(4326)
bathy <- readRDS(paste0(dataPath, "ICES_GEBCO.rds")) %>% st_set_crs(4326)

for (year in yearsToSubmit) {
  print(paste0("Start loop for year ", year))
  load(file = paste0(outPath, "/cleanEflalo", year, ".RData"))
  load(file = paste0(outPath, "/cleanTacsat", year, ".RData"))
  tacsat$geometry <- NULL
  
  # 2.1 Merge EFLALO trip info to TACSAT ---------------------------------------
  tacsatp <- mergeEflalo2Tacsat(eflalo, tacsat)
  tacsatp <- data.frame(tacsatp)
  tacsatpmin <- subset(tacsatp, FT_REF == 0)
  cat(sprintf("%.2f%% of tacsat data did not merge\n", (nrow(tacsatpmin) / nrow(tacsatp)) * 100))
  save(tacsatpmin, file = file.path(outPath, paste0("tacsatNotMerged", year, ".RData")))
  tacsatp <- subset(tacsatp, FT_REF != 0)
  cols <- c("LE_GEAR", "LE_MSZ", "VE_LEN", "VE_KW", "LE_RECT", "LE_MET", "LE_WIDTH", "VE_FLT", "VE_COU")
  for (col in cols) {
    tacsatp[[col]] <- eflalo[[col]][match(tacsatp$FT_REF, eflalo$FT_REF)]
  }
  for (col in c("LE_GEAR","LE_MSZ","LE_RECT","LE_MET")) {
    tripinfo <- trip_assign(tacsatp, eflalo, col = col, haul_logbook = FALSE)
    tacsatp <- rbindlist(list(tacsatp[tacsatp$FT_REF %!in% tripinfo$FT_REF,], tripinfo), fill = TRUE)
  }
  if ("LE_WIDTH" %in% names(eflalo)) {
    tripinfo <- trip_assign(tacsatp, eflalo, col = "LE_WIDTH", haul_logbook = FALSE)
    tacsatp <- rbindlist(list(tacsatp[tacsatp$FT_REF %!in% tripinfo$FT_REF,], tripinfo), fill = TRUE)
  }
  tacsatp <- as.data.frame(tacsatp)
  save(tacsatp, file = file.path(outPath, paste0("tacsatMerged", year, ".RData")))
  
  # 2.2 Calculate time interval, quality control, assign activity, merge landings ----
  tacsatp <- intvTacsat(tacsatp, level = "trip", fill.na = TRUE)
  tacsatp$INTV[tacsatp$INTV > intvThres] <- 2 * intvThres
  tacsatp$INTV[is.na(tacsatp$INTV)] <- intvThres
  idx <- which(is.na(tacsatp$VE_REF) | is.na(tacsatp$SI_LONG) | is.na(tacsatp$SI_LATI) |
                 is.na(tacsatp$SI_DATIM) | is.na(tacsatp$SI_SP))
  if (length(idx) > 0) tacsatp <- tacsatp[-idx, ]
  # [activity state assignment block omitted for brevity]
  save(tacsatp, file = file.path(outPath, paste0("tacsatActivity", year, ".RData")))
  kept <- nrow(tacsatp)
  removed <- nrow(tacsatp %>% filter(LE_MET %!in% valid_metiers))
  tacsatp <- tacsatp %>% filter(LE_MET %in% valid_metiers)
  cat(sprintf("%.2f%% of tacsatp removed due to invalid metier l6\n", (removed / (removed + kept) * 100)))
  idx_kg <- grep("LE_KG_", colnames(eflalo)[colnames(eflalo) %!in% c("LE_KG_TOT")])
  idx_euro <- grep("LE_EURO_", colnames(eflalo)[colnames(eflalo) %!in% c("LE_EURO_TOT")])
  if ("LE_KG_TOT" %!in% names(eflalo)) eflalo$LE_KG_TOT <- rowSums(eflalo[, idx_kg], na.rm = TRUE)
  if ("LE_EURO_TOT" %!in% names(eflalo)) eflalo$LE_EURO_TOT <- rowSums(eflalo[, idx_euro], na.rm = TRUE)
  eflalo <- eflalo[, -c(idx_kg, idx_euro)]
  eflaloM <- subset(eflalo, FT_REF %in% unique(tacsatp$FT_REF))
  eflaloNM <- subset(eflalo, !FT_REF %in% unique(tacsatp$FT_REF))
  message(sprintf("%.2f%% of the eflalo data not in tacsat\n", (nrow(eflaloNM) / (nrow(eflaloNM) + nrow(eflaloM)) * 100)))
  tacsatp$SI_STATE <- ifelse(tacsatp$SI_STATE == "f", 1, 0)
  tacsatp <- tacsatp[tacsatp$SI_STATE == 1 & !is.na(tacsatp$INTV), ]
  if ((sum(tacsatp$INTV == 0) > 0) || (sum(is.na(tacsatp$INTV)) > 0)) {
    message(sprintf("%.2f%% of intervals in tacsatp contain NA or zeros and have been discarded.\n", 
                    (sum(tacsatp$INTV == 0) + sum(is.na(tacsatp$INTV))) / nrow(tacsatp) * 100))
    tacsatp <- tacsatp %>% filter(!is.na(INTV) & INTV > 0)
  }
  tacsatEflalo <- splitAmongPings(
    tacsat = tacsatp,
    eflalo = eflaloM,
    variable = "all",
    level = c("day", "ICESrectangle", "trip"),
    conserve = TRUE, 
    by = "INTV"
  )
  eflalo$tripInTacsat <- ifelse(eflalo$FT_REF %in% tacsatEflalo$FT_REF, "Y", "N")
  save(tacsatEflalo, file = file.path(outPath, paste0("tacsatEflalo", year, ".RData")))
  save(eflalo, file = file.path(outPath, paste0("/processedEflalo", year, ".RData")))
  print("Dispatching landings completed")
}

# 2.3 Add habitat, depth, and effort refinements -------------------------------
for (year in yearsToSubmit) {
  print(paste0("Start loop for year ", year))
  load(file = paste0(outPath, "tacsatEflalo", year, ".RData"))
  tacsatEflalo <- tacsatEflalo |> 
    sf::st_as_sf(coords = c("SI_LONG", "SI_LATI"), remove = FALSE) |> 
    sf::st_set_crs(4326) |> 
    st_join(eusm, join = st_intersects) |> 
    st_join(bathy, join = st_intersects) |> 
    mutate(geometry = NULL) |> 
    data.frame()
  tacsatEflalo$Csquare <- CSquare(tacsatEflalo$SI_LONG, tacsatEflalo$SI_LATI, degrees = 0.05)
  tacsatEflalo$Year <- year(tacsatEflalo$SI_DATIM)
  tacsatEflalo$Month <- month(tacsatEflalo$SI_DATIM)
  tacsatEflalo$kwHour <- tacsatEflalo$VE_KW * tacsatEflalo$INTV / 60
  tacsatEflalo$INTV <- tacsatEflalo$INTV / 60
  tacsatEflalo$GEARWIDTHKM <- add_gearwidth(tacsatEflalo)
  tacsatEflalo$SA_KM2 <- tacsatEflalo$GEARWIDTHKM * tacsatEflalo$INTV * tacsatEflalo$SI_SP * 1.852
  tacsatEflalo[, .(min = min(GEARWIDTHKM), max = max(GEARWIDTHKM)), by = .(LE_MET)]
  save(tacsatEflalo, file = file.path(outPath, paste0("tacsatEflalo", year, ".RData")))
}

rm(speedarr, tacsatp, tacsatEflalo, eflalo, eflaloM, eflaloNM)

# Section 3: Construct and output summary tables -------------------------------

library(glue)
library(gt)

for (year in yearsToSubmit) {
  load(file = paste0(outPath, "/processedEflalo", year, ".RData"))
  load(file = paste0(outPath, "/tacsatEflalo", year, ".RData"))
  
  # 3.1 Table 2: Logbook records -----------------------------------------------
  eflalo$Year <- year(eflalo$FT_LDATIM)
  eflalo$Month <- month(eflalo$FT_LDATIM)
  eflalo$INTV <- 1
  eflalo$record <- 1
  res <- aggregate(
    eflalo$record,
    by = as.list(eflalo[, c("VE_COU", "VE_REF", "LE_CDAT")]),
    FUN = sum,
    na.rm = TRUE
  )
  colnames(res) <- c("VE_COU", "VE_REF", "LE_CDAT", "nrRecords")
  eflalo <- merge(eflalo, res, by = c("VE_COU", "VE_REF", "LE_CDAT"))
  eflalo$INTV <- eflalo$INTV / eflalo$nrRecords
  eflalo$kwDays <- eflalo$VE_KW * eflalo$INTV
  eflalo$tripInTacsat <- ifelse(eflalo$FT_REF %in% tacsatEflalo$FT_REF, "Y", "N")
  RecordType <- "LE"
  cols <- c(
    "VE_REF", "VE_COU", "Year", "Month", "LE_RECT", "LE_GEAR", "LE_MET",
    "VE_LEN", "tripInTacsat", "INTV", "kwDays", "LE_KG_TOT", "LE_EURO_TOT"
  )
  if (year == yearsToSubmit[1]) {
    table2 <- cbind(RT = RecordType, eflalo[, cols])
  } else {
    table2 <- rbind(table2, cbind(RT = RecordType, eflalo[, cols]))
  }
  save(table2, file = file.path(outPath, "table2.RData"))
  message(glue("Table 2 for year {year} is completed"))
  
  # 3.2 Table 1: VMS with assigned logbook info --------------------------------
  tacsatEflalo <- data.frame(tacsatEflalo)
  RecordType <- "VE"
  cols <- c(
    "VE_REF", "VE_COU", "Year", "Month", "Csquare", "MSFD_BBHT", "depth", "LE_GEAR",
    "LE_MET", "SI_SP", "INTV", "VE_LEN", "kwHour", "VE_KW", "LE_KG_TOT", "LE_EURO_TOT",
    "GEARWIDTHKM", "SA_KM2"
  )
  if (year == yearsToSubmit[1]) {
    table1 <- cbind(RT = RecordType, tacsatEflalo[, cols])
  } else {
    table1 <- rbind(table1, cbind(RT = RecordType, tacsatEflalo[, cols]))
  }
  save(table1, file = file.path(outPath, "table1.RData"))
  message(glue("Table 1 for year {year} is completed"))
}

table(table1$INTV > 0)
table(table2$INTV > 0)

# 3.3 Prepare and anonymize vessel IDs -----------------------------------------
load(file = paste0(outPath, "/table1.RData"))
load(file = paste0(outPath, "/table2.RData"))
VE_lut <- data.frame(VE_REF = unique(c(table1$VE_REF, table2$VE_REF)))
fmt <- paste0("%0", floor(log10(nrow(VE_lut))) + 1, "d")
VE_lut$VE_ID <- paste0(table1$VE_COU[1], sprintf(fmt, 1:nrow(VE_lut)))
table1 <- left_join(table1, VE_lut)
table2 <- left_join(table2, VE_lut)

# 3.4 Add vessel length categories from ICES DATSU vocab -----------------------
vlen_ices <- getCodeList("VesselLengthClass")
vlen_icesc <- vlen_ices %>%
  filter(Key %in% c("VL0006", "VL0608", "VL0810", "VL1012", "VL1215", "VL1518", "VL1824", "VL2440", "VL40XX")) %>%
  dplyr::select(Key) %>%
  dplyr::arrange(Key)
table1$LENGTHCAT <- table1$VE_LEN %>% cut(
  breaks = c(0, 6, 8, 10, 12, 15, 18, 24, 40, Inf),
  right = FALSE, include.lowest = TRUE, labels = vlen_icesc$Key
)
table2$LENGTHCAT <- table2$VE_LEN %>% cut(
  breaks = c(0, 6, 8, 10, 12, 15, 18, 24, 40, Inf),
  right = FALSE, include.lowest = TRUE, labels = vlen_icesc$Key
)

# 3.5 Aggregate and summarize tables -------------------------------------------
table1Save <- table1 %>%
  separate(col = LE_MET, c("MetierL4", "MetierL5"), sep = '_', extra = "drop", remove = FALSE) %>%
  group_by(RecordType = RT, CountryCode = VE_COU, Year, Month, Csquare, MetierL4, MetierL5, MetierL6 = LE_MET, VesselLengthRange = LENGTHCAT, Habitat = MSFD_BBHT, Depth = depth) %>%
  summarise(
    No_Records = n(),
    AverageFishingSpeed = mean(SI_SP),
    FishingHour = sum(INTV, na.rm = TRUE),
    AverageInterval = mean(INTV, na.rm = TRUE),
    AverageVesselLength = mean(VE_LEN, na.rm = TRUE),
    AveragekW = mean(VE_KW, na.rm = TRUE),
    kWFishingHour = sum(kwHour, na.rm = TRUE),
    SweptArea = sum(SA_KM2, na.rm = TRUE),
    TotWeight = sum(LE_KG_TOT, na.rm = TRUE),
    TotValue = sum(LE_EURO_TOT, na.rm = TRUE),
    NoDistinctVessels = n_distinct(VE_ID, na.rm = TRUE),
    AnonymizedVesselID = ifelse(n_distinct(VE_ID) < 3, paste(unique(VE_ID), collapse = ";"), 'not_required'),
    AverageGearWidth = mean(GEARWIDTHKM, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  relocate(NoDistinctVessels, AnonymizedVesselID, .before = Csquare) %>%
  as.data.frame()
table2Save <- table2 %>%
  separate(col = LE_MET, c("MetierL4", "MetierL5"), sep = '_', remove = FALSE) %>%
  group_by(RecordType = RT, CountryCode = VE_COU, Year, Month, ICESrectangle = LE_RECT, MetierL4, MetierL5, MetierL6 = LE_MET, VesselLengthRange = LENGTHCAT, VMSEnabled = tripInTacsat) %>%
  summarise(
    FishingDays = sum(INTV, na.rm = TRUE),
    kWFishingDays = sum(kwDays, na.rm = TRUE),
    TotWeight = sum(LE_KG_TOT, na.rm = TRUE),
    TotValue = sum(LE_EURO_TOT, na.rm = TRUE),
    NoDistinctVessels = n_distinct(VE_ID, na.rm = TRUE),
    AnonymizedVesselID = ifelse(n_distinct(VE_ID) < 3, paste(unique(VE_ID), collapse = ";"), 'not_required'),
    .groups = "drop"
  ) %>%
  relocate(NoDistinctVessels, AnonymizedVesselID, .before = ICESrectangle) %>%
  as.data.frame()
saveRDS(table1Save, paste0(outPath, "table1Save.rds"))
saveRDS(table2Save, paste0(outPath, "table2Save.rds"))

# 3.6 Field NA/format summaries (optional) -------------------------------------
table_nas <- NULL
for (nn in colnames(table1Save)) {
  table_na <- table(is.na(table1Save[, nn]))
  row <- c(field = nn, is_na =  ifelse(is.na(table_na['TRUE']), 0, table_na['TRUE'] ), total_records = length(table1Save[, nn]), field_type = class(table1Save[, nn]))
  table_nas <- rbind(table_nas, row)
}
gt(
  as_tibble(table_nas),
  rowname_col = 'field'
) %>%
  tab_header(title = md('Summary of **Table 1**  number of NA and records types'))
table_nas <- NULL
for (nn in colnames(table2Save)) {
  table_na <- table(is.na(table2Save[, nn]))
  row <- c(field = nn, is_na =  ifelse(is.na(table_na['TRUE']), 0, table_na['TRUE'] ), total_records = length(table2Save[, nn]), field_type = class(table2Save[, nn]))
  table_nas <- rbind(table_nas, row)
}
gt(
  as_tibble(table_nas),
  rowname_col = 'field'
) %>%
  tab_header(title = md('Summary of **Table 2**  number of NA and records types'))

table(table1$INTV > 0)
table(table2$INTV > 0)

# 3.7 Save as ICES-ready CSV ---------------------------------------------------
colnames(table1Save)[which(colnames(table1Save) == "Csquare")] <- "C-square"
colnames(table1Save)[which(colnames(table1Save) == "Habitat")] <- "HabitatType"
colnames(table1Save)[which(colnames(table1Save) == "Depth")] <- "DepthRange"
colnames(table1Save)[which(colnames(table1Save) == "No_Records")] <- "NumberOfRecords"
write.table(table1Save, file.path(outPath, "table1Save.csv"), na = "", row.names = FALSE, col.names = TRUE, sep = ",", quote = FALSE)
write.table(table2Save, file.path(outPath, "table2Save.csv"), na = "", row.names = FALSE, col.names = TRUE, sep = ",", quote = FALSE)

#'------------------------------------------------------------------------------
# End of script
#'------------------------------------------------------------------------------