ICES VMS datacall: present -> dplyr (-> duckdb)

Note: This document is actively evolving. Feedback, corrections, and contributions are welcome!

1 Preamble

This document illustrates as well as reviews the ICES VMS datacall workflow, highlighting both the legacy and alternative data processing approaches.

The goal of this exercise was to clarify the process, if only personally for one that resides mostly in the tidyverse environment. Moving to the tidyverse, more specifically using the {dplyr}-verbs is not a goal in itself, but rather a stepping stone towards being able to use backend translation of that type of R code to SQL, via the {dbplyr}-package. Processing the data via in-memory duckdb reading from sweeps of files residing on the computer via R script, without importing the whole datasets into R is the ultimate aim.

1.1 TODO

Implement “split-among-pings” the new way
Expand spatial procedures (points-in-polygons for ICES areas, harbours, habitats)
DuckDB section: save processed files as Parquet and show how to start from there

1.2 The Story So Far

The ICES VMS datacall codebase is a legacy, shaped by contributions from multiple developers, each with their preferred R dialect (data.table, tidyverse, etc.). While {vmstools} remains foundational, many new functions have accumulated over time—often as standalone scripts with limited documentation.

Challenge: The codebase is heterogeneous and difficult to refactor wholesale. Incremental, undocumented changes can also introduce long-term maintenance risks.
Aim: This document provides an external review and stepwise modernization, focusing on reproducibility, transparency, and scalability. Note that the author is not a long-time datacall maintainer, but rather a user and reviewer of the latest code version.

1.3 Approach

Use the dplyr grammar throughout, to facilitate translation across:
in-memory data frames,
traditional transactional databases,
on-the-fly in-memory databases (e.g., DuckDB).
Rationale:
Out-of-memory workflows help avoid system crashes as data volumes grow (e.g., higher VMS ping frequency, new data sources like AIS/SSF).
Anticipate future requirements and enable smooth scaling without major rewrites.
Coding principles:
Use and expose generic, well-tested dplyr functions.
Use pipeline for readability
Log all data quality checks as part of the pipeline, not in separate object.
Minimize redundant date/time variables—generate datetime columns once upfront, and reconstruct character formats only if needed later.
Split EFLALO data into “trips” and “events” and use those when assigning events to pings.

Downstream, compare the “official” 2025 script (“datacall”) and a modernized approach (“myway”) for validation.

1.4 Packages and ad-hoc functions

We rely on several packages and external scripts/functions. The list below includes both CRAN and GitHub dependencies.

Code

library(data.table)  # Required for some datacall functions
library(vmstools)    # Provides foundational VMS/logbook utilities
library(tidyverse)   # Modern data manipulation and piping
library(duckdb)      # For scalable, out-of-memory data processing

### datacall functions - can not source global.R, here a datacall trip_assign as a temporary gist
source("https://gist.githubusercontent.com/einarhjorleifsson/deee46b493f1f95e95e905c15f1c2432/raw/e2d2a0a8241ac8f9c86c3971de677ff260d7319d/rb_trip_assign.R")

### ramb functions
source("https://raw.githubusercontent.com/einarhjorleifsson/ramb/refs/heads/main/R/rb_sum_across.R")
# source(here::here("R/dc_spread_cash_and_catch_v2.R"))

rb_getCodeList <- function(code_type, from_web = TRUE) {
  if(from_web) {
    icesVocab::getCodeList(code_type)
  } else {
    readr::read_rds(paste0(here::here("data/icesVocab"), "/", code_type, ".rds"))
  }
}

1.5 Data

For illustration, we use the {vmstools} built-in datasets and ICES vocabularies.

Code

data(eflalo)
eflalo_org <- eflalo  # Make a copy for donwstream comparisons
data(tacsat)

# Advice to change from_web = FALSE to from_web = TRUE, have temporary problems
iv_gear <-   rb_getCodeList("GearType", from_web = FALSE)$Key
iv_target <- rb_getCodeList("TargetAssemblage", from_web = FALSE)$Key
iv_met6 <-   rb_getCodeList("Metier6_FishingActivity", from_web = FALSE)$Key

1.6 Date-time consolidation, sum catch and weight & unique rownumber

Convert and consolidate all relevant date and datetime fields up front.

This reduces clutter and makes downstream processing simpler
Unique row IDs are generated for easy traceability and comparison checks

Code

remove <- TRUE  # Decide if you want retain orginal variables if new are created
eflalo <- 
  eflalo |> 
  as_tibble() |>
  rename(LE_MET6 = LE_MET_level6) |> 
  unite(col = "D_DATIM", FT_DDAT, FT_DTIME, sep = " ", remove = remove) |> 
  unite(col = "L_DATIM", FT_LDAT, FT_LTIME, sep = " ", remove = remove) |> 
  # should really be FT_DDATIM and FT_LDATIM - that is what splitAmongPings checks for
  mutate(D_DATIM = dmy_hms(D_DATIM, tz = "UTC"),       # Note: {vmstools} used "GMT"
         L_DATIM = dmy_hms(L_DATIM, tz = "UTC"),
         LE_CDAT = dmy(LE_CDAT),
         .after = FT_REF) |> 
  arrange(VE_COU, VE_COU, LE_CDAT) |> 
  mutate(.eid = 1L:n()) |> 
  select(.eid, VE_COU, VE_REF, VE_LEN, VE_KW, VE_TON, everything()) |> 
  # for the demo, reduces the clutter
  rb_sum_across("LE_KG", remove = remove) |> 
  rb_sum_across("LE_EURO", remove = remove)
tacsat <- 
  tacsat |> 
  as_tibble() |> 
  # have to keep SI_DATE, because used in trip_assign
  unite("SI_DATIM", SI_DATE, SI_TIME, sep = " ", remove = FALSE) |>
  select(-SI_TIME) |> 
  # UTC does not recognize "29/02/1801", not a leap year
  mutate(SI_DATIM = dmy_hms(SI_DATIM, tz = "UTC"),
         SI_DATE = dmy(SI_DATE)) |> 
  # hence filter upfront
  filter(!is.na(SI_DATIM)) |> 
  arrange(VE_COU, VE_REF, SI_DATIM) |> 
  # possibly part of the checks, but it is natural to do it upfront
  distinct(VE_COU, VE_REF, SI_DATIM, .keep_all = TRUE) |> 
  arrange(VE_COU, VE_COU, SI_DATIM) |> 
  mutate(.pid = 1L:n(),
         .before = VE_COU)

2 Pre-processing

2.1 eflalo: Checks

All records are assessed for potential errors or issues, which are logged in a checks column. This makes it easy to review and filter problematic records. Here bookkeeping of potential filtering of records is part of the pipe-flow process. Additional checks can easily be implemented and then in the end the user can decide what records should be filtered, based on values in “checks”.

Code

eflalo <- 
  eflalo |> 
  # only needed if one wants to do additional testing
  separate(LE_MET6, into = c(".m6gear", ".m6target", ".m6mesh", ".m6rest"), sep = "_", remove = FALSE, extra = "merge") |> 
  separate(.m6mesh, into = c(".m1", ".m2"), sep = "-", remove = FALSE, fill = "right", convert = TRUE) |> 
  separate(LE_ID, into = c(".etid", ".egear", ".eir"), sep = "-", remove = FALSE) |> 
  # TODO: Deal with .m1 or .m2 being characters
  #mutate(.m2 = case_when(is.na(.m2) & as.character(.m1) == "0" ~ "0",
  #                       .default = .m2)) |> 
  mutate(
    checks = case_when(                                                   # datacall
      # Catch record outlier - code pending                               # 1.3.2
      base::duplicated(paste(VE_REF, LE_ID, LE_CDAT)) ~ "01 duplicated events",   # 1.3.3
      is.na(D_DATIM) | is.na(L_DATIM) ~ "02 impossible time",                       # 1.3.4
      year(D_DATIM) == (year(L_DATIM) - 1) ~ "03 new years trip",                   # 1.3.5
      D_DATIM > L_DATIM ~ "04 departure before arrival",                           # 1.3.6
      # "05 overlapping trips  - code pending                             # 1.3.7
      !LE_GEAR %in% iv_gear ~ "06 gear (metier 4) invalid",                  # 1.4.1
      !LE_MET6 %in% iv_met6 ~ "07 metier 6 invalid",                         # 3.5.5
      # Addendum, needs discussions
      !between(LE_CDAT, as_date(D_DATIM), as_date(L_DATIM)) ~ "ok - 08 catch date not within trip",
      # metier level 6 component checks
      LE_GEAR != .m6gear ~ "ok - 01 met6 gear different from gear",
      !.m6target %in% iv_target ~ "ok - 10 met6 target invalid",
      # event id component checks
      FT_REF != .etid ~ "ok - 11 event id tripid different from tripid",
      LE_GEAR != .egear ~ "ok - 12 event id gear different from gear",
      LE_RECT != .eir ~ "ok - 13 event id rectangle different from rectangle",
      # Pending solution prior to case_when applied
      #!between(mesh, as.integer(.m1), .m2) ~ "11 mesh size not in the met6 range",
      .default = "ok")
  )
eflalo |> 
  count(checks) |> 
  mutate('%' = round(n / sum(n) * 100, 2)) |> 
  knitr::kable(caption = "Logbooks: All records not starting with 'ok' will be filtered downstream")

Logbooks: All records not starting with ‘ok’ will be filtered downstream
checks	n	%
01 duplicated events	3	0.07
07 metier 6 invalid	741	16.33
ok	3775	83.17
ok - 01 met6 gear different from gear	20	0.44

Code

eflalo <-
  eflalo |> 
  select(-c(.etid, .egear, .eir,
            .m6gear, .m6target, .m6mesh, .m1, .m2, .m6rest)) |> 
  # You may want to use another filter
  filter(str_starts(checks, "ok")) |> 
  select(-checks)

2.2 eflalo: Split eflalo into trip and event table

Splitting the data clarifies intent as well as making “myway” downstream code flow simpler.

Code

trips <-
  eflalo |> 
  group_by(VE_COU, VE_REF, FT_REF) |> 
  mutate(weight = sum(LE_KG),
         price = sum(LE_EURO)) |> 
  ungroup() |> 
  # here we could add variables corresponding to the highest trip
  select(VE_COU, VE_REF, VE_LEN, VE_KW, VE_TON, FT_REF, D_DATIM, L_DATIM, weight, price) |> 
  distinct(VE_COU, VE_REF, FT_REF, .keep_all = TRUE) |> 
  mutate(.tid = 1L:n(),
         .before = VE_COU)
events <- 
  eflalo |> 
  select(.eid, VE_COU, VE_REF, FT_REF, starts_with("LE_")) |> 
  # reduce the clutter for this demo test only - do not use generic code
  janitor::remove_empty(which = "cols")

Code

# TODO: printout with smaller fonts
trips

# A tibble: 2,519 × 11
    .tid VE_COU   VE_REF VE_LEN VE_KW VE_TON FT_REF D_DATIM            
   <int> <chr>    <chr>   <dbl> <dbl>  <dbl> <chr>  <dttm>             
 1     1 Atlantis 449         5    14     NA 263378 1800-01-02 07:00:00
 2     2 Atlantis 452         4    11     NA 263383 1800-01-02 07:00:00
 3     3 Atlantis 36         25   221     NA 264004 1800-01-05 22:00:00
 4     4 Atlantis 449         5    14     NA 263379 1800-01-06 08:00:00
 5     5 Atlantis 449         5    14     NA 263380 1800-01-07 09:00:00
 6     6 Atlantis 36         25   221     NA 264005 1800-01-07 14:00:00
 7     7 Atlantis 449         5    14     NA 263382 1800-01-09 06:00:00
 8     8 Atlantis 36         25   221     NA 264513 1800-01-12 22:00:00
 9     9 Atlantis 36         25   221     NA 264514 1800-01-15 11:00:00
10    10 Atlantis 449         5    14     NA 263795 1800-01-16 07:00:00
# ℹ 2,509 more rows
# ℹ 3 more variables: L_DATIM <dttm>, weight <dbl>, price <dbl>

As for the event table we have:

Code

# TODO: printout with smaller fonts
events

# A tibble: 3,795 × 15
    .eid VE_COU   VE_REF FT_REF LE_ID   LE_CDAT    LE_GEAR LE_MSZ LE_RECT LE_DIV
   <int> <chr>    <chr>  <chr>  <chr>   <date>     <chr>    <dbl> <chr>   <chr> 
 1     1 Atlantis 449    263378 263378… 1800-01-02 GNS        140 34F4    IVc   
 2     2 Atlantis 452    263383 263383… 1800-01-02 GNS        140 34F4    IVc   
 3     3 Atlantis 36     264004 264004… 1800-01-06 OTB        120 31F3    IVc   
 4     4 Atlantis 449    263379 263379… 1800-01-06 GNS        160 34F4    IVc   
 5     5 Atlantis 449    263380 263380… 1800-01-07 GNS        160 34F4    IVc   
 6     6 Atlantis 36     264005 264005… 1800-01-08 OTB        120 31F3    IVc   
 7     7 Atlantis 449    263382 263382… 1800-01-09 GNS        160 34F4    IVc   
 8     8 Atlantis 36     264513 264513… 1800-01-14 OTB        120 31F3    IVc   
 9    10 Atlantis 36     264514 264514… 1800-01-16 OTB        120 31F3    IVc   
10    11 Atlantis 449    263795 263795… 1800-01-16 GNS        160 34F4    IVc   
# ℹ 3,785 more rows
# ℹ 5 more variables: LE_MET6 <chr>, LE_UNIT <fct>, LE_EFF <dbl>, LE_KG <dbl>,
#   LE_EURO <dbl>

2.3 eflalo: Create higest_event and highest trip tables

I think the intent is of the datacall trip_assing function is:

For each fishing day, add the values of gear, mesh and met6 from the highest event catch record within the day (no summation done a priori)
For non-fishing days, add the values of gear, mesh and met6 from the highest trip catches

One way to think about this in the dplyr-flow is to create two tables, highest event table and highest trip table. We will use these two tables in “myway” flow downstream, that code flow being being fully independent of the assign_trip-function.

Code

highest_event <-
  events |> 
  arrange(desc(LE_KG)) |> 
  group_by(VE_COU, VE_REF, FT_REF, LE_CDAT) |> 
  slice(1) |> 
  ungroup() |> 
  select(VE_COU, VE_REF, FT_REF, LE_CDAT, LE_GEAR, LE_MSZ, LE_MET6, LE_RECT)
highest_event

# A tibble: 2,519 × 8
   VE_COU   VE_REF FT_REF LE_CDAT    LE_GEAR LE_MSZ LE_MET6           LE_RECT
   <chr>    <chr>  <chr>  <date>     <chr>    <dbl> <chr>             <chr>  
 1 Atlantis 10     272058 1800-05-07 TBB         80 TBB_DEF_70-99_0_0 32F2   
 2 Atlantis 10     273424 1800-05-14 TBB         80 TBB_DEF_70-99_0_0 32F3   
 3 Atlantis 10     273483 1800-05-21 TBB         80 TBB_DEF_70-99_0_0 32F3   
 4 Atlantis 10     274268 1800-05-27 TBB         80 TBB_DEF_70-99_0_0 32F3   
 5 Atlantis 10     316938 1801-09-08 TBB         80 TBB_DEF_70-99_0_0 32F3   
 6 Atlantis 10     317799 1801-09-14 TBB         80 TBB_DEF_70-99_0_0 32F3   
 7 Atlantis 10     318270 1801-09-22 TBB         80 TBB_DEF_70-99_0_0 32F3   
 8 Atlantis 10     318999 1801-09-29 TBB         80 TBB_DEF_70-99_0_0 31F2   
 9 Atlantis 1000   271704 1800-05-07 TBB         80 TBB_DEF_70-99_0_0 34F4   
10 Atlantis 1000   272438 1800-05-14 TBB         80 TBB_DEF_70-99_0_0 34F4   
# ℹ 2,509 more rows

Code

highest_trip <-
  events |> 
  group_by(VE_COU, VE_REF, FT_REF, LE_GEAR, LE_MSZ, LE_MET6, LE_RECT) |> 
  summarise(LE_KG = sum(LE_KG),
            .groups = "drop") |> 
  arrange(desc(LE_KG)) |> 
  group_by(VE_COU, VE_REF, FT_REF) |> 
  slice(1) |> 
  ungroup() |> 
  select(VE_COU, VE_REF, FT_REF, .LE_GEAR = LE_GEAR, .LE_MSZ = LE_MSZ, 
         .LE_MET6 = LE_MET6, .LE_RECT = LE_RECT)
highest_trip

# A tibble: 2,519 × 7
   VE_COU   VE_REF FT_REF .LE_GEAR .LE_MSZ .LE_MET6          .LE_RECT
   <chr>    <chr>  <chr>  <chr>      <dbl> <chr>             <chr>   
 1 Atlantis 10     272058 TBB           80 TBB_DEF_70-99_0_0 32F2    
 2 Atlantis 10     273424 TBB           80 TBB_DEF_70-99_0_0 32F3    
 3 Atlantis 10     273483 TBB           80 TBB_DEF_70-99_0_0 32F3    
 4 Atlantis 10     274268 TBB           80 TBB_DEF_70-99_0_0 32F3    
 5 Atlantis 10     316938 TBB           80 TBB_DEF_70-99_0_0 32F3    
 6 Atlantis 10     317799 TBB           80 TBB_DEF_70-99_0_0 32F3    
 7 Atlantis 10     318270 TBB           80 TBB_DEF_70-99_0_0 32F3    
 8 Atlantis 10     318999 TBB           80 TBB_DEF_70-99_0_0 31F2    
 9 Atlantis 1000   271704 TBB           80 TBB_DEF_70-99_0_0 34F4    
10 Atlantis 1000   272438 TBB           80 TBB_DEF_70-99_0_0 34F4    
# ℹ 2,509 more rows

Add test here to illustrate the intent

2.4 tacsat: Checks

Code

it_min <- 5 * 60 # Interval threshold minimum [units: seconds]
tacsat <- 
  tacsat |> 
  group_by(VE_COU, VE_REF) |> 
  mutate(dt = c(it_min, diff(unclass(SI_DATIM)))) |>
  ungroup() |> 
  mutate(
    checks = 
      case_when(
        is.na(VE_COU) ~ "00 1 no country id",
        is.na(VE_REF) ~ "00 2 no vessel id",
        is.na(SI_DATIM) ~ "00 3 no time",
        is.na(SI_LONG) ~  "00 4 no lon",
        is.na(SI_LATI) ~  "00 5 no lat",
        is.na(SI_SP) ~ "00 6 no speed",
        base::duplicated(paste(VE_COU, VE_REF, SI_LONG, SI_LATI, SI_DATIM)) ~ "02 duplicates",
        !between(SI_LONG, -180, 180) | !between(SI_LATI, -90, 90) ~ "03 coordinates out of bound",
        dt < it_min ~ "04 time interval exceeds threshold",
        .default = "ok")
  )

tacsat |>
  count(checks) |>
  mutate('%' = round(n / sum(n) * 100, 2)) |>
  knitr::kable(caption = "VMS: All records not 'ok' will be filtered downstream")

VMS: All records not ‘ok’ will be filtered downstream
checks	n	%
00 6 no speed	11900	13.84
04 time interval exceeds threshold	1128	1.31
ok	72954	84.85

Code

tacsat <- 
  tacsat |> 
  filter(str_starts(checks, "ok")) |>
  select(-c(checks))
trails <- tacsat  # for myway

3 Analysis

3.1 Add trip information to pings

First we add trip using the legacy vmstools::mergeEflalo2Tacsat
Then use a loop to add fixed trip related variables, (We do not expect vessel length, power or tonnage to change within a trip)

Code

tacsat <-   # normally called tacsatp in the code flow 
  mergeEflalo2Tacsat(eflalo |> as.data.frame(),
                     tacsat |> as.data.frame())
# NOTE: event information added downstream - to clarify thinking
cols <- c("VE_LEN", "VE_KW")
for (col in cols) {
  tacsat[[col]] <- eflalo[[col]][match(tacsat$FT_REF, eflalo$FT_REF)]
}

Code

trails <- 
  trails |> 
  left_join(trips,
            by = join_by(VE_COU, VE_REF, between(SI_DATIM, D_DATIM, L_DATIM)),
            relationship = "many-to-one") |> 
  # No longer needed
  select(-c(D_DATIM, L_DATIM))

Code

comparison <- 
  tacsat |> 
  as_tibble() |> 
  select(.pid, VE_COU, VE_REF, FT_REF, length_dc = VE_LEN, kw_dc = VE_KW) |> 
  left_join(trails |> select(.pid, VE_COU, VE_REF, FT_REF, length_mw = VE_LEN, kw_mw = VE_KW),
            by = join_by(.pid, VE_COU, VE_REF, FT_REF)) |> 
  filter(length_dc != length_mw | kw_dc != kw_mw) |> 
  select(-c(.pid, FT_REF)) |> 
  distinct()
tacsat |> 
  as_tibble() |> 
  select(.pid, VE_COU, VE_REF, FT_REF, length_dc = VE_LEN, kw_dc = VE_KW) |> 
  left_join(trails |> select(.pid, VE_COU, VE_REF, FT_REF, length_mw = VE_LEN, kw_mw = VE_KW)) |> 
  filter(length_dc != length_mw | kw_dc != kw_mw) |> 
  select(-c(.pid, FT_REF)) |> 
  distinct() |> 
  left_join(eflalo_org |> 
              filter(VE_REF %in% unique(comparison$VE_REF)) |> 
              select(VE_REF, VE_LEN, VE_KW) |> 
              distinct()) |> 
  knitr::kable(caption = "Assigning vessel info to pings - discrepancies\n'_dc': datacall, '_mw: myway, original from eflalo\nSeems like myway same as the original")

Assigning vessel info to pings - discrepancies ’_dc’: datacall, ’_mw: myway, original from eflalo Seems like myway same as the original
VE_COU	VE_REF	length_dc	kw_dc	length_mw	kw_mw	VE_LEN	VE_KW
Atlantis	729	24.73	221	25.00000	221	25.00000	221
Atlantis	864	25.27	184	17.52212	111	17.52212	111
Atlantis	915	24.50	349	24.00000	221	24.00000	221

Seems “myway” is correct given the original, so something to check in the datacall flow

3.2 Add event information to pings

This step is is not a precursor to the next step - consider rearranging …

Code

## Add gear, mesh and met6 to pings --------------------------------------------
### datacall ----------------------------------------------------------------
# Now add the event info - this step needed so that next step works - could be simplified
cols <- c("LE_GEAR", "LE_MSZ", "LE_RECT", "LE_MET6")
for (col in cols) {
  tacsat[[col]] <- eflalo[[col]][match(tacsat$FT_REF, eflalo$FT_REF)]
}

tacsat_LE_GEAR <- trip_assign(tacsat, eflalo, col = "LE_GEAR",  haul_logbook = F)
# Here and subsequently, replaced %!in% with %in%, moving the ! upfront
tacsat <- rbindlist(list(tacsat[!tacsat$FT_REF %in% tacsat_LE_GEAR$FT_REF,], tacsat_LE_GEAR), fill = T, ignore.attr=TRUE)

tacsat_LE_MSZ <- trip_assign(tacsat, eflalo, col = "LE_MSZ",  haul_logbook = F)
tacsat <- rbindlist(list(tacsat[!tacsat$FT_REF %in% tacsat_LE_MSZ$FT_REF,], tacsat_LE_MSZ), fill = T)

tacsat_LE_RECT <- trip_assign(tacsat, eflalo, col = "LE_RECT",  haul_logbook = F)
tacsat <- rbindlist(list(tacsat[!tacsat$FT_REF %in% tacsat_LE_RECT$FT_REF,], tacsat_LE_RECT), fill = T)

tacsat_LE_MET <- trip_assign(tacsat, eflalo, col = "LE_MET6",  haul_logbook = F)
tacsat <- rbindlist(list(tacsat[!tacsat$FT_REF %in% tacsat_LE_MET$FT_REF,], tacsat_LE_MET), fill = T)

Here we use a left join to the trails and the final values consolidated. Take note that the joining is as follows:

1. highest_event (    fishing day): join_by(VE_COU, VE_REF, FT_REF, SI_DATE == LE_CDAT)
2. highest_trips (non-fishing day): join_by(VE_COU, VE_REF, FT_REF)

Code

trails <- 
  trails |> 
  left_join(highest_event,
            by = join_by(VE_COU, VE_REF, FT_REF, SI_DATE == LE_CDAT),
            relationship = "many-to-one") |> 
  left_join(highest_trip,
            by = join_by(VE_COU, VE_REF, FT_REF),
            relationship = "many-to-one") |> 
  mutate(LE_GEAR = case_when(is.na(LE_GEAR) & !is.na(.LE_GEAR) ~ .LE_GEAR,
                             .default = LE_GEAR),
         LE_MSZ = case_when(is.na(LE_MSZ) & !is.na(.LE_MSZ) ~ .LE_MSZ,
                            .default = LE_MSZ),
         LE_MET6 = case_when(is.na(LE_MET6) & !is.na(.LE_MET6) ~ .LE_MET6,
                             .default = LE_MET6),
         LE_RECT = case_when(is.na(LE_RECT) & !is.na(.LE_RECT) ~ .LE_RECT,
                             .default = LE_RECT)) |> 
  select(-c(.LE_GEAR, .LE_MSZ, .LE_MET6, .LE_RECT))

Code

comparison <- 
  tacsat |> 
  as_tibble() |> 
  select(.pid, VE_COU, VE_REF, FT_REF, LE_GEAR, LE_MSZ, LE_MET6, LE_RECT) |> 
  left_join(trails |> select(.pid, VE_COU, VE_REF, FT_REF, 
                             .LE_GEAR = LE_GEAR, .LE_MSZ = LE_MSZ, 
                             .LE_MET6 = LE_MET6, .LE_RECT = LE_RECT),
            by = join_by(.pid, VE_COU, VE_REF, FT_REF)) |> 
  # reorder
  select(.pid, VE_COU, VE_REF, FT_REF, LE_GEAR, .LE_GEAR, LE_MSZ, .LE_MSZ, LE_MET6, .LE_MET6, LE_RECT, .LE_RECT)
comparison |> 
  #filter(LE_GEAR != .LE_GEAR  | LE_MSZ != .LE_MSZ | LE_MET6 != .LE_MET6) |> 
  mutate(gear = if_else(LE_GEAR == .LE_GEAR, "yes", "no", "yes_na"),
         mesh = if_else(LE_MSZ  == .LE_MSZ, "yes", "no", "yes_na"),
         met6 = if_else(LE_MET6 == .LE_MET6, "yes", "no", "yes_na"),
         rect = if_else(LE_RECT   == .LE_RECT, "yes", "no", "yes_na")) |> 
  count(gear, mesh, met6, rect) |> 
  mutate('%' = round(n / sum(n) * 100, 2)) |> 
  knitr::kable(caption = "Assigning events info to pings - discrepancies between datacall vs. myway\nNeeds a revisit, at least the rectangles")

Assigning events info to pings - discrepancies between datacall vs. myway Needs a revisit, at least the rectangles
gear	mesh	met6	rect	n	%
no	yes	yes	no	33	0.05
yes	no	yes	no	738	1.01
yes	yes	yes	no	24089	33.02
yes	yes	yes	yes	32609	44.70
yes_na	yes_na	yes_na	yes_na	15485	21.23

Code

# For now let's look at the event records were gear and rectangles are not the same (33 records)
problem_trips <- 
  comparison |> 
  #filter(LE_GEAR != .LE_GEAR  | LE_MSZ != .LE_MSZ | LE_MET6 != .LE_MET6) |> 
  filter(LE_GEAR != .LE_GEAR  & LE_RECT != .LE_RECT) |> 
  select(VE_COU, VE_REF, FT_REF) |> 
  distinct()
problem_events <- 
  problem_trips |> 
  left_join(events |> select(VE_REF, FT_REF, LE_CDAT, LE_GEAR, LE_RECT, LE_KG)) |> 
  arrange(LE_CDAT, desc(LE_KG))
problem_events |> knitr::kable(caption = "Events table, original data")

Events table, original data
VE_COU	VE_REF	FT_REF	LE_CDAT	LE_GEAR	LE_RECT	LE_KG
Atlantis	1527	270892	1800-04-29	TBB	36F5	688.5230
Atlantis	1527	270892	1800-04-29	OTB	34F4	373.7451
Atlantis	1527	270892	1800-04-29	TBB	34F4	319.9824

Code

problem_events |> 
  inner_join(trails) |> 
  arrange(SI_DATIM) |> 
  select(.pid, SI_DATIM, VE_REF, FT_REF, LE_CDAT, SI_DATE, LE_GEAR, LE_RECT, LE_KG) |> 
  #distinct(VE_REF, FT_REF, LE_CDAT, SI_DATE, LE_GEAR, LE_RECT, LE_KG, .keep_all = TRUE) |> 
  knitr::kable(caption = "Discrepancies - myway")

Discrepancies - myway
.pid	SI_DATIM	VE_REF	FT_REF	LE_CDAT	SI_DATE	LE_GEAR	LE_RECT	LE_KG
3696	1800-04-28 09:46:00	1527	270892	1800-04-29	1800-04-28	TBB	36F5	688.523
3699	1800-04-28 11:42:00	1527	270892	1800-04-29	1800-04-28	TBB	36F5	688.523
3702	1800-04-28 13:36:00	1527	270892	1800-04-29	1800-04-28	TBB	36F5	688.523
3705	1800-04-28 15:32:00	1527	270892	1800-04-29	1800-04-28	TBB	36F5	688.523
3708	1800-04-28 17:28:00	1527	270892	1800-04-29	1800-04-28	TBB	36F5	688.523
3711	1800-04-28 19:22:00	1527	270892	1800-04-29	1800-04-28	TBB	36F5	688.523
3714	1800-04-28 21:18:00	1527	270892	1800-04-29	1800-04-28	TBB	36F5	688.523
3716	1800-04-28 23:12:00	1527	270892	1800-04-29	1800-04-28	TBB	36F5	688.523
3718	1800-04-29 01:08:00	1527	270892	1800-04-29	1800-04-29	TBB	36F5	688.523
3720	1800-04-29 03:04:00	1527	270892	1800-04-29	1800-04-29	TBB	36F5	688.523
3722	1800-04-29 04:58:00	1527	270892	1800-04-29	1800-04-29	TBB	36F5	688.523
3724	1800-04-29 06:54:00	1527	270892	1800-04-29	1800-04-29	TBB	36F5	688.523
3726	1800-04-29 08:48:00	1527	270892	1800-04-29	1800-04-29	TBB	36F5	688.523
3728	1800-04-29 10:44:00	1527	270892	1800-04-29	1800-04-29	TBB	36F5	688.523
3731	1800-04-29 14:34:00	1527	270892	1800-04-29	1800-04-29	TBB	36F5	688.523
3734	1800-04-29 18:24:00	1527	270892	1800-04-29	1800-04-29	TBB	36F5	688.523
3736	1800-04-29 20:20:00	1527	270892	1800-04-29	1800-04-29	TBB	36F5	688.523
3739	1800-04-30 00:10:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3741	1800-04-30 02:06:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3744	1800-04-30 05:56:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3746	1800-04-30 07:50:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3748	1800-04-30 09:46:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3750	1800-04-30 11:42:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3752	1800-04-30 13:36:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3754	1800-04-30 15:32:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3756	1800-04-30 17:28:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3760	1800-04-30 19:22:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3764	1800-04-30 21:18:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3768	1800-04-30 23:12:00	1527	270892	1800-04-29	1800-04-30	TBB	36F5	688.523
3772	1800-05-01 01:08:00	1527	270892	1800-04-29	1800-05-01	TBB	36F5	688.523
3776	1800-05-01 03:02:00	1527	270892	1800-04-29	1800-05-01	TBB	36F5	688.523
3780	1800-05-01 04:58:00	1527	270892	1800-04-29	1800-05-01	TBB	36F5	688.523
3784	1800-05-01 06:54:00	1527	270892	1800-04-29	1800-05-01	TBB	36F5	688.523

Code

problem_events |> 
  inner_join(tacsat) |> 
  arrange(SI_DATIM) |> 
  select(.pid, SI_DATIM, VE_REF, FT_REF, LE_CDAT, SI_DATE, LE_GEAR, LE_RECT, LE_KG) |> 
  #distinct(VE_REF, FT_REF, LE_CDAT, SI_DATE, LE_GEAR, LE_RECT, LE_KG, .keep_all = TRUE) |> 
  knitr::kable(caption = "Discrepancies - datacall")

Discrepancies - datacall
.pid	SI_DATIM	VE_REF	FT_REF	LE_CDAT	SI_DATE	LE_GEAR	LE_RECT	LE_KG
3696	1800-04-28 09:46:00	1527	270892	1800-04-29	1800-04-28	OTB	34F4	373.7451
3699	1800-04-28 11:42:00	1527	270892	1800-04-29	1800-04-28	OTB	34F4	373.7451
3702	1800-04-28 13:36:00	1527	270892	1800-04-29	1800-04-28	OTB	34F4	373.7451
3705	1800-04-28 15:32:00	1527	270892	1800-04-29	1800-04-28	OTB	34F4	373.7451
3708	1800-04-28 17:28:00	1527	270892	1800-04-29	1800-04-28	OTB	34F4	373.7451
3711	1800-04-28 19:22:00	1527	270892	1800-04-29	1800-04-28	OTB	34F4	373.7451
3714	1800-04-28 21:18:00	1527	270892	1800-04-29	1800-04-28	OTB	34F4	373.7451
3716	1800-04-28 23:12:00	1527	270892	1800-04-29	1800-04-28	OTB	34F4	373.7451
3718	1800-04-29 01:08:00	1527	270892	1800-04-29	1800-04-29	OTB	34F4	373.7451
3720	1800-04-29 03:04:00	1527	270892	1800-04-29	1800-04-29	OTB	34F4	373.7451
3722	1800-04-29 04:58:00	1527	270892	1800-04-29	1800-04-29	OTB	34F4	373.7451
3724	1800-04-29 06:54:00	1527	270892	1800-04-29	1800-04-29	OTB	34F4	373.7451
3726	1800-04-29 08:48:00	1527	270892	1800-04-29	1800-04-29	OTB	34F4	373.7451
3728	1800-04-29 10:44:00	1527	270892	1800-04-29	1800-04-29	OTB	34F4	373.7451
3731	1800-04-29 14:34:00	1527	270892	1800-04-29	1800-04-29	OTB	34F4	373.7451
3734	1800-04-29 18:24:00	1527	270892	1800-04-29	1800-04-29	OTB	34F4	373.7451
3736	1800-04-29 20:20:00	1527	270892	1800-04-29	1800-04-29	OTB	34F4	373.7451
3739	1800-04-30 00:10:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3741	1800-04-30 02:06:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3744	1800-04-30 05:56:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3746	1800-04-30 07:50:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3748	1800-04-30 09:46:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3750	1800-04-30 11:42:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3752	1800-04-30 13:36:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3754	1800-04-30 15:32:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3756	1800-04-30 17:28:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3760	1800-04-30 19:22:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3764	1800-04-30 21:18:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3768	1800-04-30 23:12:00	1527	270892	1800-04-29	1800-04-30	OTB	34F4	373.7451
3772	1800-05-01 01:08:00	1527	270892	1800-04-29	1800-05-01	OTB	34F4	373.7451
3776	1800-05-01 03:02:00	1527	270892	1800-04-29	1800-05-01	OTB	34F4	373.7451
3780	1800-05-01 04:58:00	1527	270892	1800-04-29	1800-05-01	OTB	34F4	373.7451
3784	1800-05-01 06:54:00	1527	270892	1800-04-29	1800-05-01	OTB	34F4	373.7451

Code

# TODO: Check why something is wrotten in the State of Denmark
print("There is a pending issue, the two 'ways' do not give the same results")

[1] "There is a pending issue, the two 'ways' do not give the same results"

3.3 Interval and state

Here I will keep things simple, applying a simple speed filter for both the datacall and myway, also throwing in interval along the way:

Code

speed_criteria <- 
  tribble(~LE_GEAR, ~s1,  ~s2,
          "GN",       1,    6,
          "GNS",      1,    6,
          "OTB",      1,    6,
          "OTM",      1,    6,
          "PTB",      1,    6,
          "TBB",      1,    6)

tacsat <- 
  tacsat |>
  arrange(VE_COU, VE_REF, SI_DATIM) |> 
  group_by(VE_COU, VE_REF, FT_REF) |> 
  mutate(INTV = c(NA_real_, diff(unclass(SI_DATIM))) / 60) |> # in minutes
  fill(INTV, .direction = "up") |> 
  ungroup() |> 
  left_join(speed_criteria) |> 
  mutate(SI_STATE = case_when(is.na(SI_SP) ~ NA,
                              between(SI_SP, s1, s2) ~ 1,
                              !between(SI_SP, s1, s2) ~ 0,
                              .default = -9999)) |> 
  select(-c(s1, s2))

trails <- 
  trails |>
  arrange(VE_COU, VE_REF, SI_DATIM) |> 
  group_by(VE_COU, VE_REF, FT_REF) |> 
  mutate(INTV = c(NA_real_, diff(unclass(SI_DATIM))) / 60) |> # in minutes
  fill(INTV, .direction = "up") |> 
  ungroup() |> 
  left_join(speed_criteria) |> 
  mutate(SI_STATE = case_when(is.na(SI_SP) ~ NA,
                              between(SI_SP, s1, s2) ~ 1,
                              !between(SI_SP, s1, s2) ~ 0,
                              .default = -9999)) |> 
  select(-c(s1, s2))

interval checks (do a proper one, should also check for negative intverals)

Code

table(tacsat$INTV == 0, useNA = "ifany")


FALSE  <NA> 
72944    10

Code

table(trails$INTV == 0, useNA = "ifany")


FALSE  <NA> 
72944    10

Speed checks:

Code

tacsat |> 
  mutate(in_eflalo = ifelse(FT_REF %in% unique(eflalo$FT_REF), "yes", "no")) |>  
  count(in_eflalo, SI_STATE)

# A tibble: 3 × 3
  in_eflalo SI_STATE     n
  <chr>        <dbl> <int>
1 no           -9999 15485
2 yes              0 29826
3 yes              1 27643

Code

trails |> 
  mutate(in_eflalo = ifelse(FT_REF %in% unique(eflalo$FT_REF), "yes", "no")) |>  
  count(in_eflalo, SI_STATE)

# A tibble: 3 × 3
  in_eflalo SI_STATE     n
  <chr>        <dbl> <int>
1 no           -9999 15485
2 yes              0 29826
3 yes              1 27643

Code

trails |> 
  mutate(in_eflalo = ifelse(FT_REF %in% unique(eflalo$FT_REF), "yes", "no")) |> 
  filter(SI_STATE >= 0) |> 
  count(in_eflalo, SI_STATE, LE_GEAR)

# A tibble: 9 × 4
  in_eflalo SI_STATE LE_GEAR     n
  <chr>        <dbl> <chr>   <int>
1 yes              0 GNS         9
2 yes              0 OTB      1226
3 yes              0 OTM      4763
4 yes              0 PTB        30
5 yes              0 TBB     23798
6 yes              1 OTB      3168
7 yes              1 OTM      4069
8 yes              1 PTB        43
9 yes              1 TBB     20363

For now remove the records where there is not LE_GEAR assigned to trips and SI_STATE is -9999

Code

tacsat <- tacsat |> filter(!is.na(LE_GEAR), INTV > 0)
trails <- trails |> filter(!is.na(LE_GEAR), INTV > 0)

3.4 Split among pings

Code

eflalo2 <- 
  eflalo |> 
  #filter(FT_REF == TRIP) |> 
  # vmstools::splitAmongPings checks if trips cross newyear and stops if any
  #  hence need these (Note: not in d/m/y format, may cause trouble)
  mutate(FT_DDAT = as.character(as_date(D_DATIM)),
         FT_DTIME = str_sub(as.character(D_DATIM), 12),
         FT_LDAT = as.character(as_date(L_DATIM)),
         FT_LTIME = str_sub(as.character(L_DATIM), 12))

tacsat2 <-
  tacsat |> 
  #filter(FT_REF == TRIP) |> 
  select(-LE_RECT)

tacsatEflalo <-  splitAmongPings(
  # create an issue in [vmstools}
  tacsat = tacsat2 |> as.data.frame(),
  eflalo = eflalo2 |> as.data.frame(),
  variable = "all",
  level = c("trip", "ICESrectangle", "day"),
  conserve = FALSE 
  #, by = "INTV"
) |> 
  as_tibble()

[1] "level: day"

[1] "kg in eflalo 56868810"
[1] "kg in merged tacsat 56868810"
[1] "kg removed from eflalo 56868810"
[1] "level: rectangle"
[1] "kg in eflalo 177646931"
[1] "kg in merged tacsat 177646931"
[1] "kg removed from eflalo 177646931"
[1] "level: trip"
[1] "kg in eflalo 28941866"
[1] "kg in merged tacsat 28941866"
[1] "kg removed from eflalo 28941866"

Code

# Fix variable names upstream, idea is to make a distinction between
#  Standard TACSAT and EFLALO variable names and types and 
#  derived or external variables
trails2 <- 
  trails |> 
  select(-LE_RECT) |> 
  rename(time = SI_DATIM,
         state = SI_STATE) |> 
  mutate(ir = ramb::rb_d2ir(SI_LONG, SI_LATI),
         state = as.integer(state))
events2 <- 
  events |> 
  rename(date = LE_CDAT) |> 
  group_by(VE_COU, VE_REF, FT_REF, LE_RECT, date) |> 
  summarise(LE_KG = sum(LE_KG, na.rm = TRUE),
            LE_EURO = sum(LE_EURO, na.rm = TRUE),
            .groups = "drop") |> 
  mutate(.eid = as.integer(1:n()),
         .before = VE_COU)
trails_spread2 <- 
  ramb::dc_spread_cash_and_catch(
    trails = trails2,
    events = events2,
    remove_diagnostic = FALSE
  )

Code

comp <-
  tacsatEflalo |> 
  select(FT_REF, .pid, dc = LE_KG) |> 
  full_join(trails_spread2 |> 
              select(.pid, rb = LE_KG, .how, ir))
comp |> 
  group_by(FT_REF) |> 
  summarise(dc = sum(dc, na.rm = TRUE),
            rb = sum(rb, na.rm = TRUE)) |> 
  filter(!near(dc, rb)) |> 
  knitr::kable(caption = "List of FT_REFs that are not the same")

List of FT_REFs that are not the same
FT_REF	dc	rb
273053	3975.201	2875.4039
317846	1604.871	519.3498

Code

comp |> 
  ggplot(aes(dc / 1e3, rb / 1e3)) +
  theme_bw() +
  geom_abline(linewidth = 2, colour = "pink") +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_y_log10() +
  labs(title = "Comparison of catch spread",
       x = "datacall", y = "myway",
       caption = "Guess there is more than one way to skin a cat")

3.5 Appendix

Copilot contronts the datacall scripts

This workflow provides a reproducible, quality-controlled process to prepare, link, and summarize Vessel Monitoring System (VMS) and logbook (EFLALO) data for submission to the ICES VMS and Logbook Data Call. The process is organized into three main sections, each with clear substeps and quality control.

3.6 1. Pre-processing and Cleaning

Static Data Preparation:
- Load static reference datasets such as harbours and ICES statistical areas.
- Prepare harbour polygons (with buffer) and standardize coordinate encodings.
Yearly Loop:
For each year:
- Data Loading:
  - Load and format raw TACSAT (VMS) and EFLALO (logbook) data.
- TACSAT Cleaning:
  - Remove records outside ICES areas, with impossible coordinates, duplicates, pseudo-duplicates, and records within harbours.
  - Track and log record exclusions at each stage.
  - Save cleaned TACSAT data.
- EFLALO Cleaning:
  - Remove duplicate trips, invalid or missing dates, trips outside the year, and arrival-before-departure errors.
  - Detect and remove overlapping trips.
  - Check gear and metier codes against ICES vocabularies.
  - Save cleaned EFLALO data and exclusion statistics.

3.7 2. Linking, Spatial Enrichment, and Effort Calculation

Spatial Data Setup:
- Load and prepare habitat and bathymetry layers for spatial joins.
Yearly Loop:
For each year:
- Trip Assignment:
  - Link VMS records to logbook trips by matching date/time windows.
  - Save unmatched records for quality control.
  - Merge trip and vessel attributes (gear, mesh size, vessel length, etc.) to VMS records.
- Advanced Assignment:
  - Refine assignment for trips with multiple gears or ICES rectangles using event-level data.
- Effort and Activity Status:
  - Calculate and cap time intervals between VMS points.
  - Remove incomplete records.
  - Classify vessel positions as fishing or steaming (using speed-based methods).
  - Retain only valid metier codes.
  - Distribute logbook landings among VMS pings based on fishing effort (interval).
- Spatial and Effort Enrichment:
  - Join habitat and bathymetry data to VMS points.
  - Assign C-square codes, year, and month.
  - Compute kilowatt-hour and swept area metrics.
  - Assign gear width (using either models or lookup rules).
  - Save enriched datasets.

3.8 3. Construction and Output of Summary Tables

Yearly Table Construction:
For each year:
- Table 1:
  - VMS-level records with spatial, temporal, effort, vessel, and gear information.
- Table 2:
  - Logbook-level records with trip-level effort, catch, and VMS-link status.
Postprocessing:
- Merge tables across years.
- Anonymize vessel IDs and assign vessel length classes using ICES vocabulary.
- Aggregate and summarize by key groupings (e.g., C-square, metier, vessel length, habitat).
- Compute summary statistics like effort, catch, swept area, and vessel counts.
- Generate field type and NA summaries for QC.
Vocabulary and QC Checks:
- Check all key fields against ICES DATSU vocabularies (metiers, gear, vessel length, ICES rectangles, country codes, etc.).
- Filter out or correct invalid/missing values and log exclusions.
Final Output:
- Save tables in ICES-ready CSV format for submission.
- Provide in-script reporting and intermediate files for transparency and troubleshooting.

3.9 Key Points

Transparency and Traceability: Every major step and QC point is traceable, with annotation and data snapshots for review and troubleshooting.
Reproducibility and Standards Compliance: All processing follows ICES data standards and is reproducible.
Comprehensive Data Preparation: The workflow ensures harmonized, linked, and spatially enriched data, ready for robust aggregation and ICES submission.

3.10 Appendix - the datacall code, the short version

#'------------------------------------------------------------------------------
#
# Script: Extract and process VMS and logbook data for ICES VMS data call
# Sections:
#   1. Pre-process and clean TACSAT and EFLALO data
#   2. Link TACSAT and EFLALO, add spatial and effort info
#   3. Construct and output summary tables
#
#'------------------------------------------------------------------------------

# Section 1: Pre-process and clean TACSAT and EFLALO data ----------------------

library(icesVocab)
library(vmstools)
library(data.table)
library(tidyverse)
library(sf)
source("0_global_functions.R")
source("0_global_settings.R")

# 1.1 Load and prepare static data ---------------------------------------------
data(harbours)
data(ICESareas)

harbours_alt <- 
  harbours |>
  dplyr::mutate(harbour = iconv(harbour, from = "latin1", to = "UTF-8")) |>
  as_tibble() |>
  sf::st_as_sf(coords = c("lon", "lat"), crs = 4326) |>
  sf::st_transform(crs = 3857) |>
  sf::st_buffer(dist = 3000) |>
  sf::st_transform(crs = 4326) |>
  dplyr::select(harbour)

# 1.2 Loop through years to process data ---------------------------------------
for (year in yearsToSubmit) {
  print(paste0("Start loop for year ", year))
  
  # 1.2.1 Load TACSAT and EFLALO -----------------------------------------------
  tacsat_name <- load(file.path(dataPath, paste0("tacsat_", year, ".RData")))
  eflalo_name <- load(file.path(dataPath, paste0("eflalo_", year, ".RData")))
  tacsat <- data.frame(get(tacsat_name))
  eflalo <- data.frame(get(eflalo_name))
  tacsat <- formatTacsat(tacsat)
  eflalo <- formatEflalo(eflalo)
  
  # 1.2.2 Clean TACSAT ---------------------------------------------------------
  remrecsTacsat <- matrix(
    NA, nrow = 6, ncol = 2,
    dimnames = list(
      c("total", "outsideICESarea", "duplicates", "notPossible", "pseudoDuplicates", "harbour"),
      c("rows", "percentage"))
  )
  remrecsTacsat["total", ] <- c(nrow(tacsat), "100%")
  ia <- transform_to_sf(ICESareas, coords = c("SI_LONG", "SI_LATI"))
  tacsat <- transform_to_sf(tacsat, coords = c("SI_LONG", "SI_LATI"))
  ia <- ia %>% sf::st_make_valid() %>% sf::st_transform(4326) |> sf::st_zm()
  overs <- sf::st_intersects(tacsat, ia)
  tacsat <- tacsat[lengths(overs) > 0, ]
  percentage_remaining <- round(nrow(tacsat) / as.numeric(remrecsTacsat["total", 1]) * 100, 2)
  remrecsTacsat["outsideICESarea", ] <- c(nrow(tacsat), percentage_remaining)
  tacsat$SI_DATIM <- as.POSIXct(paste(tacsat$SI_DATE, tacsat$SI_TIME), tz = "GMT", format = "%d/%m/%Y  %H:%M")
  tacsat$unique_id <- paste(tacsat$VE_REF, tacsat$SI_LATI, tacsat$SI_LONG, tacsat$SI_DATIM)
  tacsat <- tacsat[!duplicated(tacsat$unique_id), ]
  percentage_remaining <- round(nrow(tacsat) / as.numeric(remrecsTacsat["total", 1]) * 100, 2)
  remrecsTacsat["duplicates", ] <- c(nrow(tacsat), percentage_remaining)
  coords <- st_coordinates(tacsat)
  invalid_positions <- which(coords[,2] > 90 | coords[,2] < -90 | coords[,1] > 180 | coords[,1] < -180)
  if (length(invalid_positions) > 0) tacsat <- tacsat[-invalid_positions, ]
  percentage_remaining <- round(nrow(tacsat) / as.numeric(remrecsTacsat["total", 1]) * 100, 2)
  remrecsTacsat["notPossible", ] <- c(nrow(tacsat), percentage_remaining)
  tacsat <- sfsortTacsat(tacsat)
  tacsat$INTV <- intervalTacsat(as.data.frame(tacsat), level = "vessel", fill.na = TRUE)$INTV
  tacsat <- tacsat[tacsat$INTV >= intThres, ]
  percentage_remaining <- round(nrow(tacsat) / as.numeric(remrecsTacsat["total", 1]) * 100, 2)
  remrecsTacsat["pseudoDuplicates", ] <- c(nrow(tacsat), percentage_remaining)
  tacsat$INTV <- NULL
  overs <- sf::st_intersects(tacsat, harbours_alt)
  tacsat <- tacsat[!(lengths(overs) > 0), ]
  percentage_remaining <- round(nrow(tacsat) / as.numeric(remrecsTacsat["total", 1]) * 100, 2)
  remrecsTacsat["harbour", ] <- c(nrow(tacsat), percentage_remaining)
  save(remrecsTacsat, file = file.path(outPath, paste0("remrecsTacsat", year, ".RData")))
  tacsat <- as.data.frame(tacsat) %>% dplyr::select(-geometry, -unique_id)
  save(tacsat, file = file.path(outPath, paste("cleanTacsat", year, ".RData", sep = "")))
  message("Cleaning tacsat completed for year ", year)
  print(remrecsTacsat)
  
  # 1.2.3 Clean EFLALO ---------------------------------------------------------
  remrecsEflalo <- matrix(
    NA, nrow = 8, ncol = 2,
    dimnames = list(
      c("total", "duplicated", "impossible time", "before 1st Jan", "departArrival", "overlappingTrips", "MetierL4_LE_GEAR", "MetierL6_LE_MET"),
      c("rows", "percentage"))
  )
  remrecsEflalo["total", ] <- c(nrow(eflalo), "100%")
  trip_id <- create_trip_id(eflalo)
  eflalo <- eflalo[!duplicated(trip_id), ]
  num_records <- nrow(eflalo)
  percent_removed <- round((num_records - as.numeric(remrecsEflalo["total", 1])) / as.numeric(remrecsEflalo["total", 1]) * 100, 2)
  remrecsEflalo["duplicated", ] <- c(num_records, 100 + percent_removed)
  eflalo$FT_DDATIM <- convert_to_datetime(eflalo$FT_DDAT, eflalo$FT_DTIME)
  eflalo$FT_LDATIM <- convert_to_datetime(eflalo$FT_LDAT, eflalo$FT_LTIME)
  eflalo <- eflalo[!is.na(eflalo$FT_DDATIM) & !is.na(eflalo$FT_LDATIM), ]
  num_records <- nrow(eflalo)
  percent_removed <- round((num_records - as.numeric(remrecsEflalo["total", 1])) / as.numeric(remrecsEflalo["total", 1]) * 100, 2)
  remrecsEflalo["impossible time", ] <- c(num_records, 100 + percent_removed)
  eflalo <- remove_before_jan(eflalo, year)
  num_records <- nrow(eflalo)
  percent_removed <- round((num_records - as.numeric(remrecsEflalo["total", 1])) / as.numeric(remrecsEflalo["total", 1]) * 100, 2)
  remrecsEflalo["before 1st Jan", ] <- c(num_records, 100 + percent_removed)
  idx <- which(eflalo$FT_LDATIM >= eflalo$FT_DDATIM)
  eflalo <- eflalo[idx, ]
  remrecsEflalo["departArrival", ] <- c(
    nrow(eflalo),
    100 + round((nrow(eflalo) - as.numeric(remrecsEflalo["total", 1])) / as.numeric(remrecsEflalo["total", 1]) * 100, 2)
  )
  eflalo <- orderBy(~ VE_COU + VE_REF + FT_DDATIM + FT_LDATIM, data = eflalo)
  dt1 <- data.table(eflalo)[, .(VE_REF, FT_REF, FT_DDATIM, FT_LDATIM)]
  dt1 <- unique(dt1, by = c("VE_REF", "FT_REF"))
  setkey(dt1, VE_REF, FT_DDATIM, FT_LDATIM)
  dt2 <- dt1[, ref := .N > 1, by = key(dt1)][ref == TRUE]
  dt3 <- dt2[, .(FT_REF_NEW = FT_REF[1]), by = .(VE_REF, FT_DDATIM, FT_LDATIM)]
  dt4 <- merge(dt2, dt3)
  eflalo2 <- merge(data.table(eflalo), dt4, all.x = TRUE)
  eflalo2[!is.na(FT_REF_NEW), FT_REF := FT_REF_NEW]
  eflalo2[, FT_REF_NEW := NULL]
  eflalo <- data.frame(eflalo2)
  eflalo <- eflalo %>% dplyr::select(-ref)
  dt1 <- data.table(ID = eflalo$VE_REF, FT = eflalo$FT_REF,
                    startdate = eflalo$FT_DDATIM,
                    enddate = eflalo$FT_LDATIM)
  dt1 <- dt1[!duplicated(paste(dt1$ID, dt1$FT)), ]
  setkey(dt1, ID, startdate, enddate)
  result <- foverlaps(dt1, dt1, by.x = c("ID", "startdate", "enddate"),
                      by.y = c("ID", "startdate", "enddate"))
  overlapping.trips <- subset(result, startdate < i.enddate & enddate > i.startdate & FT != i.FT)
  if (nrow(overlapping.trips) > 0) {
    eflalo <- eflalo[!eflalo$FT_REF %in% overlapping.trips$FT, ]
    save(overlapping.trips, file = file.path(outPath, paste0("overlappingTrips", year, ".RData")))
  }
  num_records <- nrow(eflalo)
  percent_removed <- round((num_records - as.numeric(remrecsEflalo["total", 1])) / as.numeric(remrecsEflalo["total", 1]) * 100, 2)
  remrecsEflalo["overlappingTrips", ] <- c(num_records, 100 + percent_removed)
  m4_ices <- getCodeList("GearType")
  eflalo <- eflalo %>% filter(LE_GEAR %in% m4_ices$Key)
  remrecsEflalo["MetierL4_LE_GEAR", ] <- c(nrow(eflalo), round(nrow(eflalo) / as.numeric(remrecsEflalo["total", 1]) * 100, 2))
  m6_ices <- getCodeList("Metier6_FishingActivity")
  eflalo <- eflalo %>% filter(LE_MET %in% m6_ices$Key)
  remrecsEflalo["MetierL6_LE_MET", ] <- c(nrow(eflalo), round(nrow(eflalo) / as.numeric(remrecsEflalo["total", 1]) * 100, 2))
  save(remrecsEflalo, file = file.path(outPath, paste0("remrecsEflalo", year, ".RData")))
  save(eflalo, file = file.path(outPath, paste0("cleanEflalo", year, ".RData")))
  message("Cleaning eflalo completed for year ", year)
  print(remrecsEflalo)
}

rm(harbours, harbours_alt, ICESareas, tacsat_name, eflalo_name, tacsat, eflalo, remrecsTacsat, remrecsEflalo,
   ia, overs, coords, invalid_positions, trip_id, percent_removed, num_records, idx, dt1, result, overlapping.trips)
rm(list = ls(pattern = "_20"))

# Section 2: Link TACSAT and EFLALO, add spatial and effort info ---------------

library(sf)
sf::sf_use_s2(FALSE)
eusm <- readRDS(paste0(dataPath, "eusm.rds")) %>% st_transform(4326)
bathy <- readRDS(paste0(dataPath, "ICES_GEBCO.rds")) %>% st_set_crs(4326)

for (year in yearsToSubmit) {
  print(paste0("Start loop for year ", year))
  load(file = paste0(outPath, "/cleanEflalo", year, ".RData"))
  load(file = paste0(outPath, "/cleanTacsat", year, ".RData"))
  tacsat$geometry <- NULL
  
  # 2.1 Merge EFLALO trip info to TACSAT ---------------------------------------
  tacsatp <- mergeEflalo2Tacsat(eflalo, tacsat)
  tacsatp <- data.frame(tacsatp)
  tacsatpmin <- subset(tacsatp, FT_REF == 0)
  cat(sprintf("%.2f%% of tacsat data did not merge\n", (nrow(tacsatpmin) / nrow(tacsatp)) * 100))
  save(tacsatpmin, file = file.path(outPath, paste0("tacsatNotMerged", year, ".RData")))
  tacsatp <- subset(tacsatp, FT_REF != 0)
  cols <- c("LE_GEAR", "LE_MSZ", "VE_LEN", "VE_KW", "LE_RECT", "LE_MET", "LE_WIDTH", "VE_FLT", "VE_COU")
  for (col in cols) {
    tacsatp[[col]] <- eflalo[[col]][match(tacsatp$FT_REF, eflalo$FT_REF)]
  }
  for (col in c("LE_GEAR","LE_MSZ","LE_RECT","LE_MET")) {
    tripinfo <- trip_assign(tacsatp, eflalo, col = col, haul_logbook = FALSE)
    tacsatp <- rbindlist(list(tacsatp[tacsatp$FT_REF %!in% tripinfo$FT_REF,], tripinfo), fill = TRUE)
  }
  if ("LE_WIDTH" %in% names(eflalo)) {
    tripinfo <- trip_assign(tacsatp, eflalo, col = "LE_WIDTH", haul_logbook = FALSE)
    tacsatp <- rbindlist(list(tacsatp[tacsatp$FT_REF %!in% tripinfo$FT_REF,], tripinfo), fill = TRUE)
  }
  tacsatp <- as.data.frame(tacsatp)
  save(tacsatp, file = file.path(outPath, paste0("tacsatMerged", year, ".RData")))
  
  # 2.2 Calculate time interval, quality control, assign activity, merge landings ----
  tacsatp <- intvTacsat(tacsatp, level = "trip", fill.na = TRUE)
  tacsatp$INTV[tacsatp$INTV > intvThres] <- 2 * intvThres
  tacsatp$INTV[is.na(tacsatp$INTV)] <- intvThres
  idx <- which(is.na(tacsatp$VE_REF) | is.na(tacsatp$SI_LONG) | is.na(tacsatp$SI_LATI) |
                 is.na(tacsatp$SI_DATIM) | is.na(tacsatp$SI_SP))
  if (length(idx) > 0) tacsatp <- tacsatp[-idx, ]
  # [activity state assignment block omitted for brevity]
  save(tacsatp, file = file.path(outPath, paste0("tacsatActivity", year, ".RData")))
  kept <- nrow(tacsatp)
  removed <- nrow(tacsatp %>% filter(LE_MET %!in% valid_metiers))
  tacsatp <- tacsatp %>% filter(LE_MET %in% valid_metiers)
  cat(sprintf("%.2f%% of tacsatp removed due to invalid metier l6\n", (removed / (removed + kept) * 100)))
  idx_kg <- grep("LE_KG_", colnames(eflalo)[colnames(eflalo) %!in% c("LE_KG_TOT")])
  idx_euro <- grep("LE_EURO_", colnames(eflalo)[colnames(eflalo) %!in% c("LE_EURO_TOT")])
  if ("LE_KG_TOT" %!in% names(eflalo)) eflalo$LE_KG_TOT <- rowSums(eflalo[, idx_kg], na.rm = TRUE)
  if ("LE_EURO_TOT" %!in% names(eflalo)) eflalo$LE_EURO_TOT <- rowSums(eflalo[, idx_euro], na.rm = TRUE)
  eflalo <- eflalo[, -c(idx_kg, idx_euro)]
  eflaloM <- subset(eflalo, FT_REF %in% unique(tacsatp$FT_REF))
  eflaloNM <- subset(eflalo, !FT_REF %in% unique(tacsatp$FT_REF))
  message(sprintf("%.2f%% of the eflalo data not in tacsat\n", (nrow(eflaloNM) / (nrow(eflaloNM) + nrow(eflaloM)) * 100)))
  tacsatp$SI_STATE <- ifelse(tacsatp$SI_STATE == "f", 1, 0)
  tacsatp <- tacsatp[tacsatp$SI_STATE == 1 & !is.na(tacsatp$INTV), ]
  if ((sum(tacsatp$INTV == 0) > 0) || (sum(is.na(tacsatp$INTV)) > 0)) {
    message(sprintf("%.2f%% of intervals in tacsatp contain NA or zeros and have been discarded.\n", 
                    (sum(tacsatp$INTV == 0) + sum(is.na(tacsatp$INTV))) / nrow(tacsatp) * 100))
    tacsatp <- tacsatp %>% filter(!is.na(INTV) & INTV > 0)
  }
  tacsatEflalo <- splitAmongPings(
    tacsat = tacsatp,
    eflalo = eflaloM,
    variable = "all",
    level = c("day", "ICESrectangle", "trip"),
    conserve = TRUE, 
    by = "INTV"
  )
  eflalo$tripInTacsat <- ifelse(eflalo$FT_REF %in% tacsatEflalo$FT_REF, "Y", "N")
  save(tacsatEflalo, file = file.path(outPath, paste0("tacsatEflalo", year, ".RData")))
  save(eflalo, file = file.path(outPath, paste0("/processedEflalo", year, ".RData")))
  print("Dispatching landings completed")
}

# 2.3 Add habitat, depth, and effort refinements -------------------------------
for (year in yearsToSubmit) {
  print(paste0("Start loop for year ", year))
  load(file = paste0(outPath, "tacsatEflalo", year, ".RData"))
  tacsatEflalo <- tacsatEflalo |> 
    sf::st_as_sf(coords = c("SI_LONG", "SI_LATI"), remove = FALSE) |> 
    sf::st_set_crs(4326) |> 
    st_join(eusm, join = st_intersects) |> 
    st_join(bathy, join = st_intersects) |> 
    mutate(geometry = NULL) |> 
    data.frame()
  tacsatEflalo$Csquare <- CSquare(tacsatEflalo$SI_LONG, tacsatEflalo$SI_LATI, degrees = 0.05)
  tacsatEflalo$Year <- year(tacsatEflalo$SI_DATIM)
  tacsatEflalo$Month <- month(tacsatEflalo$SI_DATIM)
  tacsatEflalo$kwHour <- tacsatEflalo$VE_KW * tacsatEflalo$INTV / 60
  tacsatEflalo$INTV <- tacsatEflalo$INTV / 60
  tacsatEflalo$GEARWIDTHKM <- add_gearwidth(tacsatEflalo)
  tacsatEflalo$SA_KM2 <- tacsatEflalo$GEARWIDTHKM * tacsatEflalo$INTV * tacsatEflalo$SI_SP * 1.852
  tacsatEflalo[, .(min = min(GEARWIDTHKM), max = max(GEARWIDTHKM)), by = .(LE_MET)]
  save(tacsatEflalo, file = file.path(outPath, paste0("tacsatEflalo", year, ".RData")))
}

rm(speedarr, tacsatp, tacsatEflalo, eflalo, eflaloM, eflaloNM)

# Section 3: Construct and output summary tables -------------------------------

library(glue)
library(gt)

for (year in yearsToSubmit) {
  load(file = paste0(outPath, "/processedEflalo", year, ".RData"))
  load(file = paste0(outPath, "/tacsatEflalo", year, ".RData"))
  
  # 3.1 Table 2: Logbook records -----------------------------------------------
  eflalo$Year <- year(eflalo$FT_LDATIM)
  eflalo$Month <- month(eflalo$FT_LDATIM)
  eflalo$INTV <- 1
  eflalo$record <- 1
  res <- aggregate(
    eflalo$record,
    by = as.list(eflalo[, c("VE_COU", "VE_REF", "LE_CDAT")]),
    FUN = sum,
    na.rm = TRUE
  )
  colnames(res) <- c("VE_COU", "VE_REF", "LE_CDAT", "nrRecords")
  eflalo <- merge(eflalo, res, by = c("VE_COU", "VE_REF", "LE_CDAT"))
  eflalo$INTV <- eflalo$INTV / eflalo$nrRecords
  eflalo$kwDays <- eflalo$VE_KW * eflalo$INTV
  eflalo$tripInTacsat <- ifelse(eflalo$FT_REF %in% tacsatEflalo$FT_REF, "Y", "N")
  RecordType <- "LE"
  cols <- c(
    "VE_REF", "VE_COU", "Year", "Month", "LE_RECT", "LE_GEAR", "LE_MET",
    "VE_LEN", "tripInTacsat", "INTV", "kwDays", "LE_KG_TOT", "LE_EURO_TOT"
  )
  if (year == yearsToSubmit[1]) {
    table2 <- cbind(RT = RecordType, eflalo[, cols])
  } else {
    table2 <- rbind(table2, cbind(RT = RecordType, eflalo[, cols]))
  }
  save(table2, file = file.path(outPath, "table2.RData"))
  message(glue("Table 2 for year {year} is completed"))
  
  # 3.2 Table 1: VMS with assigned logbook info --------------------------------
  tacsatEflalo <- data.frame(tacsatEflalo)
  RecordType <- "VE"
  cols <- c(
    "VE_REF", "VE_COU", "Year", "Month", "Csquare", "MSFD_BBHT", "depth", "LE_GEAR",
    "LE_MET", "SI_SP", "INTV", "VE_LEN", "kwHour", "VE_KW", "LE_KG_TOT", "LE_EURO_TOT",
    "GEARWIDTHKM", "SA_KM2"
  )
  if (year == yearsToSubmit[1]) {
    table1 <- cbind(RT = RecordType, tacsatEflalo[, cols])
  } else {
    table1 <- rbind(table1, cbind(RT = RecordType, tacsatEflalo[, cols]))
  }
  save(table1, file = file.path(outPath, "table1.RData"))
  message(glue("Table 1 for year {year} is completed"))
}

table(table1$INTV > 0)
table(table2$INTV > 0)

# 3.3 Prepare and anonymize vessel IDs -----------------------------------------
load(file = paste0(outPath, "/table1.RData"))
load(file = paste0(outPath, "/table2.RData"))
VE_lut <- data.frame(VE_REF = unique(c(table1$VE_REF, table2$VE_REF)))
fmt <- paste0("%0", floor(log10(nrow(VE_lut))) + 1, "d")
VE_lut$VE_ID <- paste0(table1$VE_COU[1], sprintf(fmt, 1:nrow(VE_lut)))
table1 <- left_join(table1, VE_lut)
table2 <- left_join(table2, VE_lut)

# 3.4 Add vessel length categories from ICES DATSU vocab -----------------------
vlen_ices <- getCodeList("VesselLengthClass")
vlen_icesc <- vlen_ices %>%
  filter(Key %in% c("VL0006", "VL0608", "VL0810", "VL1012", "VL1215", "VL1518", "VL1824", "VL2440", "VL40XX")) %>%
  dplyr::select(Key) %>%
  dplyr::arrange(Key)
table1$LENGTHCAT <- table1$VE_LEN %>% cut(
  breaks = c(0, 6, 8, 10, 12, 15, 18, 24, 40, Inf),
  right = FALSE, include.lowest = TRUE, labels = vlen_icesc$Key
)
table2$LENGTHCAT <- table2$VE_LEN %>% cut(
  breaks = c(0, 6, 8, 10, 12, 15, 18, 24, 40, Inf),
  right = FALSE, include.lowest = TRUE, labels = vlen_icesc$Key
)

# 3.5 Aggregate and summarize tables -------------------------------------------
table1Save <- table1 %>%
  separate(col = LE_MET, c("MetierL4", "MetierL5"), sep = '_', extra = "drop", remove = FALSE) %>%
  group_by(RecordType = RT, CountryCode = VE_COU, Year, Month, Csquare, MetierL4, MetierL5, MetierL6 = LE_MET, VesselLengthRange = LENGTHCAT, Habitat = MSFD_BBHT, Depth = depth) %>%
  summarise(
    No_Records = n(),
    AverageFishingSpeed = mean(SI_SP),
    FishingHour = sum(INTV, na.rm = TRUE),
    AverageInterval = mean(INTV, na.rm = TRUE),
    AverageVesselLength = mean(VE_LEN, na.rm = TRUE),
    AveragekW = mean(VE_KW, na.rm = TRUE),
    kWFishingHour = sum(kwHour, na.rm = TRUE),
    SweptArea = sum(SA_KM2, na.rm = TRUE),
    TotWeight = sum(LE_KG_TOT, na.rm = TRUE),
    TotValue = sum(LE_EURO_TOT, na.rm = TRUE),
    NoDistinctVessels = n_distinct(VE_ID, na.rm = TRUE),
    AnonymizedVesselID = ifelse(n_distinct(VE_ID) < 3, paste(unique(VE_ID), collapse = ";"), 'not_required'),
    AverageGearWidth = mean(GEARWIDTHKM, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  relocate(NoDistinctVessels, AnonymizedVesselID, .before = Csquare) %>%
  as.data.frame()
table2Save <- table2 %>%
  separate(col = LE_MET, c("MetierL4", "MetierL5"), sep = '_', remove = FALSE) %>%
  group_by(RecordType = RT, CountryCode = VE_COU, Year, Month, ICESrectangle = LE_RECT, MetierL4, MetierL5, MetierL6 = LE_MET, VesselLengthRange = LENGTHCAT, VMSEnabled = tripInTacsat) %>%
  summarise(
    FishingDays = sum(INTV, na.rm = TRUE),
    kWFishingDays = sum(kwDays, na.rm = TRUE),
    TotWeight = sum(LE_KG_TOT, na.rm = TRUE),
    TotValue = sum(LE_EURO_TOT, na.rm = TRUE),
    NoDistinctVessels = n_distinct(VE_ID, na.rm = TRUE),
    AnonymizedVesselID = ifelse(n_distinct(VE_ID) < 3, paste(unique(VE_ID), collapse = ";"), 'not_required'),
    .groups = "drop"
  ) %>%
  relocate(NoDistinctVessels, AnonymizedVesselID, .before = ICESrectangle) %>%
  as.data.frame()
saveRDS(table1Save, paste0(outPath, "table1Save.rds"))
saveRDS(table2Save, paste0(outPath, "table2Save.rds"))

# 3.6 Field NA/format summaries (optional) -------------------------------------
table_nas <- NULL
for (nn in colnames(table1Save)) {
  table_na <- table(is.na(table1Save[, nn]))
  row <- c(field = nn, is_na =  ifelse(is.na(table_na['TRUE']), 0, table_na['TRUE'] ), total_records = length(table1Save[, nn]), field_type = class(table1Save[, nn]))
  table_nas <- rbind(table_nas, row)
}
gt(
  as_tibble(table_nas),
  rowname_col = 'field'
) %>%
  tab_header(title = md('Summary of **Table 1**  number of NA and records types'))
table_nas <- NULL
for (nn in colnames(table2Save)) {
  table_na <- table(is.na(table2Save[, nn]))
  row <- c(field = nn, is_na =  ifelse(is.na(table_na['TRUE']), 0, table_na['TRUE'] ), total_records = length(table2Save[, nn]), field_type = class(table2Save[, nn]))
  table_nas <- rbind(table_nas, row)
}
gt(
  as_tibble(table_nas),
  rowname_col = 'field'
) %>%
  tab_header(title = md('Summary of **Table 2**  number of NA and records types'))

table(table1$INTV > 0)
table(table2$INTV > 0)

# 3.7 Save as ICES-ready CSV ---------------------------------------------------
colnames(table1Save)[which(colnames(table1Save) == "Csquare")] <- "C-square"
colnames(table1Save)[which(colnames(table1Save) == "Habitat")] <- "HabitatType"
colnames(table1Save)[which(colnames(table1Save) == "Depth")] <- "DepthRange"
colnames(table1Save)[which(colnames(table1Save) == "No_Records")] <- "NumberOfRecords"
write.table(table1Save, file.path(outPath, "table1Save.csv"), na = "", row.names = FALSE, col.names = TRUE, sep = ",", quote = FALSE)
write.table(table2Save, file.path(outPath, "table2Save.csv"), na = "", row.names = FALSE, col.names = TRUE, sep = ",", quote = FALSE)

#'------------------------------------------------------------------------------
# End of script
#'------------------------------------------------------------------------------