AIS gridding and summations in duckdb: Speed test

See intext
code
rtip
Author

Einar Hjörleifsson

Published

September 26, 2025

Data gridding speed test: duckdb vs R

This test is related to SFD geospatial framework - spatial grids, but here the only focus is on comparison of regular gridding. The original data-source is parquet files that are read into in-process duckdb database. Do the following tests:

  • Do the gridding within duckdb, only extracting the summary results into R (db object in test below).
    • The connection time is not incuded in the test, it is almost instantaneous.
  • Import data first into R (~2.5 million records, takes a little time here, not included in the speed test), then:
    • Do the gridding the same way as in duckdb, do some summary (gr object in test below).
    • Use the {vmstools::CSquare} for gridding, the do some summary (cs object in test below).
library(vmstools)    # Do access the CSquare-function
library(duckdb)      # All things for duckdb 
library(duckdbfs)    # Installed via install.packages("duckdb")
                     #  wrapper to connect to multiple parquet files
library(tidyverse)       # For backend SQL communication to duckbb via {dbplyr}
library(bench)       # For comparison test 


# Create a connection to parquet files, filter out one year, one gear
q <-
  open_dataset(here::here("data/ais/trail")) |>
  filter(year == 2024, agf_gid == 6)
# Import data into R (some ~2.5 million records):
d <-
  q |>
  dplyr::collect()
#Grid resolution (same as in the datacall) and then the test:

dx <- dy <- 0.05
res <- bench::mark(
  # Here we let duckdb do the heavy lifting, only import to R at the last step
  db <-
    q |>
    mutate(x = lon %/% dx * dx + dx/2,
           y = lat %/% dy * dy + dy/2) |>
    count(x, y) |>
    dplyr::collect(),
  # Start with table already in R-memory
  gr <-
    d |>
    mutate(x = lon %/% dx * dx + dx/2,
           y = lat %/% dy * dy + dy/2) |>
    count(x, y),
  # The conventional way
  cs <-
    d |>
    mutate(cs = CSquare(lon = lon, lat = lat, degrees = dx)) |>
    count(cs),
  check = FALSE,
  relative = TRUE
)
res |> select(expression, median, `itr/sec`, mem_alloc, total_time) |> dplyr::mutate(expression = stringr::str_sub(expression, 1, 5))
# A tibble: 3 × 4
  expression median `itr/sec` mem_alloc
  <chr>       <dbl>     <dbl>     <dbl>
1 db <-        1.12      39.6        1 
2 gr <-        1         44.4     1097.
3 cs <-       44.4        1       1048.

Observations:

  • There is not much gained in speed when gridding done within duckdb vs within R.
    • But here we have not included the import time.
    • Here we do not take into account negative effect of having loaded large object into R-memory.
  • Using x%/%dx * dx + dx/2 to grid is faster than {vmstools::CSquare}

Now we may as well check if we are getting out the same results from using x%/%dx * dx + dx/2 vs {vmstools::CSquare}. I kind of suspect not exactly, but that is another story.

db |>
  # convert the midpoints to c-squares
  mutate(cs = CSquare(lon = x, lat = y, degrees = dx)) |>
  rename(db_n = n) |>
  full_join(cs) |>
  mutate(diff = n - db_n) |>
  count(diff) |>
  mutate(`%` = round(n / sum(n) * 100, 2)) |> 
  knitr::kable(caption = "Difference in ping allocation by 'CSquare' using `x%/%dx * dx + dx/2` vs {vmstools::CSquare}")
Difference in ping allocation by ‘CSquare’ using x%/%dx * dx + dx/2 vs
diff n %
-4 1 0.01
-3 4 0.04
-2 28 0.25
-1 195 1.73
0 10834 95.89
1 205 1.81
2 25 0.22
3 4 0.04
NA 2 0.02