Points in polygon via duckdb.

A dummy dataset

We generate here a simple test dataset that has 100 million coordinates. We store the data as a parquet file, which if you have not heard of it is kind of like csv-file on steroids.

# Not run
N <- 1e8
# A tibble that covers the bounding box of FAO area 27
tibble::tibble(lon = runif(N,-44.00, 68.50),
               lat = runif(N, 36.00, 89.99)) |> 
  dplyr::mutate(.id = 1:n(),
                .before = lon) |> 
  duckdbfs::write_dataset(here::here("data/test/test_dataset.parquet"))

In addition we will use the eusm sediment spatial dataset, the same as used in the ICES VMS datacall. This dataset has 3.17 million polygons consisting a total of 119 million point-coordinates! In addition some of the polygons are not valid (meaning e.g. that the a polygon may intersect itself) and there are “holes” between some of the polygons. Not ideal for proper analysis but a good canditate for the test setup here.

Library loading

We will be using a number of packages but instead of loading them we will call each functions directly within the script, just to limit ambiguity. We start off by turning the google s2geometry.io library off, because otherwise the conventional spatial join will not work, given that the eusm has non-valid geometries (see above).

sf::sf_use_s2(FALSE)

duckdb

In this section the data is not loaded into R-memory, rather a temporary duckdb database is generated with a pointer to the dataset on the disk. The process used is the same as before, except here we run the test using up to 100 million ais points. And the data is only imported into R at the end of the join process.

eusm <-
  duckdbfs::open_dataset(here::here("data/auxillary/ICESareas.gpkg")) |> 
  dplyr::select(area = Area_Full, geom)
ais <- 
  duckdbfs::open_dataset(here::here("data/test/test_dataset.parquet"))


o <- seq(4, 8, by = 0.2)
N <- 10^o
tictoc::tic.clearlog()
for(i in 1:length(N)) {
  print(i)
  tictoc::tic()
  X <- N[i]
  ais_join <- 
    ais |> 
    dplyr::filter(.id <= X) |> 
    dplyr::mutate(geom = ST_POINT(lon, lat)) |> 
    duckdbfs::spatial_join(eusm, by = "st_intersects", join = "left") |> 
    dplyr::select(-geom_1) |> 
    dplyr::collect()
  tictoc::toc(log = TRUE, quiet = TRUE)
}
log.lst <- tictoc::tic.log(format = FALSE)
tictoc::tic.clearlog()
timings <- unlist(lapply(log.lst, function(x) x$toc - x$tic))
timings
duckdb_eusm <- 
  tibble::tibble(n = N / 1e6,
                 timing = timings |> as.vector())
duckdb_eusm |> readr::write_csv("test_duckdbfs.csv")

The conventional way

In this section we are going to read the data into memory. And because of the limitations that that creates we are not going to test for more than 10 million ais records. The steps are as follows:

Load the dataset into memory
Run a loop with ever increasing number of ais-records, ending with a dataframe of 1 million records
Within each loop generate an sf-tibble and join that with the eusm-polygons to get the sediment type for each ping
Keep track of time elapsed using {tictoc}
Summarise the timing and plot the results.

# Here read data into memory
eusm <-
  sf::read_sf(here::here("data/auxillary/ICESareas.gpkg")) |> 
  dplyr::select(fao = Area_Full)
ais <- 
  nanoparquet::read_parquet(here::here("data/test/test_dataset.parquet")) |> 
  # Let's just limit the case to 1 million records,
  #  is enough to get the point across
  dplyr::filter(.id <= 1e7)
o <- seq(4, 7, by = 0.2)
N <- 10^o
tictoc::tic.clearlog()
for(i in 1:length(N)) {
  tictoc::tic()
  X <- N[i]
  ais_join <- 
    ais |> 
    dplyr::filter(.id <= X) |> 
    sf::st_as_sf(coords = c("lon", "lat"),
                 crs = 4326,
                 remove = FALSE) |> 
    sf::st_join(eusm, join = sf::st_intersects, left = TRUE)
  tictoc::toc(log = TRUE, quiet = TRUE)
}
log.lst <- tictoc::tic.log(format = FALSE)
tictoc::tic.clearlog()
timings <- unlist(lapply(log.lst, function(x) x$toc - x$tic))
timings
dplyr_eusm <- 
  tibble::tibble(n = N / 1e6,
                 timing = timings |> as.vector())

Comparison

dplyr::bind_rows(duckdb_eusm |> dplyr::mutate(method = "duckdb"),
                 dplyr_eusm  |> dplyr::mutate(method = "dplyr")) |> 
  ggplot2::ggplot(ggplot2::aes(n, timing / 60, colour = method)) +
  ggplot2::geom_point() +
  ggplot2::labs(x = "Millions of pings",
                y = "Timing [minutes]",
                colour = "Method") |> 
  ggplot2::scale_colour_brewer(palette = "Set1")

Conclusion

Generally, doing things via {duckdb} is better than via {dplyr-sf}.

Reuse

CC BY 4.0