library(vmstools) # Do access the CSquare-function
library(duckdb) # All things for duckdb
library(duckdbfs) # Installed via install.packages("duckdb")
# wrapper to connect to multiple parquet files
library(tidyverse) # For backend SQL communication to duckbb via {dbplyr}
library(bench) # For comparison test
# Create a connection to parquet files, filter out one year, one gear
<-
q open_dataset(here::here("data/ais/trail")) |>
filter(year == 2024, agf_gid == 6)
Data gridding speed test: duckdb vs R
This test is related to SFD geospatial framework - spatial grids, but here the only focus is on comparison of regular gridding. The original data-source is parquet files that are read into in-process duckdb database. Do the following tests:
- Do the gridding within duckdb, only extracting the summary results into R (db object in test below).
- The connection time is not incuded in the test, it is almost instantaneous.
- Import data first into R (~2.5 million records, takes a little time here, not included in the speed test), then:
- Do the gridding the same way as in duckdb, do some summary (gr object in test below).
- Use the {vmstools::CSquare} for gridding, the do some summary (cs object in test below).
# Import data into R (some ~2.5 million records):
<-
d |>
q ::collect() dplyr
#Grid resolution (same as in the datacall) and then the test:
<- dy <- 0.05
dx <- bench::mark(
res # Here we let duckdb do the heavy lifting, only import to R at the last step
<-
db |>
q mutate(x = lon %/% dx * dx + dx/2,
y = lat %/% dy * dy + dy/2) |>
count(x, y) |>
::collect(),
dplyr# Start with table already in R-memory
<-
gr |>
d mutate(x = lon %/% dx * dx + dx/2,
y = lat %/% dy * dy + dy/2) |>
count(x, y),
# The conventional way
<-
cs |>
d mutate(cs = CSquare(lon = lon, lat = lat, degrees = dx)) |>
count(cs),
check = FALSE,
relative = TRUE
)|> select(expression, median, `itr/sec`, mem_alloc, total_time) |> dplyr::mutate(expression = stringr::str_sub(expression, 1, 5)) res
# A tibble: 3 × 4
expression median `itr/sec` mem_alloc
<chr> <dbl> <dbl> <dbl>
1 db <- 1.12 39.6 1
2 gr <- 1 44.4 1097.
3 cs <- 44.4 1 1048.
Observations:
- There is not much gained in speed when gridding done within duckdb vs within R.
- But here we have not included the import time.
- Here we do not take into account negative effect of having loaded large object into R-memory.
- Using
x%/%dx * dx + dx/2
to grid is faster than {vmstools::CSquare}
Now we may as well check if we are getting out the same results from using x%/%dx * dx + dx/2
vs {vmstools::CSquare}. I kind of suspect not exactly, but that is another story.
|>
db # convert the midpoints to c-squares
mutate(cs = CSquare(lon = x, lat = y, degrees = dx)) |>
rename(db_n = n) |>
full_join(cs) |>
mutate(diff = n - db_n) |>
count(diff) |>
mutate(`%` = round(n / sum(n) * 100, 2)) |>
::kable(caption = "Difference in ping allocation by 'CSquare' using `x%/%dx * dx + dx/2` vs {vmstools::CSquare}") knitr
diff | n | % |
---|---|---|
-4 | 1 | 0.01 |
-3 | 4 | 0.04 |
-2 | 28 | 0.25 |
-1 | 195 | 1.73 |
0 | 10834 | 95.89 |
1 | 205 | 1.81 |
2 | 25 | 0.22 |
3 | 4 | 0.04 |
NA | 2 | 0.02 |