If you have ever worked with DATRAS data you have gone through the cumbersome task of first spending time on getting the data. Either through point-and-mouse click download approach or possibly via some coding procedure using the DATRAS webservices.

The above access/distribution methods are well tested and server their purposes. One alternative may be that someone generates the three fundamental raw DATRAS data as parquet-files and stores them on a webspace, from which users can address them directly.

Here the raw DATRAS data for all the surveys have been combined into single tables (only the HH and the HL tables, the CA table is pending). There are at least two ways to access them into R:

  • Direct read into R
  • Via DuckDB
# library(tidyverse)

Direct read into R

The files are stored in parquet format. Consider them to be csv-files on steroids. We can read them directly into R by:

hh <- arrow::read_parquet("https://heima.hafro.is/~einarhj/DATRAS/HH.parquet")
hl <- arrow::read_parquet("https://heima.hafro.is/~einarhj/DATRAS/HL.parquet")

So here we got some ~140965 records, which basically are all the tows in the DATRAS database. And some ~13518402 records of length data. The count by surveys and years is as follows:

hl |> 
  dplyr::filter(Year >= 2010) |> 
  dplyr::mutate(Survey = paste0(Survey, " ", Quarter)) |> 
  dplyr::count(Survey, Year) |> 
  tidyr::spread(Year, n) |> 
  knitr::kable()
Survey 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
BITS 1 27639 26850 22854 26160 21510 23880 25144 31186 28532 25667 27709 27835 25472 28096 28324
BITS 4 20947 20106 18318 18675 20652 20848 27063 27557 24852 25010 25317 24778 27560 26849 NA
BTS 1 11533 11607 9141 12333 16717 18262 19407 19539 18963 21227 13954 18431 13037 17899 NA
BTS 3 40068 45800 47792 50217 48333 49060 49674 51295 51576 53766 45969 52992 37536 53945 NA
BTS 4 5866 5347 4987 4961 NA NA NA NA NA NA NA NA NA NA NA
BTS-GSA17 4 NA NA NA NA NA NA 1019 928 1123 993 1023 993 707 976 NA
BTS-VIII 4 NA 9805 8078 4558 6496 5148 4824 5118 5106 3095 4068 3729 2341 3461 NA
Can-Mar 3 23471 27027 21056 22989 19509 21168 28859 27158 10720 24764 21186 14215 NA NA NA
DYFS 3 24689 22773 29399 32811 38511 34923 25065 32664 30062 26802 35009 28940 33152 24493 NA
DYFS 4 6876 9614 9181 6838 7023 8638 7957 7821 9684 9535 5969 8818 8063 8036 NA
EVHOE 4 21953 25982 21526 22900 24727 24020 24979 3912 24965 24497 29614 24989 18524 23420 NA
FR-CGFS 4 4620 4918 5146 5649 8488 8647 9194 8134 8867 9945 8731 8662 8377 10040 NA
FR-WCGFS 3 NA NA NA NA NA NA NA NA NA NA NA NA NA 5219 NA
FR-WCGFS 4 NA NA NA NA NA NA NA NA NA NA NA 5446 NA NA NA
IE-IAMS 1 NA NA NA NA NA NA 8554 6636 7039 8597 5723 4864 4205 10007 NA
IE-IAMS 2 NA NA NA NA NA NA NA 5190 6881 6306 3772 3708 5209 6952 NA
IE-IGFS 4 36790 31780 32456 34610 33612 29494 37997 29869 32449 32123 24367 31275 31595 33289 NA
NIGFS 1 10793 11555 11933 10695 11841 12847 14110 12005 10892 11433 11261 11058 11802 10893 NA
NIGFS 4 9106 8994 10299 9923 10755 11550 11380 10931 10920 10687 9922 8329 9061 6783 NA
NL-BSAS 3 NA NA NA NA NA NA NA NA NA 73 333 773 512 1505 NA
NL-BSAS 4 NA NA NA NA NA NA NA NA NA 1267 905 343 448 NA NA
NS-IBTS 1 42942 42277 52063 47451 44670 52486 50151 54712 44244 44780 47663 51151 32767 45753 47705
NS-IBTS 2 NA 92 1324 NA NA 787 NA NA NA NA NA NA NA NA NA
NS-IBTS 3 39620 42789 42692 40424 46572 48374 53164 49428 47228 48456 52849 51987 49676 50732 NA
NS-IBTS_UNIFtest 1 NA NA NA NA NA NA NA NA NA NA NA 15 NA NA NA
NS-IBTS_UNIFtest 4 NA NA NA NA NA NA NA NA NA 189 NA NA NA NA NA
NS-IDPS 3 NA NA NA 1898 NA NA 230 NA NA NA NA NA NA NA NA
NSSS 4 1106 759 2872 3078 4091 3611 4997 4294 3892 4923 4593 4499 4833 3026 NA
PT-IBTS 4 6089 5788 NA 6224 5553 6479 5345 7240 3851 NA NA 9511 6893 4741 NA
SCOROC 3 NA 5146 4748 3990 7067 6282 6413 6294 6203 6589 5047 5981 5933 5888 NA
SCOWCGFS 1 NA 9196 10475 12551 10690 10931 12089 12036 10869 11015 11069 10977 NA 11598 10978
SCOWCGFS 4 NA 8729 11259 4457 11704 10561 11315 10311 9653 11165 9571 9711 11196 10263 NA
SE-SOUND 1 NA 440 328 299 246 248 344 153 218 671 544 831 694 737 NA
SE-SOUND 3 NA 321 272 340 275 310 280 227 NA NA NA NA NA NA NA
SE-SOUND 4 NA NA NA NA NA NA NA NA 439 707 571 677 889 649 NA
SNS 3 4052 3362 NA 3892 3988 3877 3571 3589 3751 3144 3505 3056 3119 3228 NA
SNS 4 NA NA 3295 NA NA NA NA NA NA 189 NA NA NA NA NA
SP-ARSA 1 3752 3560 2702 4185 4406 4187 5304 5080 7127 10096 5101 NA 5133 5453 NA
SP-ARSA 4 5110 4451 5075 6006 5411 6321 5595 5949 11668 5771 5601 NA 6067 5966 NA
SP-NORTH 3 4900 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
SP-NORTH 4 23722 24693 24579 30081 28118 29357 28678 28313 25592 27827 24355 24233 16721 17306 NA
SP-PORC 3 19870 18799 21806 24005 18074 25301 20919 18948 19153 18522 13074 21721 14215 11941 NA
SWC-IBTS 1 8860 NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Comment: Somewhat surprising how few surveys in 2024, possibly because of the attack.

Reading via duckDB

Another way is to use duckDB:

q_hh <- duckdbfs::open_dataset("https://heima.hafro.is/~einarhj/hh/hh.parquet")
q_hl <- duckdbfs::open_dataset("https://heima.hafro.is/~einarhj/hh/hl.parquet")

q_hh |> dplyr::glimpse()
#> Rows: ??
#> Columns: 69
#> Database: DuckDB v1.0.1-dev4228 [unknown@Linux 5.10.0-28-amd64:R 4.3.2/:memory:]
#> $ RecordType        <chr> "HH", "HH", "HH", "HH", "HH", "HH", "HH", "HH", "HH"…
#> $ Survey            <chr> "BITS", "BITS", "BITS", "BITS", "BITS", "BITS", "BIT…
#> $ Quarter           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ Country           <chr> "DK", "DK", "DK", "DK", "DK", "DK", "DK", "DK", "DK"…
#> $ Ship              <chr> "26D4", "26D4", "26D4", "26D4", "26D4", "26D4", "26D…
#> $ Gear              <chr> "CAM", "CAM", "CAM", "EXP", "EXP", "GRT", "GRT", "GR…
#> $ SweepLngt         <int> NA, NA, NA, 110, 110, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ GearEx            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DoorType          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ StNo              <chr> "150", "151", "152", "149", "147", "10", "101", "1",…
#> $ HaulNo            <int> 67, 68, 69, 66, 65, 8, 54, 2, 55, 56, 57, 59, 60, 58…
#> $ Year              <chr> "1991", "1991", "1991", "1991", "1991", "1991", "199…
#> $ Month             <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
#> $ Day               <int> 20, 20, 20, 19, 19, 6, 17, 5, 17, 17, 17, 17, 18, 17…
#> $ TimeShot          <chr> "514", "644", "923", "2128", "1829", "1417", "511", …
#> $ DepthStratum      <chr> "11", "11", "12", "12", "12", "10", "11", "9", "11",…
#> $ HaulDur           <int> 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, …
#> $ DayNight          <chr> "D", "D", "D", "N", "N", "D", "D", "D", "D", "D", "D…
#> $ ShootLat          <dbl> 55.6000, 55.6667, 55.5167, 55.4500, 55.5500, 54.9833…
#> $ ShootLong         <dbl> 16.2500, 16.2667, 16.1667, 15.1167, 15.1833, 14.1500…
#> $ HaulLat           <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ HaulLong          <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ StatRec           <chr> "40G6", "40G6", "40G6", "39G5", "40G5", "38G4", "39G…
#> $ Depth             <int> 76, 71, 80, 83, 80, 47, 79, 34, 60, 80, 80, 73, 64, …
#> $ HaulVal           <chr> "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V…
#> $ HydroStNo         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "104+5", NA, "10…
#> $ StdSpecRecCode    <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ BySpecRecCode     <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ DataType          <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C…
#> $ Netopening        <dbl> -9, 5, 5, 7, 16, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ Rigging           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Tickler           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Distance          <int> 6111, 6482, 6482, 6667, 8519, 6667, 6482, 6296, 6667…
#> $ Warplngt          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Warpdia           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ WarpDen           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DoorSurface       <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ DoorWgt           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DoorSpread        <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ WingSpread        <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ Buoyancy          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ KiteDim           <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ WgtGroundRope     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TowDir            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ GroundSpeed       <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ SpeedWater        <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ SurCurDir         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SurCurSpeed       <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ BotCurDir         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BotCurSpeed       <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ WindDir           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ WindSpeed         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SwellDir          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SwellHeight       <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ SurTemp           <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ BotTemp           <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ SurSal            <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ BotSal            <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ ThermoCline       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ ThClineDepth      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ CodendMesh        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SecchiDepth       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Turbidity         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TidePhase         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TideSpeed         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ PelSampType       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ MinTrawlDepth     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ MaxTrawlDepth     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DateofCalculation <chr> "20220301", "20220301", "20220301", "20220301", "202…
q_hl |> dplyr::glimpse()
#> Rows: ??
#> Columns: 29
#> Database: DuckDB v1.0.1-dev4228 [unknown@Linux 5.10.0-28-amd64:R 4.3.2/:memory:]
#> $ RecordType        <chr> "HL", "HL", "HL", "HL", "HL", "HL", "HL", "HL", "HL"…
#> $ Survey            <chr> "BITS", "BITS", "BITS", "BITS", "BITS", "BITS", "BIT…
#> $ Quarter           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ Country           <chr> "DK", "DK", "DK", "DK", "DK", "DK", "DK", "DK", "DK"…
#> $ Ship              <chr> "26D4", "26D4", "26D4", "26D4", "26D4", "26D4", "26D…
#> $ Gear              <chr> "GRT", "GRT", "GRT", "GRT", "GRT", "GRT", "GRT", "GR…
#> $ SweepLngt         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ GearEx            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DoorType          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ StNo              <chr> "124", "124", "124", "124", "124", "124", "124", "12…
#> $ HaulNo            <int> 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, …
#> $ Year              <chr> "1991", "1991", "1991", "1991", "1991", "1991", "199…
#> $ SpecCodeType      <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", "W…
#> $ SpecCode          <chr> "127141", "127141", "127141", "127141", "127141", "1…
#> $ SpecVal           <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ Sex               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TotalNo           <dbl> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 1, 2, 2,…
#> $ CatIdentifier     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ NoMeas            <int> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 1, 2, 2,…
#> $ SubFactor         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ SubWgt            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ CatCatchWgt       <int> 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 4, 28, 2…
#> $ LngtCode          <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ LngtClass         <int> 22, 24, 25, 26, 27, 28, 29, 30, 31, 35, 37, 32, 44, …
#> $ HLNoAtLngt        <dbl> 1, 3, 2, 3, 1, 3, 2, 1, 1, 1, 1, 1, 1, 1, 113, 1, 1,…
#> $ DevStage          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ LenMeasType       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DateofCalculation <chr> "20220301", "20220301", "20220301", "20220301", "202…
#> $ Valid_Aphia       <chr> "127141", "127141", "127141", "127141", "127141", "1…

Note the header in the printout above, basically we do not have the number of rows. Let’s look at what we got:

q_hh |> dplyr::explain()
#> <SQL>
#> SELECT *
#> FROM uivgipaxhlkpppr
#> 
#> <PLAN>
#> physical_plan
#> ┌───────────────────────────┐
#> │       PARQUET_SCAN        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        Projections:       │
#> │         RecordType        │
#> │           Survey          │
#> │          Quarter          │
#> │          Country          │
#> │            Ship           │
#> │            Gear           │
#> │         SweepLngt         │
#> │           GearEx          │
#> │          DoorType         │
#> │            StNo           │
#> │           HaulNo          │
#> │            Year           │
#> │           Month           │
#> │            Day            │
#> │          TimeShot         │
#> │        DepthStratum       │
#> │          HaulDur          │
#> │          DayNight         │
#> │          ShootLat         │
#> │         ShootLong         │
#> │          HaulLat          │
#> │          HaulLong         │
#> │          StatRec          │
#> │           Depth           │
#> │          HaulVal          │
#> │         HydroStNo         │
#> │       StdSpecRecCode      │
#> │       BySpecRecCode       │
#> └───────────────────────────┘

So basically we have not imported anything into R yet, we just having established a connection to the tables. We could proceed a little further using regular dplyr codes:

q <-
  q_hl |> 
  dplyr::mutate(Year = as.integer(Year)) |> # Fix this upstream
  dplyr::filter(Year >= 2010) |> 
  dplyr::mutate(Survey = paste0(Survey, " ", Quarter)) |> 
  dplyr::count(Survey, Year)
q |> dplyr::glimpse()
#> Rows: ??
#> Columns: 3
#> Database: DuckDB v1.0.1-dev4228 [unknown@Linux 5.10.0-28-amd64:R 4.3.2/:memory:]
#> $ Survey <chr> "BTS-VIII 4", "BTS-VIII 4", "Can-Mar 3", "SP-PORC 3", "SP-PORC …
#> $ Year   <int> 2019, 2022, 2011, 2010, 2014, 2015, 2018, 2023, 2010, 2014, 201…
#> $ n      <dbl> 3095, 2341, 27027, 19870, 18074, 25301, 19153, 11941, 6876, 702…
q |> dplyr::explain()
#> <SQL>
#> SELECT Survey, "Year", COUNT(*) AS n
#> FROM (
#>   SELECT
#>     RecordType,
#>     CONCAT_WS('', Survey, ' ', "Quarter") AS Survey,
#>     "Quarter",
#>     Country,
#>     Ship,
#>     Gear,
#>     SweepLngt,
#>     GearEx,
#>     DoorType,
#>     StNo,
#>     HaulNo,
#>     "Year",
#>     SpecCodeType,
#>     SpecCode,
#>     SpecVal,
#>     Sex,
#>     TotalNo,
#>     CatIdentifier,
#>     NoMeas,
#>     SubFactor,
#>     SubWgt,
#>     CatCatchWgt,
#>     LngtCode,
#>     LngtClass,
#>     HLNoAtLngt,
#>     DevStage,
#>     LenMeasType,
#>     DateofCalculation,
#>     Valid_Aphia
#>   FROM (
#>     SELECT
#>       RecordType,
#>       Survey,
#>       "Quarter",
#>       Country,
#>       Ship,
#>       Gear,
#>       SweepLngt,
#>       GearEx,
#>       DoorType,
#>       StNo,
#>       HaulNo,
#>       CAST("Year" AS INTEGER) AS "Year",
#>       SpecCodeType,
#>       SpecCode,
#>       SpecVal,
#>       Sex,
#>       TotalNo,
#>       CatIdentifier,
#>       NoMeas,
#>       SubFactor,
#>       SubWgt,
#>       CatCatchWgt,
#>       LngtCode,
#>       LngtClass,
#>       HLNoAtLngt,
#>       DevStage,
#>       LenMeasType,
#>       DateofCalculation,
#>       Valid_Aphia
#>     FROM iooirrdpbssatmk
#>   ) q01
#>   WHERE ("Year" >= 2010.0)
#> ) q01
#> GROUP BY Survey, "Year"
#> 
#> <PLAN>
#> physical_plan
#> ┌───────────────────────────┐
#> │       HASH_GROUP_BY       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │          Groups:          │
#> │             #0            │
#> │             #1            │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        Aggregates:        │
#> │        count_star()       │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │   Estimated Cardinality:  │
#> │          1351840          │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        Projections:       │
#> │           Survey          │
#> │            Year           │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │   Estimated Cardinality:  │
#> │          2703680          │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        Projections:       │
#> │           Survey          │
#> │            Year           │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │   Estimated Cardinality:  │
#> │          2703680          │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │         PROJECTION        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        Projections:       │
#> │           Survey          │
#> │          Quarter          │
#> │            Year           │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │   Estimated Cardinality:  │
#> │          2703680          │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │           FILTER          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │           Name:           │
#> │ (CAST(Year AS INTEGER) >= │
#> │            2010)          │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │   Estimated Cardinality:  │
#> │          2703680          │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │       PARQUET_SCAN        │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │        Projections:       │
#> │           Survey          │
#> │          Quarter          │
#> │            Year           │
#> │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
#> │   Estimated Cardinality:  │
#> │          13518402         │
#> └───────────────────────────┘

So basically we have generate a complex query, but yet not imported anything into R. If we want to get this into R (e.g. for plotting) we have to “collect” the data:

d <- 
  q |> 
  dplyr::collect()
d |> dplyr::glimpse()
#> Rows: 415
#> Columns: 3
#> $ Survey <chr> "BTS-VIII 4", "BTS-VIII 4", "Can-Mar 3", "BITS 4", "DYFS 4", "D…
#> $ Year   <int> 2019, 2022, 2011, 2021, 2010, 2014, 2015, 2016, 2018, 2023, 201…
#> $ n      <dbl> 3095, 2341, 27027, 24778, 6876, 7023, 8638, 25065, 9684, 8036, …

So it is only in this last step were we have loaded something into R.

Why should we do it this way:

  • Generally faster
  • Not limited by users computer memory