If you have ever worked with DATRAS data you have gone through the cumbersome task of first spending time on getting the data. Either through point-and-mouse click download approach or possibly via some coding procedure using the DATRAS webservices.
The above access/distribution methods are well tested and server their purposes. One alternative may be that someone generates the three fundamental raw DATRAS data as parquet-files and stores them on a webspace, from which users can address them directly.
Here the raw DATRAS data for all the surveys have been combined into single tables (only the HH and the HL tables, the CA table is pending). There are at least two ways to access them into R:
# library(tidyverse)
The files are stored in parquet format. Consider them to be csv-files on steroids. We can read them directly into R by:
hh <- arrow::read_parquet("https://heima.hafro.is/~einarhj/DATRAS/HH.parquet")
hl <- arrow::read_parquet("https://heima.hafro.is/~einarhj/DATRAS/HL.parquet")
So here we got some ~140965 records, which basically are all the tows in the DATRAS database. And some ~13518402 records of length data. The count by surveys and years is as follows:
hl |>
dplyr::filter(Year >= 2010) |>
dplyr::mutate(Survey = paste0(Survey, " ", Quarter)) |>
dplyr::count(Survey, Year) |>
tidyr::spread(Year, n) |>
knitr::kable()
Survey | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BITS 1 | 27639 | 26850 | 22854 | 26160 | 21510 | 23880 | 25144 | 31186 | 28532 | 25667 | 27709 | 27835 | 25472 | 28096 | 28324 |
BITS 4 | 20947 | 20106 | 18318 | 18675 | 20652 | 20848 | 27063 | 27557 | 24852 | 25010 | 25317 | 24778 | 27560 | 26849 | NA |
BTS 1 | 11533 | 11607 | 9141 | 12333 | 16717 | 18262 | 19407 | 19539 | 18963 | 21227 | 13954 | 18431 | 13037 | 17899 | NA |
BTS 3 | 40068 | 45800 | 47792 | 50217 | 48333 | 49060 | 49674 | 51295 | 51576 | 53766 | 45969 | 52992 | 37536 | 53945 | NA |
BTS 4 | 5866 | 5347 | 4987 | 4961 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
BTS-GSA17 4 | NA | NA | NA | NA | NA | NA | 1019 | 928 | 1123 | 993 | 1023 | 993 | 707 | 976 | NA |
BTS-VIII 4 | NA | 9805 | 8078 | 4558 | 6496 | 5148 | 4824 | 5118 | 5106 | 3095 | 4068 | 3729 | 2341 | 3461 | NA |
Can-Mar 3 | 23471 | 27027 | 21056 | 22989 | 19509 | 21168 | 28859 | 27158 | 10720 | 24764 | 21186 | 14215 | NA | NA | NA |
DYFS 3 | 24689 | 22773 | 29399 | 32811 | 38511 | 34923 | 25065 | 32664 | 30062 | 26802 | 35009 | 28940 | 33152 | 24493 | NA |
DYFS 4 | 6876 | 9614 | 9181 | 6838 | 7023 | 8638 | 7957 | 7821 | 9684 | 9535 | 5969 | 8818 | 8063 | 8036 | NA |
EVHOE 4 | 21953 | 25982 | 21526 | 22900 | 24727 | 24020 | 24979 | 3912 | 24965 | 24497 | 29614 | 24989 | 18524 | 23420 | NA |
FR-CGFS 4 | 4620 | 4918 | 5146 | 5649 | 8488 | 8647 | 9194 | 8134 | 8867 | 9945 | 8731 | 8662 | 8377 | 10040 | NA |
FR-WCGFS 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5219 | NA |
FR-WCGFS 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5446 | NA | NA | NA |
IE-IAMS 1 | NA | NA | NA | NA | NA | NA | 8554 | 6636 | 7039 | 8597 | 5723 | 4864 | 4205 | 10007 | NA |
IE-IAMS 2 | NA | NA | NA | NA | NA | NA | NA | 5190 | 6881 | 6306 | 3772 | 3708 | 5209 | 6952 | NA |
IE-IGFS 4 | 36790 | 31780 | 32456 | 34610 | 33612 | 29494 | 37997 | 29869 | 32449 | 32123 | 24367 | 31275 | 31595 | 33289 | NA |
NIGFS 1 | 10793 | 11555 | 11933 | 10695 | 11841 | 12847 | 14110 | 12005 | 10892 | 11433 | 11261 | 11058 | 11802 | 10893 | NA |
NIGFS 4 | 9106 | 8994 | 10299 | 9923 | 10755 | 11550 | 11380 | 10931 | 10920 | 10687 | 9922 | 8329 | 9061 | 6783 | NA |
NL-BSAS 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 73 | 333 | 773 | 512 | 1505 | NA |
NL-BSAS 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1267 | 905 | 343 | 448 | NA | NA |
NS-IBTS 1 | 42942 | 42277 | 52063 | 47451 | 44670 | 52486 | 50151 | 54712 | 44244 | 44780 | 47663 | 51151 | 32767 | 45753 | 47705 |
NS-IBTS 2 | NA | 92 | 1324 | NA | NA | 787 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
NS-IBTS 3 | 39620 | 42789 | 42692 | 40424 | 46572 | 48374 | 53164 | 49428 | 47228 | 48456 | 52849 | 51987 | 49676 | 50732 | NA |
NS-IBTS_UNIFtest 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 15 | NA | NA | NA |
NS-IBTS_UNIFtest 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 189 | NA | NA | NA | NA | NA |
NS-IDPS 3 | NA | NA | NA | 1898 | NA | NA | 230 | NA | NA | NA | NA | NA | NA | NA | NA |
NSSS 4 | 1106 | 759 | 2872 | 3078 | 4091 | 3611 | 4997 | 4294 | 3892 | 4923 | 4593 | 4499 | 4833 | 3026 | NA |
PT-IBTS 4 | 6089 | 5788 | NA | 6224 | 5553 | 6479 | 5345 | 7240 | 3851 | NA | NA | 9511 | 6893 | 4741 | NA |
SCOROC 3 | NA | 5146 | 4748 | 3990 | 7067 | 6282 | 6413 | 6294 | 6203 | 6589 | 5047 | 5981 | 5933 | 5888 | NA |
SCOWCGFS 1 | NA | 9196 | 10475 | 12551 | 10690 | 10931 | 12089 | 12036 | 10869 | 11015 | 11069 | 10977 | NA | 11598 | 10978 |
SCOWCGFS 4 | NA | 8729 | 11259 | 4457 | 11704 | 10561 | 11315 | 10311 | 9653 | 11165 | 9571 | 9711 | 11196 | 10263 | NA |
SE-SOUND 1 | NA | 440 | 328 | 299 | 246 | 248 | 344 | 153 | 218 | 671 | 544 | 831 | 694 | 737 | NA |
SE-SOUND 3 | NA | 321 | 272 | 340 | 275 | 310 | 280 | 227 | NA | NA | NA | NA | NA | NA | NA |
SE-SOUND 4 | NA | NA | NA | NA | NA | NA | NA | NA | 439 | 707 | 571 | 677 | 889 | 649 | NA |
SNS 3 | 4052 | 3362 | NA | 3892 | 3988 | 3877 | 3571 | 3589 | 3751 | 3144 | 3505 | 3056 | 3119 | 3228 | NA |
SNS 4 | NA | NA | 3295 | NA | NA | NA | NA | NA | NA | 189 | NA | NA | NA | NA | NA |
SP-ARSA 1 | 3752 | 3560 | 2702 | 4185 | 4406 | 4187 | 5304 | 5080 | 7127 | 10096 | 5101 | NA | 5133 | 5453 | NA |
SP-ARSA 4 | 5110 | 4451 | 5075 | 6006 | 5411 | 6321 | 5595 | 5949 | 11668 | 5771 | 5601 | NA | 6067 | 5966 | NA |
SP-NORTH 3 | 4900 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
SP-NORTH 4 | 23722 | 24693 | 24579 | 30081 | 28118 | 29357 | 28678 | 28313 | 25592 | 27827 | 24355 | 24233 | 16721 | 17306 | NA |
SP-PORC 3 | 19870 | 18799 | 21806 | 24005 | 18074 | 25301 | 20919 | 18948 | 19153 | 18522 | 13074 | 21721 | 14215 | 11941 | NA |
SWC-IBTS 1 | 8860 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Comment: Somewhat surprising how few surveys in 2024, possibly because of the attack.
Another way is to use duckDB:
q_hh <- duckdbfs::open_dataset("https://heima.hafro.is/~einarhj/hh/hh.parquet")
q_hl <- duckdbfs::open_dataset("https://heima.hafro.is/~einarhj/hh/hl.parquet")
q_hh |> dplyr::glimpse()
#> Rows: ??
#> Columns: 69
#> Database: DuckDB v1.0.1-dev4228 [unknown@Linux 5.10.0-28-amd64:R 4.3.2/:memory:]
#> $ RecordType <chr> "HH", "HH", "HH", "HH", "HH", "HH", "HH", "HH", "HH"…
#> $ Survey <chr> "BITS", "BITS", "BITS", "BITS", "BITS", "BITS", "BIT…
#> $ Quarter <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ Country <chr> "DK", "DK", "DK", "DK", "DK", "DK", "DK", "DK", "DK"…
#> $ Ship <chr> "26D4", "26D4", "26D4", "26D4", "26D4", "26D4", "26D…
#> $ Gear <chr> "CAM", "CAM", "CAM", "EXP", "EXP", "GRT", "GRT", "GR…
#> $ SweepLngt <int> NA, NA, NA, 110, 110, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ GearEx <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DoorType <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ StNo <chr> "150", "151", "152", "149", "147", "10", "101", "1",…
#> $ HaulNo <int> 67, 68, 69, 66, 65, 8, 54, 2, 55, 56, 57, 59, 60, 58…
#> $ Year <chr> "1991", "1991", "1991", "1991", "1991", "1991", "199…
#> $ Month <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
#> $ Day <int> 20, 20, 20, 19, 19, 6, 17, 5, 17, 17, 17, 17, 18, 17…
#> $ TimeShot <chr> "514", "644", "923", "2128", "1829", "1417", "511", …
#> $ DepthStratum <chr> "11", "11", "12", "12", "12", "10", "11", "9", "11",…
#> $ HaulDur <int> 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, …
#> $ DayNight <chr> "D", "D", "D", "N", "N", "D", "D", "D", "D", "D", "D…
#> $ ShootLat <dbl> 55.6000, 55.6667, 55.5167, 55.4500, 55.5500, 54.9833…
#> $ ShootLong <dbl> 16.2500, 16.2667, 16.1667, 15.1167, 15.1833, 14.1500…
#> $ HaulLat <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ HaulLong <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ StatRec <chr> "40G6", "40G6", "40G6", "39G5", "40G5", "38G4", "39G…
#> $ Depth <int> 76, 71, 80, 83, 80, 47, 79, 34, 60, 80, 80, 73, 64, …
#> $ HaulVal <chr> "V", "V", "V", "V", "V", "V", "V", "V", "V", "V", "V…
#> $ HydroStNo <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "104+5", NA, "10…
#> $ StdSpecRecCode <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ BySpecRecCode <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ DataType <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C…
#> $ Netopening <dbl> -9, 5, 5, 7, 16, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ Rigging <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Tickler <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Distance <int> 6111, 6482, 6482, 6667, 8519, 6667, 6482, 6296, 6667…
#> $ Warplngt <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Warpdia <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ WarpDen <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DoorSurface <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ DoorWgt <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DoorSpread <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ WingSpread <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ Buoyancy <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ KiteDim <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ WgtGroundRope <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TowDir <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ GroundSpeed <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ SpeedWater <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ SurCurDir <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SurCurSpeed <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ BotCurDir <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BotCurSpeed <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ WindDir <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ WindSpeed <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SwellDir <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SwellHeight <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ SurTemp <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ BotTemp <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ SurSal <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ BotSal <dbl> -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, -9, …
#> $ ThermoCline <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ ThClineDepth <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ CodendMesh <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SecchiDepth <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Turbidity <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TidePhase <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TideSpeed <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ PelSampType <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ MinTrawlDepth <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ MaxTrawlDepth <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DateofCalculation <chr> "20220301", "20220301", "20220301", "20220301", "202…
q_hl |> dplyr::glimpse()
#> Rows: ??
#> Columns: 29
#> Database: DuckDB v1.0.1-dev4228 [unknown@Linux 5.10.0-28-amd64:R 4.3.2/:memory:]
#> $ RecordType <chr> "HL", "HL", "HL", "HL", "HL", "HL", "HL", "HL", "HL"…
#> $ Survey <chr> "BITS", "BITS", "BITS", "BITS", "BITS", "BITS", "BIT…
#> $ Quarter <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ Country <chr> "DK", "DK", "DK", "DK", "DK", "DK", "DK", "DK", "DK"…
#> $ Ship <chr> "26D4", "26D4", "26D4", "26D4", "26D4", "26D4", "26D…
#> $ Gear <chr> "GRT", "GRT", "GRT", "GRT", "GRT", "GRT", "GRT", "GR…
#> $ SweepLngt <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ GearEx <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DoorType <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ StNo <chr> "124", "124", "124", "124", "124", "124", "124", "12…
#> $ HaulNo <int> 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, …
#> $ Year <chr> "1991", "1991", "1991", "1991", "1991", "1991", "199…
#> $ SpecCodeType <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", "W…
#> $ SpecCode <chr> "127141", "127141", "127141", "127141", "127141", "1…
#> $ SpecVal <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ Sex <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TotalNo <dbl> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 1, 2, 2,…
#> $ CatIdentifier <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ NoMeas <int> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 1, 2, 2,…
#> $ SubFactor <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ SubWgt <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ CatCatchWgt <int> 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 4, 28, 2…
#> $ LngtCode <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1…
#> $ LngtClass <int> 22, 24, 25, 26, 27, 28, 29, 30, 31, 35, 37, 32, 44, …
#> $ HLNoAtLngt <dbl> 1, 3, 2, 3, 1, 3, 2, 1, 1, 1, 1, 1, 1, 1, 113, 1, 1,…
#> $ DevStage <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ LenMeasType <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DateofCalculation <chr> "20220301", "20220301", "20220301", "20220301", "202…
#> $ Valid_Aphia <chr> "127141", "127141", "127141", "127141", "127141", "1…
Note the header in the printout above, basically we do not have the number of rows. Let’s look at what we got:
q_hh |> dplyr::explain()
#> <SQL>
#> SELECT *
#> FROM uivgipaxhlkpppr
#>
#> <PLAN>
#> physical_plan
#> ┌───────────────────────────┐
#> │ PARQUET_SCAN │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Projections: │
#> │ RecordType │
#> │ Survey │
#> │ Quarter │
#> │ Country │
#> │ Ship │
#> │ Gear │
#> │ SweepLngt │
#> │ GearEx │
#> │ DoorType │
#> │ StNo │
#> │ HaulNo │
#> │ Year │
#> │ Month │
#> │ Day │
#> │ TimeShot │
#> │ DepthStratum │
#> │ HaulDur │
#> │ DayNight │
#> │ ShootLat │
#> │ ShootLong │
#> │ HaulLat │
#> │ HaulLong │
#> │ StatRec │
#> │ Depth │
#> │ HaulVal │
#> │ HydroStNo │
#> │ StdSpecRecCode │
#> │ BySpecRecCode │
#> └───────────────────────────┘
So basically we have not imported anything into R yet, we just having established a connection to the tables. We could proceed a little further using regular dplyr codes:
q <-
q_hl |>
dplyr::mutate(Year = as.integer(Year)) |> # Fix this upstream
dplyr::filter(Year >= 2010) |>
dplyr::mutate(Survey = paste0(Survey, " ", Quarter)) |>
dplyr::count(Survey, Year)
q |> dplyr::glimpse()
#> Rows: ??
#> Columns: 3
#> Database: DuckDB v1.0.1-dev4228 [unknown@Linux 5.10.0-28-amd64:R 4.3.2/:memory:]
#> $ Survey <chr> "BTS-VIII 4", "BTS-VIII 4", "Can-Mar 3", "SP-PORC 3", "SP-PORC …
#> $ Year <int> 2019, 2022, 2011, 2010, 2014, 2015, 2018, 2023, 2010, 2014, 201…
#> $ n <dbl> 3095, 2341, 27027, 19870, 18074, 25301, 19153, 11941, 6876, 702…
q |> dplyr::explain()
#> <SQL>
#> SELECT Survey, "Year", COUNT(*) AS n
#> FROM (
#> SELECT
#> RecordType,
#> CONCAT_WS('', Survey, ' ', "Quarter") AS Survey,
#> "Quarter",
#> Country,
#> Ship,
#> Gear,
#> SweepLngt,
#> GearEx,
#> DoorType,
#> StNo,
#> HaulNo,
#> "Year",
#> SpecCodeType,
#> SpecCode,
#> SpecVal,
#> Sex,
#> TotalNo,
#> CatIdentifier,
#> NoMeas,
#> SubFactor,
#> SubWgt,
#> CatCatchWgt,
#> LngtCode,
#> LngtClass,
#> HLNoAtLngt,
#> DevStage,
#> LenMeasType,
#> DateofCalculation,
#> Valid_Aphia
#> FROM (
#> SELECT
#> RecordType,
#> Survey,
#> "Quarter",
#> Country,
#> Ship,
#> Gear,
#> SweepLngt,
#> GearEx,
#> DoorType,
#> StNo,
#> HaulNo,
#> CAST("Year" AS INTEGER) AS "Year",
#> SpecCodeType,
#> SpecCode,
#> SpecVal,
#> Sex,
#> TotalNo,
#> CatIdentifier,
#> NoMeas,
#> SubFactor,
#> SubWgt,
#> CatCatchWgt,
#> LngtCode,
#> LngtClass,
#> HLNoAtLngt,
#> DevStage,
#> LenMeasType,
#> DateofCalculation,
#> Valid_Aphia
#> FROM iooirrdpbssatmk
#> ) q01
#> WHERE ("Year" >= 2010.0)
#> ) q01
#> GROUP BY Survey, "Year"
#>
#> <PLAN>
#> physical_plan
#> ┌───────────────────────────┐
#> │ HASH_GROUP_BY │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Groups: │
#> │ #0 │
#> │ #1 │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Aggregates: │
#> │ count_star() │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Estimated Cardinality: │
#> │ 1351840 │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Projections: │
#> │ Survey │
#> │ Year │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Estimated Cardinality: │
#> │ 2703680 │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Projections: │
#> │ Survey │
#> │ Year │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Estimated Cardinality: │
#> │ 2703680 │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PROJECTION │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Projections: │
#> │ Survey │
#> │ Quarter │
#> │ Year │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Estimated Cardinality: │
#> │ 2703680 │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ FILTER │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Name: │
#> │ (CAST(Year AS INTEGER) >= │
#> │ 2010) │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Estimated Cardinality: │
#> │ 2703680 │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │ PARQUET_SCAN │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Projections: │
#> │ Survey │
#> │ Quarter │
#> │ Year │
#> │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
#> │ Estimated Cardinality: │
#> │ 13518402 │
#> └───────────────────────────┘
So basically we have generate a complex query, but yet not imported anything into R. If we want to get this into R (e.g. for plotting) we have to “collect” the data:
d <-
q |>
dplyr::collect()
d |> dplyr::glimpse()
#> Rows: 415
#> Columns: 3
#> $ Survey <chr> "BTS-VIII 4", "BTS-VIII 4", "Can-Mar 3", "BITS 4", "DYFS 4", "D…
#> $ Year <int> 2019, 2022, 2011, 2021, 2010, 2014, 2015, 2016, 2018, 2023, 201…
#> $ n <dbl> 3095, 2341, 27027, 24778, 6876, 7023, 8638, 25065, 9684, 8036, …
So it is only in this last step were we have loaded something into R.
Why should we do it this way: