Getting started


The ggplot bible

Libraries

Needed libraries:

library(tidyverse)   # installs sweeps of packages
library(maps)        # some basic country maps
library(mapdata)     # higher resolution maps
library(marmap)      # access global topgraphy data

Key components of ggplot

Data used:

minke <- 
  read_csv("ftp://ftp.hafro.is/pub/data/csv/minke.csv")
glimpse(minke)
## Rows: 190
## Columns: 13
## $ id             <dbl> 1, 690, 926, 1333, 1334, 1335, 1336, 1338, 1339, 1341, …
## $ date           <dttm> 2004-06-10 22:00:00, 2004-06-15 17:00:00, 2004-06-22 0…
## $ lon            <dbl> -21.42350, -21.39183, -19.81333, -21.57500, -15.61167, …
## $ lat            <dbl> 65.66183, 65.65350, 66.51167, 65.67333, 66.29000, 66.17…
## $ area           <chr> "North", "North", "North", "North", "North", "North", "…
## $ length         <dbl> 780, 793, 858, 567, 774, 526, 809, 820, 697, 777, 739, …
## $ weight         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ age            <dbl> 11.3, NA, 15.5, 7.2, 12.3, 9.6, 17.3, 13.8, 12.2, 15.4,…
## $ sex            <chr> "Female", "Female", "Female", "Male", "Female", "Female…
## $ maturity       <chr> "pregnant", "pregnant", "pregnant", "immature", "immatu…
## $ stomach.volume <dbl> 58, 90, 24, 25, 85, 18, 200, 111, 8, 25, 38, 6, 11, 100…
## $ stomach.weight <dbl> 31.900000, 36.290000, 9.420000, 3.640000, 5.510000, 1.1…
## $ year           <dbl> 2004, 2004, 2004, 2003, 2003, 2003, 2003, 2003, 2003, 2…

See full explanation of variables in Datasets - overview).

ggplot has three key components:

  1. argument data, which must be a data.frame or some derivative there of (tbl, data.table, …)

  2. A set of aesthetic mappings (called with the aes~function) between variables in the data and visual properties, and

  3. At least one call to a layer which describes how to render each observation.

ggplot(data = minke) +
  aes(x = lon, y = lat, colour = sex) + 
  layer(geom = "point", stat = "identity", position = "identity")

Generally we do not call layer directly but use functions starting with geom_ that are a shortcut calls to the layer-function. Hence the above call is normally written as:

ggplot(data = minke, aes(x = lon, y = lat, colour = sex)) + geom_point()

Different syntax, equivalent outcome:

ggplot()                        + geom_point(data = minke, aes(lon, lat, colour = sex))
ggplot(data = minke)            + geom_point(aes(x = lon, y = lat, colour = sex))
ggplot(minke)                   + geom_point(aes(lon, lat, colour = sex))
ggplot(minke, aes(lon, lat, colour = sex))    + geom_point()

A ggplot can be stored as an object for later use:

p <- ggplot(minke) + geom_point(aes(lon, lat, colour = sex))

In the above plot, we basically have mapped the longitude position on the x-axis and the latitude position on the y-axis. There are two things missing:

  • A background or a reference, giving the reader a better indication of the geographical region of the sample location.
  • The projection looks odd.

Reference map

  • Maps as background for r-plot can come from myriad of sources. Here we take an example of shoreline that reside in the map-packages.
  • To get the data into required ggplot2 form (a data.frame) we use the map_data-function.
iceland <- map_data("worldHires", region = "Iceland")
glimpse(iceland)
## Rows: 13,393
## Columns: 6
## $ long      <dbl> -14.55777, -14.56333, -14.56751, -14.57335, -14.57971, -14.5…
## $ lat       <dbl> 66.38364, 66.38003, 66.37751, 66.37470, 66.37303, 66.37029, …
## $ group     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ order     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ region    <chr> "Iceland", "Iceland", "Iceland", "Iceland", "Iceland", "Icel…
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Here we have just a simple dataframe with 13393 coordinates (long and lat) and some other variables. We can try map these coordinates to different layers (more on labs later):

m <- 
  ggplot(iceland, aes(long, lat, group = group))
m1 <- m + geom_point()
m2 <- m + geom_line()
m3 <- m + geom_path()
m4 <- m + geom_polygon()
# quickfix: to see the plot type,
m1
m2
m3
m4

The above sweep of code and plots demonstrate that background maps are just a set of longitudinal and latitudinal data that are ordered in a specific way.

Projections

As noted above a map is just a xy-plot but with the addition of having a projections. A proper coverage of projections is done in a xxx, here we just cover some basics with respect to ggplot2. We could try to guess the projections (or rather the aspect ration of the plot) as done on the left or better use the specific functions available in ggplot:

m <- 
  ggplot(iceland, aes(long, lat, group = group)) +
  geom_path()
m1 <- m + coord_fixed(ratio = 2.4) 
m2 <- m + coord_quickmap()
m3 <- m + coord_map()
m4 <- m + coord_map(projection = "stereographic")

Note that the geom_quickmap is an approximation (albeit good one for most purposes), if one is operating on a larger scale geom_map may be more accurate (actually all maps are wrong when put on a two dimensional pane).

Adding layers

The power of ggplot comes into place when one adds layers on top of other layers. With the following code one first generates the background-map and then adds the minke datapoints on top of that background map:

m <-
  ggplot() +
  geom_polygon(data = iceland, 
               aes(long, lat, group = group)) +
  geom_point(data = minke, 
             aes(lon, lat, colour = sex))
m + coord_quickmap()

Besides adding a point layer on top of the polygon layer we have added one aesthetics:

  • colour: Here we are letting the value of the variable sex control the colour displayed. Hence we call that argument within the aes-function.

We will be introduced to more aesthetics as we move along.

Customization

ggplot2 provides a lot of sensible defaults that it derives from the data used to generate a plot. Here we will take a peek on those of that may be of most interest for now, further customization being introduced as we move along.

When one writes:

p <- 
  ggplot(minke) + 
  geom_point(aes(lon, lat, colour = sex))

what ggplot does is actually:

ggplot(minke) + 
  geom_point(aes(lon, lat, colour = sex)) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

Within the scale_-functions all the settings take the default argument, check e.g.:

?scale_x_continuous

Here we observe among other things: name, breaks, minor_breaks, limits, position, … Here is an example where among other the default name and breaks are overwritten:

p + 
  scale_x_continuous(name = "Longitude",
                     breaks = seq(-25, -10, by = 1),
                     minor_breaks = NULL) +
  scale_y_continuous(name = "Latitude",
                     breaks = seq(63, 67, by = 0.5),
                     minor_breaks = NULL) +
  scale_colour_discrete(name = "Sex") +
  coord_map()

If one was interested in:

  • Overwrite the default black colour in the polygon with grey (fill = “grey”)
  • Suppressing the axis labels (name = NULL)
  • Suppressing the axis values (breaks = NULL)
  • Change the default colour palette to brewer scale (palette = “Set1”)
  • Place the legend in the center of the figure (theme(legend.position = c(0.5, 0.5)))

we would do:

m <- 
  ggplot() +
  geom_polygon(data = iceland, 
               aes(long, lat, group = group),
               fill = "grey") +
  geom_point(data = minke,
             aes(lon, lat, colour = sex)) +
  scale_x_continuous(name = NULL,
                     breaks = NULL) +
  scale_y_continuous(name = NULL,
                     breaks = NULL) +
  scale_colour_brewer(name = "Sex",
                      palette = "Set1") +
  theme(legend.position = c(0.5, 0.5)) +
  coord_map()
m

If one only interested in overwriting the default axis-name one could use the labs-shortcut function.

One of ggplot defaults is that the range of the x and the y-axis is dictated by the range in the data. If we want to “zoom into” a plot we can overwrite the default boundaries of the horizontal and vertical axis by calling the arguments xlim and ylim within the coord_quickmap-function.

m + coord_quickmap(xlim = c(-25, -20), ylim = c(65, 66.6))

A list of all the fixed (english) colour available in R:

colours()

Exercise

Exercise
  • Emulate the following map of the Faeroe Islands (or use whatever country you like) using the data in the map-package (hint: Faeroe Island is a colony of a country).
  • What happens if you do not use the group-argument?

The minke data used is:

minke.fo <- 
  tibble(lon = c(-8.5, -5.5),
         lat = c(61.5, 62.2),
         sex = c("Male", "Female"))

More layers


Here we will only focus on ggplot layers that are useful when plotting spatial data.

Data used:

smb <- 
  read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb.csv") 
smb2019 <- filter(smb, year == 2019)
glimpse(smb2019)
## Rows: 581
## Columns: 40
## $ id             <dbl> 490893, 490829, 490830, 490831, 490832, 490833, 490834,…
## $ date           <date> 2019-03-06, 2019-03-06, 2019-03-06, 2019-03-06, 2019-0…
## $ year           <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ vid            <dbl> 1131, 2350, 2350, 2350, 2350, 2350, 2350, 2350, 2350, 2…
## $ tow_id         <chr> "669-13", "673-4", "673-7", "672-4", "722-1", "722-3", …
## $ t1             <dttm> 2019-03-06 08:55:00, 2019-03-06 13:21:00, 2019-03-06 1…
## $ t2             <dttm> 2019-03-06 10:02:00, 2019-03-06 14:23:00, 2019-03-06 1…
## $ lon1           <dbl> -19.80333, -23.15833, -22.99800, -22.95317, -22.82617, …
## $ lat1           <dbl> 66.78167, 66.90200, 66.91200, 66.90350, 66.99933, 67.04…
## $ lon2           <dbl> -19.95550, -23.00733, -23.05833, -22.82517, -22.99583, …
## $ lat2           <dbl> 66.75300, 66.93250, 66.85033, 66.94700, 67.00033, 67.11…
## $ ir             <chr> "62D0", "62C6", "62C6", "62C7", "62C7", "63C7", "63C7",…
## $ ir_lon         <dbl> -19.5, -23.5, -23.5, -22.5, -22.5, -22.5, -22.5, -23.5,…
## $ ir_lat         <dbl> 66.75, 66.75, 66.75, 66.75, 66.75, 67.25, 67.25, 67.25,…
## $ z1             <dbl> 270, 226, 211, 207, 224, 235, 269, 243, 249, 240, 232, …
## $ z2             <dbl> 242, 222, 176, 212, 247, 250, 250, 230, 247, 246, 253, …
## $ temp_s         <dbl> 4.0, 3.1, 3.4, 3.3, 4.5, 4.6, 4.6, 4.8, 4.5, 5.0, 4.6, …
## $ temp_b         <dbl> 1.3, 2.8, 3.7, 3.7, 1.3, 1.8, 1.3, 1.7, 1.3, 3.4, 2.1, …
## $ speed          <dbl> 3.6, 3.9, 3.8, 3.8, 3.8, 3.8, 3.8, 3.8, 3.8, 3.7, 3.8, …
## $ duration       <dbl> 67, 62, 64, 63, 63, 64, 64, 64, 64, 65, 64, 60, 60, 60,…
## $ towlength      <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ horizontal     <dbl> 97, 103, 90, 98, 94, 92, 102, 101, 87, 93, 94, 99, 105,…
## $ verical        <dbl> 2.2, 2.3, 2.3, 2.4, 2.5, 2.6, 2.6, 2.6, 2.5, 2.8, 2.6, …
## $ wind           <dbl> 13, 10, 8, 3, 10, 12, 12, 14, 14, 12, 10, 7, 7, 7, 5, 3…
## $ wind_direction <dbl> 5, 2, 36, 7, 5, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 5, 7, 2, …
## $ bormicon       <dbl> 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ oldstrata      <dbl> 34, 87, 87, 87, 87, 87, 36, 87, 87, 87, 87, 87, 87, 87,…
## $ newstrata      <dbl> 31, 18, 40, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18,…
## $ cod_kg         <dbl> 4471.76481, 333.11652, 42.05824, 229.95894, 270.20725, …
## $ cod_n          <dbl> 1204, 98, 45, 98, 139, 329, 244, 85, 69, 82, 402, 381, …
## $ haddock_kg     <dbl> 302.572948, 31.090541, 490.880857, 92.117378, 0.000000,…
## $ haddock_n      <dbl> 210, 72, 646, 186, 0, 0, 0, 0, 2, 0, 0, 62, 73, 53, 0, …
## $ saithe_kg      <dbl> 9375.3863235, 0.0000000, 1.7301465, 1.4905674, 0.000000…
## $ saithe_n       <dbl> 2940, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 2, 6, 2, 0, 0, 6, 3…
## $ wolffish_kg    <dbl> 1.3074065, 9.2663356, 3.4822189, 11.0757233, 4.6397027,…
## $ wolffish_n     <dbl> 3, 7, 12, 4, 12, 29, 3, 19, 4, 4, 9, 5, 6, 3, 1, 14, 13…
## $ plaice_kg      <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000…
## $ plaice_n       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ monkfish_kg    <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,…
## $ monkfish_n     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Since we are going to use some Icelandic survey data in the following example we first generate an appropriate reference background layer:

m <- 
  ggplot() +
  geom_polygon(data = iceland, aes(long, lat, group = group),
               fill = "grey") +
  coord_map() +
  scale_x_continuous(name = NULL, breaks = NULL) +
  scale_y_continuous(name = NULL, breaks = NULL)

geom_segment

The segment-layer is useful when wanting to display some very simple vector data that only have a start and an end point. Lets take the tow location from the Icelandic spring survey as an example:

m +
  geom_segment(data = smb2019,
               aes(x    = lon1, y    = lat1,
                   xend = lon2, yend = lat2))

geom_point

We can assign a variable to the size argument in the geom_point-layer, the size of the point then being some function of value of a variable. Lets visualize the abundance of cod:

p <-
  m +
  geom_point(data = smb2019,
             aes(lon1, lat1, size = cod_n),
             alpha = 0.2,
             colour = "red")
p

Here we have introduced a new aesthetic named alpha. This argument controls the transparency of the bubble, the value going from 0 (invisible) to 1 (no transparency, the ggplot default).

Besides the spatial patterns observed, there is one problem with this plot: The radius of the bubble size is not proportional to abundance. ggplot provides a very useful function to remedy this (now zero abundance gets a zero radius):

p +
  scale_size_area(max_size = 20)

The value assigned to max_size is a bit of a trial and error as well as personal preference.

Exercise

Exercise
  • Create a code that emulates the following plot for the abundance of cod (or for any other variable of interest).

geom_path

We have already shown a plot using the path-layer, but just to create a background map. For a more appropriate demonstration lets use the vessel track data from the 2019 spring survey. Besides the coordinate data the key additional element of this dataset is time, which it has already been ordered by.

track <- 
  read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb_vms2019.csv")
m +
  geom_path(data = track,
            aes(x = lon, y = lat, colour = vessel))

geom_tile

The survey data contains the ICES rectangles that a tow belongs (ir) to as well as the center coordinate position (ir_lon and ir_lat) we can use the tile-layer to display some characteristics of the data (we will later deal with how this was achieved).

However, since there is more than one tow in each tile we need first to generate some summary of the data by each rectangle (actually here by the center-points):

s <-
  smb2019 %>% 
  group_by(ir_lon, ir_lat) %>% 
  summarise(Cod = mean(cod_n))
p <-
  m +
  geom_tile(data = s,
            aes(x = ir_lon, y = ir_lat, fill = Cod))
p

Do not like the default colour?. Try:

p +
  scale_fill_viridis_c(option = "B", direction = -1)

This display has still at least two problems:

  • The background reference (Iceland) is now hidden behind some tiles (see exercise below).
  • The mapping of the value on the colour scale used may not be the most reviling (much more details on that later).

Exercise

Exercise

Ameliorate the code for the plot above such that is looks like this (you have to start from scratch):

geom_raster

The above plot using the tile-layer is actually a kind of a raster plot, albeit the resolution being not very high.

To demonstrate the use of the raster-layer lets use the global relief models from the ETOPO1 dataset hosted on a NOAA server can be accessed using the getNOAA.bathy-function in the marmap-package (you need to install that package). To access them one specifies the boundary of the data of interest and then, since we are using ggplot for mapping are turned into a data frame using the fortify-function:

xlim <- c(-28, -10)
ylim <- c(62.5, 67.5)

depth <- 
  getNOAA.bathy(lon1 = xlim[1], lon2 = xlim[2],
                lat1 = ylim[1], lat2 = ylim[2],
                resolution = 1) %>% 
  fortify() %>%  # turn the object into a data.frame
  filter(z <= 0)
glimpse(depth)
## Rows: 254,268
## Columns: 3
## $ x <dbl> -28.00000, -27.98332, -27.96664, -27.94995, -27.93327, -27.91659, -2…
## $ y <dbl> 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67…
## $ z <int> -297, -298, -300, -302, -304, -306, -306, -305, -305, -305, -305, -3…

So this data is just a set of x (longitude), y (latitudes) and z (depth). The dataset is a raster-grid which we can visualize by using the geom_raster-function:

m +
  geom_raster(data = depth,
              aes(x = x, y = y, fill = z)) +
  coord_quickmap() +
  scale_x_continuous(name = NULL, breaks = NULL) +
  scale_y_continuous(name = NULL, breaks = NULL) +
  labs(fill = "Depth [m]")

geom_contour

We will later deal with the syntax for generating contours from rasters. However, as is often the case, ggplot provides a shortcut for generating contours based on raster data, the contour-layer.

m2 <- 
  m +
  geom_contour(data = depth, aes(x, y, z = z),
               breaks = c(-25, -50, -100, -200, -400),
               lwd = 0.1) 
m2

Here we have specified to display the depth contours of 25, 50, 100, 200 and 400 meters.

Now we are ready to add the minke data or any other data of interest:

m2 + geom_point(data = minke, aes(lon, lat), colour = "red")

Exercise

Exercise
  1. Create a depth raster map of the region of your interest.
  2. Create a contour map of the region of your interest, specifying your own preference for the depth values to display.

and / or

  1. Create a code that emulates the following plot

Facetting

Facetting allows one to split up the data by a variable and display the same graph for each subset. One can split the data by one variable (wrap) or by two variables (grid).

facet_wrap

Lets say we want to visualize the survey biomass of a species (here haddock) over time. For simplicity here we will only use selected 6 years:

smb2 <- 
  # more on filter later
  filter(smb, year %in% c(1995, 2000, 2005, 2010, 2015, 2019))
p <-
  m + 
  geom_point(data = smb2,
             aes(lon1, lat1, size = haddock_kg),
             alpha = 0.15,
             colour = "red") +
  scale_size_area(max_size = 15)
p +
  facet_wrap(~ year)

facet_grid

data used: Survey abundance data of cod by year, age and yearclass at each station.

rbya <- 
  read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb_cod_rbya.csv")
rbya <- filter(rbya, yc %in% c(1984, 1985, 1996, 1997))
glimpse(rbya)
## Rows: 6,582
## Columns: 7
## $ id   <dbl> 31885, 31883, 31882, 31880, 31879, 31876, 31875, 31874, 31873, 31…
## $ year <dbl> 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986,…
## $ lon1 <dbl> -24.52467, -24.70017, -24.20867, -24.79200, -25.09133, -25.92833,…
## $ lat1 <dbl> 64.48383, 64.59000, 64.66183, 64.93800, 64.68400, 64.67583, 64.77…
## $ age  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ n    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 11, 0, 0, 0, 4…
## $ yc   <dbl> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985,…

Note: Here we have filtered out four year classes.

m + 
  geom_point(data = rbya,
             aes(lon1, lat1, size = n),
             alpha = 0.2,
             colour = "red") +
  scale_size_area(max_size = 10) +
  facet_grid(age ~ yc) +
  labs(title = "Abundance of cod", 
       subtitle = "By yearclass and age",
       size = "per 4 miles")

Note: The in the facet_grid-function the first variable (here age) is plotted as rows, the second variable (here yearclass) as columns.

Bits

Getting location from ggplot

In base R there is function called locator. For ggplot there is an equivalent function available in the ggmap-package. Check out:

library(ggmap)
?gglocator