The ggplot bible
Needed libraries:
library(tidyverse) # installs sweeps of packages
library(maps) # some basic country maps
library(mapdata) # higher resolution maps
library(marmap) # access global topgraphy data
Data used:
<-
minke read_csv("ftp://ftp.hafro.is/pub/data/csv/minke.csv")
glimpse(minke)
## Rows: 190
## Columns: 13
## $ id <dbl> 1, 690, 926, 1333, 1334, 1335, 1336, 1338, 1339, 1341, …
## $ date <dttm> 2004-06-10 22:00:00, 2004-06-15 17:00:00, 2004-06-22 0…
## $ lon <dbl> -21.42350, -21.39183, -19.81333, -21.57500, -15.61167, …
## $ lat <dbl> 65.66183, 65.65350, 66.51167, 65.67333, 66.29000, 66.17…
## $ area <chr> "North", "North", "North", "North", "North", "North", "…
## $ length <dbl> 780, 793, 858, 567, 774, 526, 809, 820, 697, 777, 739, …
## $ weight <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ age <dbl> 11.3, NA, 15.5, 7.2, 12.3, 9.6, 17.3, 13.8, 12.2, 15.4,…
## $ sex <chr> "Female", "Female", "Female", "Male", "Female", "Female…
## $ maturity <chr> "pregnant", "pregnant", "pregnant", "immature", "immatu…
## $ stomach.volume <dbl> 58, 90, 24, 25, 85, 18, 200, 111, 8, 25, 38, 6, 11, 100…
## $ stomach.weight <dbl> 31.900000, 36.290000, 9.420000, 3.640000, 5.510000, 1.1…
## $ year <dbl> 2004, 2004, 2004, 2003, 2003, 2003, 2003, 2003, 2003, 2…
See full explanation of variables in Datasets - overview).
ggplot has three key components:
argument data, which must be a
data.frame
or some derivative there of (tbl
,
data.table
, …)
A set of aesthetic mappings (called with the aes~function) between variables in the data and visual properties, and
At least one call to a layer which describes how to render each observation.
ggplot(data = minke) +
aes(x = lon, y = lat, colour = sex) +
layer(geom = "point", stat = "identity", position = "identity")
Generally we do not call layer directly but use functions starting with geom_ that are a shortcut calls to the layer-function. Hence the above call is normally written as:
ggplot(data = minke, aes(x = lon, y = lat, colour = sex)) + geom_point()
Different syntax, equivalent outcome:
ggplot() + geom_point(data = minke, aes(lon, lat, colour = sex))
ggplot(data = minke) + geom_point(aes(x = lon, y = lat, colour = sex))
ggplot(minke) + geom_point(aes(lon, lat, colour = sex))
ggplot(minke, aes(lon, lat, colour = sex)) + geom_point()
A ggplot can be stored as an object for later use:
<- ggplot(minke) + geom_point(aes(lon, lat, colour = sex)) p
In the above plot, we basically have mapped the longitude position on the x-axis and the latitude position on the y-axis. There are two things missing:
<- map_data("worldHires", region = "Iceland")
iceland glimpse(iceland)
## Rows: 13,393
## Columns: 6
## $ long <dbl> -14.55777, -14.56333, -14.56751, -14.57335, -14.57971, -14.5…
## $ lat <dbl> 66.38364, 66.38003, 66.37751, 66.37470, 66.37303, 66.37029, …
## $ group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ region <chr> "Iceland", "Iceland", "Iceland", "Iceland", "Iceland", "Icel…
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Here we have just a simple dataframe with 13393 coordinates (long and lat) and some other variables. We can try map these coordinates to different layers (more on labs later):
<-
m ggplot(iceland, aes(long, lat, group = group))
<- m + geom_point()
m1 <- m + geom_line()
m2 <- m + geom_path()
m3 <- m + geom_polygon() m4
# quickfix: to see the plot type,
m1
m2
m3 m4
The above sweep of code and plots demonstrate that background maps are just a set of longitudinal and latitudinal data that are ordered in a specific way.
As noted above a map is just a xy-plot but with the addition of having a projections. A proper coverage of projections is done in a xxx, here we just cover some basics with respect to ggplot2. We could try to guess the projections (or rather the aspect ration of the plot) as done on the left or better use the specific functions available in ggplot:
<-
m ggplot(iceland, aes(long, lat, group = group)) +
geom_path()
<- m + coord_fixed(ratio = 2.4)
m1 <- m + coord_quickmap()
m2 <- m + coord_map()
m3 <- m + coord_map(projection = "stereographic") m4
Note that the geom_quickmap is an approximation (albeit good one for most purposes), if one is operating on a larger scale geom_map may be more accurate (actually all maps are wrong when put on a two dimensional pane).
The power of ggplot comes into place when one adds layers on top of other layers. With the following code one first generates the background-map and then adds the minke datapoints on top of that background map:
<-
m ggplot() +
geom_polygon(data = iceland,
aes(long, lat, group = group)) +
geom_point(data = minke,
aes(lon, lat, colour = sex))
+ coord_quickmap() m
Besides adding a point layer on top of the polygon layer we have added one aesthetics:
colour
: Here we are letting the value of the variable
sex control the colour displayed. Hence we call that argument within the
aes-function.We will be introduced to more aesthetics as we move along.
ggplot2 provides a lot of sensible defaults that it derives from the data used to generate a plot. Here we will take a peek on those of that may be of most interest for now, further customization being introduced as we move along.
When one writes:
<-
p ggplot(minke) +
geom_point(aes(lon, lat, colour = sex))
what ggplot does is actually:
ggplot(minke) +
geom_point(aes(lon, lat, colour = sex)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
Within the scale_-functions all the settings take the default argument, check e.g.:
?scale_x_continuous
Here we observe among other things: name, breaks, minor_breaks, limits, position, … Here is an example where among other the default name and breaks are overwritten:
+
p scale_x_continuous(name = "Longitude",
breaks = seq(-25, -10, by = 1),
minor_breaks = NULL) +
scale_y_continuous(name = "Latitude",
breaks = seq(63, 67, by = 0.5),
minor_breaks = NULL) +
scale_colour_discrete(name = "Sex") +
coord_map()
If one was interested in:
we would do:
<-
m ggplot() +
geom_polygon(data = iceland,
aes(long, lat, group = group),
fill = "grey") +
geom_point(data = minke,
aes(lon, lat, colour = sex)) +
scale_x_continuous(name = NULL,
breaks = NULL) +
scale_y_continuous(name = NULL,
breaks = NULL) +
scale_colour_brewer(name = "Sex",
palette = "Set1") +
theme(legend.position = c(0.5, 0.5)) +
coord_map()
m
If one only interested in overwriting the default axis-name one could use the labs-shortcut function.
One of ggplot defaults is that the range of the x and the y-axis is dictated by the range in the data. If we want to “zoom into” a plot we can overwrite the default boundaries of the horizontal and vertical axis by calling the arguments xlim and ylim within the coord_quickmap-function.
+ coord_quickmap(xlim = c(-25, -20), ylim = c(65, 66.6)) m
A list of all the fixed (english) colour available in R:
colours()
The minke data used is:
<-
minke.fo tibble(lon = c(-8.5, -5.5),
lat = c(61.5, 62.2),
sex = c("Male", "Female"))
Here we will only focus on ggplot layers that are useful when plotting spatial data.
Data used:
<-
smb read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb.csv")
<- filter(smb, year == 2019)
smb2019 glimpse(smb2019)
## Rows: 581
## Columns: 40
## $ id <dbl> 490893, 490829, 490830, 490831, 490832, 490833, 490834,…
## $ date <date> 2019-03-06, 2019-03-06, 2019-03-06, 2019-03-06, 2019-0…
## $ year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2…
## $ vid <dbl> 1131, 2350, 2350, 2350, 2350, 2350, 2350, 2350, 2350, 2…
## $ tow_id <chr> "669-13", "673-4", "673-7", "672-4", "722-1", "722-3", …
## $ t1 <dttm> 2019-03-06 08:55:00, 2019-03-06 13:21:00, 2019-03-06 1…
## $ t2 <dttm> 2019-03-06 10:02:00, 2019-03-06 14:23:00, 2019-03-06 1…
## $ lon1 <dbl> -19.80333, -23.15833, -22.99800, -22.95317, -22.82617, …
## $ lat1 <dbl> 66.78167, 66.90200, 66.91200, 66.90350, 66.99933, 67.04…
## $ lon2 <dbl> -19.95550, -23.00733, -23.05833, -22.82517, -22.99583, …
## $ lat2 <dbl> 66.75300, 66.93250, 66.85033, 66.94700, 67.00033, 67.11…
## $ ir <chr> "62D0", "62C6", "62C6", "62C7", "62C7", "63C7", "63C7",…
## $ ir_lon <dbl> -19.5, -23.5, -23.5, -22.5, -22.5, -22.5, -22.5, -23.5,…
## $ ir_lat <dbl> 66.75, 66.75, 66.75, 66.75, 66.75, 67.25, 67.25, 67.25,…
## $ z1 <dbl> 270, 226, 211, 207, 224, 235, 269, 243, 249, 240, 232, …
## $ z2 <dbl> 242, 222, 176, 212, 247, 250, 250, 230, 247, 246, 253, …
## $ temp_s <dbl> 4.0, 3.1, 3.4, 3.3, 4.5, 4.6, 4.6, 4.8, 4.5, 5.0, 4.6, …
## $ temp_b <dbl> 1.3, 2.8, 3.7, 3.7, 1.3, 1.8, 1.3, 1.7, 1.3, 3.4, 2.1, …
## $ speed <dbl> 3.6, 3.9, 3.8, 3.8, 3.8, 3.8, 3.8, 3.8, 3.8, 3.7, 3.8, …
## $ duration <dbl> 67, 62, 64, 63, 63, 64, 64, 64, 64, 65, 64, 60, 60, 60,…
## $ towlength <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ horizontal <dbl> 97, 103, 90, 98, 94, 92, 102, 101, 87, 93, 94, 99, 105,…
## $ verical <dbl> 2.2, 2.3, 2.3, 2.4, 2.5, 2.6, 2.6, 2.6, 2.5, 2.8, 2.6, …
## $ wind <dbl> 13, 10, 8, 3, 10, 12, 12, 14, 14, 12, 10, 7, 7, 7, 5, 3…
## $ wind_direction <dbl> 5, 2, 36, 7, 5, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 5, 7, 2, …
## $ bormicon <dbl> 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ oldstrata <dbl> 34, 87, 87, 87, 87, 87, 36, 87, 87, 87, 87, 87, 87, 87,…
## $ newstrata <dbl> 31, 18, 40, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18,…
## $ cod_kg <dbl> 4471.76481, 333.11652, 42.05824, 229.95894, 270.20725, …
## $ cod_n <dbl> 1204, 98, 45, 98, 139, 329, 244, 85, 69, 82, 402, 381, …
## $ haddock_kg <dbl> 302.572948, 31.090541, 490.880857, 92.117378, 0.000000,…
## $ haddock_n <dbl> 210, 72, 646, 186, 0, 0, 0, 0, 2, 0, 0, 62, 73, 53, 0, …
## $ saithe_kg <dbl> 9375.3863235, 0.0000000, 1.7301465, 1.4905674, 0.000000…
## $ saithe_n <dbl> 2940, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 2, 6, 2, 0, 0, 6, 3…
## $ wolffish_kg <dbl> 1.3074065, 9.2663356, 3.4822189, 11.0757233, 4.6397027,…
## $ wolffish_n <dbl> 3, 7, 12, 4, 12, 29, 3, 19, 4, 4, 9, 5, 6, 3, 1, 14, 13…
## $ plaice_kg <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000…
## $ plaice_n <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ monkfish_kg <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,…
## $ monkfish_n <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Since we are going to use some Icelandic survey data in the following example we first generate an appropriate reference background layer:
<-
m ggplot() +
geom_polygon(data = iceland, aes(long, lat, group = group),
fill = "grey") +
coord_map() +
scale_x_continuous(name = NULL, breaks = NULL) +
scale_y_continuous(name = NULL, breaks = NULL)
The segment-layer is useful when wanting to display some very simple vector data that only have a start and an end point. Lets take the tow location from the Icelandic spring survey as an example:
+
m geom_segment(data = smb2019,
aes(x = lon1, y = lat1,
xend = lon2, yend = lat2))
We can assign a variable to the size argument in the geom_point-layer, the size of the point then being some function of value of a variable. Lets visualize the abundance of cod:
<-
p +
m geom_point(data = smb2019,
aes(lon1, lat1, size = cod_n),
alpha = 0.2,
colour = "red")
p
Here we have introduced a new aesthetic named alpha. This argument controls the transparency of the bubble, the value going from 0 (invisible) to 1 (no transparency, the ggplot default).
Besides the spatial patterns observed, there is one problem with this plot: The radius of the bubble size is not proportional to abundance. ggplot provides a very useful function to remedy this (now zero abundance gets a zero radius):
+
p scale_size_area(max_size = 20)
The value assigned to max_size is a bit of a trial and error as well as personal preference.
We have already shown a plot using the path-layer, but just to create a background map. For a more appropriate demonstration lets use the vessel track data from the 2019 spring survey. Besides the coordinate data the key additional element of this dataset is time, which it has already been ordered by.
<-
track read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb_vms2019.csv")
+
m geom_path(data = track,
aes(x = lon, y = lat, colour = vessel))
The survey data contains the ICES rectangles that a tow belongs (ir) to as well as the center coordinate position (ir_lon and ir_lat) we can use the tile-layer to display some characteristics of the data (we will later deal with how this was achieved).
However, since there is more than one tow in each tile we need first to generate some summary of the data by each rectangle (actually here by the center-points):
<-
s %>%
smb2019 group_by(ir_lon, ir_lat) %>%
summarise(Cod = mean(cod_n))
<-
p +
m geom_tile(data = s,
aes(x = ir_lon, y = ir_lat, fill = Cod))
p
Do not like the default colour?. Try:
+
p scale_fill_viridis_c(option = "B", direction = -1)
This display has still at least two problems:
Ameliorate the code for the plot above such that is looks like this (you have to start from scratch):
The above plot using the tile-layer is actually a kind of a raster plot, albeit the resolution being not very high.
To demonstrate the use of the raster-layer
lets use the global relief models from the ETOPO1 dataset hosted on a
NOAA server can be accessed using the
getNOAA.bathy
-function in the marmap-package (you need to
install that package). To access them one specifies the boundary of the
data of interest and then, since we are using ggplot for mapping are
turned into a data frame using the
fortify-function:
<- c(-28, -10)
xlim <- c(62.5, 67.5)
ylim
<-
depth getNOAA.bathy(lon1 = xlim[1], lon2 = xlim[2],
lat1 = ylim[1], lat2 = ylim[2],
resolution = 1) %>%
fortify() %>% # turn the object into a data.frame
filter(z <= 0)
glimpse(depth)
## Rows: 254,268
## Columns: 3
## $ x <dbl> -28.00000, -27.98332, -27.96664, -27.94995, -27.93327, -27.91659, -2…
## $ y <dbl> 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67.5, 67…
## $ z <int> -297, -298, -300, -302, -304, -306, -306, -305, -305, -305, -305, -3…
So this data is just a set of x (longitude), y (latitudes) and z
(depth). The dataset is a raster-grid which we can visualize by using
the geom_raster
-function:
+
m geom_raster(data = depth,
aes(x = x, y = y, fill = z)) +
coord_quickmap() +
scale_x_continuous(name = NULL, breaks = NULL) +
scale_y_continuous(name = NULL, breaks = NULL) +
labs(fill = "Depth [m]")
We will later deal with the syntax for generating contours from rasters. However, as is often the case, ggplot provides a shortcut for generating contours based on raster data, the contour-layer.
<-
m2 +
m geom_contour(data = depth, aes(x, y, z = z),
breaks = c(-25, -50, -100, -200, -400),
lwd = 0.1)
m2
Here we have specified to display the depth contours of 25, 50, 100, 200 and 400 meters.
Now we are ready to add the minke data or any other data of interest:
+ geom_point(data = minke, aes(lon, lat), colour = "red") m2
and / or
Facetting allows one to split up the data by a variable and display the same graph for each subset. One can split the data by one variable (wrap) or by two variables (grid).
Lets say we want to visualize the survey biomass of a species (here haddock) over time. For simplicity here we will only use selected 6 years:
<-
smb2 # more on filter later
filter(smb, year %in% c(1995, 2000, 2005, 2010, 2015, 2019))
<-
p +
m geom_point(data = smb2,
aes(lon1, lat1, size = haddock_kg),
alpha = 0.15,
colour = "red") +
scale_size_area(max_size = 15)
+
p facet_wrap(~ year)
data used: Survey abundance data of cod by year, age and yearclass at each station.
<-
rbya read_csv("ftp://ftp.hafro.is/pub/data/csv/is_smb_cod_rbya.csv")
<- filter(rbya, yc %in% c(1984, 1985, 1996, 1997))
rbya glimpse(rbya)
## Rows: 6,582
## Columns: 7
## $ id <dbl> 31885, 31883, 31882, 31880, 31879, 31876, 31875, 31874, 31873, 31…
## $ year <dbl> 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986, 1986,…
## $ lon1 <dbl> -24.52467, -24.70017, -24.20867, -24.79200, -25.09133, -25.92833,…
## $ lat1 <dbl> 64.48383, 64.59000, 64.66183, 64.93800, 64.68400, 64.67583, 64.77…
## $ age <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ n <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 11, 0, 0, 0, 4…
## $ yc <dbl> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985,…
Note: Here we have filtered out four year classes.
+
m geom_point(data = rbya,
aes(lon1, lat1, size = n),
alpha = 0.2,
colour = "red") +
scale_size_area(max_size = 10) +
facet_grid(age ~ yc) +
labs(title = "Abundance of cod",
subtitle = "By yearclass and age",
size = "per 4 miles")
Note: The in the facet_grid-function the first variable (here age) is plotted as rows, the second variable (here yearclass) as columns.
In base R there is function called locator. For ggplot there is an equivalent function available in the ggmap-package. Check out:
library(ggmap)
?gglocator