Nice resources:
library(tidyverse) # This loads among other things the ggplot2-package
library(ggmap)
library(gridExtra)
library(maps)
library(mapdata)
Note if you have not installed these package, check the Installing-documentation.
For this tutorial we are going to use the following data (information on them found here)
minke <- read.csv("http://www.hafro.is/~einarhj/data/minke.csv",
stringsAsFactors = FALSE)
sau <- read.csv("http://www.hafro.is/~einarhj/data/sau-crfm-country-catches.csv",
stringsAsFactors = FALSE)
iceland <- read.csv("http://www.hafro.is/~einarhj/data/iceland.csv",
stringsAsFactors = FALSE)
Just get a quick overview of the data we use the glimpse-function:
glimpse(minke)
#> Observations: 190
#> Variables: 13
#> $ id <int> 1, 690, 926, 1333, 1334, 1335, 1336, 1338, 1339...
#> $ date <chr> "2004-06-10 22:00:00", "2004-06-15 17:00:00", "...
#> $ lon <dbl> -21.4, -21.4, -19.8, -21.6, -15.6, -18.7, -21.5...
#> $ lat <dbl> 65.7, 65.7, 66.5, 65.7, 66.3, 66.2, 65.7, 66.1,...
#> $ area <chr> "North", "North", "North", "North", "North", "N...
#> $ length <int> 780, 793, 858, 567, 774, 526, 809, 820, 697, 77...
#> $ weight <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
#> $ age <dbl> 11.3, NA, 15.5, 7.2, 12.3, 9.6, 17.3, 13.8, 12....
#> $ sex <chr> "Female", "Female", "Female", "Male", "Female",...
#> $ maturity <chr> "pregnant", "pregnant", "pregnant", "immature",...
#> $ stomach.volume <dbl> 58, 90, 24, 25, 85, 18, 200, 111, 8, 25, 38, 6,...
#> $ stomach.weight <dbl> 31.900, 36.290, 9.420, 3.640, 5.510, 1.167, 99....
#> $ year <int> 2004, 2004, 2004, 2003, 2003, 2003, 2003, 2003,...
glimpse(sau)
#> Observations: 64
#> Variables: 4
#> $ year <int> 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 195...
#> $ reported <dbl> 14.2, 14.5, 16.8, 19.9, 19.9, 21.9, 23.2, 23.6, 27....
#> $ unreported <dbl> 75.8, 78.3, 79.4, 87.0, 88.0, 90.7, 92.3, 91.0, 93....
#> $ total <dbl> 90.0, 92.8, 96.3, 106.9, 107.9, 112.6, 115.6, 114.6...
glimpse(iceland)
#> Observations: 1,323
#> Variables: 2
#> $ lat <dbl> 65.8, 65.8, 65.8, 65.7, 65.7, 65.7, 65.6, 65.6, 65.6, 65.6...
#> $ lon <dbl> -23.9, -24.0, -24.1, -24.1, -24.0, -23.9, -23.8, -23.8, -2...
ggplot has three key components:
data,
A set of aesthetic mappings [aes] for the variables in the data, and
At least one layer which describes how to render each observation. Layers are usually created with a geom function.
ggplot(data = minke, aes(x = age, y = length)) +
geom_point()
Here we have basically just created a point-plot where age is plotted (“mapped”) on the x-axis and length on the y-axis. One can use different syntax resulting in the same outcome:
ggplot(minke, aes(x = age, y = length)) + geom_point()
ggplot(minke, aes(age, length)) + geom_point()
ggplot() + geom_point(data = minke, aes(age, length))
ggplot(data = minke) + geom_point(aes(x = age, y = length))
ggplot(minke) + geom_point(aes(age, length))
minke %>% ggplot() + geom_point(aes(age, length))
One has quite a set of options were one can control colour, size, shape and alpha (the transparency of the points):
ggplot(minke) +
geom_point(aes(age, length), colour = "red", size = 2, shape = 3, alpha = 0.4)
One can also stored the plot in an object for later use:
p <- ggplot(minke) + geom_point(aes(age, length))
The above is useful if one wants to build a plot step-by-step or arrange many plots together for display (see below).
In the above cases the colour, size, shape and alpha levels are fixed values set outside the aes
-function. We can however use the value of a variable in the dataset to control the visualization. Below is just a list of some things that can be set:
One can add more aesthetics to the plot, e.g. if one wants to distinguish between sex or the area where the whale was caught:
p <- ggplot(minke)
p + geom_point(aes(age, length, colour = sex))
p + geom_point(aes(age, length, colour = area))
Colours can be manually specified using scale_colour_manual
:
p <- ggplot(minke)
p + geom_point(aes(age, length, colour = sex)) +
scale_colour_manual(values = c("orange","brown"))
p + geom_point(aes(age, length, colour = area)) +
scale_colour_manual(values = c("green","red"))
In the above case sex and area had only two values, hence only two colours were specified.
p + geom_point(aes(age, length, shape = sex))
Create a code that results in these plots:
p + geom_point(aes(age, length, size = stomach.volume))
To reveal overlays:
p + geom_point(aes(age, length, size = stomach.volume), alpha = 0.6)
p + geom_point(aes(age, length, size = stomach.volume), alpha = 0.3, col = "red")
A plot can be subsetted by using the face_wrap
-function. Here we e.g. split the plot up into the two survey areas (North and South):
ggplot(minke) +
geom_point(aes(age, length, colour = sex)) +
facet_wrap(~ area)
Create a code that results in this plot:
If one wanted two different plots side-by-side one needs to store each plot as and object and then use the grid.arrange
-function. E.g.:
p1 <- p + geom_point(aes(age, length), colour = "blue")
p2 <- p + geom_point(aes(age, length, shape = sex))
grid.arrange(p1, p2, ncol = 2)
In the above cases we have only one type of a layer, the point layer. ggplot2 has of course myriads of layers. Below are some examples that give a brief overview of some other layers.
ggplot(sau, aes(year, total)) + geom_line()
You may have noticed that by default the plot area shown cover only the range of the data. In the above case one may have wanted to have the y-plot have a starting point at zero. Here we would need to use the expand_limits
-function:
ggplot(sau, aes(year, total)) + geom_line() + expand_limits(y = 0)
We can create histograms for discrete data using the geom_bar
-function:
ggplot(minke, aes(maturity)) + geom_bar()
Modify the above code to generate the following:
We can create histograms for discrete data using the geom_bar
-function. That function has also an argument for controlling the binwidth:
p <- ggplot(minke, aes(length))
p + geom_histogram()
p + geom_histogram(binwidth = 50)
One may want to get a histogram show the size distribution of e.g. the different sexes:
p + geom_histogram(aes(fill = sex))
It is quite hard to get a visualization of the length distribution of the Females here, so it may be better to split the plot into different panels:
p + geom_histogram(aes(fill = sex)) + facet_wrap(~ sex, ncol = 1)
Instead of a histogram it may also be better to show things as frequency lines:
p + geom_freqpoly()
p + geom_freqpoly(aes(colour = sex), binwidth = 50)
Add a little random noise to the data to avoid over-plotting can be done using the geom_jitter
-function:
p <- ggplot(minke, aes(sex, length))
p + geom_point()
p + geom_jitter()
Instead of a jitter plot, a more condensed way to show the distribution of the data is to use box- or violinplots:
p + geom_boxplot()
p + geom_violin()
In ggplot one can have more than one layer. E.g. in if we generate a summary distribution plot one may also want the get a “glimpse” at the raw data:
p + geom_boxplot() + geom_jitter(colour = "red", alpha = 0.3)
Create the following 3 layer plot:
Another example where we add layers could be adding a smoother to a point-plot:
p <- ggplot(minke, aes(age, length))
p + geom_point() + geom_smooth()
p + geom_point() + geom_smooth(span = 0.1)
We even have some specific models we could try, here a linear model:
p + geom_point() + geom_smooth(method = "lm")
If we want to put a plot into a report we may not like the default that is given but to refine the plot further. There are a number of functions in ggplot2 that allow us to take full control on the final outlook.
So far, the labels in the plot have been just the variable names in the dataframe. To specify the labels one can use the labs
-function:
p <- ggplot(minke, aes(age, length, colour = sex)) + geom_point()
p
p + labs(x = "Age [year]", y = "Length [cm]",
colour = "Sex", title = "My minke plot",
subtitle = "Based on survey data from 2003-2007")
Controlling which values appear as tick marks one can use:
p <- ggplot(minke, aes(age, length)) + geom_point() + labs(x = NULL, y = NULL)
p
p +
scale_x_continuous(breaks = c(5, 10, 15, 20, 25, 30, 35, 40, 45))
p +
scale_x_continuous(breaks = seq(5, 45, by = 5)) +
scale_y_continuous(breaks = seq(500, 950, by = 50))
Create code that mimics the following plot:
If you only want “zoom into” a data area one can use xlim
or ylim
:
p <- ggplot(minke, aes(maturity, length))
p + geom_jitter()
p + geom_jitter() + ylim(600, 800)
p + geom_jitter() + ylim(NA, 800) # setting only one limit
For discrete variables:
p + geom_jitter() + ylim(600,800) + xlim("immature","mature")
But be careful when using with summary statistics, e.g.:
p + geom_boxplot()
p + geom_boxplot() + ylim(600, 800)
This is because when you specify the ylim
the data outside that range are filtered out completely from the plot-data. The remedy is to wrap the function inside the coord_cartesian
-function:
p + geom_boxplot() + coord_cartesian(ylim = c(600, 800))
The object island contains the Latitude and Longitude of the Icelandic shoreline. We could try to do a point or a line plot:
p <- ggplot(iceland, aes(lon, lat)) + labs(x = NULL, y = NULL)
p + geom_point()
p + geom_line()
The point plot shows something that is close to recognizable. However in the line plot data are rearranged such the line is drawn from the smallest x-value, to the next-smallest x-value and so on. The island object actually has the data arranged in a specific order. If one want to retain that order one needs to use the geom_path
-function. In addition, because the data are in coordinates we want to set the correct aspect ratio between the x- and the y-axis, hence here we also need to use the coord_map
-function
p + geom_path()
p + geom_path() + coord_map()
So, gis-mapping is nothing more than having data in a structured order and one connect the dots with a line in addition to specifying the map projection.
map_data
function.m <- map_data("world")
str(m)
#> 'data.frame': 99338 obs. of 6 variables:
#> $ long : num -69.9 -69.9 -69.9 -70 -70.1 ...
#> $ lat : num 12.5 12.4 12.4 12.5 12.5 ...
#> $ group : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ order : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ region : chr "Aruba" "Aruba" "Aruba" "Aruba" ...
#> $ subregion: chr NA NA NA NA ...
ggplot(m) +
geom_polygon(aes(long, lat, group = group)) +
coord_quickmap()
The resolution of “world” is not very high as can be seen if only want Barbados:
m <- map_data("world", region = "Barbados")
ggplot(m) +
geom_polygon(aes(long, lat, group = group)) +
coord_quickmap()
If we want higher resolution the object “worldHires” is often sufficient:
m <- map_data("worldHires")
m <- m[m$long > -85 & m$long < -60 & m$lat > 0 & m$lat < 30,]
ggplot(m) +
geom_polygon(aes(long, lat, group = group)) +
coord_quickmap(ylim = c(7, 28))
But it is still not really good, e.g. if we check out St. Vincent:
m <- m[m$region == "Saint Vincent",]
ggplot(m) +
geom_polygon(aes(long, lat, group = group)) +
coord_quickmap()
we only get the main island, not the Grenadines.
m <- ggplot(iceland, aes(lon, lat)) +
theme_bw() +
geom_polygon(fill = "grey90") +
coord_map() +
labs(x = NULL, y = NULL)
m
m + geom_point(data = minke, aes(lon, lat))
m + geom_point(data = minke, aes(lon, lat, colour = area))
m + geom_point(data = minke, aes(lon, lat, colour = sex))
m + geom_point(data = minke, aes(lon, lat, colour = year))
m + geom_point(data = minke, aes(lon, lat, colour = factor(year)))
m + geom_point(data = minke, aes(lon, lat, size = length), alpha = 0.2)
# possible remedy of the above plot - is beyound the basic introduction
m + geom_point(data = minke, aes(lon, lat, colour = length)) +
scale_colour_gradient(low = "yellow", high = "red")
m2 <- get_map(location = c(-19,65), zoom= 6, maptype = "satellite", source = "google")
m2 <- ggmap(m2) +
labs(x = NULL, y = NULL)
m2
Repeat previous plots where we use “iceland” or do new ones, but using the Google base as a background. E.g.: