Visualization - part I
Preamble
lskfjalfksajæloia dsfliadlfiuaspodf SDOIK
In this section we are going to introduce you to how to plot data in R. There are quite a number of plot-lingos in R, here we are going to limit ourselves to the ggplot-lingo. But first we are going to side track a bit.
RStudio projects and orientation
Before we do anything else lets create an RStudio project.
We are going to use a dataset of 190 observations on minke whales in this tutoral. Details on the data and the variables can be obtained here
We are going to deal with data import in more detail later in the course. For those wanting to get ahead, read r4ds - Import.
Creating a ggplot
Our first objective is to learn how to generate this type of a graph:
Plot key components
ggplot has three key components:
data, this must be a
data.frame
A set of aesthetic mappings (
aes
) between variables in the data and visual properties, andAt least one
layer
which describes how to render each observation.
We always start a plot creation by calling ggplot
.
The above code is actually composed of different element that are build on top of each other:
# A blank canvas
ggplot(data = w)
# Add (map) what will be on x and y axis
ggplot(data = w,
mapping = aes(x = age, y = length))
# Add point layer
ggplot(data = w,
mapping = aes(x = age, y = length)) +
geom_point()
Different syntax, equivalent outcome:
ggplot(data = w, mapping = aes(age, length)) + geom_point()
ggplot() + geom_point(data = w, mapping = aes(age, length))
ggplot(data = w) + geom_point(mapping = aes(x = age, y = length))
ggplot(w) + geom_point(aes(age, length))
We have not yet managed to emulate the plot shown at the beginning. We can add the sex of the minke by using colour, … and add add a “loess”-smoother layer via geom_smooth
:
ggplot(w,
aes(age, length, colour = sex)) +
geom_point()
# ... add a loess-smoother
ggplot(w,
aes(age, length, colour = sex)) +
geom_point() +
geom_smooth()
As a data-exploration exercise this plot should suffice. But if we were to e.g. put this figure into a report to be read by others we may want to add nicer labels, possibly using other colours, in addition to some auxillary informations:
ggplot(w, aes(age, length, colour = sex)) +
geom_point() +
geom_smooth() +
labs(x = "Age [year]",
y = "Length [cm]",
colour = "Sex",
title = "Minke whale",
subtitle = "Age and length by sex",
caption = "Data from Iceland") +
scale_colour_brewer(palette = "Set1")
Distributions
A histogram that shows the distribution of the data can be generated by using geom_histogram
:
ggplot(w) +
geom_histogram(aes(length),
binwidth = 30)
We see that the data are distributed to the right, with relatively few observations of whales less than ~7 meters. The binwidth above is set to 30 [cm]. What binwidth is used is a users preference, but below are examples of two extremes, both wich are less informative than the one above:
ggplot(w) +
geom_histogram(aes(length),
binwidth = 5)
ggplot(w) +
geom_histogram(aes(length),
binwidth = 200)
A geom_boxplot
allows us to get a broader idea of distributions, particularly when comparing different categories:
ggplot(w,
aes(x = sex, y = length)) +
geom_boxplot()
Here we see that Females in the sample are generally larger than the Males (as we saw also in the scatterplot) and that “outliers” are in the lower length range of the data (as indicated in the histogram above).
We can actually create a different vision of the distribution using geom_violin
ggplot(w,
aes(x = sex, y = length)) +
geom_violin()
The violin plot is actually a “density” histogram. The information in the above graph could also be presented as:
ggplot(w,
aes(x = length)) +
geom_density()
Facets
One of the power of ggplot is that you can split the plot up based on some (categorical) varibles in your data by using geom_facet
. Take e.g. this histogram plot, where we have used “fill” (rather than “colour”) to separte out the sexes:
ggplot(w,
aes(length, fill = sex)) +
geom_histogram()
This histogram is diffult to “read”, particularily when it comes to the Females (because they are plotted on top of the Males). Here we can resort to splitting up the plot into facets based on sex:
ggplot(w,
aes(length, fill = sex)) +
geom_histogram() +
facet_wrap(. ~ sex)
Saving you plot
Your objective with creating a graph is to use it as a part of your communication with others. There are two ways to export graphs out of RStudio.
- A simple copy-paste:
- In the “Plots” pane click on “Export” -> “Copy to Clipboard …”.
- Adjust the dimentions to your liking and then right-click
- Paste this into your favourite commication medium
- Use the
ggsave
to save your active graph: -
ggsave(filename = "minke-plot.png")
Check the help file for ggsave to explore the options you have when saving a plot.
A “spatial” plot
The minke dataset has coordinates. If you are map enthusiast (like me) you may want to get a spatial representation of the location of each observation.