NOTE: This document is only partially completed
In this exercise the power of combining functions from the dplyr and ggplot2 packages (and then some) as a data exploration tool to reveal potential patterns and trends is demonstrated.
It is expected that readers are already familiar with the basics of data manipulation and plotting. We will use the flying fish data again.
Needed packages:
library(tidyverse)
library(lubridate)
Getting the example data into R:
ff <- read.csv("http://www.hafro.is/~einarhj/crfmr/data-raw/flyingfish.csv",
stringsAsFactors = FALSE)
Lets see what we got:
glimpse(ff)
## Observations: 589
## Variables: 6
## $ Year <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 19...
## $ Month <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3...
## $ Country <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago", ...
## $ Vessel <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "D...
## $ Weight..kg. <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 5...
## $ Trips <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25...
So we have 589 observations. Take note that the variable type of Year, Month, Weight..kg. and Trips are numerical values (labelled as <dbl>
above) as expected. The column names are not to my liking (I want to minimize keyboard work in the code that follows) so I change the column names:
names(ff) <- c("year", "month", "country", "vessel", "catch", "effort")
And because we are going to be most interested in the catch per unit effort data we might as well generate that variable:
ff <-
ff %>%
mutate(cpue = catch/effort)
There are two columns that refer to time, year and month. Because I want later to plot data along time I might as well set up the proper date variable. Here I use the ymd
function from the lubridate package (for further reading check the introductory vignette). Since there is no specific “day” in the data I just use the first day of each month as a dummy constant.
ff <-
ff %>%
mutate(cpue = catch/effort,
date = ymd(paste(year, month, 1, sep = "-")))
If you now take a glimpse at the data you notice that the type for the date column is labelled as <date>
(i.e. as expected).
ff %>%
ggplot(aes(date, effort, colour = country)) +
geom_line() +
facet_wrap(~ vessel) +
labs(x = NULL, y = "Number of trips")
ff %>%
ggplot(aes(date, catch, colour = country)) +
geom_line() +
facet_wrap(~ vessel) +
labs(x = NULL, y = "Catch [kg]")
ff %>%
ggplot(aes(date, cpue, colour = country)) +
geom_line() +
facet_wrap(~ vessel) +
labs(x = NULL, y = "CPUE [kg/trip]")
The main patterns one observes is:
To reveal long term trend in catch per unit of effort we could try to use a smoother:
ff %>%
ggplot(aes(date, cpue, colour = country)) +
geom_point(size = 0.5) +
geom_smooth(span = 0.3) +
facet_wrap(~ vessel) +
labs(x = NULL, y = "CPUE [kg/trip]")
Among the general patterns are:
In order to make comparisons across the cpue series one can normalize the data (notice that I start filtering the data such that only common year ranges in the data (1995-2007) are included in the analysis) such that mean of the data within groups (country and vessel) is the equal to 1 by using the combination of group_by
and mutate
functions:
ff %>%
filter(year %in% 1995:2007) %>%
group_by(country, vessel) %>%
mutate(mean = cpue/mean(cpue)) %>%
ungroup() %>%
mutate(group = paste(country, vessel)) %>%
ggplot(aes(date, mean, colour = group, fill = group)) +
theme_bw() +
geom_hline(yintercept = 1) +
geom_smooth(aes(y = mean), span = 0.3) +
#facet_wrap(~ vessel + country) +
labs(x = NULL, y = "CPUE index") +
scale_fill_brewer(palette = "Set1") +
scale_colour_brewer(palette = "Set1")
Here we have not plotted the actual data, just the “loess” smoother. We added a horizontal line (using geom_hline
) to indicate the mean within each time series. The default ggplot colour scheme was also overwritten using the scale_xxx_brewer
(type ?scale_fill_brewer
to get further information of the function) to set of functions.
TODO: Add text to explain the code
p <-
ff %>%
filter(year %in% 1995:2007) %>%
ggplot(aes(factor(month), cpue)) +
geom_boxplot()
p
p + facet_wrap(~ vessel)
ff %>%
filter(year %in% 1955:2007) %>%
group_by(month, vessel) %>%
summarize(mean = mean(cpue),
p005 = quantile(cpue, 0.05),
p050 = quantile(cpue, 0.50),
p095 = quantile(cpue, 0.95)) %>%
ggplot(aes(factor(month))) +
geom_linerange(aes(ymin = p005,
ymax = p095)) +
geom_point(aes(y = p050)) +
geom_point(aes(y = mean), colour = "red") +
facet_wrap(~ vessel, scale = "free_y")
ff %>%
filter(year %in% 1955:2007) %>%
ggplot(aes(factor(month), cpue)) +
theme_bw() +
stat_summary(fun.data = "mean_cl_boot", colour = "red", size = 1) +
facet_wrap(~ vessel, scale = "free_y")
## Warning: Removed 1 rows containing missing values (geom_pointrange).
TODO: Add text to explain the code
ff %>%
filter(year %in% 1955:2007) %>%
ggplot(aes(vessel, cpue)) +
theme_bw() +
stat_summary(fun.data = "mean_cl_boot", colour = "red", size = 1)