Preamble

NOTE: This document is only partially completed

In this exercise the power of combining functions from the dplyr and ggplot2 packages (and then some) as a data exploration tool to reveal potential patterns and trends is demonstrated.

It is expected that readers are already familiar with the basics of data manipulation and plotting.

The data we are going to use is the regional catch and effort flying fish data that was used as a part of the UNU-FTP stock assessment course that was held some years ago. It data contains observation of catch and effort by year, month, country and vessel type.

The most interesting parameter we may want to focus on is the catch per unit effort. This is because it is often proposed that such data may be a proxy measure of the actual biomass index. In analytical models that use time series data (stock production model or the more data demanding length and/or age based models) the cpue index is often assumed to be linearly related to stock biomass or abundance (green line in the schematic graph below) through an equation sometimes referred to as “the link model”

\(CPUE = qX\)

where cpue can be either in mass or in numbers and equivalently the X either the stock biomass or stock numbers. The \(q\) (often called catchability is estimated within the model).

The linear relationship (read: “The model assumption”) is often suspected to not hold (read: “Is wrong”), particular if the index (read: cpue) is base on fisheries dependent data. The reason in the latter case is that fisherman’s behavior is driven by getting has high catch as possible with the least amount of effort (cost of fishing often plays a role). So even though the stock may e.g. be declining fishermen will change behavior by whatever means in order hold up the catch per unit effort.

Add text to introduce:

\(CPUE = qX^b\)

In other words the fisherman’s objective is not estimating changes in stock size over time - it only us fisheries scientist that are sometimes daring enough to make that assumption. Often muttering at the same time “Given the data that one has this is the best one can do”.

But enough of a preamble, our main objective here is to learn some more R, including standardization of catch per unit effort data using GLM.

“They do not address any basic problems with cpue as an abundance index, such as hyperstability or hyperdepletion”

Getting the data into R

Once done you should have the flyfish Excel sheet in your current working directory (recall that to get information of the current working directory one can use the getwd command. If you open the workbook in Excel and go to sheet “flyfish” you see that the data we are interested in reading into R starts in row 3 and column “F” (number 6) and ends column “K” (number 6). Some may note that there is also data in column “L” (CPUE). We can omit them in the importing step because we can derive from other variables (“Weight (kg)” and “Trip” internally in R. Since the data is only a section of the worksheet we use the functions in the XLConnect package using the following code:

library(XLConnect)
library(tidyverse)
library(lubridate)
library(stringr)
library(broom)
df <- read.csv("http://www.hafro.is/~einarhj/crfmr/data-raw/flyingfish.csv",
               stringsAsFactors = FALSE)

Lets see what we got:

glimpse(df)

## Observations: 589
## Variables: 6
## $ Year        <int> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, 19...
## $ Month       <int> 1, 2, 3, 4, 5, 6, 7, 11, 11, 12, 12, 1, 1, 2, 2, 3...
## $ Country     <chr> "Tobago", "Tobago", "Tobago", "Tobago", "Tobago", ...
## $ Vessel      <chr> "Dayboats", "Dayboats", "Dayboats", "Dayboats", "D...
## $ Weight..kg. <dbl> 14366, 14453, 12366, 5409, 8887, 10725, 243, 65, 5...
## $ Trips       <int> 132, 134, 225, 159, 154, 174, 29, 1, 5, 16, 10, 25...

So we have 589 observations. Take note that the variable type of Year, Month, Weight..kg. and Trips are numerical values (labelled as <dbl> above) as expected. The column names are not to my liking (I want to minimize keyboard work in the code that follows) so I change the column names:

names(df) <- c("year", "month", "country", "vessel", "catch", "effort")

And because we are going to be most interested in the catch per unit effort data we might as well generate that variable:

df <-
  df %>% 
  mutate(cpue = catch/effort)

A tiny introduction to “date” format in R

There are two columns that refer to time, year and month. Because I want later to plot data along time I might as well set up the proper date variable. Here I use the ymd function from the lubridate package (for further reading check the introductory vignette). Since there is no specific “day” in the data I just use the first day of each month as a dummy constant.

df <- 
  df %>% 
  mutate(cpue = catch/effort,
         date = ymd(paste(year, month, 1, sep = "-")))

If you now take a glimpse at the data you notice that the type for the date column is labelled as <date> (i.e. as expected).

An introduction to GLM

TODO: Add text here to explain model and code

Catch per unit effort is in many cases the only index available to inform on stock trends. Raw cpue may, as we have seen, be influenced by factors such as fishing season, fishing area, vessel and gear type. If these change over the time (years) the trend in the raw cpue index may be reflecting changes in these factors rather than being indicative of changes in biomass. Take an imaginary simple case where there was a gradual change in the season when fishing effort (and hence sampling) occurs. E.g. from being more or less the same in all months in the beginning towards being focused only when the highest catch rates occur in terminal years. All else being equal a simple annual mean of all the observations would result in annual index that would increase over time.

Standardizing cpue with generalized linear models (GLMs) provides one way to remove some of these effects, resulting in an index that hopefully more accurately reflects changes in the true biomass. It however does not alleviate all problems such imprecise measure of true effort and the issue of “hyperstability” discussed at the beginning of this section.

If the data is already in a tidy format setting up the R-code for the model is relatively easy:

df <-
  df %>% 
  mutate(lcatch = log(catch),
         leffort = log(effort),
         year = factor(year),
         month = factor(month))

m <- glm(formula = lcatch ~ leffort + year + month + country + vessel,
     data = df)

In the above model the variable we are trying to “explain” is the catch (on a log-scale) as a function of the effort (on a log scale), year, month, country and vessel. Lets see what we got:

## 
## Call:  glm(formula = lcatch ~ leffort + year + month + country + vessel, 
##     data = df)
## 
## Coefficients:
##    (Intercept)         leffort        year1989        year1990  
##        3.32651         1.05153         0.41083         0.50806  
##       year1991        year1992        year1993        year1994  
##        0.98037         0.73147         1.17973         0.71047  
##       year1995        year1996        year1997        year1998  
##        0.64791         0.86752         0.75707         1.21512  
##       year1999        year2000        year2001        year2002  
##        0.99096         1.39016         1.46239         1.29172  
##       year2003        year2004        year2005        year2006  
##        0.93242         0.87069         1.18142         1.00108  
##       year2007        year2008          month2          month3  
##        1.08888         1.53798        -0.05906        -0.12176  
##         month4          month5          month6          month7  
##       -0.39176         0.21219         0.09258        -0.67453  
##         month8          month9         month10         month11  
##       -1.45909        -1.92004        -1.72528        -0.94345  
##        month12  countryStLucia   countryTobago  vesselIceboats  
##       -0.25471        -0.36789         0.48707         2.07469  
## 
## Degrees of Freedom: 588 Total (i.e. Null);  553 Residual
## Null Deviance:       3698 
## Residual Deviance: 336.2     AIC: 1415

This object is no longer a familiar dataframe frame but of a class that is called ‘“glm” “lm”’ (try running class(m). Besides the coefficients there are a bunch of statistical measures in the output that I try to explain.

The deviance is a measure of goodness of fit of a generalized linear model, the higher the value the worse the fit. The “Null Deviance” shows how well the response variable is predicted by a model that includes only the intercept (grand mean). For our example, we have a value of 3700 with 588 degrees of freedom (that is the total number of observations minus 1, try nrow(df)-1). Including the independent variables (effort, year, month, country and vessel) decreased the deviance to 337.1 with 553 degrees of freedom, a significant reduction in deviance. I.e the deviance was reduced by 3362.9 with a loss of 35 degrees of freedom.

The AIC (Akaike Information Criterion) provides a method for assessing the quality of your model through comparison of related models. It’s based on the Deviance, but penalizes you for making the model more complicated. Much like adjusted R-squared, it’s intent is to prevent you from including irrelevant predictors.

However, unlike adjusted R-squared, the number itself is not meaningful. If you have more than one similar candidate models (where all of the variables of the simpler model occur in the more complex models), then you should select the model that has the smallest AIC. We will look at that later on.

We can get the coefficient statistics with the tidy function from the broom-package:

res <- tidy(m)
res

##              term    estimate  std.error  statistic       p.value
## 1     (Intercept)  3.32650851 0.32233385 10.3200719  5.918807e-23
## 2         leffort  1.05153383 0.03087474 34.0580650 7.040910e-138
## 3        year1989  0.41083077 0.30177855  1.3613650  1.739528e-01
## 4        year1990  0.50805675 0.30894166  1.6445071  1.006398e-01
## 5        year1991  0.98037158 0.29779886  3.2920596  1.058039e-03
## 6        year1992  0.73146921 0.30054661  2.4337962  1.525700e-02
## 7        year1993  1.17972707 0.30107428  3.9183920  1.003101e-04
## 8        year1994  0.71047367 0.28497550  2.4931044  1.295447e-02
## 9        year1995  0.64790739 0.27538228  2.3527563  1.898423e-02
## 10       year1996  0.86752408 0.27534446  3.1506865  1.716975e-03
## 11       year1997  0.75707010 0.27162218  2.7872176  5.499013e-03
## 12       year1998  1.21512098 0.27451258  4.4264673  1.154458e-05
## 13       year1999  0.99096317 0.27530153  3.5995556  3.474004e-04
## 14       year2000  1.39015501 0.27692344  5.0199976  6.979207e-07
## 15       year2001  1.46239385 0.27593444  5.2997874  1.677531e-07
## 16       year2002  1.29172327 0.27486170  4.6995390  3.295154e-06
## 17       year2003  0.93241575 0.27722953  3.3633349  8.233802e-04
## 18       year2004  0.87069427 0.27540934  3.1614551  1.655834e-03
## 19       year2005  1.18141990 0.27903432  4.2339592  2.688754e-05
## 20       year2006  1.00107628 0.28003119  3.5748742  3.810504e-04
## 21       year2007  1.08888050 0.27721397  3.9279424  9.651411e-05
## 22       year2008  1.53797978 0.33541149  4.5853521  5.609542e-06
## 23         month2 -0.05906178 0.13433563 -0.4396583  6.603563e-01
## 24         month3 -0.12175501 0.13450388 -0.9052156  3.657454e-01
## 25         month4 -0.39176469 0.13843225 -2.8300102  4.823805e-03
## 26         month5  0.21218772 0.13435704  1.5792825  1.148431e-01
## 27         month6  0.09258200 0.13569037  0.6823034  4.953328e-01
## 28         month7 -0.67452668 0.15622292 -4.3177190  1.868521e-05
## 29         month8 -1.45909298 0.26050398 -5.6010392  3.361772e-08
## 30         month9 -1.92003948 0.38546708 -4.9810725  8.466223e-07
## 31        month10 -1.72528300 0.26281113 -6.5647259  1.202114e-10
## 32        month11 -0.94344837 0.16515759 -5.7124131  1.820986e-08
## 33        month12 -0.25471012 0.14317452 -1.7790184  7.578595e-02
## 34 countryStLucia -0.36789105 0.13502967 -2.7245201  6.643261e-03
## 35  countryTobago  0.48707079 0.10579656  4.6038431  5.150379e-06
## 36 vesselIceboats  2.07469122 0.09186045 22.5852503  1.605031e-80

We thus have a dataframe with the following columns:

term: these are related explanatory variables we included in the analysis (log effort, year, month, vessel, country)
estimates: These are the coefficient estimates (see later).
std.error: The standard error of the coefficients
statistic: The t-statistics
p.value: The probability value that indicate significance

Because the coefficient values (estimate) and the associated error are on a log scale lets rescale them:

res <-
  res %>%
  mutate(mean = exp(estimate),
         lower.ci = exp(estimate - 2 * std.error),
         upper.ci = exp(estimate + 2 * std.error))

Now even though the information we are most interested is the year factor lets visualize the other factors because they are also of interest. Lets start with months:

var <- "month"
res %>% 
  filter(str_detect(term, var)) %>% 
  mutate(term = str_replace(term, var, ""),
         term = as.integer(term)) %>% 
  ggplot(aes(term, mean)) +
  theme_bw() +
  geom_hline(yintercept = 1, colour = "grey") +
  geom_point() +
  geom_line() +
  geom_linerange(aes(ymin = lower.ci,
                     ymax = upper.ci)) +
  scale_x_continuous(name = "Month", breaks = c(2:12)) +
  scale_y_continuous(name = "Index")

You first notice that the first month is not included. That is because the model by default “sets” that to one (actually zero on the log scale). Here I have just included a horizontal line to represent that value. To interpret this plot you can think of the response values representing the relative catch per unit effort by month where the year, vessel and country factor have been taken into account. The error bars represent the 95% confidence interval (approximately). Notice e.g. that the error bars for the month of 2, 3, 5, 6 and 12 “overlap” with one (the horizontal line). That means that mean cpue for those months is not significantly different from the first month. So the broad seasonal patterns are that the high catch rates occur in the fishing season December to June followed by the low catch rate season from July through November.

# Food for thought:
res %>% 
  filter(str_detect(term, var)) %>% 
  mutate(term = str_replace(term, var, ""),
         term = as.integer(term)) %>% 
  ggplot(aes(term, std.error)) +
  geom_point() +
  expand_limits(y = 0)

Equivalently we can do this for the vessel type. Since we have only two vessel type we only obtain the mean and std.error estimates for the Iceboats (the Dayboats are then set to one). The mean is 7.96and the 95% confidence interval is 6.63 - 9.57. Ergo the Iceboats have approximately 8 times higher catch rates than the Dayboats all else being equal. And the difference is significant.

When we look at the country term we get:

##             term  lower.ci      mean  upper.ci      p.value
## 1 countryStLucia 0.5283743 0.6921926 0.9068015 6.643261e-03
## 2  countryTobago 1.3171597 1.6275418 2.0110640 5.150379e-06

I.e all else being equal the catch rates for St. Lucia are lower and the catch rates for Tobago higher than that observed in Barbados, both being significantly different. Take note that this does not mean that the Tobago fleet is the most efficient rather it may mean that the local density in the waters may be higher, that the true effort is higher, etc. Here I lack expertise/knowledge of these flying fish fisheries. But it may be of interest to dig a little deeper into the reason for the above observation.

Now finally the term that is normally of interest, the year factors:

var <- "year"
res %>% 
  filter(str_detect(term, var)) %>% 
  mutate(term = str_replace(term, var, ""),
         term = as.integer(term)) %>% 
  ggplot(aes(term, mean)) +
  theme_bw() +
  geom_hline(yintercept = 1, colour = "grey") +
  geom_point() +
  geom_line() +
  geom_linerange(aes(ymin = lower.ci,
                     ymax = upper.ci)) +
  labs(x = NULL, y = "Index") +
  expand_limits(y = 0)

If we were to assume that the standardize cpue index is a good indicator of the true biomass (\(CPUE = qB\)) we can at minimum conclude with some reasonable confidence that the long term effect of human removals from this stock has not had any detrimental effect on the stock development.

GLM - stepwise addition of variables

TODO: Add text here to explain approach

m1 <- glm(formula = lcatch ~ leffort + year, data = df)
r1 <-
  m1 %>% 
  tidy() %>% 
  mutate(model = "1. + year")
m2 <- glm(formula = lcatch ~ leffort + year + vessel, data = df)
r2 <-
  m2 %>% 
  tidy() %>% 
  mutate(model = "2. + vessel")
m3 <-  glm(formula = lcatch ~ leffort + year + vessel + month, data = df)
r3 <-
  m3 %>%
  tidy() %>% 
  mutate(model = "3. + month")
m4 <- lm(formula = lcatch ~ leffort + year + month + country + vessel, data = df)
r4 <-
  m4 %>%
  tidy() %>% 
  mutate(model = "4. + country")
  
d <-
  bind_rows(r1, r2) %>% 
  bind_rows(r3) %>% 
  bind_rows(r4) %>% 
  mutate(mean = exp(estimate),
         lower.ci = exp(estimate - 2 * std.error),
         upper.ci = exp(estimate + 2 * std.error)) %>% 
  filter(str_detect(term, "year")) %>% 
  mutate(term = str_replace(term, "year", ""),
         term = as.integer(term))
d %>%
  ggplot(aes(term, mean, colour = model)) +
  geom_line() +
  geom_point() +
  labs(x = NULL, y = "Standardized cpue")

d %>% 
  ggplot(aes(term, std.error, colour = model)) + 
  geom_line() +
  geom_point() +
  labs(x = NULL, y = "Coefficent of variation")

GLM - think of title name

Preamble

Getting the data into R

A tiny introduction to “date” format in R

An introduction to GLM

GLM - stepwise addition of variables