NOTE: Currently use data from the 2014 R course. Need to get permission before pushing to the public sphere. If permission not granted need to use another dataset.
We are going to import an Excel file called all_length_data.xls. In the example case it resides in a directory called data-raw under the current R working space.
Needed packages:
library(tidyverse) # or load just dplyr
library(stringr) # for string manipulations
library(readxl) # has functions to read data from Excel
First go:
We are going to work with a dataset that contains 2309 observations and 7 variables
d_raw <- read_excel("data-raw/all_length_data.xls")
Notice that in the console the following text was returned:
There were 50 or more warnings (use warnings() to see the first 50)
Lets not worry about that for now, we will return to that later. To get a view of the data read in do:
glimpse(d_raw)
## Observations: 2,458
## Variables: 7
## $ DATE <chr> "Fall 2004", "Fall 2004", "Fall 2004", "F...
## $ Boat <chr> "Harold", "Harold", "Harold", "Harold", "...
## $ Technique <chr> "trap", "trap", "trap", "trap", "trap", "...
## $ Species <chr> "Acanthurus chirurgus", "Haemulon flavoli...
## $ Fork Length (cm) <dbl> 24.0, 18.0, 24.0, 27.0, 26.0, 21.0, 20.0,...
## $ Total Length (cm) <dbl> 26.0, 19.5, 25.0, 28.5, 28.5, 23.0, 22.0,...
## $ Standard Length (cm) <dbl> 21.0, 16.5, 21.0, 24.5, 24.0, 19.0, 19.0,...
So we have 7 variables as expected. The first 4 variables are of type character and the last three are numeric values. However the number of observations are 2458 which is more than the 2309 variables as stated above.
The glimpse
function used above shows the topmost records. We can use the tail
function to get a view of the last records (6 records by default):
tail(d_raw)
## # A tibble: 6 × 7
## DATE Boat Technique Species `Fork Length (cm)` `Total Length (cm)`
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 <NA> <NA> <NA> <NA> NA NA
## 2 <NA> <NA> <NA> <NA> NA NA
## 3 <NA> <NA> <NA> <NA> NA NA
## 4 <NA> <NA> <NA> <NA> NA NA
## 5 <NA> <NA> <NA> <NA> NA NA
## 6 <NA> <NA> <NA> <NA> NA NA
## # ... with 1 more variables: `Standard Length (cm)` <dbl>
We see that all these observations are empty (marked as NA). A very convenient function to get rid of empty record (rows) is the xxxx
-function (NOTE: need to find that function):
# use this for now
d <- d_raw[rowSums(is.na(d_raw)) != ncol(d_raw),]
glimpse(d)
## Observations: 2,309
## Variables: 7
## $ DATE <chr> "Fall 2004", "Fall 2004", "Fall 2004", "F...
## $ Boat <chr> "Harold", "Harold", "Harold", "Harold", "...
## $ Technique <chr> "trap", "trap", "trap", "trap", "trap", "...
## $ Species <chr> "Acanthurus chirurgus", "Haemulon flavoli...
## $ Fork Length (cm) <dbl> 24.0, 18.0, 24.0, 27.0, 26.0, 21.0, 20.0,...
## $ Total Length (cm) <dbl> 26.0, 19.5, 25.0, 28.5, 28.5, 23.0, 22.0,...
## $ Standard Length (cm) <dbl> 21.0, 16.5, 21.0, 24.5, 24.0, 19.0, 19.0,...
So now we have the number of observations we expected.
The names of the variables is not very convenient for further processing (contain space, mix of lower and upper cap, …). Lets rename them:
d <-
d %>%
rename(date = DATE,
boat = Boat,
gear = Technique,
species = Species,
flength = `Fork Length (cm)`,
tlength = `Total Length (cm)`,
slength = `Standard Length (cm)`)
names(d)
## [1] "date" "boat" "gear" "species" "flength" "tlength" "slength"
Now lets look at the non-numeric variable and see if they are as expected. One way to do this is as follows:
table(d$gear)
##
## Electronic Reel Hawaiian Sling hook and line Hook and Line
## 718 163 8 396
## Line lobster hook lobster trap Lobster Trap
## 66 1 130 159
## NA sling spear trap
## 29 7 152 467
## Traps Trolling
## 7 6
First we notice that R is case sensitive (lobster trap in not the same as Lobster Trap). We also observe other type of (likely) typos. We expect for example that Traps and trap refer to the same gear group. We also see that Trolling aligns not to the right, which means that is has a trailing space. Lets try to clean those and other likely typing errors:
d <-
d %>%
mutate(gear = str_trim(gear), # Get rid of leading and trailing spaces
gear = ifelse(gear == "NA", NA, gear), # NA is hear actually NA
# NOTE: check if this can not be fixed upstream (e.g. read_excel)
gear = Hmisc::capitalize(tolower(gear)), # Just the first word with capital
gear = ifelse(gear == "Trap", "Traps", gear),
gear = ifelse(gear == "Lobster hook", "Lobster trap", gear))
Lets what we have got:
table(d$gear)
##
## Electronic reel Hawaiian sling Hook and line Line
## 718 163 404 66
## Lobster trap Sling Spear Traps
## 290 7 152 474
## Trolling
## 6
Now we can do similar things with the boat variables:
table(d$boat)
##
## 117.000000 20.000000 27 Anthony Aubrey
## 1 19 4 20 11
## Brooks C Eagle Caiban Caliban CMK Boat
## 18 58 4 229 6
## Da Boss TCI Dealer Dolph George Goldilocks
## 304 51 346 5 2
## Harbour Snaper Harold Harold's Imanuel Isabella
## 58 82 14 29 15
## Isabella Lady T Lil Irene Mary Jay 2 Morris
## 387 1 5 1 108
## Normanda North Star Picket poacher Poacher
## 331 3 21 27 6
## Ronald Ronald Dean samuel Samuel SC112
## 12 24 24 7 2
## Soupy Starfish Suvi The Dance
## 8 61 1 4
We notice that we seem to have numerical values as text. This needs to be dug into further. But where we have actual names we observe some likely typos (not listed here, just read the code below):
d <-
d %>%
mutate(boat = str_trim(boat), # get rid of leading or trailing spaces
boat = ifelse(boat == "Caiban", "Caliban", boat),
boat = ifelse(boat == "Harold's", "Harold", boat),
boat = ifelse(boat == "poacher", "Poacher", boat),
boat = ifelse(boat == "Ronald", "Ronald Dean", boat),
boat = ifelse(boat == "samuel", "Samuel", boat))
table(d$species)
##
## Acanthocybium solandri Acanthostracion polygonius
## 1 2
## Acanthostracion polygonius Acanthostracion sp
## 11 1
## Acanthurus bahianus Acanthurus chirurgus
## 24 2
## Acanthurus coeruleus Apsilus dentatus
## 2 39
## Balistes vetula Calamus calamus
## 16 5
## Canthidermis sufflamen Canthidermis sufflament
## 5 2
## Caranx latus Caranx latus jack
## 1 1
## Caranx lugubris Caranx ruber
## 4 73
## Cephalopholis fulvus Cephalopholis fulvus grouper
## 36 13
## Cephalopholus cruentatus Coryphaena equiselis
## 5 1
## Epinephelus guttatus Epinephelus mystacinus
## 40 69
## Epinephelus striatus Etelis oculatus
## 229 127
## Ginglymostoma cirratum glass eye lutjanus
## 1 1
## Haemulon album Haemulon carbonarium
## 144 10
## Haemulon flavolineatum Haemulon melanurum
## 129 2
## Haemulon parra Haemulon plumierii
## 3 165
## Haemulon plumierii Haemulon sciurus
## 78 60
## Holocentrus adscensionis Holocentrus rufus
## 3 2
## Hypoplectrus spp. Kyphosus
## 1 1
## L.maximus Lachnolaimus maximus
## 3 43
## Lactophrys triqueter lutjanus
## 2 154
## lutjanus Lutjanus analis
## 1 63
## Lutjanus apodus Lutjanus buccanella
## 4 274
## Lutjanus campechanus Lutjanus griseus
## 111 66
## Lutjanus mahogani Lutjanus mahogoni
## 1 17
## Lutjanus purpureus mahogony lutjanus
## 50 1
## Mycteroperca venenosa NA
## 9 5
## Ocyurus chrysurus Pomacanthus arcuatus
## 73 3
## Pomacanthus paru porgy
## 1 1
## Pseudupeneus maculatus rockfish
## 5 2
## S.taenopterus salmon rockfish
## 1 1
## Scarus iseri (IP) Scarus iseri parrotfish
## 1 5
## Scarus taeniopterus Scarus vetula
## 35 6
## Scarus vetula (IP) schoolmaster lutjanus
## 1 1
## Scomberomorus cavalla Sparisoma aurofrenatum
## 1 6
## Sparisoma viride Sphyraena barracuda
## 27 14
## Thunnus Thunnus albacares
## 1 8
## yellowfin mojarra
## 3
Bit of a nightmare to clean these :-) One way to do this is to generate a table with the unique species names, export it, add a column in some simple data editor (Excel will to nicely for this).
species <-
d %>%
mutate(species = str_trim(species))
But lets start to look at the numerical variables
ggplot(d, aes(tlength, slength)) +
geom_point() +
geom_abline()