Cleaning data

NOTE: Currently use data from the 2014 R course. Need to get permission before pushing to the public sphere. If permission not granted need to use another dataset.

We are going to import an Excel file called all_length_data.xls. In the example case it resides in a directory called data-raw under the current R working space.

Needed packages:

library(tidyverse)      # or load just dplyr
library(stringr)        # for string manipulations
library(readxl)         # has functions to read data from Excel

First go:

We are going to work with a dataset that contains 2309 observations and 7 variables

d_raw <- read_excel("data-raw/all_length_data.xls")

Notice that in the console the following text was returned:

There were 50 or more warnings (use warnings() to see the first 50)

Lets not worry about that for now, we will return to that later. To get a view of the data read in do:

glimpse(d_raw)

## Observations: 2,458
## Variables: 7
## $ DATE                 <chr> "Fall 2004", "Fall 2004", "Fall 2004", "F...
## $ Boat                 <chr> "Harold", "Harold", "Harold", "Harold", "...
## $ Technique            <chr> "trap", "trap", "trap", "trap", "trap", "...
## $ Species              <chr> "Acanthurus chirurgus", "Haemulon flavoli...
## $ Fork Length  (cm)    <dbl> 24.0, 18.0, 24.0, 27.0, 26.0, 21.0, 20.0,...
## $ Total Length (cm)    <dbl> 26.0, 19.5, 25.0, 28.5, 28.5, 23.0, 22.0,...
## $ Standard Length (cm) <dbl> 21.0, 16.5, 21.0, 24.5, 24.0, 19.0, 19.0,...

So we have 7 variables as expected. The first 4 variables are of type character and the last three are numeric values. However the number of observations are 2458 which is more than the 2309 variables as stated above.

The glimpse function used above shows the topmost records. We can use the tail function to get a view of the last records (6 records by default):

tail(d_raw)

## # A tibble: 6 × 7
##    DATE  Boat Technique Species `Fork Length  (cm)` `Total Length (cm)`
##   <chr> <chr>     <chr>   <chr>               <dbl>               <dbl>
## 1  <NA>  <NA>      <NA>    <NA>                  NA                  NA
## 2  <NA>  <NA>      <NA>    <NA>                  NA                  NA
## 3  <NA>  <NA>      <NA>    <NA>                  NA                  NA
## 4  <NA>  <NA>      <NA>    <NA>                  NA                  NA
## 5  <NA>  <NA>      <NA>    <NA>                  NA                  NA
## 6  <NA>  <NA>      <NA>    <NA>                  NA                  NA
## # ... with 1 more variables: `Standard Length (cm)` <dbl>

We see that all these observations are empty (marked as NA). A very convenient function to get rid of empty record (rows) is the xxxx-function (NOTE: need to find that function):

# use this for now
d <- d_raw[rowSums(is.na(d_raw)) != ncol(d_raw),]
glimpse(d)

## Observations: 2,309
## Variables: 7
## $ DATE                 <chr> "Fall 2004", "Fall 2004", "Fall 2004", "F...
## $ Boat                 <chr> "Harold", "Harold", "Harold", "Harold", "...
## $ Technique            <chr> "trap", "trap", "trap", "trap", "trap", "...
## $ Species              <chr> "Acanthurus chirurgus", "Haemulon flavoli...
## $ Fork Length  (cm)    <dbl> 24.0, 18.0, 24.0, 27.0, 26.0, 21.0, 20.0,...
## $ Total Length (cm)    <dbl> 26.0, 19.5, 25.0, 28.5, 28.5, 23.0, 22.0,...
## $ Standard Length (cm) <dbl> 21.0, 16.5, 21.0, 24.5, 24.0, 19.0, 19.0,...

So now we have the number of observations we expected.

The names of the variables is not very convenient for further processing (contain space, mix of lower and upper cap, …). Lets rename them:

d <-
  d %>% 
  rename(date = DATE,
         boat = Boat,
         gear = Technique,
         species = Species,
         flength = `Fork Length  (cm)`,
         tlength = `Total Length (cm)`,
         slength = `Standard Length (cm)`)
names(d)

## [1] "date"    "boat"    "gear"    "species" "flength" "tlength" "slength"

Now lets look at the non-numeric variable and see if they are as expected. One way to do this is as follows:

table(d$gear)

## 
## Electronic Reel  Hawaiian Sling   hook and line   Hook and Line 
##             718             163               8             396 
##            Line    lobster hook    lobster trap    Lobster Trap 
##              66               1             130             159 
##              NA           sling           spear            trap 
##              29               7             152             467 
##           Traps       Trolling  
##               7               6

First we notice that R is case sensitive (lobster trap in not the same as Lobster Trap). We also observe other type of (likely) typos. We expect for example that Traps and trap refer to the same gear group. We also see that Trolling aligns not to the right, which means that is has a trailing space. Lets try to clean those and other likely typing errors:

d <- 
  d %>% 
  mutate(gear = str_trim(gear),                        # Get rid of leading and trailing spaces
         gear = ifelse(gear == "NA", NA, gear),        # NA is hear actually NA
                                                       #  NOTE: check if this can not be fixed upstream (e.g. read_excel)
         gear = Hmisc::capitalize(tolower(gear)),      # Just the first word with capital
         gear = ifelse(gear == "Trap", "Traps", gear),
         gear = ifelse(gear == "Lobster hook", "Lobster trap", gear))

Lets what we have got:

table(d$gear)

## 
## Electronic reel  Hawaiian sling   Hook and line            Line 
##             718             163             404              66 
##    Lobster trap           Sling           Spear           Traps 
##             290               7             152             474 
##        Trolling 
##               6

Now we can do similar things with the boat variables:

table(d$boat)

## 
##     117.000000      20.000000             27        Anthony         Aubrey 
##              1             19              4             20             11 
##         Brooks        C Eagle         Caiban        Caliban       CMK Boat 
##             18             58              4            229              6 
##    Da Boss TCI         Dealer          Dolph        George      Goldilocks 
##            304             51            346              5              2 
## Harbour Snaper         Harold       Harold's        Imanuel       Isabella 
##             58             82             14             29             15 
##      Isabella          Lady T      Lil Irene     Mary Jay 2         Morris 
##            387              1              5              1            108 
##       Normanda     North Star         Picket        poacher        Poacher 
##            331              3             21             27              6 
##         Ronald    Ronald Dean         samuel         Samuel          SC112 
##             12             24             24              7              2 
##          Soupy       Starfish           Suvi      The Dance 
##              8             61              1              4

We notice that we seem to have numerical values as text. This needs to be dug into further. But where we have actual names we observe some likely typos (not listed here, just read the code below):

d <- 
  d %>% 
  mutate(boat = str_trim(boat),    # get rid of leading or trailing spaces
         boat = ifelse(boat == "Caiban", "Caliban", boat),
         boat = ifelse(boat == "Harold's", "Harold", boat),
         boat = ifelse(boat == "poacher", "Poacher", boat),
         boat = ifelse(boat == "Ronald", "Ronald Dean", boat),
         boat = ifelse(boat == "samuel", "Samuel", boat))

table(d$species)

## 
##       Acanthocybium solandri   Acanthostracion polygonius 
##                            1                            2 
##  Acanthostracion polygonius            Acanthostracion sp 
##                           11                            1 
##          Acanthurus bahianus         Acanthurus chirurgus 
##                           24                            2 
##         Acanthurus coeruleus             Apsilus dentatus 
##                            2                           39 
##              Balistes vetula              Calamus calamus 
##                           16                            5 
##       Canthidermis sufflamen      Canthidermis sufflament 
##                            5                            2 
##                 Caranx latus            Caranx latus jack 
##                            1                            1 
##              Caranx lugubris                 Caranx ruber 
##                            4                           73 
##         Cephalopholis fulvus Cephalopholis fulvus grouper 
##                           36                           13 
##     Cephalopholus cruentatus         Coryphaena equiselis 
##                            5                            1 
##         Epinephelus guttatus       Epinephelus mystacinus 
##                           40                           69 
##         Epinephelus striatus              Etelis oculatus 
##                          229                          127 
##       Ginglymostoma cirratum           glass eye lutjanus 
##                            1                            1 
##               Haemulon album         Haemulon carbonarium 
##                          144                           10 
##       Haemulon flavolineatum           Haemulon melanurum 
##                          129                            2 
##               Haemulon parra           Haemulon plumierii 
##                            3                          165 
##          Haemulon plumierii              Haemulon sciurus 
##                           78                           60 
##     Holocentrus adscensionis            Holocentrus rufus 
##                            3                            2 
##            Hypoplectrus spp.                     Kyphosus 
##                            1                            1 
##                    L.maximus         Lachnolaimus maximus 
##                            3                           43 
##         Lactophrys triqueter                     lutjanus 
##                            2                          154 
##                    lutjanus               Lutjanus analis 
##                            1                           63 
##              Lutjanus apodus          Lutjanus buccanella 
##                            4                          274 
##         Lutjanus campechanus             Lutjanus griseus 
##                          111                           66 
##            Lutjanus mahogani            Lutjanus mahogoni 
##                            1                           17 
##           Lutjanus purpureus            mahogony lutjanus 
##                           50                            1 
##        Mycteroperca venenosa                           NA 
##                            9                            5 
##            Ocyurus chrysurus         Pomacanthus arcuatus 
##                           73                            3 
##             Pomacanthus paru                        porgy 
##                            1                            1 
##       Pseudupeneus maculatus                     rockfish 
##                            5                            2 
##                S.taenopterus              salmon rockfish 
##                            1                            1 
##            Scarus iseri (IP)      Scarus iseri parrotfish 
##                            1                            5 
##          Scarus taeniopterus                Scarus vetula 
##                           35                            6 
##           Scarus vetula (IP)        schoolmaster lutjanus 
##                            1                            1 
##        Scomberomorus cavalla       Sparisoma aurofrenatum 
##                            1                            6 
##             Sparisoma viride          Sphyraena barracuda 
##                           27                           14 
##                      Thunnus           Thunnus albacares  
##                            1                            8 
##            yellowfin mojarra 
##                            3

Bit of a nightmare to clean these :-) One way to do this is to generate a table with the unique species names, export it, add a column in some simple data editor (Excel will to nicely for this).

species <-
  d %>% 
  mutate(species = str_trim(species))

But lets start to look at the numerical variables

ggplot(d, aes(tlength, slength)) +
  geom_point() +
  geom_abline()