Import and export

Needed library:

library(tidyverse)
library(janitor)

R is not a database software so one normally has to import the data from some other sources.

Entering data directly into R

Although R is not a good data entry medium it is possible. E.g. to one can create vectors by:

weight <- c( 1,  5,  3,  2,  6)
length <- c(10, 17, 14, 12, 18)

Or one could generate a data frame by:

d <- 
  tibble(weight = c( 1,  5,  3,  2,  6),
         length = c(10, 17, 14, 12, 18))

Or one could edit a data frame using the {editData}-package. But that said R is not a good data entry medium and normally only done on an ad hoc basis.

Importing data from plain text files

A lot of functions in R deal with reading in text files in various formats. We have already used the read_csv-function.

w <- read_csv(file = "https://heima.hafro.is/~einarhj/data/minke.csv")

Text files (like above) that reside on the web can normally be read in directly into R. However some other files like Excel can not. If we know the url-path of any files we can first download them before importing them. Here we will just use the minke csv-file as an introductory example, saving the file in the directory “data-raw” in our project directory using the download.file-function:

if(!dir.exists("data-raw")) dir.create("data-raw")  # create if not exists
download.file(url = "https://heima.hafro.is/~einarhj/data/minke.csv",
              mode = "wb",
              destfile = "data-raw/minke.csv")

We can then simply read the file from our computer:

w <- read_csv("data-raw/minke.csv")

There are different read_xxx functions available in the {readr}-package that can be used to read in different formatted text files, check e.g.:

read_csv        # US style csv file (column separator is ',' and dec '.'
read_csv2       # Alternative style csv file (column separator is ';' and dec ','
read_tsv        # Tab deliminated data, US style decimal (.)
read_tsv2       # Tab deliminated data, style decimal (,)
read_delim      # The generic read function

There are also base-R functions like:

# Base R functions
read.table      # Can read in most text files, has a lot of arguements
                #  This may actually be your best friend
read.fortran    # Fortran formated text
readLines       # Raw lines from the file

To read the minke datafile using read.table we would do:

w <- read.table("data-raw/minke.csv", sep = ",", header = TRUE)
w |> glimpse()

Rows: 190
Columns: 13
$ id             <int> 1, 690, 926, 1333, 1334, 1335, 1336, 1338, 1339, 1341, …
$ date           <chr> "2004-06-10T22:00:00Z", "2004-06-15T17:00:00Z", "2004-0…
$ lon            <dbl> -21.42350, -21.39183, -19.81333, -21.57500, -15.61167, …
$ lat            <dbl> 65.66183, 65.65350, 66.51167, 65.67333, 66.29000, 66.17…
$ area           <chr> "North", "North", "North", "North", "North", "North", "…
$ length         <int> 780, 793, 858, 567, 774, 526, 809, 820, 697, 777, 739, …
$ weight         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ age            <dbl> 11.3, NA, 15.5, 7.2, 12.3, 9.6, 17.3, 13.8, 12.2, 15.4,…
$ sex            <chr> "Female", "Female", "Female", "Male", "Female", "Female…
$ maturity       <chr> "pregnant", "pregnant", "pregnant", "immature", "immatu…
$ stomach.volume <dbl> 58, 90, 24, 25, 85, 18, 200, 111, 8, 25, 38, 6, 11, 100…
$ stomach.weight <dbl> 31.900000, 36.290000, 9.420000, 3.640000, 5.510000, 1.1…
$ year           <int> 2004, 2004, 2004, 2003, 2003, 2003, 2003, 2003, 2003, 2…

Take note that the date is here a character, and to get the proper time-format we would need an additional step:

w |> 
  mutate(date = ymd_hms(date)) |> # more on this later
  glimpse()

Rows: 190
Columns: 13
$ id             <int> 1, 690, 926, 1333, 1334, 1335, 1336, 1338, 1339, 1341, …
$ date           <dttm> 2004-06-10 22:00:00, 2004-06-15 17:00:00, 2004-06-22 0…
$ lon            <dbl> -21.42350, -21.39183, -19.81333, -21.57500, -15.61167, …
$ lat            <dbl> 65.66183, 65.65350, 66.51167, 65.67333, 66.29000, 66.17…
$ area           <chr> "North", "North", "North", "North", "North", "North", "…
$ length         <int> 780, 793, 858, 567, 774, 526, 809, 820, 697, 777, 739, …
$ weight         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ age            <dbl> 11.3, NA, 15.5, 7.2, 12.3, 9.6, 17.3, 13.8, 12.2, 15.4,…
$ sex            <chr> "Female", "Female", "Female", "Male", "Female", "Female…
$ maturity       <chr> "pregnant", "pregnant", "pregnant", "immature", "immatu…
$ stomach.volume <dbl> 58, 90, 24, 25, 85, 18, 200, 111, 8, 25, 38, 6, 11, 100…
$ stomach.weight <dbl> 31.900000, 36.290000, 9.420000, 3.640000, 5.510000, 1.1…
$ year           <int> 2004, 2004, 2004, 2003, 2003, 2003, 2003, 2003, 2003, 2…

On file paths

The first argument in read_xxx functions is the file name, including the path. If the file is in the current working directory (use getwd() to get the location of you current working directory [or press the “blue R” button in the Files-tab]) one can simply write the following command:

w <- read_csv("minke.csv")

If the data file are in folder called “data-raw” within the working directory:

w <- read_csv('data-raw/minke.csv')

One can also use absolute paths like:

read_csv("~/edu/crfmr/data-raw/minke.csv")      # if on Linux/Mac
read_csv("C:/Users/username/Documents/edu/crfmr/data-raw/minke.csv") # if on Windose

Absolute paths are however specific to each computer directory structure and it is thus not recommended to use if you are distributing the project to other collaborators.

If the data is not in the current working directory tree one may use:

read_csv("../***.csv")             # One folder up
read_csv("../../***.csv")          # Two folders up
read_csv("../../data-raw/***.csv") # Two folders up, one folder down

Again, if sharing a project this may not be reproducible.

Sanity checks

After the data has been imported one frequently checks the data to see what was imported is as expected. Some common functions are:

head(d)     # shows the top 6 lines of d
tail(d)     # shows the last 6 lines of d
dim(d)      # shows the num row and col of d
names(d)    # gives the column names of d
summary(d)  # quick summary statistics for the columns of d
str(d)      # show the structure of the data, e.g. variable types
glimpse(d)  # dplyr equivalent of str for dataframe

Take particular note that the class (type) of each variable is as you expect. E.g. in the minke case one expects that the class for length is numeric (dbl) and class for date is a date-class (dttm), not e.g. character.

Arguments

Lets generate a short inline csv-file:

tmp <- 
  "metadata 1: this data was collected on R/V Bjarni
  metatdata 2: this research was funded by EU
  ID,Date,Fishing area
  1,2021-01-01,North
  N/A,2021-01-02,North
  NA,NA,NA
  3,2021-02-28,South
  4,2021-02-29,N/A
  5,-9999,South"
writeLines(tmp, "data-raw/tmp.csv")

This short datafile is on purpose quite messy. We have:

Metadata: Here the first two lines
Variable names: Here in the third line
Missing data represented as “NA”, “N/A” or -9999

Reading the tmp “csv”-file with just the default arguments gives us this:

read_csv("data-raw/tmp.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

# A tibble: 8 × 1
  `metadata 1: this data was collected on R/V Bjarni`
  <chr>                                              
1 metatdata 2: this research was funded by EU        
2 ID,Date,Fishing area                               
3 1,2021-01-01,North                                 
4 N/A,2021-01-02,North                               
5 NA,NA,NA                                           
6 3,2021-02-28,South                                 
7 4,2021-02-29,N/A                                   
8 5,-9999,South

Look carefully at the output message. This dataset actually contains only one variable (metadata 1: this data was collected on R/V Bjarni), the reason being that the first line in the data is interpreted as the variable names and since we have no comma there the read_csv function “thinks” that the data is only one variable.

Part of the message says: “call problems() on your data frame for details”. Lets read in again and run also the suggested problems().

read_csv("data-raw/tmp.csv")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

# A tibble: 8 × 1
  `metadata 1: this data was collected on R/V Bjarni`
  <chr>                                              
1 metatdata 2: this research was funded by EU        
2 ID,Date,Fishing area                               
3 1,2021-01-01,North                                 
4 N/A,2021-01-02,North                               
5 NA,NA,NA                                           
6 3,2021-02-28,South                                 
7 4,2021-02-29,N/A                                   
8 5,-9999,South

problems()

This says that we expected the data to be only 1 column but got 3 columns because it detected two “,” in each of the listed rows. In order to get the data properly into R we need to overwrite some of the defaults, first let’s skip the 2 metadata lines:

read_csv("data-raw/tmp.csv",
         skip = 2)                # skip the first two lines (the metadata)

# A tibble: 6 × 3
  ID    Date       `Fishing area`
  <chr> <chr>      <chr>         
1 1     2021-01-01 North         
2 N/A   2021-01-02 North         
3 <NA>  <NA>       <NA>          
4 3     2021-02-28 South         
5 4     2021-02-29 N/A           
6 5     -9999      South

Now in the above code we manage to get things roughly right. But take note that:

ID and date is a character, but we kind of expected a numerical (actually an integer) and a date respectively
The third record returns “NA” coloured in red because in R, NA is represented as missing data by default.

Exercise:

Check the help file for read_csv
Try to fix the na argument in order to account for “N/A” and “-9999” in the data as missing

Solution:

read_csv("data-raw/tmp.csv",
         skip = 2,                      # skip the first two lines
         na = c("NA", "N/A", "-9999"))  # Representation of missing value

But what happened to the date in ID 4?? Try this (right after running the read_csv) to get a hint:

problems()

Variable names

The variables names in imported files are often lengthy, contain combination of upper and lower cases and often have spaces. This mean downstream coding is often cumbersome though doable. To make coding less so, it is strongly adviced that you:

Keep variable names as short as possible
Use only lower case letters
use “_” instead of a space in the name

janitor::clean_names does the last two parts pretty well and dplyr::rename the first.

d <- read_csv("data-raw/tmp.csv",
         skip = 2,
         na = c("NA", "N/A", "-9999")) |> 
  clean_names() |>                # get here only lower case letters 
                                  # replace " " with "_"
  rename(area = fishing_area)     # shorten name
d

# A tibble: 6 × 3
     id date       area 
  <dbl> <date>     <chr>
1     1 2021-01-01 North
2    NA 2021-01-02 North
3    NA NA         <NA> 
4     3 2021-02-28 South
5     4 NA         <NA> 
6     5 NA         South

Exercise:

Read the help file on `remove_empty`` and apply it to the above dataframe such that row 3, which has no data is dropped.

Solution:

d |> 
  remove_empty(which = "rows")

Writing files

Writing a csv-file is as simple as:

write_csv(w, "data-raw/my-minke.csv")

write_rds() store data in R’s custom binary format called RDS preserving the exact object format.

write_rds(w, "data-raw/my-minke.rds")

These can be read in using read_rds():

read_rds("data-raw/my-minke.rds")

Exercise:

Write the “d” table as rds
Read the table in again

Solution:

d |> write_rds("data-raw/tmp.rds")
read_rds("data-raw/tmp.rds")

Importing data from excel sheets

The readxl-package provides “light-weight” support to read in Excel files directly into R. The minke data is avalaible in an excel format called minke.xlsx. You can either download it onto your computer via the usual point and mouse click or use the download.file function:

download.file(url = "https://heima.hafro.is/~einarhj/data/minke.xlsx",
              destfile = "data-raw/minke.xlsx",
              mode = "wb")
library(readxl)
d <- 
  read_excel("data-raw/minke.xlsx")
glimpse(d)

Rows: 190
Columns: 12
$ id             <dbl> 1, 690, 926, 1333, 1334, 1335, 1336, 1338, 1339, 1341, …
$ lon            <dbl> -21.42350, -21.39183, -19.81333, -21.57500, -15.61167, …
$ lat            <dbl> 65.66183, 65.65350, 66.51167, 65.67333, 66.29000, 66.17…
$ area           <chr> "North", "North", "North", "North", "North", "North", "…
$ length         <dbl> 780, 793, 858, 567, 774, 526, 809, 820, 697, 777, 739, …
$ weight         <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "…
$ age            <chr> "11.3", "NA", "15.5", "7.2", "12.3", "9.6", "17.3", "13…
$ sex            <chr> "Female", "Female", "Female", "Male", "Female", "Female…
$ maturity       <chr> "pregnant", "pregnant", "pregnant", "immature", "immatu…
$ stomach.volume <chr> "58", "90", "24", "25", "85", "18", "200", "111", "8", …
$ stomach.weight <chr> "31.9", "36.29", "9.42", "3.64", "5.51", "1.167", "99",…
$ year           <dbl> 2004, 2004, 2004, 2003, 2003, 2003, 2003, 2003, 2003, 2…

The read_excel function will by default read in the 1st data-sheet (checkout args(read_excel). To get information on what sheets are in an excel file one can use the excel_sheets function:

excel_sheets("data-raw/minke.xlsx")  # only one sheet here

[1] "maesurements"

If NAs are represented by something other than blank cells, set the na argument by e.g. if -9999 represents missing data then:

read_excel("data-raw/minke.xlsx", na = "-9999") # actually no na values in this file

One can read in a certain cell range by:

read_excel("data-raw/minke.xlsx", range = "A1:C10")

# A tibble: 9 × 3
     id   lon   lat
  <dbl> <dbl> <dbl>
1     1 -21.4  65.7
2   690 -21.4  65.7
3   926 -19.8  66.5
4  1333 -21.6  65.7
5  1334 -15.6  66.3
6  1335 -18.7  66.2
7  1336 -21.5  65.7
8  1338 -22.8  66.1
9  1339 -17.5  66.6

Sanity check on the object read in from Excel is an absolute must because the data can be notoriously corrupt because the user is free to do whatever in that framework either intentionally or by accident.

Other software connections

Package haven provides functions for reading in SPSS, STATA and SAS files:

library(haven)
read_sas("path/to/file") ## SAS files
read_por("path/to/file") ## SPSS portable files
read_sav("path/to/file") ## SPSS data files
read_dta("path/to/file") ## Stata files

Similarly in the R.matlab package there is a function that reads in matlab type of data:

library(R.matlab)
readMat('path/to/file') ## Matlab data files

Importing directly from tip-files

We can read tip-files directly into R using the function foreign::read.dbf. I you have not installed the {foreign}-package before you have to install it first by running:

install.packages("foreign")

Then do:

library(tidyverse)
library(foreign)

If you have a raw tip-file (the suffix is .DBF) you could try something like this:

read.dbf("your/path/to/the/data/tip11.DBF", as.is = TRUE)

Of course you need to replace “your/path/to/the/data/tip11.DBF” with your path of your data on your own computer. If you do not have tip file, but want to give it a go run this code:

# download some species lookup-TIP data taken from some anonymous country
download.file(url = "https://heima.hafro.is/~einarhj/data/FISHCODE.DBF", destfile = "FISHCODE.DBF", mode = "wb")
fishcode <- read.dbf("FISHCODE.DBF", as.is = TRUE)
glimpse(fishcode)

Rows: 797
Columns: 10
$ NODC_COD   <chr> "8741000000", "8835000000", "8861010206", "9202000000", "88…
$ SC_NAME    <chr> "O+:ELOP. O:ANG. O-:ANG.", "O+:ACAN. O:PER. O-:PER.", "SPHO…
$ CMN_NAME   <chr> NA, NA, NA, NA, "AMBERJACK,GREATER", "AMBERJACK,LESSER", "A…
$ ALP_CODE   <chr> NA, NA, NA, NA, "GAM", "LAM", NA, NA, NA, NA, NA, "FAN", NA…
$ AREA       <chr> "(  )", "(  )", NA, NA, "(11)", "(11)", NA, NA, NA, NA, NA,…
$ NMFS_COD   <chr> "NONE", "NONE", "NONE", "NONE", "1812", "1815", "60", "NONE…
$ FISHERY    <chr> NA, NA, NA, NA, "RF", "RF", "EG", "EG", "EG", "EG", "RF", "…
$ PR_CODE    <chr> NA, NA, "741", NA, "112", NA, "42", NA, "44", "43", "578", …
$ OECS_CODE  <chr> NA, NA, NA, NA, "CARADU", NA, "ENGR", "ENGRLY", "ENGRHE", "…
$ LOCAL_NAME <chr> NA, NA, NA, NA, "AMBERJACK,GREATER", "AMBERJACK,LESSER", "A…