Preamble


Load the required packages

library(tidyverse)

# Statistical software
library(haven)

# Excel I/O
library(readxl)
library(openxlsx)

# for database access
library(dbplyr)
library(RSQLite)

Note if you do not have a package installed use `install.packages´, e.g.:

install.packages("openxlsx")
install.packages("RSQLite")

Sources of data


Keyboard


one can simply enter data directly to the console

weight <- c(1,5,3,2,6)
length <- c(10,17,14,12,18)
plot(length,weight)

or if you want a more structured entry one can :

d <- data_frame(weight = c(1,5,3,2,6),
                length = c(10,17,14,12,18))
ggplot(d,aes(length,weight)) + geom_point()

R even has a built data entry window:

dat <- data.frame(id=numeric(0), species=character(0),
                  length=numeric(0),age=numeric(0),
                  lat = numeric(0),lon=numeric(0))
dat <- edit(dat)

but this only creates variable in R that, unless saved, will disappear when R is closed.

Saving R objects

R has a special file format for saving R objects, such as data frames, called Rdata files. To save an object to a file you can:

save(d,file='d.Rdata')

where R saves the d data frame to a file called ‘d.Rdata’. You can any number of objects to a file:

save(d,length, weight, file = 'someData.Rdata')

If you want to save everything in your workspace you can also do:

save.image(file='everything.Rdata')

Text files


Text files are the most basic types of data files and commonly used in all data exchange. Most of the data that we are interested in is in some sort of tabular form and whole host of functions are available to import these files into R. These include the following functions from the readr package, included in the tidyverse package:

read_csv  ## read comma separated files where the decimal is denoted as '.'
read_csv2 ## read semicolon separated files where the decimal is denoted as ','
read_tsv  ## read tab separated files
read_table ## read white space separated files 
read_delim ## read files with user specified delimination

These functions all have similar inputs where the settings of the data import can be tweaked:

read_function(file,           ## location of the data file 
              skip=n,         ## how many lines at the top of the file should be discarded
              col_names=TRUE, ## should the first line be treated as column names
              comment = '#',  ## everything after the comment symbol will be ignored
              n_max=Inf)      ## maximum number of lines to read in

Note that read_function is a placeholder name for any of the read_* functions. For the most common cases only the file location is needed, which means if your data file is situated in your working directory this is just the name of the file:

d <- read_function('nameOfFile.txt')

But files can be stored anywhere on your computer, if the file is in a folder called data within the working directory:

d <- read_function('data/nameOfFile.txt')

If it is the folder that contains your working directory:

d <- read_function('../nameOfFile.txt')

And if it is somewhere on the computer one can use absolute positioning:

d <- read_function('~/Path/to/File/nameOfFile.txt')  ## linux/mac
d <- read_function('c:/Path/to/File/nameOfFile.txt') ## windows

R can even read files on-line:

d <- read_function('http://somwhereontheweb.com/nameOfFile.txt')

Downloading files: Before proceeding lets download some files on our computer. Here we get some datasets and place the datasets in a folder called “data” within your project:

dir.create('data')
download.file("ftp://ftp.hafro.is/pub/tcrenv/minke.csv",
              destfile = 'data/minke.csv')
download.file("ftp://ftp.hafro.is/pub/tcrenv/iCod_2017.xlsx",
              destfile = 'data/example_excel.xlsx',
              mode = 'wb')

Information on these datasets can be found here.

Now reading in the minke whale dataset into R is simply:

minke <- 
  read_csv('data/minke.csv')

We could also have read this directly from the web:

minke <- read_csv("ftp://ftp.hafro.is/pub/tcrenv/minke.csv")

It is however useful to look at how one would use read_delim to import the same data file:

minke <-
  read_delim(file = 'data/minke.csv', ## path to the file
             col_names = TRUE,        ## are column names
                                      ## at the top
             delim = ',',             ## column separator symbol
             skip = 0,                ## num lines at top to skip
             comment = '#')           ## indicating comments
Exercise 1
  1. How would you import data where the column separator is “|”
  2. You current working directory is “~/Documents/workdir1/” and you want to read a file named “data.csv” in “~/Documents/datadir/”. Can you think of two ways to write the file path?
  3. Copy the following text to a text file and read in this file using read_delim:

    this is some data
    col1 col2 col3
    1 2 a
    3 4 b
    5 6 c

Useful sanity checks

When importing data into R it is fairly useful to have the following in mind

  • Does each variable have its own column and each subject it own line?
  • Are there any unnecessary lines or columns?
  • Do the data contain any non-US characters?
  • Are there gaps in the data?
  • Are the results entered consistently?
  • Does every variable have its own name?
  • Are the numbers correctly entered?
  • Are there any items that can cause misunderstanding?

Any of these points can cause trouble when importing the data, in particular when the data has been prepared using Excel. Notably automatic date conversions in Excel is frequently the culprit for many data import issues. After importing is often useful to run a couple of tests on the data to ensure you have read the data in correctly:

The first six rows:

head(minke)
## # A tibble: 6 x 13
##      id date                  lon   lat area  length weight   age sex   
##   <int> <dttm>              <dbl> <dbl> <chr>  <int>  <int> <dbl> <chr> 
## 1     1 2004-06-10 22:00:00 -21.4  65.7 North    780     NA 11.3  Female
## 2   690 2004-06-15 17:00:00 -21.4  65.7 North    793     NA NA    Female
## 3   926 2004-06-22 01:00:00 -19.8  66.5 North    858     NA 15.5  Female
## 4  1333 2003-09-30 16:00:00 -21.6  65.7 North    567     NA  7.20 Male  
## 5  1334 2003-09-25 15:00:00 -15.6  66.3 North    774     NA 12.3  Female
## 6  1335 2003-09-16 16:00:00 -18.7  66.2 North    526     NA  9.60 Female
## # ... with 4 more variables: maturity <chr>, stomach.volume <dbl>,
## #   stomach.weight <dbl>, year <int>

last six rows

tail(minke)
## # A tibble: 6 x 13
##      id date                  lon   lat area  length weight   age sex   
##   <int> <dttm>              <dbl> <dbl> <chr>  <int>  <int> <dbl> <chr> 
## 1 21065 2007-08-28 18:00:00 -21.9  66.3 North    840   5301  27.5 Female
## 2 21066 2007-08-29 13:00:00 -21.9  66.3 North    838   5685  22.6 Female
## 3 21068 2007-09-02 16:00:00 -23.5  64.7 South    731   4358  16.3 Male  
## 4 21069 2007-05-09 10:00:00 -22.4  64.3 South    803     NA  21.3 Male  
## 5 21070 2007-05-14 20:00:00 -22.6  64.3 South    701     NA  14.9 Male  
## 6 21072 2007-05-15 16:00:00 -23.1  64.2 South    802     NA  32.9 Male  
## # ... with 4 more variables: maturity <chr>, stomach.volume <dbl>,
## #   stomach.weight <dbl>, year <int>

How many columns or rows are there in the data

dim(minke)
## [1] 190  13

What are the names of the column

names(minke)
##  [1] "id"             "date"           "lon"            "lat"           
##  [5] "area"           "length"         "weight"         "age"           
##  [9] "sex"            "maturity"       "stomach.volume" "stomach.weight"
## [13] "year"

Summary statistics for the dataset

summary(minke)
##        id             date                          lon        
##  Min.   :    1   Min.   :2003-08-18 17:00:00   Min.   :-24.63  
##  1st Qu.: 1503   1st Qu.:2004-06-22 03:00:00   1st Qu.:-22.75  
##  Median :20677   Median :2006-01-16 13:00:00   Median :-20.94  
##  Mean   :14547   Mean   :2005-09-26 08:10:44   Mean   :-19.69  
##  3rd Qu.:20982   3rd Qu.:2006-08-14 11:30:00   3rd Qu.:-16.43  
##  Max.   :21072   Max.   :2007-09-02 16:00:00   Max.   :-13.21  
##                                                                
##       lat            area               length          weight    
##  Min.   :63.30   Length:190         Min.   :461.0   Min.   :1663  
##  1st Qu.:64.17   Class :character   1st Qu.:725.5   1st Qu.:3462  
##  Median :65.13   Mode  :character   Median :770.0   Median :4165  
##  Mean   :65.11                      Mean   :752.6   Mean   :4084  
##  3rd Qu.:66.10                      3rd Qu.:802.8   3rd Qu.:4620  
##  Max.   :67.04                      Max.   :871.0   Max.   :5685  
##                                                     NA's   :169   
##       age            sex              maturity         stomach.volume  
##  Min.   : 3.40   Length:190         Length:190         Min.   :  0.00  
##  1st Qu.:11.40   Class :character   Class :character   1st Qu.: 14.75  
##  Median :17.30   Mode  :character   Mode  :character   Median : 32.00  
##  Mean   :18.97                                         Mean   : 41.91  
##  3rd Qu.:24.30                                         3rd Qu.: 57.25  
##  Max.   :47.40                                         Max.   :200.00  
##  NA's   :17                                            NA's   :2       
##  stomach.weight         year     
##  Min.   :  0.000   Min.   :2003  
##  1st Qu.:  2.312   1st Qu.:2004  
##  Median :  7.819   Median :2006  
##  Mean   : 14.822   Mean   :2005  
##  3rd Qu.: 17.483   3rd Qu.:2006  
##  Max.   :106.250   Max.   :2007  
##  NA's   :2

Structure of the data

glimpse(minke)
## Observations: 190
## Variables: 13
## $ id             <int> 1, 690, 926, 1333, 1334, 1335, 1336, 1338, 1339...
## $ date           <dttm> 2004-06-10 22:00:00, 2004-06-15 17:00:00, 2004...
## $ lon            <dbl> -21.42350, -21.39183, -19.81333, -21.57500, -15...
## $ lat            <dbl> 65.66183, 65.65350, 66.51167, 65.67333, 66.2900...
## $ area           <chr> "North", "North", "North", "North", "North", "N...
## $ length         <int> 780, 793, 858, 567, 774, 526, 809, 820, 697, 77...
## $ weight         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ age            <dbl> 11.3, NA, 15.5, 7.2, 12.3, 9.6, 17.3, 13.8, 12....
## $ sex            <chr> "Female", "Female", "Female", "Male", "Female",...
## $ maturity       <chr> "pregnant", "pregnant", "pregnant", "immature",...
## $ stomach.volume <dbl> 58, 90, 24, 25, 85, 18, 200, 111, 8, 25, 38, 6,...
## $ stomach.weight <dbl> 31.900000, 36.290000, 9.420000, 3.640000, 5.510...
## $ year           <int> 2004, 2004, 2004, 2003, 2003, 2003, 2003, 2003,...
Exercise 2
  1. What is wrong with this file

    a,b,c 
    1,1,bla,bla
    2,2,bingo
  2. What went wrong with the import?

## # A tibble: 6 x 1
##   `id,date,lon,lat,area,length,weight,age,sex,maturity,stomach.volume,sto…
##   <chr>                                                                   
## 1 1,2004-06-10T22:00:00Z,-21.4235,65.66183333333332,North,780,,11.3,Femal…
## 2 690,2004-06-15T17:00:00Z,-21.391833333333306,65.6535,North,793,,,Female…
## 3 926,2004-06-22T01:00:00Z,-19.813333333333304,66.51166666666671,North,85…
## 4 1333,2003-09-30T16:00:00Z,-21.575,65.67333333333332,North,567,,7.2,Male…
## 5 1334,2003-09-25T15:00:00Z,-15.611666666666704,66.29,North,774,,12.3,Fem…
## 6 1335,2003-09-16T16:00:00Z,-18.67,66.17833333333331,North,526,,9.6,Femal…

Exporting data

R can also export data into formats readable by other software. Among these are:

write_csv   # write a comma separated file
write_delim # write a file with user defined delimiter
write_tsv   # write a tab separated file
write_excel_csv  # write a comma separated file for Excel
write       # write a line to a file

Most used inputs to these functions are the object to be exported and the file name with the desired location. To export the minke data set as a comma separated file you can simply do:

write_csv(minke, path = 'minke.csv')

but with a bit more control

write_delim(minke,
            path = 'minke-class.csv',  ## file name
            col_names = TRUE,          ## write header
            delim = ',')               ## specifiy the delimiter

The write_excel_csv is nearly identical to write_csv, except it adds a hidden character at the top of the file that instructs Excel to read the file with “utf-8” (the universal character encoding).

You can write single lines to a file with the write function:

write('Here is some text', file = 'Afile.txt')

But this becomes more useful when appending the text to an existing file:

write('# comment describing the data', file = 'dataFile.csv')
write_csv(minke,path = 'dataFile.csv', append = TRUE, col_names = TRUE)
write('# other comment at the bottom', file = 'dataFile.csv',append = TRUE)

Be careful with append, if not properly set the file can be overwritten.

Excel


Importing Excel files

With the readxl package we can read Excel files directly into R with the read_excel function:

d <- read_excel(path,               # Path to the excel file
                sheet = NULL,       # What sheet should be read (either name or number)
                range = NULL,       # What cells should be read   
                skip = 0,           # how many lines a the top should be skipped
                col_names = TRUE,   # is the first line 
                col_types = 'guess')# what is the type of data in the column

In the simplest of cases where the data is well formed we can simply read the data in

icod_summary <- read_excel("data/example_excel.xlsx")
icod_summary
## # A tibble: 63 x 7
##     Year    Rec    SSB   `B4+` Landings    HR  Fbar
##    <dbl>  <dbl>  <dbl>   <dbl>    <dbl> <dbl> <dbl>
##  1  1955 151968 936957 2354550   545250 0.230 0.290
##  2  1956 152779 791479 2079670   486909 0.230 0.290
##  3  1957 170645 771545 1876560   455182 0.240 0.310
##  4  1958 220745 871819 1863510   517359 0.280 0.350
##  5  1959 289220 849812 1825050   459081 0.250 0.320
##  6  1960 154235 707616 1752670   470121 0.270 0.370
##  7  1961 192814 466533 1495590   377291 0.250 0.360
##  8  1962 128753 568226 1491700   388985 0.260 0.380
##  9  1963 177361 507238 1314700   408800 0.310 0.460
## 10  1964 203993 450537 1218020   437012 0.360 0.550
## # ... with 53 more rows

You can also list the sheets in the Excel file

excel_sheets('data/example_excel.xlsx')
## [1] "iCod_summary" "Nobob"

So if the data is in a specific sheet we can specifiy the sheet directly by name or number:

# by order
read_excel("data/example_excel.xlsx", sheet = 1) 
# by name
read_excel("data/example_excel.xlsx", sheet = 'iCod_summary') 

Excel files are notoriously free form so often data are mingled in with comments and colors. We can specify the columns and rows where the data are located within the sheet explicity with the range setting of read_excel:

d <- read_excel('data/example_excel.xlsx', range = 'A1:D6')
d
## # A tibble: 5 x 4
##    Year    Rec    SSB   `B4+`
##   <dbl>  <dbl>  <dbl>   <dbl>
## 1  1955 151968 936957 2354550
## 2  1956 152779 791479 2079670
## 3  1957 170645 771545 1876560
## 4  1958 220745 871819 1863510
## 5  1959 289220 849812 1825050

And you can even specify the sheet in the range Excel-style:

d <- read_excel('data/example_excel.xlsx',range = 'iCod_summary!A1:D6')

The readxl package provides also a couple of helper functions when specifing the range:

anchored      # anchor selction to a specific cell
cell_cols     # read selected columns
cell_rows     # read selected roww
cell_limits   # read selected columns and rows

So you can ask for only the 3rd and the 4th column:

read_excel('data/example_excel.xlsx',range = cell_cols(3:4))
## # A tibble: 63 x 2
##       SSB   `B4+`
##     <dbl>   <dbl>
##  1 936957 2354550
##  2 791479 2079670
##  3 771545 1876560
##  4 871819 1863510
##  5 849812 1825050
##  6 707616 1752670
##  7 466533 1495590
##  8 568226 1491700
##  9 507238 1314700
## 10 450537 1218020
## # ... with 53 more rows

When importing you can also explicitly define how the columns in your data will be determined using the col_types argument:

read_excel('data/example_excel.xlsx',col_types = c('text','numeric','numeric','numeric','numeric','numeric','numeric'))
## # A tibble: 63 x 7
##    Year     Rec    SSB   `B4+` Landings    HR  Fbar
##    <chr>  <dbl>  <dbl>   <dbl>    <dbl> <dbl> <dbl>
##  1 1955  151968 936957 2354550   545250 0.230 0.290
##  2 1956  152779 791479 2079670   486909 0.230 0.290
##  3 1957  170645 771545 1876560   455182 0.240 0.310
##  4 1958  220745 871819 1863510   517359 0.280 0.350
##  5 1959  289220 849812 1825050   459081 0.250 0.320
##  6 1960  154235 707616 1752670   470121 0.270 0.370
##  7 1961  192814 466533 1495590   377291 0.250 0.360
##  8 1962  128753 568226 1491700   388985 0.260 0.380
##  9 1963  177361 507238 1314700   408800 0.310 0.460
## 10 1964  203993 450537 1218020   437012 0.360 0.550
## # ... with 53 more rows

Available column types are:

guess numeric
text logical
date list
skip -

Note that when the data in the cells can’t be coerced in the the desired format the cell contents will be replace with NA-s.

Exercise 3
  1. Use the range attribute to import the tabular data from ‘Nobob’ sheet in the example_exel file
  2. Can you think of a way to ensure that the column are imported as numbers?

Exporting to excel

Another package, openxlsx, provides tools to export data to an Excel format. To export a single data frame you can simply write:

write.xlsx(minke, file = 'minke.xlsx')

If you need to export more that one data frame you can use a named list:

write.xlsx(list(minke=minke, iCod=icod_summary), file = 'RtoExcel.xlsx')

Other statistical software


Data is often stored in format created by other statistical software packages such a SPSS and SAS. The haven package provides the necessary functions to import and export to these file formats:

# SAS
read_sas    # read sas file
write_sas   # write sas file
# SPSS
read_sav    # read spss file
write_sav   # write spss file

# Stata
read_dta    # read stata file
write_dta   # write stata file

There are a number of SPSS/SAS/Stata features that have no direct equivalent in R. Haven preserves them so you can choose what do with them. To simply eliminate them, use one of the zap functions:

as_factor     # Convert input to a factor.
labelled_spss # Labelled vectors for SPSS
labelled      # Create a labelled vector.
is.labelled   # -><- 
print_labels  # Print the labels of a labelled vector

tagged_na        # "Tagged" missing values
na_tag  
is_tagged_na 
format_tagged_na 
print_tagged_na

zap_empty     # Convert empty strings into missing values.
zap_formats   # Remove format attributes
zap_labels    # Zap labels
zap_missing   # Zap special missings to regular R missings
zap_widths    # Remove display width attributes

Database connectivity


Databases are commonly used to store (large amounts of) data and numerous software vendors provide database solutions, both general and specific Similarly numerous packages exist to interact with databases in R. Notably DBI, RODBC and dplyr. Typically in an R session the user queries the database for the data needed for a particular analysis and loads it into memory. Larger datasets, that don’t fit into memory will need to be subsampled. The most common types of database engines have a specialised R package:

Engine R-package
Oracle ROracle
Postgres RPostgreSQL
MSSQL odbc
MySQL RMySQL
sqlite RSQLite

To illustrate how the database connectivity works we will set up a dummy sqlite database containing the minke dataset. The following command creates and sets up a link to a dummy database

db <- src_sqlite('minke.db',create = TRUE)

It is fairly straightforward to add data to the database. Here the minke whale data is copied to a table called minke_tbl

tmp <- copy_to(db,minke,'minke_tbl')

We can query the database in many ways. If we want everything from the table we use the tbl command:

minke.tbl <- tbl(db,'minke_tbl')
minke.tbl
## # Source:   table<minke_tbl> [?? x 13]
## # Database: sqlite 3.19.3 [/home/haf/einarhj/edu/tcrenv/minke.db]
##       id    date   lon   lat area  length weight   age sex   matu… stomac…
##    <int>   <dbl> <dbl> <dbl> <chr>  <int>  <int> <dbl> <chr> <chr>   <dbl>
##  1     1  1.09e⁹ -21.4  65.7 North    780     NA 11.3  Fema… preg…   58.0 
##  2   690  1.09e⁹ -21.4  65.7 North    793     NA NA    Fema… preg…   90.0 
##  3   926  1.09e⁹ -19.8  66.5 North    858     NA 15.5  Fema… preg…   24.0 
##  4  1333  1.06e⁹ -21.6  65.7 North    567     NA  7.20 Male  imma…   25.0 
##  5  1334  1.06e⁹ -15.6  66.3 North    774     NA 12.3  Fema… imma…   85.0 
##  6  1335  1.06e⁹ -18.7  66.2 North    526     NA  9.60 Fema… imma…   18.0 
##  7  1336  1.06e⁹ -21.5  65.7 North    809     NA 17.3  Fema… preg…  200   
##  8  1338  1.06e⁹ -22.8  66.1 North    820     NA 13.8  Fema… preg…  111   
##  9  1339  1.06e⁹ -17.5  66.6 North    697     NA 12.2  Male  pube…    8.00
## 10  1341  1.06e⁹ -14.7  66.2 North    777     NA 15.4  Male  matu…   25.0 
## # ... with more rows, and 2 more variables: stomach.weight <dbl>,
## #   year <int>

You can also run arbitrary sql commands:

num.minke <- 
  tbl(db,sql('select count(1) from minke_tbl'))
num.minke
## # Source:   SQL [?? x 1]
## # Database: sqlite 3.19.3 [/home/haf/einarhj/edu/tcrenv/minke.db]
##   `count(1)`
##        <int>
## 1        190

Describe the table

tbl_vars(minke.tbl)
##  [1] "id"             "date"           "lon"            "lat"           
##  [5] "area"           "length"         "weight"         "age"           
##  [9] "sex"            "maturity"       "stomach.volume" "stomach.weight"
## [13] "year"
minke.tbl  #glimpse(minke.tbl) generates error
## # Source:   table<minke_tbl> [?? x 13]
## # Database: sqlite 3.19.3 [/home/haf/einarhj/edu/tcrenv/minke.db]
##       id    date   lon   lat area  length weight   age sex   matu… stomac…
##    <int>   <dbl> <dbl> <dbl> <chr>  <int>  <int> <dbl> <chr> <chr>   <dbl>
##  1     1  1.09e⁹ -21.4  65.7 North    780     NA 11.3  Fema… preg…   58.0 
##  2   690  1.09e⁹ -21.4  65.7 North    793     NA NA    Fema… preg…   90.0 
##  3   926  1.09e⁹ -19.8  66.5 North    858     NA 15.5  Fema… preg…   24.0 
##  4  1333  1.06e⁹ -21.6  65.7 North    567     NA  7.20 Male  imma…   25.0 
##  5  1334  1.06e⁹ -15.6  66.3 North    774     NA 12.3  Fema… imma…   85.0 
##  6  1335  1.06e⁹ -18.7  66.2 North    526     NA  9.60 Fema… imma…   18.0 
##  7  1336  1.06e⁹ -21.5  65.7 North    809     NA 17.3  Fema… preg…  200   
##  8  1338  1.06e⁹ -22.8  66.1 North    820     NA 13.8  Fema… preg…  111   
##  9  1339  1.06e⁹ -17.5  66.6 North    697     NA 12.2  Male  pube…    8.00
## 10  1341  1.06e⁹ -14.7  66.2 North    777     NA 15.4  Male  matu…   25.0 
## # ... with more rows, and 2 more variables: stomach.weight <dbl>,
## #   year <int>

List all tables in a database

db_list_tables(db$con)

For information related to database connectivity refer to: * https://db.rstudio.com/dplyr/


Exercise 4

Try to open a file from “home”