Load the required packages
library(tidyverse)
# Statistical software
library(haven)
# Excel I/O
library(readxl)
library(openxlsx)
# for database access
library(dbplyr)
library(RSQLite)
Note if you do not have a package installed use `install.packages´, e.g.:
install.packages("openxlsx")
install.packages("RSQLite")
one can simply enter data directly to the console
weight <- c(1,5,3,2,6)
length <- c(10,17,14,12,18)
plot(length,weight)
or if you want a more structured entry one can :
d <- data_frame(weight = c(1,5,3,2,6),
length = c(10,17,14,12,18))
ggplot(d,aes(length,weight)) + geom_point()
R even has a built data entry window:
dat <- data.frame(id=numeric(0), species=character(0),
length=numeric(0),age=numeric(0),
lat = numeric(0),lon=numeric(0))
dat <- edit(dat)
but this only creates variable in R that, unless saved, will disappear when R is closed.
R has a special file format for saving R objects, such as data frames, called Rdata files. To save an object to a file you can:
save(d,file='d.Rdata')
where R saves the d data frame to a file called ‘d.Rdata’. You can any number of objects to a file:
save(d,length, weight, file = 'someData.Rdata')
If you want to save everything in your workspace you can also do:
save.image(file='everything.Rdata')
Text files are the most basic types of data files and commonly used in all data exchange. Most of the data that we are interested in is in some sort of tabular form and whole host of functions are available to import these files into R. These include the following functions from the readr
package, included in the tidyverse
package:
read_csv ## read comma separated files where the decimal is denoted as '.'
read_csv2 ## read semicolon separated files where the decimal is denoted as ','
read_tsv ## read tab separated files
read_table ## read white space separated files
read_delim ## read files with user specified delimination
These functions all have similar inputs where the settings of the data import can be tweaked:
read_function(file, ## location of the data file
skip=n, ## how many lines at the top of the file should be discarded
col_names=TRUE, ## should the first line be treated as column names
comment = '#', ## everything after the comment symbol will be ignored
n_max=Inf) ## maximum number of lines to read in
Note that read_function
is a placeholder name for any of the read_*
functions. For the most common cases only the file location is needed, which means if your data file is situated in your working directory this is just the name of the file:
d <- read_function('nameOfFile.txt')
But files can be stored anywhere on your computer, if the file is in a folder called data within the working directory:
d <- read_function('data/nameOfFile.txt')
If it is the folder that contains your working directory:
d <- read_function('../nameOfFile.txt')
And if it is somewhere on the computer one can use absolute positioning:
d <- read_function('~/Path/to/File/nameOfFile.txt') ## linux/mac
d <- read_function('c:/Path/to/File/nameOfFile.txt') ## windows
R can even read files on-line:
d <- read_function('http://somwhereontheweb.com/nameOfFile.txt')
Downloading files: Before proceeding lets download some files on our computer. Here we get some datasets and place the datasets in a folder called “data” within your project:
dir.create('data')
download.file("ftp://ftp.hafro.is/pub/tcrenv/minke.csv",
destfile = 'data/minke.csv')
download.file("ftp://ftp.hafro.is/pub/tcrenv/iCod_2017.xlsx",
destfile = 'data/example_excel.xlsx',
mode = 'wb')
Information on these datasets can be found here.
Now reading in the minke whale dataset into R is simply:
minke <-
read_csv('data/minke.csv')
We could also have read this directly from the web:
minke <- read_csv("ftp://ftp.hafro.is/pub/tcrenv/minke.csv")
It is however useful to look at how one would use read_delim
to import the same data file:
minke <-
read_delim(file = 'data/minke.csv', ## path to the file
col_names = TRUE, ## are column names
## at the top
delim = ',', ## column separator symbol
skip = 0, ## num lines at top to skip
comment = '#') ## indicating comments
Copy the following text to a text file and read in this file using read_delim
:
this is some data
col1 col2 col3
1 2 a
3 4 b
5 6 c
When importing data into R it is fairly useful to have the following in mind
Any of these points can cause trouble when importing the data, in particular when the data has been prepared using Excel. Notably automatic date conversions in Excel is frequently the culprit for many data import issues. After importing is often useful to run a couple of tests on the data to ensure you have read the data in correctly:
The first six rows:
head(minke)
## # A tibble: 6 x 13
## id date lon lat area length weight age sex
## <int> <dttm> <dbl> <dbl> <chr> <int> <int> <dbl> <chr>
## 1 1 2004-06-10 22:00:00 -21.4 65.7 North 780 NA 11.3 Female
## 2 690 2004-06-15 17:00:00 -21.4 65.7 North 793 NA NA Female
## 3 926 2004-06-22 01:00:00 -19.8 66.5 North 858 NA 15.5 Female
## 4 1333 2003-09-30 16:00:00 -21.6 65.7 North 567 NA 7.20 Male
## 5 1334 2003-09-25 15:00:00 -15.6 66.3 North 774 NA 12.3 Female
## 6 1335 2003-09-16 16:00:00 -18.7 66.2 North 526 NA 9.60 Female
## # ... with 4 more variables: maturity <chr>, stomach.volume <dbl>,
## # stomach.weight <dbl>, year <int>
last six rows
tail(minke)
## # A tibble: 6 x 13
## id date lon lat area length weight age sex
## <int> <dttm> <dbl> <dbl> <chr> <int> <int> <dbl> <chr>
## 1 21065 2007-08-28 18:00:00 -21.9 66.3 North 840 5301 27.5 Female
## 2 21066 2007-08-29 13:00:00 -21.9 66.3 North 838 5685 22.6 Female
## 3 21068 2007-09-02 16:00:00 -23.5 64.7 South 731 4358 16.3 Male
## 4 21069 2007-05-09 10:00:00 -22.4 64.3 South 803 NA 21.3 Male
## 5 21070 2007-05-14 20:00:00 -22.6 64.3 South 701 NA 14.9 Male
## 6 21072 2007-05-15 16:00:00 -23.1 64.2 South 802 NA 32.9 Male
## # ... with 4 more variables: maturity <chr>, stomach.volume <dbl>,
## # stomach.weight <dbl>, year <int>
How many columns or rows are there in the data
dim(minke)
## [1] 190 13
What are the names of the column
names(minke)
## [1] "id" "date" "lon" "lat"
## [5] "area" "length" "weight" "age"
## [9] "sex" "maturity" "stomach.volume" "stomach.weight"
## [13] "year"
Summary statistics for the dataset
summary(minke)
## id date lon
## Min. : 1 Min. :2003-08-18 17:00:00 Min. :-24.63
## 1st Qu.: 1503 1st Qu.:2004-06-22 03:00:00 1st Qu.:-22.75
## Median :20677 Median :2006-01-16 13:00:00 Median :-20.94
## Mean :14547 Mean :2005-09-26 08:10:44 Mean :-19.69
## 3rd Qu.:20982 3rd Qu.:2006-08-14 11:30:00 3rd Qu.:-16.43
## Max. :21072 Max. :2007-09-02 16:00:00 Max. :-13.21
##
## lat area length weight
## Min. :63.30 Length:190 Min. :461.0 Min. :1663
## 1st Qu.:64.17 Class :character 1st Qu.:725.5 1st Qu.:3462
## Median :65.13 Mode :character Median :770.0 Median :4165
## Mean :65.11 Mean :752.6 Mean :4084
## 3rd Qu.:66.10 3rd Qu.:802.8 3rd Qu.:4620
## Max. :67.04 Max. :871.0 Max. :5685
## NA's :169
## age sex maturity stomach.volume
## Min. : 3.40 Length:190 Length:190 Min. : 0.00
## 1st Qu.:11.40 Class :character Class :character 1st Qu.: 14.75
## Median :17.30 Mode :character Mode :character Median : 32.00
## Mean :18.97 Mean : 41.91
## 3rd Qu.:24.30 3rd Qu.: 57.25
## Max. :47.40 Max. :200.00
## NA's :17 NA's :2
## stomach.weight year
## Min. : 0.000 Min. :2003
## 1st Qu.: 2.312 1st Qu.:2004
## Median : 7.819 Median :2006
## Mean : 14.822 Mean :2005
## 3rd Qu.: 17.483 3rd Qu.:2006
## Max. :106.250 Max. :2007
## NA's :2
Structure of the data
glimpse(minke)
## Observations: 190
## Variables: 13
## $ id <int> 1, 690, 926, 1333, 1334, 1335, 1336, 1338, 1339...
## $ date <dttm> 2004-06-10 22:00:00, 2004-06-15 17:00:00, 2004...
## $ lon <dbl> -21.42350, -21.39183, -19.81333, -21.57500, -15...
## $ lat <dbl> 65.66183, 65.65350, 66.51167, 65.67333, 66.2900...
## $ area <chr> "North", "North", "North", "North", "North", "N...
## $ length <int> 780, 793, 858, 567, 774, 526, 809, 820, 697, 77...
## $ weight <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ age <dbl> 11.3, NA, 15.5, 7.2, 12.3, 9.6, 17.3, 13.8, 12....
## $ sex <chr> "Female", "Female", "Female", "Male", "Female",...
## $ maturity <chr> "pregnant", "pregnant", "pregnant", "immature",...
## $ stomach.volume <dbl> 58, 90, 24, 25, 85, 18, 200, 111, 8, 25, 38, 6,...
## $ stomach.weight <dbl> 31.900000, 36.290000, 9.420000, 3.640000, 5.510...
## $ year <int> 2004, 2004, 2004, 2003, 2003, 2003, 2003, 2003,...
What is wrong with this file
a,b,c
1,1,bla,bla
2,2,bingo
What went wrong with the import?
## # A tibble: 6 x 1
## `id,date,lon,lat,area,length,weight,age,sex,maturity,stomach.volume,sto…
## <chr>
## 1 1,2004-06-10T22:00:00Z,-21.4235,65.66183333333332,North,780,,11.3,Femal…
## 2 690,2004-06-15T17:00:00Z,-21.391833333333306,65.6535,North,793,,,Female…
## 3 926,2004-06-22T01:00:00Z,-19.813333333333304,66.51166666666671,North,85…
## 4 1333,2003-09-30T16:00:00Z,-21.575,65.67333333333332,North,567,,7.2,Male…
## 5 1334,2003-09-25T15:00:00Z,-15.611666666666704,66.29,North,774,,12.3,Fem…
## 6 1335,2003-09-16T16:00:00Z,-18.67,66.17833333333331,North,526,,9.6,Femal…
R can also export data into formats readable by other software. Among these are:
write_csv # write a comma separated file
write_delim # write a file with user defined delimiter
write_tsv # write a tab separated file
write_excel_csv # write a comma separated file for Excel
write # write a line to a file
Most used inputs to these functions are the object to be exported and the file name with the desired location. To export the minke data set as a comma separated file you can simply do:
write_csv(minke, path = 'minke.csv')
but with a bit more control
write_delim(minke,
path = 'minke-class.csv', ## file name
col_names = TRUE, ## write header
delim = ',') ## specifiy the delimiter
The write_excel_csv
is nearly identical to write_csv
, except it adds a hidden character at the top of the file that instructs Excel to read the file with “utf-8” (the universal character encoding).
You can write single lines to a file with the write
function:
write('Here is some text', file = 'Afile.txt')
But this becomes more useful when appending the text to an existing file:
write('# comment describing the data', file = 'dataFile.csv')
write_csv(minke,path = 'dataFile.csv', append = TRUE, col_names = TRUE)
write('# other comment at the bottom', file = 'dataFile.csv',append = TRUE)
Be careful with append, if not properly set the file can be overwritten.
With the readxl package we can read Excel files directly into R with the read_excel
function:
d <- read_excel(path, # Path to the excel file
sheet = NULL, # What sheet should be read (either name or number)
range = NULL, # What cells should be read
skip = 0, # how many lines a the top should be skipped
col_names = TRUE, # is the first line
col_types = 'guess')# what is the type of data in the column
In the simplest of cases where the data is well formed we can simply read the data in
icod_summary <- read_excel("data/example_excel.xlsx")
icod_summary
## # A tibble: 63 x 7
## Year Rec SSB `B4+` Landings HR Fbar
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1955 151968 936957 2354550 545250 0.230 0.290
## 2 1956 152779 791479 2079670 486909 0.230 0.290
## 3 1957 170645 771545 1876560 455182 0.240 0.310
## 4 1958 220745 871819 1863510 517359 0.280 0.350
## 5 1959 289220 849812 1825050 459081 0.250 0.320
## 6 1960 154235 707616 1752670 470121 0.270 0.370
## 7 1961 192814 466533 1495590 377291 0.250 0.360
## 8 1962 128753 568226 1491700 388985 0.260 0.380
## 9 1963 177361 507238 1314700 408800 0.310 0.460
## 10 1964 203993 450537 1218020 437012 0.360 0.550
## # ... with 53 more rows
You can also list the sheets in the Excel file
excel_sheets('data/example_excel.xlsx')
## [1] "iCod_summary" "Nobob"
So if the data is in a specific sheet we can specifiy the sheet directly by name or number:
# by order
read_excel("data/example_excel.xlsx", sheet = 1)
# by name
read_excel("data/example_excel.xlsx", sheet = 'iCod_summary')
Excel files are notoriously free form so often data are mingled in with comments and colors. We can specify the columns and rows where the data are located within the sheet explicity with the range setting of read_excel
:
d <- read_excel('data/example_excel.xlsx', range = 'A1:D6')
d
## # A tibble: 5 x 4
## Year Rec SSB `B4+`
## <dbl> <dbl> <dbl> <dbl>
## 1 1955 151968 936957 2354550
## 2 1956 152779 791479 2079670
## 3 1957 170645 771545 1876560
## 4 1958 220745 871819 1863510
## 5 1959 289220 849812 1825050
And you can even specify the sheet in the range Excel-style:
d <- read_excel('data/example_excel.xlsx',range = 'iCod_summary!A1:D6')
The readxl
package provides also a couple of helper functions when specifing the range:
anchored # anchor selction to a specific cell
cell_cols # read selected columns
cell_rows # read selected roww
cell_limits # read selected columns and rows
So you can ask for only the 3rd and the 4th column:
read_excel('data/example_excel.xlsx',range = cell_cols(3:4))
## # A tibble: 63 x 2
## SSB `B4+`
## <dbl> <dbl>
## 1 936957 2354550
## 2 791479 2079670
## 3 771545 1876560
## 4 871819 1863510
## 5 849812 1825050
## 6 707616 1752670
## 7 466533 1495590
## 8 568226 1491700
## 9 507238 1314700
## 10 450537 1218020
## # ... with 53 more rows
When importing you can also explicitly define how the columns in your data will be determined using the col_types argument:
read_excel('data/example_excel.xlsx',col_types = c('text','numeric','numeric','numeric','numeric','numeric','numeric'))
## # A tibble: 63 x 7
## Year Rec SSB `B4+` Landings HR Fbar
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1955 151968 936957 2354550 545250 0.230 0.290
## 2 1956 152779 791479 2079670 486909 0.230 0.290
## 3 1957 170645 771545 1876560 455182 0.240 0.310
## 4 1958 220745 871819 1863510 517359 0.280 0.350
## 5 1959 289220 849812 1825050 459081 0.250 0.320
## 6 1960 154235 707616 1752670 470121 0.270 0.370
## 7 1961 192814 466533 1495590 377291 0.250 0.360
## 8 1962 128753 568226 1491700 388985 0.260 0.380
## 9 1963 177361 507238 1314700 408800 0.310 0.460
## 10 1964 203993 450537 1218020 437012 0.360 0.550
## # ... with 53 more rows
Available column types are:
guess | numeric |
---|---|
text | logical |
date | list |
skip | - |
Note that when the data in the cells can’t be coerced in the the desired format the cell contents will be replace with NA
-s.
Another package, openxlsx
, provides tools to export data to an Excel format. To export a single data frame you can simply write:
write.xlsx(minke, file = 'minke.xlsx')
If you need to export more that one data frame you can use a named list:
write.xlsx(list(minke=minke, iCod=icod_summary), file = 'RtoExcel.xlsx')
Data is often stored in format created by other statistical software packages such a SPSS and SAS. The haven package provides the necessary functions to import and export to these file formats:
# SAS
read_sas # read sas file
write_sas # write sas file
# SPSS
read_sav # read spss file
write_sav # write spss file
# Stata
read_dta # read stata file
write_dta # write stata file
There are a number of SPSS/SAS/Stata features that have no direct equivalent in R. Haven preserves them so you can choose what do with them. To simply eliminate them, use one of the zap functions:
as_factor # Convert input to a factor.
labelled_spss # Labelled vectors for SPSS
labelled # Create a labelled vector.
is.labelled # -><-
print_labels # Print the labels of a labelled vector
tagged_na # "Tagged" missing values
na_tag
is_tagged_na
format_tagged_na
print_tagged_na
zap_empty # Convert empty strings into missing values.
zap_formats # Remove format attributes
zap_labels # Zap labels
zap_missing # Zap special missings to regular R missings
zap_widths # Remove display width attributes
Databases are commonly used to store (large amounts of) data and numerous software vendors provide database solutions, both general and specific Similarly numerous packages exist to interact with databases in R. Notably DBI, RODBC and dplyr. Typically in an R session the user queries the database for the data needed for a particular analysis and loads it into memory. Larger datasets, that don’t fit into memory will need to be subsampled. The most common types of database engines have a specialised R package:
Engine | R-package |
---|---|
Oracle | ROracle |
Postgres | RPostgreSQL |
MSSQL | odbc |
MySQL | RMySQL |
sqlite | RSQLite |
To illustrate how the database connectivity works we will set up a dummy sqlite database containing the minke dataset. The following command creates and sets up a link to a dummy database
db <- src_sqlite('minke.db',create = TRUE)
It is fairly straightforward to add data to the database. Here the minke whale data is copied to a table called minke_tbl
tmp <- copy_to(db,minke,'minke_tbl')
We can query the database in many ways. If we want everything from the table we use the tbl
command:
minke.tbl <- tbl(db,'minke_tbl')
minke.tbl
## # Source: table<minke_tbl> [?? x 13]
## # Database: sqlite 3.19.3 [/home/haf/einarhj/edu/tcrenv/minke.db]
## id date lon lat area length weight age sex matu… stomac…
## <int> <dbl> <dbl> <dbl> <chr> <int> <int> <dbl> <chr> <chr> <dbl>
## 1 1 1.09e⁹ -21.4 65.7 North 780 NA 11.3 Fema… preg… 58.0
## 2 690 1.09e⁹ -21.4 65.7 North 793 NA NA Fema… preg… 90.0
## 3 926 1.09e⁹ -19.8 66.5 North 858 NA 15.5 Fema… preg… 24.0
## 4 1333 1.06e⁹ -21.6 65.7 North 567 NA 7.20 Male imma… 25.0
## 5 1334 1.06e⁹ -15.6 66.3 North 774 NA 12.3 Fema… imma… 85.0
## 6 1335 1.06e⁹ -18.7 66.2 North 526 NA 9.60 Fema… imma… 18.0
## 7 1336 1.06e⁹ -21.5 65.7 North 809 NA 17.3 Fema… preg… 200
## 8 1338 1.06e⁹ -22.8 66.1 North 820 NA 13.8 Fema… preg… 111
## 9 1339 1.06e⁹ -17.5 66.6 North 697 NA 12.2 Male pube… 8.00
## 10 1341 1.06e⁹ -14.7 66.2 North 777 NA 15.4 Male matu… 25.0
## # ... with more rows, and 2 more variables: stomach.weight <dbl>,
## # year <int>
You can also run arbitrary sql commands:
num.minke <-
tbl(db,sql('select count(1) from minke_tbl'))
num.minke
## # Source: SQL [?? x 1]
## # Database: sqlite 3.19.3 [/home/haf/einarhj/edu/tcrenv/minke.db]
## `count(1)`
## <int>
## 1 190
Describe the table
tbl_vars(minke.tbl)
## [1] "id" "date" "lon" "lat"
## [5] "area" "length" "weight" "age"
## [9] "sex" "maturity" "stomach.volume" "stomach.weight"
## [13] "year"
minke.tbl #glimpse(minke.tbl) generates error
## # Source: table<minke_tbl> [?? x 13]
## # Database: sqlite 3.19.3 [/home/haf/einarhj/edu/tcrenv/minke.db]
## id date lon lat area length weight age sex matu… stomac…
## <int> <dbl> <dbl> <dbl> <chr> <int> <int> <dbl> <chr> <chr> <dbl>
## 1 1 1.09e⁹ -21.4 65.7 North 780 NA 11.3 Fema… preg… 58.0
## 2 690 1.09e⁹ -21.4 65.7 North 793 NA NA Fema… preg… 90.0
## 3 926 1.09e⁹ -19.8 66.5 North 858 NA 15.5 Fema… preg… 24.0
## 4 1333 1.06e⁹ -21.6 65.7 North 567 NA 7.20 Male imma… 25.0
## 5 1334 1.06e⁹ -15.6 66.3 North 774 NA 12.3 Fema… imma… 85.0
## 6 1335 1.06e⁹ -18.7 66.2 North 526 NA 9.60 Fema… imma… 18.0
## 7 1336 1.06e⁹ -21.5 65.7 North 809 NA 17.3 Fema… preg… 200
## 8 1338 1.06e⁹ -22.8 66.1 North 820 NA 13.8 Fema… preg… 111
## 9 1339 1.06e⁹ -17.5 66.6 North 697 NA 12.2 Male pube… 8.00
## 10 1341 1.06e⁹ -14.7 66.2 North 777 NA 15.4 Male matu… 25.0
## # ... with more rows, and 2 more variables: stomach.weight <dbl>,
## # year <int>
List all tables in a database
db_list_tables(db$con)
For information related to database connectivity refer to: * https://db.rstudio.com/dplyr/
Try to open a file from “home”