Needed library:
library(tidyverse)
Note if you have not installed the tidyverse
-package then if you want to use the glimpse
function introduced below you at minimum need to load the dplyr
-package:
library(dplyr)
R is not a database software so one normally has to import the data from some other sources.
Although R is not a good data entry medium it is possible. E.g. to one can create vectors by:
weight <- c(1,5,3,2,6)
length <- c(10,17,14,12,18)
Or one could generate a dataframe by:
d <- data.frame(weight = c(1, 5, 3, 2, 6),
length = c(10, 17, 14, 12, 18))
Or one could use the simple R Data Editor:
d <- data.frame(weight = numeric(0),
length = numeric(0))
d <- edit(d)
Here the edit
-function opens up a very primitive data-entry window with the data then being stored in the object d. But as said R is not a good data entry medium so this is rarely used for anything else than ad hoc entries of data.
A lot of functions in R deal with reading in text files in various formats. The base function is read.table
:
d <-
read.table(file = "nameOfFile", # path to the file
header = TRUE, # are there column names
# at the top
dec = ".", # decimal sign
sep = " ", # column separator symbol
skip = 0, # num lines at top to skip
stringsAsFactors = FALSE, # treat text as character
comment.char = "#") # indicating comments
Some examples of the derivatives of read.table
are:
read.csv # US style CSV file (column separator is ',' and dec '.'
read.csv2 # European style CSV file (column separator is ';' and dec ','
read.delim # Tab deliminated data, US style decimal (.)
read.delim # Tab deliminated data, EU style decimal (,)
read.fwf # Fixed width input (common format in the old days)
read.fortran # Fotran formated text
readLines # Raw lines from the file
scan # Reads in a vector from the input
NOTE: Below is a generic issue, not limited to read.table The first argument in the read.table
function and its derivative is the file name, including the path. If the file (lets say it is called minke.csv) is in the current working directory (use getwd()
to get information of you current working directory) one can simply write the following command:
minke <- read.table("minke.csv",
header = TRUE,
sep = ",")
Or, because we know the file is a csv-file we simply do:
minke <- read.csv("minke.csv")
If the data file are in folder called data-raw within the working directory:
minke <- read.csv('data-raw/minke.csv')
If it is somewhere on the computer one can use absolute positioning. E.g if it resides in a root folder called all_data one would use the following:
minke <- read.csv("~/all_data/minke.csv") # linux/mac
minke <- read.csv("C:/all_data/minke.csv") # windows
If the folder name contains spaces (e.g. folder name is all data one would need to resort to:
minke <- read.csv("C:/all\t data/minke.csv")
One can also read online data directly into R, e.g.:
minke <- read.csv("http://www.hafro.is/~einarhj/data/minke.csv")
glimpse(minke)
## Observations: 190
## Variables: 13
## $ id <int> 1, 690, 926, 1333, 1334, 1335, 1336, 1338, 1339...
## $ date <fctr> 2004-06-10 22:00:00, 2004-06-15 17:00:00, 2004...
## $ lon <dbl> -21.42350, -21.39183, -19.81333, -21.57500, -15...
## $ lat <dbl> 65.66183, 65.65350, 66.51167, 65.67333, 66.2900...
## $ area <fctr> North, North, North, North, North, North, Nort...
## $ length <int> 780, 793, 858, 567, 774, 526, 809, 820, 697, 77...
## $ weight <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ age <dbl> 11.3, NA, 15.5, 7.2, 12.3, 9.6, 17.3, 13.8, 12....
## $ sex <fctr> Female, Female, Female, Male, Female, Female, ...
## $ maturity <fctr> pregnant, pregnant, pregnant, immature, immatu...
## $ stomach.volume <dbl> 58, 90, 24, 25, 85, 18, 200, 111, 8, 25, 38, 6,...
## $ stomach.weight <dbl> 31.900000, 36.290000, 9.420000, 3.640000, 5.510...
## $ year <int> 2004, 2004, 2004, 2003, 2003, 2003, 2003, 2003,...
Take note that the non-numerial (character) data are imported as factors. Although desirable in some cases (e.g. when fitting a model) for most purposes we want them in just as character strings. Hence when reading in things we often put the stringsAsFactors argument to FALSE:
minke <- read.csv("http://www.hafro.is/~einarhj/data/minke.csv",
stringsAsFactors = FALSE)
glimpse(minke)
## Observations: 190
## Variables: 13
## $ id <int> 1, 690, 926, 1333, 1334, 1335, 1336, 1338, 1339...
## $ date <chr> "2004-06-10 22:00:00", "2004-06-15 17:00:00", "...
## $ lon <dbl> -21.42350, -21.39183, -19.81333, -21.57500, -15...
## $ lat <dbl> 65.66183, 65.65350, 66.51167, 65.67333, 66.2900...
## $ area <chr> "North", "North", "North", "North", "North", "N...
## $ length <int> 780, 793, 858, 567, 774, 526, 809, 820, 697, 77...
## $ weight <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ age <dbl> 11.3, NA, 15.5, 7.2, 12.3, 9.6, 17.3, 13.8, 12....
## $ sex <chr> "Female", "Female", "Female", "Male", "Female",...
## $ maturity <chr> "pregnant", "pregnant", "pregnant", "immature",...
## $ stomach.volume <dbl> 58, 90, 24, 25, 85, 18, 200, 111, 8, 25, 38, 6,...
## $ stomach.weight <dbl> 31.900000, 36.290000, 9.420000, 3.640000, 5.510...
## $ year <int> 2004, 2004, 2004, 2003, 2003, 2003, 2003, 2003,...
The readxl
-package provides support to read in Excel files directly into R (We later introduce you to the XLConnect
package that has more options). The minke data stored in an excel sheet called minke.xlsx. You can either download it onto your computer via the usual point and mouse click or use the download.file
function:
download.file(url = "http://www.hafro.is/~einarhj/data/minke.xlsx",
destfile = "minke.xlsx",
mode = "wb")
library(readxl)
d <- read_excel("minke.xlsx")
glimpse(d)
## Observations: 190
## Variables: 12
## $ id <dbl> 1, 690, 926, 1333, 1334, 1335, 1336, 1338, 1339...
## $ lon <dbl> -21.42350, -21.39183, -19.81333, -21.57500, -15...
## $ lat <dbl> 65.66183, 65.65350, 66.51167, 65.67333, 66.2900...
## $ area <chr> "North", "North", "North", "North", "North", "N...
## $ length <dbl> 780, 793, 858, 567, 774, 526, 809, 820, 697, 77...
## $ weight <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",...
## $ age <chr> "11.3", "NA", "15.5", "7.2", "12.3", "9.6", "17...
## $ sex <chr> "Female", "Female", "Female", "Male", "Female",...
## $ maturity <chr> "pregnant", "pregnant", "pregnant", "immature",...
## $ stomach.volume <chr> "58", "90", "24", "25", "85", "18", "200", "111...
## $ stomach.weight <chr> "31.9", "36.29", "9.42", "3.64", "5.51", "1.167...
## $ year <dbl> 2004, 2004, 2004, 2003, 2003, 2003, 2003, 2003,...
The read_excel
function will by default read in the 1st data-sheet (checkout args(read_excel)
. To get information on what sheets are in an excel file one can use the excel_sheets
function:
excel_sheets("minke.xlsx")
## [1] "maesurements"
If NAs are represented by something other than blank cells, set the na argument by e.g. if -9999 represents missing data then:
read_excel("minke.xlsx", na = "-9999")
After the data has been imported one frequently checks the data to see what was imported is as expected. Some common functions are:
head(d) # shows the top 6 lines of d
tail(d) # shows the last 6 lines of d
dim(d) # shows the num row and col of d
names(d) # gives the column names of d
summary(d) # quick summary statistics for the columns of d
str(d) # show the structure of the data, e.g. variable types
glimpse(d) # dplyr equivalent of str that works on data frames
Databases are commonly used to store (large amounts of) data and numerous software vendors provide database solutions, both general and specific. Similarly numerous packages exist to interact with databases in R. Notably DBI
, RODBC
and dplyr
. Typically in an R session the user queries the database for the data needed for a particular analysis and loads it into memory. Larger datasets, that don’t fit into memory may however need to be processed prior to importing it into R.
RODBC
packages provides functions to connect to an Access database
library(RODBC) # Load RODBC package
# Connect to Access db
db <-
odbcConnectAccess("C:/Documents/NameOfMyAccessDatabase")
# Get data
d <- sqlQuery(db , "select *
from Name_of_table_in_my_database")
close(db) # close connection
The dplyr
package has built in connectivity for a wide range of data base types:
src_sqlite() ## sets up a connection
## with an sqlite database
src_postgres() ## sets up a connection with
## an postgres database
tbl() ## calls a table from a database
sql() ## prepares a sql query
copy_to() ## saves a dataframe to a database
## as a table
… need some words here
library(RPostgreSQL)
# connect to the database
ram <- src_postgres(dbname = "srdb",
host = "nautilus-vm.mathstat.dal.ca",
user = "srdbuser",
password = "srd6us3r!",
port = 5432,
options = "-c search_path=srdb")
ram # list of the tables in the database
tbl(ram, "assessment") %>% # Access the table
filter(stockid == "CODICE") %>% # I only want to look at my cod
select(assessid) %>% # I only wanted to get the assessid from this table
# to be used in the left_join below
left_join(tbl(ram, "timeseries")) %>% # Get data from the timeries (only 'CODICE'
collect() %>% # Get the data to my computer
ggplot() + # And to the plot :-)
geom_line(aes(tsyear, tsvalue)) +
facet_wrap(~ tsid, scale = "free_y")
Package haven
provides functions for reading in SPSS, STATA and SAS files:
library(haven)
read_sas("path/to/file") ## SAS files
read_por("path/to/file") ## SPSS portable files
read_sav("path/to/file") ## SPSS data files
read_dta("path/to/file") ## Stata files
Similarly in the R.matlab
package there is a function that reads in matlab type of data:
library(R.matlab)
readMat('path/to/file') ## Matlab data files
The ’rfishbase’ package allows access to FishBase directly from R:
library(rfishbase)
# query data on length weight relationship
length_weight("Gadus morhua")
# query growth parameters
popgrowth("Parexocoetus brachypterus")
# find common names (in many languages)
common_names("Parexocoetus brachypterus")
# diet data
diet("Parexocoetus brachypterus")
# fecundity
fecundity("Parexocoetus brachypterus")