Idealogy behind R for Data Science

Tidyverse is a collection of R packages that enables tools for data science, and is especially useful for data wrangling, manipulation, visualization, and communication of large data sets. Extensions of tidyverse also enable direct connections and manipulation with SQL databases (e.g, dbplyr). Here we briefly introduce some main concepts when this programming, all derived directly from the open access book R for Data Science by Garrett Grolemund and Hadley Wickham (which can be found here).

As you can read here, the main idea behind using tidyverse is that exploratory data analysis in R is composed of a few main steps: first is importing and tidying data, then iteratively transforming, visualising, and modeling data to understand patterns held by them, and finally communicating results effectively. Tidyverse was designed as a programming method and collection of functions that are focused on easing these tasks into a simple uniform routine that can be applied to any dataset. Standardizing the approach taken toward any data science project then aids reproducability of any project as well as the ability to collaborate on a project.

The foundation: tibbles, tidy data, and piping

Tibbles

First we need to install and load Tidyverse. After that we can have a look at what at the main form of data storage, called a tibble:

install.packages('tidyverse')
install.packages('nycflights13') # this is an example data package

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.1
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(nycflights13)

Note that a tibble is essentially the same as a data frame (for example made with data.frame) but with some useful information printed (e.g., dimensions and data types), as well as some restrictions placed on how it an be manipulated. These help prevent common errors. For example, recycling is possible as it is in data frames but less flexible:

flights$year <- c(2013, 2014)
try(flights$year <- rep(2014, 7))

Tidy data

In addition, the above data may look like a standard data set obtained from anywhere, but it is not. It has already been formatted as ‘tidy data’. Although data can be represented a variety of ways in tables for visualization, but for data manipulation and analysis, there is only one format that is much easier to use than others. Therefore, all data should be transformed into this format before analyzing. This format is called the ‘tidy’ dataset in tidyverse, and following three rules make a dataset ‘tidy’: 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell.

If, for example, the flights dataset were organized such that each carrier, origin, or dest had their own sets of columns, the data would no longer be tidy.

Piping

In base R programming, functions wrap around the objects that they are applied to, which are often indexed, and this manipulated object is saved as a new one. What is written is arranged like an onion: in the following example, the first step of the command is in the center of code (calling the object flights), followed by indexing the 15th and 16th columns. As we move away from the center, a function is applied, and finally the output of that function is assigned to a new object.

sub <- apply(flights[,c(15:16)], 2, mean, na.rm = T)

Piping, or using ‘%>%’ to pass objects from one function to the next, introduces a programming method that makes the process more intuitive by alligning the code with the order of operation:

flights %>% 
  select(15, 16) %>%  # or use select(air_time, distance)
  apply(., 2, mean, na.rm = T) -> sub

In the above code, the flight tibble was piped into the ‘select’ function, which indexed its 15th and 16th rows only. ‘Select’ does not require an argument where flights is referenced because it was built to accept a piped argument implicitly. Note that selecting by column name (no quotations needed) is also possible and more useful in most cases. After being piped to ‘select’, the result was then piped to the function ‘apply’. ‘Apply’ is an older function that is not built to implicitly accept piped objects; therefore, it requires the placeholder ‘.’ to be placed where the input data frame is expected. Finally, this modified data frame is assigned to ‘sub’ at the end, but alternatively it could have been assigned at the beginning as in the non-piped version.

The power of tidyverse: all you need in a handful of functions

As in the ‘select’ function, there are a variety of functions that come with the tidyverse package, but only a small set are needed to do almost any kind of data wrangling that you ever wanted to do. These are the only functions we touch on in this brief introduction. However, beyond tidyverse, there are also a variety of packages that implement more advanced piping-compatible functions that speed the manipulation of large data sets in particular (e.g., dbplyr, purrrlyr).

The most commonly used tidyverse commands, with a brief description, include: * select() - select columns * filter() - retain rows according to boolean criteria * arrange() - sorts data * rename() - renames existing columns * mutate() - writes new columns * group_by() / ungroup() - groups data according to column values (such as factors) * summarise() - reduces dataset to an aggregated leve. Used after grouping (which defines the aggregation level) and along with functions that define how to aggregate (e.g., count(), n(), sum(), mean()). * gather() / spread() - converts data between the tidy format and ‘long’ formats * full_join(), left_join(), etc. - joins data contained in two data frames according to certain criteria that define how rows are compatible (i.e., joining in relational databases)

Below is an example of how the function ‘apply’ in the previous example can be replaced using tidyverse commands, as well as functions such as ‘aggregate’ using ‘group_by’ and ‘summarise’

  flights %>%
  select(air_time, distance) %>% 
  summarise(mn_airtime = mean(air_time, na.rm = T),
            mn_distance = mean(distance))

## # A tibble: 1 x 2
##   mn_airtime mn_distance
##        <dbl>       <dbl>
## 1       151.       1040.

 # or if the operation should occur by groupings:
  flights %>%
  select(dest, air_time, distance) %>%
  group_by(dest) %>% 
  summarise(mn_airtime = mean(air_time, na.rm = T),
            mn_distance = mean(distance))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 105 x 3
##    dest  mn_airtime mn_distance
##    <chr>      <dbl>       <dbl>
##  1 ABQ        249.        1826 
##  2 ACK         42.1        199 
##  3 ALB         31.8        143 
##  4 ANC        413.        3370 
##  5 ATL        113.         757.
##  6 AUS        213.        1514.
##  7 AVL         89.9        584.
##  8 BDL         25.5        116 
##  9 BGR         54.1        378 
## 10 BHM        123.         866.
## # … with 95 more rows

Here is a smattering of demonstrations on how to use the other important functions and their equivalents in base R:

filter()

#filter()
flights[flights$month==3 & flights$dest=="DEN",]

## # A tibble: 643 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     3     1      630            630         0      831            910
##  2  2013     3     1      648            649        -1      830            912
##  3  2013     3     1      649            649         0      839            920
##  4  2013     3     1      814            815        -1     1012           1056
##  5  2013     3     1      827            830        -3     1045           1106
##  6  2013     3     1      843            800        43     1042           1031
##  7  2013     3     1      857            900        -3     1057           1135
##  8  2013     3     1      922            925        -3     1115           1205
##  9  2013     3     1     1056           1100        -4     1258           1326
## 10  2013     3     1     1144           1137         7     1327           1403
## # … with 633 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% 
  filter(month == 3, dest == "DEN")

## # A tibble: 643 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     3     1      630            630         0      831            910
##  2  2013     3     1      648            649        -1      830            912
##  3  2013     3     1      649            649         0      839            920
##  4  2013     3     1      814            815        -1     1012           1056
##  5  2013     3     1      827            830        -3     1045           1106
##  6  2013     3     1      843            800        43     1042           1031
##  7  2013     3     1      857            900        -3     1057           1135
##  8  2013     3     1      922            925        -3     1115           1205
##  9  2013     3     1     1056           1100        -4     1258           1326
## 10  2013     3     1     1144           1137         7     1327           1403
## # … with 633 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

arrange()

o <- order(flights$distance)
flights[o,c('year','month','day','distance')]

## # A tibble: 336,776 x 4
##     year month   day distance
##    <int> <int> <int>    <dbl>
##  1  2013     7    27       17
##  2  2013     1     3       80
##  3  2013     1     4       80
##  4  2013     1     4       80
##  5  2013     1     4       80
##  6  2013     1     5       80
##  7  2013     1     6       80
##  8  2013     1     7       80
##  9  2013     1     8       80
## 10  2013     1     9       80
## # … with 336,766 more rows

flights[rev(o),c('year','month','day','distance')]

## # A tibble: 336,776 x 4
##     year month   day distance
##    <int> <int> <int>    <dbl>
##  1  2013     9    30     4983
##  2  2013     9    29     4983
##  3  2013     9    28     4983
##  4  2013     9    27     4983
##  5  2013     9    25     4983
##  6  2013     9    23     4983
##  7  2013     9    22     4983
##  8  2013     9    21     4983
##  9  2013     9    20     4983
## 10  2013     9    18     4983
## # … with 336,766 more rows

flights %>% 
  select(year, month, day, distance) %>% 
  arrange(distance)

## # A tibble: 336,776 x 4
##     year month   day distance
##    <int> <int> <int>    <dbl>
##  1  2013     7    27       17
##  2  2013     1     3       80
##  3  2013     1     4       80
##  4  2013     1     4       80
##  5  2013     1     4       80
##  6  2013     1     5       80
##  7  2013     1     6       80
##  8  2013     1     7       80
##  9  2013     1     8       80
## 10  2013     1     9       80
## # … with 336,766 more rows

flights %>% 
  select(year, month, day, distance) %>% 
  arrange(desc(distance))

## # A tibble: 336,776 x 4
##     year month   day distance
##    <int> <int> <int>    <dbl>
##  1  2013     1     1     4983
##  2  2013     1     2     4983
##  3  2013     1     3     4983
##  4  2013     1     4     4983
##  5  2013     1     5     4983
##  6  2013     1     6     4983
##  7  2013     1     7     4983
##  8  2013     1     8     4983
##  9  2013     1     9     4983
## 10  2013     1    10     4983
## # … with 336,766 more rows

rename()

flights2 <- data.frame(flights, year2 = flights$year)
flights2 <- as_tibble(flights2[c(dim(flights2)[2],2:(dim(flights2)[2]-1))])
flights2

## # A tibble: 336,776 x 19
##    year2 month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

flights2 <- 
  flights %>% 
  rename(year2 = year)
flights2

## # A tibble: 336,776 x 19
##    year2 month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

mutate()

flights2 <- as_tibble(data.frame(flights, air_time_hr = flights$air_time/60, distance_1000m = flights$distance/1000))


flights2 <-
  flights %>% 
  mutate(air_time_hr = air_time/60, distance_1000m = distance/1000)
flights2

## # A tibble: 336,776 x 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 13 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   air_time_hr <dbl>, distance_1000m <dbl>

group_by() with count()

#count() with group_by()
#this yields data in a 'long' format
flights2a <- t(table(flights[,c('origin', 'dest')]))
head(flights2a)

##      origin
## dest    EWR   JFK   LGA
##   ABQ     0   254     0
##   ACK     0   265     0
##   ALB   439     0     0
##   ANC     8     0     0
##   ATL  5022  1930 10263
##   AUS   968  1471     0

#this yields 'tidy' data
flights2 <- 
  flights %>% 
  group_by(origin, dest) %>% 
  count()
flights2

## # A tibble: 224 x 3
## # Groups:   origin, dest [224]
##    origin dest      n
##    <chr>  <chr> <int>
##  1 EWR    ALB     439
##  2 EWR    ANC       8
##  3 EWR    ATL    5022
##  4 EWR    AUS     968
##  5 EWR    AVL     265
##  6 EWR    BDL     443
##  7 EWR    BNA    2336
##  8 EWR    BOS    5327
##  9 EWR    BQN     297
## 10 EWR    BTV     931
## # … with 214 more rows

spread() / gather() to convert between long and tidy formats:

flights2b <-
  flights2 %>% 
  as_tibble() %>% 
  spread(key = origin, value = n, fill = 0)

#result in the same formats 
flights2b

## # A tibble: 105 x 4
##    dest    EWR   JFK   LGA
##    <chr> <dbl> <dbl> <dbl>
##  1 ABQ       0   254     0
##  2 ACK       0   265     0
##  3 ALB     439     0     0
##  4 ANC       8     0     0
##  5 ATL    5022  1930 10263
##  6 AUS     968  1471     0
##  7 AVL     265     0    10
##  8 BDL     443     0     0
##  9 BGR       0     0   375
## 10 BHM       0     1   296
## # … with 95 more rows

head(flights2a)

##      origin
## dest    EWR   JFK   LGA
##   ABQ     0   254     0
##   ACK     0   265     0
##   ALB   439     0     0
##   ANC     8     0     0
##   ATL  5022  1930 10263
##   AUS   968  1471     0

flights2b %>% 
  gather(key = origin, value = n, -dest) %>% 
  filter(n!=0)

## # A tibble: 224 x 3
##    dest  origin     n
##    <chr> <chr>  <dbl>
##  1 ALB   EWR      439
##  2 ANC   EWR        8
##  3 ATL   EWR     5022
##  4 AUS   EWR      968
##  5 AVL   EWR      265
##  6 BDL   EWR      443
##  7 BNA   EWR     2336
##  8 BOS   EWR     5327
##  9 BQN   EWR      297
## 10 BTV   EWR      931
## # … with 214 more rows

head(flights2)

## # A tibble: 6 x 3
## # Groups:   origin, dest [6]
##   origin dest      n
##   <chr>  <chr> <int>
## 1 EWR    ALB     439
## 2 EWR    ANC       8
## 3 EWR    ATL    5022
## 4 EWR    AUS     968
## 5 EWR    AVL     265
## 6 EWR    BDL     443

Plotting in tidyverse

Tidyverse also uses ggplot2, which is intended to simplify the process of creating plots so that data can be quickly and easily visualized as an iterative component of the exploratory analysis process. Some advantages include an clear method for translating data to visuals, having many preconfigured attributes available, and being able to build and modify previously stored plot objects without needing to recreate them. A downside is that to take advantage of the full power and flexibility of ggplot2 requires a wide knowledge of what is available as options to include in graphics, and therefore involve a long learning curve. However, the ultimate results are well worth the learning investment. For a basic explanation and cheat sheet see here

ggplot: Key components

ggplot has three key components:

data, this must be a data.frame
A set of aesthetic mappings (aes) between variables in the data and visual properties, and
At least one layer which describes how to render each observation.

sub <- 
  flights %>% 
  sample_n(100, replace = F) %>% 
  filter(!is.na(distance), !is.na(air_time), !is.na(origin), !is.na(month), !is.na(dest))

ggplot(data = sub, aes(x = distance, y = air_time)) + geom_point()

Different syntax, equivalent outcome:

ggplot(sub, aes(distance, air_time)) + geom_point()
ggplot()                    + geom_point(data = sub, aes(distance, air_time))
ggplot(data = sub)            + geom_point(aes(x = distance, y = air_time))
ggplot(sub)                   + geom_point(aes(distance, air_time))

Can be stored as an object for later use. This is a useful feature of Rgadget: because default plots are created in ggplot, they can be stored and modified by the user at a later point.

p <- ggplot(sub, aes(distance, air_time)) + geom_point()

The class:

class(p)

## [1] "gg"     "ggplot"

The structure (a bit of Latin - not run here):

str(p)

aesthetic

Adding more variables to a two dimensional scatterplot can be done by mapping the variables to an aesthetic (colour, fill, size, shape, alpha)

colour

p <- ggplot(sub, aes(distance, air_time))
p + geom_point(aes(colour = origin))
p + geom_point(aes(colour = dest))

Manual control of colours or other palette schemes (here brewer):

p + geom_point(aes(colour = origin)) +
  scale_colour_manual(values = c("orange","brown","green"))
p + geom_point(aes(colour = dest)) +
  scale_colour_brewer(palette = "Set1")

## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors

## Warning: Removed 71 rows containing missing values (geom_point).

Note, to view all the brewer palettes do:

RColorBrewer::display.brewer.all()

shape

p + geom_point(aes(distance, air_time, shape = origin))

size

p + geom_point(aes(distance, air_time, size = month))

One can also “fix” the aesthetic manually, e.g.:

ggplot(sub, aes(distance, air_time)) + geom_point(colour = "blue", shape = 8, size = 10)

Note here that the call to colour, shape, etc. is done outside the aes-call. One can also combine calls inside and outside the aes-function (here we showing overlay of adjacent datapoints):

p + geom_point(aes(distance, air_time, size = month), alpha = 0.3, col = "red")

Facetting

Splitting a graph into subsets based on a categorical variable.

ggplot(sub) + 
  geom_point(aes(distance, air_time, colour = as.factor(year))) + 
  facet_wrap(~ origin)

One can also split the plot using two variables using the function facet_grid:

ggplot(sub) +
  geom_point(aes(distance, air_time)) +
  facet_grid(as.factor(year) ~ origin)

Adding layers

The power of ggplot comes into place when one adds layers on top of other layers. Let’s for now look at only at two examples.

Add a line to a scatterplot

ggplot(sub, aes(distance, air_time)) +
  geom_point() +
  geom_line()

Add a smoother to a scatterplot

p <- ggplot(sub, aes(distance, air_time))
p + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

p + geom_point() + geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Statistical summary graphs

There are some useful inbuilt routines within the ggplot2-packages which allows one to create some simple summary plots of the raw data.

bar plot

One can create bar graph for discrete data using the geom_bar

ggplot(sub, aes(dest)) + geom_bar()

The graph shows the number of observations we have of each destination. The original data is first transformed behind the scene into a table of counts, before being rendered.

histograms

For continuous data one uses the geom_histogram-function (left default bin-number, right bindwith specified as 50 mins):

p <- ggplot(sub, aes(air_time))
p + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_histogram(binwidth = 50)

One can add another variable (left) or better use facet (right):

p + geom_histogram(aes(fill = origin))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_histogram() + facet_wrap(~ origin, ncol = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Frequency polygons

Alternatives to histograms for continuous data are frequency polygons:

p + geom_freqpoly(lwd = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p + geom_freqpoly(aes(colour = origin), lwd = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Box-plots

Boxplots, which are more condensed summaries of the data than histograms, are called using geom_boxplot. Here two versions of the same graph are used, the one on the left is the default, but on the right we have reordered the maturity variable on the x-axis such that the median value of length increases from left to right:

ggplot(sub, aes(dest, air_time)) + geom_boxplot()
p <- ggplot(sub, aes(reorder(dest, air_time), air_time)) + geom_boxplot()
p

It is sometimes useful to plot the “raw” data over summary plots. Using geom_point as an overlay is sometimes not very useful when points overlap too much; geom_jitter can sometimes be more useful:

p + geom_point(colour = "red", alpha = 0.5, size = 1)
p + geom_jitter(colour = "red", alpha = 0.5, size = 1)

Read the help on geom_violin and create a code that results in this plot:

Other statistical summaries

Using stat_summary one can call specific summary statistics. Here are examples of 4 plots, going from top-left to bottom right we have:

Raw data with median length at age (red) superimposed
A pointrange plot showing the mean and the range
A pointrange plot showing the mean and the standard error
A pointrange plot showing the bootstrap mean and standard error

sub$distance <- round(sub$distance)
p <- ggplot(sub, aes(distance, air_time))
p + geom_point(alpha = 0.25) + stat_summary(fun.y = "median", geom = "point", colour = "red")

## Warning: `fun.y` is deprecated. Use `fun` instead.

p + stat_summary(fun.y = "mean", fun.ymin = "min", fun.ymax = "max", geom = "pointrange")

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: `fun.ymin` is deprecated. Use `fun.min` instead.

## Warning: `fun.ymax` is deprecated. Use `fun.max` instead.

p + stat_summary(fun.data = "mean_se")

## Warning: Removed 36 rows containing missing values (geom_segment).

p + stat_summary(fun.data = "mean_cl_boot")

## Warning: Removed 36 rows containing missing values (geom_segment).

Tidyverse