1 Introduction

According to a 2012 report of the Caribbean Regional Fisheries Mechanism (CRFM), Microsoft (MS) Access databases are commonly used in the region, accounting for 38% of databases used by Fisheries Departments for storing and managing their fishery data¹. This brief exercise intends to show one method using R to connect to an MS Access database. The hope this that this will provide a useful means of automating your workflow with your existing MS Access data and producing richer reports more quickly, thanks to the tools present within the R ecosystem.

Note that MS Access is only available on Windows machines. Therefore, this exercise will only be applicable for users of those machines.

The exercise was done using R version 4.3.1² within the RStudio Integrated Development Environment (IDE), version 2023.06.0³.

2 RODBC

An R package called RODBC⁴, or R-Open Database Connectivity, will be used for this exercise. Note that there may be other packages which may achieve in similar or different results. But in this case I intend to show you some of the considerations for accessing your database via R, using RODBC.

ODBC is an Application Programming Interface (API) for accessing DBMS, such as MS Access. It is also Operating System (OS) agnostic, meaning it can be used on virtually any operating system. RODBC enables low-level access to “talk” directly to the ODBC interface of a database connection. Additionally, RODBC allows for higher level access to a database, making it possible to read, save, copy, and manipulate data within the database using the Structured Query Language (SQL).

3 The RODBC Workflow

Using the RODBC package is fairly simple. Here are the basic steps:

Install the RODBC package (this is only necessary once, if the RODBC package is not already installed)
Load the RODBC package/library
Connect to your MS Access database
View the names of the tables and queries in your database
Pull data from your tables or queries into R
Proceed with your data preparation, analysis and visualization
Close the database connection when you are done

These steps will be explored in the exercise.

3.1 Step 1: Install the RODBC package

It is important to stress that this step is only necessary once, and this is if you do not already have an installation of RODBC on your working machine. Installation can be done via the code below:

install.packages(“RODBC”)

3.2 Step 2: Load the RODBC package/library

Here we will load RODBC along with the other packages we will use in this exercise.

library(RODBC) # Loads the RODBC package into the R evironment
library(tidyverse) # For plots, reshaping data and summaries. Takes the place of ggplot2, dplyr and tidyr
library(knitr) # For neat tables
library(here) # Used for finding files within your project. See more information at https://github.com/jennybc/here_here/blob/master/README.md and https://www.tidyverse.org/articles/2017/12/workflow-vs-script
library(janitor) # For some some data exploration, cleaning and other useful functionality

Remember that you can always use “??” and the name of the package to get some detailed guidance on an package. For example:

??RODBC

3.3 Step 3: Connect to your MS Access database

A dummy MS Access database was created to help illustrate the key goal of this exercise, connecting to an MS Access database using R.

For ease of access, the database is available in my Google Drive folder https://drive.google.com/file/d/1jua_yEVh1TXsmteFsrqsJi74eAZEnHH7/view?usp=drive_link. The database can be downloaded and saved to your machine. In this exercise, the database is saved to the “input” folder within my working directory for my project.

A few things need to be in mind before we proceed further:

You do not need to have MS Access installed on your machine in order to connect to the database. However, you do need to have the Microsoft Access Database Engine installed. This can be found here: https://www.microsoft.com/en-us/download/details.aspx?id=54920
Reinstalling the Microsoft Access Database Engine can help address certain issues in connecting to your MS Access database.
If you do have MS Access already installed, the Microsoft Access Database Engine will also be installed by default.
If you have the 32-bit version of MS Access installed, you will need to run the 32-bit version of R in order to connect to your database. If you have 64-bit MS Access, you must use 64-bit R. This is the case also for the version of the Microsoft Access Database Engine you have installed. Note that the 64-bit version of the Microsoft Access Database Engine is the one with “x64” in the file name (see screenshot below).

Downloading the Microsoft Access Database Engine

Now that those considerations are sorted, we can go ahead and connect to the MS Access database. We will be naming our database connection “dummy_db”. The function odbcConnectAccess2007()⁵ will be used to achieve this, using a single argument, stating where our database is located.

dummy_db <- odbcConnectAccess2007(access.file = here("input", "dummy_database.accdb"))

If everything worked as intended, congratulations! You’ve now connected to the MS Access database! In the Environment tab you will see the following:

Now connected to dummy_db

3.4 Step 4: View the names of the tables and queries in your database

Now we can begin to explore our database. The function sqlTables() will be used to list the items within our database. These include “system tables” and user tables (called simply “tables”). The only argument used within sqlTables() is the database connection name.

db_items <- sqlTables(dummy_db)

We will ignore the system tables during this exercise. The items of importance to us are the user tables, listed from row 13 to 19.

The user tables are briefly explained below:

Table name	Description
boats	A registry of fishing boats
catch	Catch record for the fishing fleet
contacts	A registry of fisherfolk
LUboatType	A look-up table listing boat types within the fishing fleet
LUcommunities	A look-up table of fishing communities
LUfish	A look-up table of fish species
LUsex	A look-up table of sexes

3.5 Step 5: Pull data from your tables or queries into R

We now have an idea of what data exists within our database. In this step we will read in data from our user tables. This will be done using sqlQuery(). The arguments for sqlQuery are the database connection name and the SQL code⁶, enclosed within quotes.

3.5.1 Boats

boats <- sqlQuery(dummy_db, 
                  "SELECT * 
                  FROM boats")

So what does this code do? The SQL code asks our database to give us all the columns from the “boats” table. “SELECT *” means to select all columns. “FROM boats”, says from what table or querry to select the columns. That’s it.

The “boats” table is shown below:

ID	BoatID	BoatType	BoatOwner
1	1266	3	1
2	2223	4	2
3	1122	1	4
4	1090	2	3
5	5643	2	5
6	1267	4	1

Perhaps one of the first things you notice with this data is that the “BoatType” and “BoatOwner” columns are coded. This is a common practice with databases. So in order for us to make sense of the data in this table we will need to include data from the appropriate look-up tables, “LUboatType” and “contacts”. Let’s load these in now.

3.5.2 Boat types

boat_types <- sqlQuery(dummy_db, 
                  "SELECT * 
                  FROM LUboatType")

ID	Type
1	Pirogue
2	Longliner
3	Keel
4	canoe

3.5.3 Contacts

contacts <- sqlQuery(dummy_db, 
                  "SELECT * 
                  FROM contacts")

ID	FirstName	LastName	Sex	DateOfBirth	Community
1	John	Brown	1	2000-01-07	1
2	Mary	Steel	2	1987-07-07	4
3	ralph	baron	3	1966-12-15	5
4	Harriette	Maison	2	1991-03-05	2
5	Billy	Jackman	1	19 July 2003	3

3.5.4 Boats (decoded)

Now we can decode the coded columns within the “boats” data, using the left_join() function:

boats_decoded <- boats %>% 
  left_join(boat_types, 
            by = c("BoatType" = "ID")) %>%  # The "boats" table is linked via the "BoatType" column to the "boat_types" table using the "ID" column in that table.
  left_join(contacts,
            by = c("BoatOwner" = "ID")) %>% # The "boats" table is linked via the "BoatOwner" column to the "contacts" table using the "ID" column in that table.
  select(-c(BoatType, BoatOwner)) # We remove the coded columns, as we no longer need them

Note that since we did not specify what columns from either table within our left_join() functions all non-linked columns from both tables were included in our final result.

ID	BoatID	Type	FirstName	LastName	Sex	DateOfBirth	Community
1	1266	Keel	John	Brown	1	2000-01-07	1
2	2223	canoe	Mary	Steel	2	1987-07-07	4
3	1122	Pirogue	Harriette	Maison	2	1991-03-05	2
4	1090	Longliner	ralph	baron	3	1966-12-15	5
5	5643	Longliner	Billy	Jackman	1	19 July 2003	3
6	1267	canoe	John	Brown	1	2000-01-07	1

3.5.5 Contacts (decoded)

You may notice that the “contacts” data also has coded columns: “Sex” and “Community”. Great! Let’s decode these too! This time we will call the sqlQuery() function within our left_join() code. This is a bit more complex coding, but I encourage you to follow along as the lesson may prove useful to your future work.

contacts_decoded <- contacts %>% 
  left_join(sqlQuery(dummy_db, # We use sqlQuery(), within the left_join() to pull data from all columns of the "LUsex" look-up table
                  "SELECT * 
                  FROM LUsex"),
            by = c("Sex" = "ID")) %>%  # The "contacts" table is linked via the "Sex" column to the "LUsex" table, using the "ID" column 
  left_join(sqlQuery(dummy_db, # We use sqlQuery(), within the left_join() to pull data from all columns of the "LUcommunities" look-up table
                     "SELECT *
                     FROM LUcommunities"),
            by = c("Community" = "ID")) %>% # The "contacts" table is linked via the "Community" column to the "LUcommunities" table, using the "ID" column 
  select(-c(Sex, Community)) %>% # We remove the coded columns, as we no longer need them
  rename(Sex = Sex.y) # Lastly, we rename the "Sex.y" column to "Sex". This occurred because there were two columns with the same name, i.e. "Sex", so the second column with that name was appended with ".y" at the end.

ID	FirstName	LastName	DateOfBirth	Sex	CommunityName	Parish	Country
1	John	Brown	2000-01-07	Male	Belvedere	St. Thomas	Jumanji
2	Mary	Steel	1987-07-07	Female	Scott’s peak	St. David	Jumanji
3	ralph	baron	1966-12-15	Unspecified	Cornwall	St John	Jumanji
4	Harriette	Maison	1991-03-05	Female	Mahaut	St. Paul	Jumanji
5	Billy	Jackman	19 July 2003	Male	Hastings	St. paul	Jumanji

3.5.6 Boats (decoded and revised)

Time to revise our decoded boats data:

boats_decoded <- boats %>% 
  left_join(boat_types, 
            by = c("BoatType" = "ID")) %>%  
  left_join(contacts_decoded, # This time we use the recently created "contacts_decoded" data instead. All other code remains the same. 
            by = c("BoatOwner" = "ID")) %>% 
  select(-c(BoatType, BoatOwner))

ID	BoatID	Type	FirstName	LastName	DateOfBirth	Sex	CommunityName	Parish	Country
1	1266	Keel	John	Brown	2000-01-07	Male	Belvedere	St. Thomas	Jumanji
2	2223	canoe	Mary	Steel	1987-07-07	Female	Scott’s peak	St. David	Jumanji
3	1122	Pirogue	Harriette	Maison	1991-03-05	Female	Mahaut	St. Paul	Jumanji
4	1090	Longliner	ralph	baron	1966-12-15	Unspecified	Cornwall	St John	Jumanji
5	5643	Longliner	Billy	Jackman	19 July 2003	Male	Hastings	St. paul	Jumanji
6	1267	canoe	John	Brown	2000-01-07	Male	Belvedere	St. Thomas	Jumanji

3.5.7 Catch (decoded)

The final data we will pull from our database is the “catch” data. Note that we will be decoding as we go along from within the sqlQuery() function, using SQL code. Again, this is a bit more complex coding, however, much can be learned of the possibilities available to you in using sqlQuery(). It is important to remember that there are many ways to achieve the same results in R.

catch_decoded <- sqlQuery(dummy_db,
                  "SELECT 
                  a.DateLanded AS date_landed, 
                  b.BoatID AS boat_ID,
                  c.CommunityName AS port_landed,
                  d.FishSpp AS fish_caught,
                  a.WeightLanded AS weight_lbs
                  
                  FROM 
                  catch a,
                  boats b,
                  LUcommunities c,
                  LUfish d
                  
                  WHERE
                  a.BoatID = b.ID
                  AND
                  a.PortLanded = c.ID
                  AND
                  a.FishCaught = d.ID")

I will attempt to explain the code above. We are using five functions within SQL: “SELECT” and “FROM”, which we have seen and used previously, and lastly “WHERE”, “AND” and “AS”, which are new.

Let’s begin with “SELECT”: We select five columns from four tables. These columns are “DateLanded”, “BoatID”, “CommunityName”, “FishSpp” and “WeightLanded”. As you may have noticed from those names they do not all belong to the same table within our database. You may have also noticed that letters are attached to the beginning of the names of these columns. These correspond to the table where these columns are found. So “a.DateLanded”, for example, is found in a separate table from “d.FishSpp”.

This leads us to “FROM”: Here we list the tables from which our columns will be derived and assign them a letter. So “catch a” means that we are referring to the “catch” table within our database and assigning it to letter “a”, so we can reference its columns. So “a.DateLanded” means that the column is found within the “catch” table.

How does “WHERE” and “AND” fit in?: These act similarly to the “by =” in left_join(). It says how the four tables should be linked and or merged. “WHERE” gives the conditions by which the four tables are joined, then “AND” continues. So “a.BoatID” in the “catch” table is linked to the “boats” table by the “b.ID” column and “a.PortLanded” in the “catch” table is linked to the “LUcommunities” table by the “c.ID” column and lastly “a.FishCaught” in the “catch” table is linked to the “LUfish” table by the “d.ID” column. These are the columns in the “catch” data we would like to decode.

One last thing: “AS”. Within the “SELECT” portion of the code you may have seen this “a.DateLanded AS date_landed”. This means we are selecting “DateLanded” within the “catch” table but we want to rename that column to “date_landed”. This is done for each selected column.

The final data is shown below:

date_landed	boat_ID	port_landed	fish_caught	weight_lbs
2023-06-01	2223	Mahaut	Dolphinfish	100
2023-06-03	1266	Belvedere	Yellowfin tuna	98
2023-06-09	2223	Mahaut	Dolphinfish	10
2023-06-02	1090	Hastings	Ballyhoo	420
2023-06-14	1090	Cornwall	Red Snapper	34
2023-06-23	5643	Scott’s peak	Red Snapper	56
2023-06-20	2223	Mahaut	Yellowfin tuna	77
2023-06-30	5643	Scott’s peak	No catch	5

3.6 Step 6: Proceed with your data preparation, analysis and visualization

We now have all of our tables pulled into R and decoded them as well. We can begin cleaning and data analysis. The data from the database is not perfect by any means. I noticed varying date formats, lower case letters used for the beginning of names, an unspecified sex and a “no catch” with 5 lbs weight. Perhaps you also observed these issues. I’ll leave these to you to resolve, for the most part.

catch_cleaned <- catch_decoded %>% 
  mutate(weight_lbs = if_else(fish_caught == "No catch", 0, weight_lbs)) %>%  # If fish_catch is "No catch", then set weight_lbs to zero, otherwise use the weight_lbs value corresponding to the row.
  mutate(across(port_landed, str_to_title)) # Fix the lower case letters in the port_landed names.

We can also do a quick summary and see what the data look like when plotted:

# Prep data ----
# Catch weight landed summarized by port and species
catch_port_spp <- catch_cleaned %>% 
  group_by(port_landed, fish_caught) %>% 
  summarise(weight_lbs = sum(weight_lbs, na.rm = TRUE)) %>% 
  filter(fish_caught != "No catch")

# Plot ----
ggplot(catch_port_spp, aes(x = fish_caught, y = weight_lbs, fill = fish_caught)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(weight_lbs, 0)), 
            position = position_stack(),
            vjust = "middle", hjust = "top",
            size = 3,
            angle = 0,
            check_overlap = TRUE) +
  facet_wrap(.~ port_landed) +
  coord_flip() +
  theme_bw() +
  theme(axis.text.x  = element_text(angle=0, vjust=0.5), 
        panel.grid.major.y = element_line(linetype=1, color="dark gray"),
        legend.position="none") +
  labs(x = "", 
       y = "Catch (pounds)",
       fill = "")

Total catch by species for each port

3.7 Step 7. Close the database connection

We are done with working with our MS Access database and the data contained therein. To close the connection we simply use this code:

odbcClose(dummy_db)

That’s it! You are done. Of course, you can still continue working with the data imported into R after the database connection is closed. So step 7 may indeed come before step 6.

Masters, June. 2012. “Overview of the Status of Performance of CARIFIS in CRFM Member States, and Options for the Way Forward.” CRFM Technical & Advisory Document Number 2012 / 4. Belize and Saint Vincent and the Grenadines↩︎
https://cran.r-project.org/↩︎
https://posit.co/products/open-source/rstudio/↩︎
RODBC documentation is found here: https://www.rdocumentation.org/packages/RODBC/versions/0.8-3/topics/RODBC ↩︎
Note that this particular function works with the type of MS Access database we are using for this exercise, an ACCDB file (.accdb). Other variations are available for other database types.↩︎
If you are completely unfamiliar with SQL, you can learn here: https://www.w3schools.com/sql/. This exercise does not delve deep into SQL code, so there is no need to worry about having to learn a lot of code before proceeding further. Also, any code used will be explained.↩︎

Using R to connect to an MS Access database

Derrick Theophille

2023-07-08