Databases

Functions, dbplyr and working with SQL in R

Introduction

Many “big data” problems are actually “small data problems in disguise”. That is, we only really need a subset of the data, or maybe we want to aggregate the data into some larger dataset. For example, we might want to access Census data, but only for a handful of municipalities. Or, we might want to analyse climate data collected from a large number of weather stations, but aggregated up to the national or monthly level. In such cases, the underlying bottleneck is interacting with the original data, which is too big to fit into memory. How do we store data of this magnitude and and then access it effectively? The answer is through a database.

Databases can exist either locally or remotely, as well as in-memory or on-disk. Regardless of where a database is located, the key point is that information is stored in a way that allows for very quick extraction and/or aggregation. More often than not, you will probably need to extract several subsets and harmonize or transform them. To facilitate and automate this task, you will need to write your own functions and know to iterate them over the relevant subsets of data. This week’s session thus ties in well with the sections on functions and iteration that we quickly touched upon during the last lab.


Joins and Databases 🔭

Although this week’s session is nominally about databases - and we will spend the majority of this session on them - we believe that joins in R are a key skill. Therefore, we will split the session in two. First we will cover joins as implemented by the tidyverse. In that section you will learn to:

  • join together different datasets
  • differentiate between types of joins

The second part of the session will deal with databases and SQL. Here you will learn to:

  • connect to remote databases with R
  • generate SQL queries in R with dbplyr
  • manipulate and transform data in a remote database
  • collect hosted data and store it locally

Joins with dplyr

One of the mainstays of the dplyr package is merging data with the family join operations.

  • dplyr::inner_join(df1, df2)
  • dplyr::left_join(df1, df2)
  • dplyr::right_join(df1, df2)
  • dplyr::full_join(df1, df2)
  • dplyr::semi_join(df1, df2)
  • dplyr::anti_join(df1, df2)

(You might find it helpful to to see visual depictions of the different join operations here.)


For the simple examples that we will see here, we’ll need some data sets that come bundled with the nycflights13 package.

  • Load it now and then inspect these data frames in your own console.

Let’s perform a left join on the flights and planes datasets. - Note: I’m going subset columns after the join, but only to keep text on the slide.

dplyr::left_join(flights, planes) |>
  dplyr::select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) |>
  head(3) ## Just to save vertical space in output
## # A tibble: 3 × 10
##    year month   day dep_time arr_time carrier flight tailnum type  model
##   <int> <int> <int>    <int>    <int> <chr>    <int> <chr>   <chr> <chr>
## 1  2013     1     1      517      830 UA        1545 N14228  <NA>  <NA> 
## 2  2013     1     1      533      850 UA        1714 N24211  <NA>  <NA> 
## 3  2013     1     1      542      923 AA        1141 N619AA  <NA>  <NA>

Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:

## Joining, by = c("year", "tailnum")

However, there’s an obvious problem here: the variable “year” does not have a consistent meaning across our joining datasets! - In one it refers to the year of flight, in the other it refers to year of construction.

Luckily, there’s an easy way to avoid this problem. - Try ?dplyr::join.

You just need to be more explicit in your join call by using the by = argument. - You can also rename any ambiguous columns to avoid confusion.

dplyr::left_join(
  flights,
  planes |> dplyr::rename(year_built = year), ## Not necessary w/ below line, but helpful
  by = "tailnum" ## Be specific about the joining column
) |>
  dplyr::select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, year_built, type, model) |>
  head(3) 
## # A tibble: 3 × 11
##    year month   day dep_time arr_time carrier flight tailnum year_built type    
##   <int> <int> <int>    <int>    <int> <chr>    <int> <chr>        <int> <chr>   
## 1  2013     1     1      517      830 UA        1545 N14228        1999 Fixed w…
## 2  2013     1     1      533      850 UA        1714 N24211        1998 Fixed w…
## 3  2013     1     1      542      923 AA        1141 N619AA        1990 Fixed w…
## # ℹ 1 more variable: model <chr>

Last thing I’ll mention for now; note what happens if we again specify the join column… but don’t rename the ambiguous “year” column in at least one of the given data frames.

dplyr::left_join(
  flights,
  planes, ## Not renaming "year" to "year_built" this time
  by = "tailnum"
) |>
  dplyr::select(dplyr::contains("year"), month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) |>
  head(3)
## # A tibble: 3 × 11
##   year.x year.y month   day dep_time arr_time carrier flight tailnum type  model
##    <int>  <int> <int> <int>    <int>    <int> <chr>    <int> <chr>   <chr> <chr>
## 1   2013   1999     1     1      517      830 UA        1545 N14228  Fixe… 737-…
## 2   2013   1998     1     1      533      850 UA        1714 N24211  Fixe… 737-…
## 3   2013   1990     1     1      542      923 AA        1141 N619AA  Fixe… 757-…

Make sure you know what “year.x” and “year.y” are. Again, it pays to be specific.


Exercise 1 🏋

You need to identify flights operated by planes that aren’t in the registration database. This could indicate: - Data entry errors (wrong tail numbers) - Foreign-registered aircraft - Missing records that need to be added to the system

Your task:

  1. Find all flights where the plane’s tailnum doesn’t appear in the planes registry. How many flights were operated by unregistered aircraft?

  2. Which airline has the most flights with unregistered planes?

Hint: Think about which type of join keeps rows from the first table that DON’T have a match in the second table.


Working with databases 💾

So far, we have dealt with small datasets that easily fit into your computer’s memory. But what about datasets when we work with data that are too large for our computers to handle as a whole?

In this case, storing the data outside of R and organizing it in a database is helpful. Connecting to the database allows you to retrieve only the parts needed for your current analysis.

Even better, many large datasets are already available in public or private databases. You can query them without having to download the data first.


Necessary packages

Accessing databases in R requires a few packages:

  • dbplyr is the database backend for dplyr. It makes use of the dplyr, but works with remote data stored in databases.

  • DBI is a package that allows R to connect easily to a DBMS (DataBase Management System)

  • Some package to interact with the back-end of the remote database such as RSQLite, other options might be:

    • RMariaDB::MariaDB() for RMariaDB,
    • RPostgres::Postgres() for RPostgres,
    • odbc::odbc() for odbc,
    • mongolite::mongo for MongoDB,
    • and bigrquery::bigquery() for BigQuery.

Connecting to a database

To connect to the database, we will use DBI::dbConnect() from the DBI package which defines a common interface between R and database management systems. The first argument is the database driver which in our case is SQLite and the second argument is the name and location of the database.

Most existing databases don’t live in a file, but instead live on a server. In addition to the two arguments above, database drivers will therefore also require details like user, password, host, port, etc. That means your code will often look more like this:

con <- DBI::dbConnect(RSQLite::SQLite(),  # driver
  host = "database.rstudio.com",
  user = "your_username",
  password = rstudioapi::askForPassword("Your password")
)

For the purposes of this lab however, we are connecting to an in-memory database. That way we can avoid potential issues with the registration for access to a database, creation and caching of credentials, as well as defining safe ports and other boring details.

To avoid all this hassle, we will create and host our own (small) database in our local memory. Luckily, the code to do so is the same as in the general case above. But, SQLite only needs a path to the database. (Here, ":memory:" is a special path that creates an in-memory database.)

We then save the database connection and store it in the object con for further use in exploring and querying data.

# set up connection with DBI and RSQLite
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

Next, let us get a quick summary of the database connection using summary(). It shows “SQLiteConnection” under class and we can ignore the other details for the time being. Great!

summary(con)
##           Length            Class             Mode 
##                1 SQLiteConnection               S4

If you were to connect to a real online database that someone else generated, you could now call DBI::dbListTables(con) to see a list of the tables present in the database. Our local database is however still devoid of content.

We need to populate our database. We will copy some data from nycflights13 to our database connection. In real life, this step would probably be taken care of by the responsible database maintainer.

# upload local data frame into remote data source; here: database
dplyr::copy_to(
  dest = con, 
  df = nycflights13::flights, 
  name = "flights"
)

Indexing

Unfortunately, it is not enough to just copy data to our database. We also need to pass a list of indexes to the function. In this example, we set up indexes that will allow us to quickly process the data by time, carrier, plane, and destination. Creating the right indices is key to good database performance. Again, in applications where we don’t set up the database, this will be taken care of by the database maintainer.

dplyr::copy_to(
  dest = con, 
  df = nycflights13::flights, 
  name = "flights",
  temporary = FALSE, 
  indexes = list(
    c("year", "month", "day"), 
    "carrier", 
    "tailnum",
    "dest"
  ),
  overwrite = T # if set to FALSE and the table already exists (as it is the case now), it will prevent this from overwriting performing the changes
)

List Existing Tables in a Database

Now that we are connected to a database, let us list all the tables present in it using DBI::dbListTables().

DBI::dbListTables(con)
## [1] "flights"      "sqlite_stat1" "sqlite_stat1" "sqlite_stat4" "sqlite_stat4"

As you can see there is only one table for now (flights). The other objects that show up are infrastructure specificities for SQLite that you can safely ignore. Usually you would find many different tables in a relational database.


Queries

A query is a request for data or information from a database table or combination of tables. 📖

Reference Table

So how do you query a table in a database?

It is actually fairly straightforward. You use the dplyr::tbl()function where you indicate the connection and the name of the table you want to interact with.

# generate reference table from the database
flights_db <- dplyr::tbl(con, "flights")
flights_db 
## # Source:   table<`flights`> [?? x 19]
## # Database: sqlite 3.50.4 [:memory:]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dbl>
## You will notice that there are some slight differences from our <tibble> or <data.frame> types (see "Source:" and "Database:")

The console output shows that this is a remote source; the table is not stored in our RStudio environment. Nor should you need to transfer the entire table to your RStudio environment. You can perform operations directly on the remote source. What is more, you can rely on the dplyr syntax to formulate your queries. R will automatically translate it into SQL (more on that below).


Selecting Columns

You can select specific columns:

# perform various queries
flights_db |> dplyr::select(year:day, dep_delay, arr_delay)
## # Source:   SQL [?? x 5]
## # Database: sqlite 3.50.4 [:memory:]
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # ℹ more rows

Filtering by Rows

Access only specific rows:

flights_db |> dplyr::filter(dep_delay > 240)
## # Source:   SQL [?? x 19]
## # Database: sqlite 3.50.4 [:memory:]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      848           1835       853     1001           1950
##  2  2013     1     1     1815           1325       290     2120           1542
##  3  2013     1     1     1842           1422       260     1958           1535
##  4  2013     1     1     2115           1700       255     2330           1920
##  5  2013     1     1     2205           1720       285       46           2040
##  6  2013     1     1     2343           1724       379      314           1938
##  7  2013     1     2     1332            904       268     1616           1128
##  8  2013     1     2     1412            838       334     1710           1147
##  9  2013     1     2     1607           1030       337     2003           1355
## 10  2013     1     2     2131           1512       379     2340           1741
## # ℹ more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dbl>

Summary Statisitics

Or immediately generate summary statistics for different groups:

flights_db |> 
  dplyr::group_by(dest) |>
  dplyr::summarize(delay = mean(dep_time))
## # Source:   SQL [?? x 2]
## # Database: sqlite 3.50.4 [:memory:]
##    dest  delay
##    <chr> <dbl>
##  1 ABQ   2006.
##  2 ACK   1033.
##  3 ALB   1627.
##  4 ANC   1635.
##  5 ATL   1293.
##  6 AUS   1521.
##  7 AVL   1175.
##  8 BDL   1490.
##  9 BGR   1690.
## 10 BHM   1944.
## # ℹ more rows

More advanced operations

You can even generate and plot figures without the need to store the data in your local environment:

flights_db |> 
  dplyr::filter(distance > 75) |>
  dplyr::group_by(origin, hour) |>
  dplyr::summarize(delay = mean(dep_delay, na.rm = TRUE)) |>
  ggplot(aes(hour, delay, color = origin)) + 
  geom_line() 


Joins

Databases become more exciting with more tables. So let’s add a couple more:

dplyr::copy_to(
  dest = con, 
  df = nycflights13::planes, 
  name = "planes", # we create the "planes" table
  temporary = FALSE, 
  indexes = "tailnum"
)

dplyr::copy_to(
  dest = con, 
  df = nycflights13::airlines, 
  name = "airlines", # we create the "airlines" table
  temporary = FALSE, 
  indexes = "carrier" 
)

dplyr::copy_to(
  dest = con, 
  df = nycflights13::weather, 
  name = "weather", # we create the "weather" table
  temporary = FALSE, 
  indexes = list(
    c("year", "month", "day", "hour", "origin")
  )
)

Exercise 2 🏋

Now it’s your turn! Let’s now add the airports table to our database.

Your task:

  1. Inspect the nycflights13::airports dataset (use head(), glimpse(), or View()). Which column uniquely identifies each airport?

  2. Add the airports table to the database with the appropriate index.

Now let us call DBI::dbListTables() again on our “con” database connection. As you can see, there are several more tables now.

DBI::dbListTables(con)
## [1] "airlines"     "flights"      "planes"       "sqlite_stat1" "sqlite_stat1"
## [6] "sqlite_stat4" "sqlite_stat4" "weather"

Working with joins

The join syntax has its origin in SQL. Unsurprisingly, you can join tables without having to store the data in memory. Here is how you perform a left join:

planes_db <-  dplyr::tbl(con, 'planes')

dplyr::left_join(
  flights_db,
  planes_db |> dplyr::rename(year_built = year), # taking care of name ambiguity
  by = "tailnum" ## Important: Be specific about the joining column
) |>
  dplyr::select(year, month, day, dep_time, arr_time, carrier, flight, tailnum,
         year_built, type, model) 
## # Source:   SQL [?? x 11]
## # Database: sqlite 3.50.4 [:memory:]
##     year month   day dep_time arr_time carrier flight tailnum year_built type   
##    <int> <int> <int>    <int>    <int> <chr>    <int> <chr>        <int> <chr>  
##  1  2013     1     1      517      830 UA        1545 N14228        1999 Fixed …
##  2  2013     1     1      533      850 UA        1714 N24211        1998 Fixed …
##  3  2013     1     1      542      923 AA        1141 N619AA        1990 Fixed …
##  4  2013     1     1      544     1004 B6         725 N804JB        2012 Fixed …
##  5  2013     1     1      554      812 DL         461 N668DN        1991 Fixed …
##  6  2013     1     1      554      740 UA        1696 N39463        2012 Fixed …
##  7  2013     1     1      555      913 B6         507 N516JB        2000 Fixed …
##  8  2013     1     1      557      709 EV        5708 N829AS        1998 Fixed …
##  9  2013     1     1      557      838 B6          79 N593JB        2004 Fixed …
## 10  2013     1     1      558      753 AA         301 N3ALAA          NA <NA>   
## # ℹ more rows
## # ℹ 1 more variable: model <chr>

This should all feel very familiar right? 😁


A look under the Hood

As you saw, you can conduct your analyses in a database, the same way you are used to do it in R. All that without your data having to be stored on your own device.

Unfortunately, there are however some differences between ordinary data frames and remote database queries that are worth pointing out.

The most important among these is that your R code is translated into SQL and executed in the database on the remote server, not in R on your local machine.

This has the following implications. When working with databases, dplyr tries to be as lazy as possible 😪:

  • It never pulls data into R unless you explicitly ask for it.
  • It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.

This even applies when you assign the output of a database query to an object:

tailnum_delay_db <- flights_db |> 
  dplyr::group_by(tailnum) |>
  dplyr::summarize(
    delay = mean(arr_delay),
    n = n()
  ) |> 
  dplyr::arrange(desc(delay)) |>
  dplyr::filter(n > 100)

This leads to some unexpected behaviour:

Exhibit A: Because there’s generally no way to determine how many rows a query will return unless you actually run it, nrow() is always NA.

nrow(tailnum_delay_db)
## [1] NA

Exhibit B: Because you can’t find the last few rows without executing the whole query, you can’t use tail().

tail(tailnum_delay_db)
## Error in `tail()`:
## ! `tail()` is not supported on database backends.

Inspecting queries

We can always inspect the SQL code that dbplyr is generating in the background:

tailnum_delay_db |> dplyr::show_query()
## <SQL>
## SELECT `tailnum`, AVG(`arr_delay`) AS `delay`, COUNT(*) AS `n`
## FROM `flights`
## GROUP BY `tailnum`
## HAVING (COUNT(*) > 100.0)
## ORDER BY `delay` DESC

That’s probably not how you would write the SQL yourself, but it works.

More information about SQL translation can be found here: vignette("translation-verb").


From remote to local storage

If you then want to pull the data into a local data frame, use dplyr::collect():

tailnum_delay <- tailnum_delay_db |> dplyr::collect()
tailnum_delay
## # A tibble: 1,201 × 3
##    tailnum delay     n
##    <chr>   <dbl> <int>
##  1 N11119   30.3   148
##  2 N16919   29.9   251
##  3 N14998   27.9   230
##  4 N15910   27.6   280
##  5 N13123   26.0   121
##  6 N11192   25.9   154
##  7 N14950   25.3   219
##  8 N21130   25.0   126
##  9 N24128   24.9   129
## 10 N22971   24.7   230
## # ℹ 1,191 more rows

** (Optional) Exercise 3** 🏋

You want to find the full airline names (not just the carrier codes) for all flights on January 1st.

Your task:

  1. Join flights_db with airlines_db and filter for flights on January 1st. Select the carrier code, airline name, and flight number.

  2. Collect the results into a local dataframe.


Using SQL directly in R

If, for whatever reason you might want to write your SQL queries yourself, you can use DBI::dbGetQuery() to run SQL queries in R scripts:

sql_query <- "SELECT * FROM flights WHERE dep_delay > 240.0 LIMIT 5"
DBI::dbGetQuery(con, sql_query)
##   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013     1   1      848           1835       853     1001           1950
## 2 2013     1   1     1815           1325       290     2120           1542
## 3 2013     1   1     1842           1422       260     1958           1535
## 4 2013     1   1     2115           1700       255     2330           1920
## 5 2013     1   1     2205           1720       285       46           2040
##   arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1       851      MQ   3944  N942MQ    JFK  BWI       41      184   18     35
## 2       338      EV   4417  N17185    EWR  OMA      213     1134   13     25
## 3       263      EV   4633  N18120    EWR  BTV       46      266   14     22
## 4       250      9E   3347  N924XJ    JFK  CVG      115      589   17      0
## 5       246      AA   1999  N5DNAA    EWR  MIA      146     1085   17     20
##    time_hour
## 1 1357081200
## 2 1357063200
## 3 1357066800
## 4 1357077600
## 5 1357077600

If you want to learn more about writing SQL with dbplyr, check out vignette('sql', package = 'dbplyr').


Disconnect from database

When you are done with your SQL queries, it is a good idea to disconnect from the database. This becomes increasingly important if you work with servers that charge you for their services!

DBI::dbDisconnect(con)

Exercises on Databases

If you want to practice accessing databases going forward, have a look at the practice script here. It comes with a database file so you can open the connection locally without the need to register. You will get a practice script, as well as an answer sheet. You can thank our colleague Will Lowe for the excellent example!


(Advanced) BigQuery

If you are still curious about databases and SQL and wonder how you might scale all this up in for the purposes of a real project, you might be interested to look into Google’s BigQuery service. You can register for a free Google Cloud account here. Be aware that you only have a certain amount of free queries (1 TB / month) before you are charged. BigQuery is the most widely used service to interact with online databses and it has a number of public datasets that you can easily practice with. Everything we saw above applies, with the exception that you need to specify the backend bigrquery::bigquery().

Here is an example of how it would look:

con <- DBI::dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = google_cloud_project_name # This will tell Google whom to charge
)

Acknowledgements

The section on databases and SQL relies on the vignette from the dbplyr package, RStudio Tutorial on databases as well as the Databases Session in McDermott’s Data Science for Economists by Grant McDermott.

This script was drafted by Tom Arendt and Lisa Oswald, with contributions by Steve Kerr, Hiba Ahmad, Carmen Garro, Sebastian Ramirez-Ruiz, Killian Conyngham and Carol Sobral.