Databases
Functions, dbplyr
and working
with SQL in R
Introduction
Many “big data” problems are actually “small data problems in disguise”. That is, we only really need a subset of the data, or maybe we want to aggregate the data into some larger dataset. For example, we might want to access Census data, but only for a handful of municipalities. Or, we might want to analyse climate data collected from a large number of weather stations, but aggregated up to the national or monthly level. In such cases, the underlying bottleneck is interacting with the original data, which is too big to fit into memory. How do we store data of this magnitude and and then access it effectively? The answer is through a database.
Databases can exist either locally or remotely, as well as in-memory or on-disk. Regardless of where a database is located, the key point is that information is stored in a way that allows for very quick extraction and/or aggregation. More often than not, you will probably need to extract several subsets and harmonize or transform them. To facilitate and automate this task, you will need to write your own functions and know to iterate them over the relevant subsets of data. This week’s session thus ties in well with the sections on functions and iteration that we quickly touched upon during the last lab.
Joins and Databases 🔭
Although this week’s session is nominally about databases - and we will spend the majority of this session on them - we believe that joins in R are a key skill. Therefore, we will split the session in two. First we will cover joins as implemented by the tidyverse. In that section you will learn to:
- join together different datasets
- differentiate between types of joins
The second part of the session will deal with databases and SQL. Here you will learn to:
- connect to remote databases with R
- generate SQL queries in R with
dbplyr
- manipulate and transform data in a remote database
- collect hosted data and store it locally
Joins with dplyr
One of the mainstays of the dplyr package is merging data with the family join operations.
dplyr::inner_join(df1, df2)
dplyr::left_join(df1, df2)
dplyr::right_join(df1, df2)
dplyr::full_join(df1, df2)
dplyr::semi_join(df1, df2)
dplyr::anti_join(df1, df2)
(You might find it helpful to to see visual depictions of the different join operations here.)
For the simple examples that we will see here, we’ll need some data sets that come bundled with the nycflights13 package.
- Load it now and then inspect these data frames in your own console.
Let’s perform a left join on the flights and planes datasets. - Note: I’m going subset columns after the join, but only to keep text on the slide.
dplyr::left_join(flights, planes) |>
dplyr::select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) |>
head(3) ## Just to save vertical space in output
## # A tibble: 3 × 10
## year month day dep_time arr_time carrier flight tailnum type model
## <int> <int> <int> <int> <int> <chr> <int> <chr> <chr> <chr>
## 1 2013 1 1 517 830 UA 1545 N14228 <NA> <NA>
## 2 2013 1 1 533 850 UA 1714 N24211 <NA> <NA>
## 3 2013 1 1 542 923 AA 1141 N619AA <NA> <NA>
Note that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:
## Joining, by = c("year", "tailnum")
However, there’s an obvious problem here: the variable “year” does not have a consistent meaning across our joining datasets! - In one it refers to the year of flight, in the other it refers to year of construction.
Luckily, there’s an easy way to avoid this problem. - Try
?dplyr::join
.
You just need to be more explicit in your join call by using the
by =
argument. - You can also rename any ambiguous columns
to avoid confusion.
dplyr::left_join(
flights,
planes |> dplyr::rename(year_built = year), ## Not necessary w/ below line, but helpful
by = "tailnum" ## Be specific about the joining column
) |>
dplyr::select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, year_built, type, model) |>
head(3)
## # A tibble: 3 × 11
## year month day dep_time arr_time carrier flight tailnum year_built type
## <int> <int> <int> <int> <int> <chr> <int> <chr> <int> <chr>
## 1 2013 1 1 517 830 UA 1545 N14228 1999 Fixed w…
## 2 2013 1 1 533 850 UA 1714 N24211 1998 Fixed w…
## 3 2013 1 1 542 923 AA 1141 N619AA 1990 Fixed w…
## # ℹ 1 more variable: model <chr>
Last thing I’ll mention for now; note what happens if we again specify the join column… but don’t rename the ambiguous “year” column in at least one of the given data frames.
dplyr::left_join(
flights,
planes, ## Not renaming "year" to "year_built" this time
by = "tailnum"
) |>
dplyr::select(dplyr::contains("year"), month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) |>
head(3)
## # A tibble: 3 × 11
## year.x year.y month day dep_time arr_time carrier flight tailnum type model
## <int> <int> <int> <int> <int> <int> <chr> <int> <chr> <chr> <chr>
## 1 2013 1999 1 1 517 830 UA 1545 N14228 Fixe… 737-…
## 2 2013 1998 1 1 533 850 UA 1714 N24211 Fixe… 737-…
## 3 2013 1990 1 1 542 923 AA 1141 N619AA Fixe… 757-…
Make sure you know what “year.x” and “year.y” are. Again, it pays to be specific.
Exercise 1 🏋
You need to identify flights operated by planes that aren’t in the registration database. This could indicate: - Data entry errors (wrong tail numbers) - Foreign-registered aircraft - Missing records that need to be added to the system
Your task:
Find all flights where the plane’s
tailnum
doesn’t appear in theplanes
registry. How many flights were operated by unregistered aircraft?Which airline has the most flights with unregistered planes?
Hint: Think about which type of join keeps rows from the first table that DON’T have a match in the second table.
Working with databases 💾
So far, we have dealt with small datasets that easily fit into your computer’s memory. But what about datasets when we work with data that are too large for our computers to handle as a whole?
In this case, storing the data outside of R and organizing it in a database is helpful. Connecting to the database allows you to retrieve only the parts needed for your current analysis.
Even better, many large datasets are already available in public or private databases. You can query them without having to download the data first.
Necessary packages
Accessing databases in R requires a few packages:
dbplyr
is the database backend fordplyr
. It makes use of thedplyr
, but works with remote data stored in databases.DBI
is a package that allows R to connect easily to a DBMS (DataBase Management System)Some package to interact with the back-end of the remote database such as
RSQLite
, other options might be:RMariaDB::MariaDB()
for RMariaDB,RPostgres::Postgres()
for RPostgres,odbc::odbc()
for odbc,mongolite::mongo
for MongoDB,- and
bigrquery::bigquery()
for BigQuery.
Connecting to a database
To connect to the database, we will use DBI::dbConnect()
from the DBI
package which defines a common interface
between R and database management systems. The first
argument is the database driver which in our
case is SQLite and the second argument is the
name and location of the database.
Most existing databases don’t live in a file, but instead live on a server. In addition to the two arguments above, database drivers will therefore also require details like user, password, host, port, etc. That means your code will often look more like this:
con <- DBI::dbConnect(RSQLite::SQLite(), # driver
host = "database.rstudio.com",
user = "your_username",
password = rstudioapi::askForPassword("Your password")
)
For the purposes of this lab however, we are connecting to an in-memory database. That way we can avoid potential issues with the registration for access to a database, creation and caching of credentials, as well as defining safe ports and other boring details.
To avoid all this hassle, we will create and host our own (small)
database in our local memory. Luckily, the code to do so is the
same as in the general case above. But, SQLite only needs a
path to the database. (Here, ":memory:"
is a special path
that creates an in-memory database.)
We then save the database connection and store it in
the object con
for further use in exploring and querying
data.
Next, let us get a quick summary of the database connection using
summary()
. It shows “SQLiteConnection” under
class and we can ignore the other details for the time being.
Great!
## Length Class Mode
## 1 SQLiteConnection S4
If you were to connect to a real online database that someone else
generated, you could now call DBI::dbListTables(con)
to see
a list of the tables present in the database. Our local database is
however still devoid of content.
We need to populate our database. We will copy some
data from nycflights13
to our database connection. In
real life, this step would probably be taken care of by
the responsible database maintainer.
# upload local data frame into remote data source; here: database
dplyr::copy_to(
dest = con,
df = nycflights13::flights,
name = "flights"
)
Indexing
Unfortunately, it is not enough to just copy data to our database. We also need to pass a list of indexes to the function. In this example, we set up indexes that will allow us to quickly process the data by time, carrier, plane, and destination. Creating the right indices is key to good database performance. Again, in applications where we don’t set up the database, this will be taken care of by the database maintainer.
dplyr::copy_to(
dest = con,
df = nycflights13::flights,
name = "flights",
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
),
overwrite = T # if set to FALSE and the table already exists (as it is the case now), it will prevent this from overwriting performing the changes
)
List Existing Tables in a Database
Now that we are connected to a database, let us list all the tables
present in it using DBI::dbListTables()
.
## [1] "flights" "sqlite_stat1" "sqlite_stat1" "sqlite_stat4" "sqlite_stat4"
As you can see there is only one table for now (flights). The other objects that show up are infrastructure specificities for SQLite that you can safely ignore. Usually you would find many different tables in a relational database.
Queries
A query is a request for data or information from a database table or combination of tables. 📖
Reference Table
So how do you query a table in a database?
It is actually fairly straightforward. You use the
dplyr::tbl()
function where you indicate the
connection and the name of the table
you want to interact with.
## # Source: table<`flights`> [?? x 19]
## # Database: sqlite 3.50.4 [:memory:]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dbl>
## You will notice that there are some slight differences from our <tibble> or <data.frame> types (see "Source:" and "Database:")
The console output shows that this is a remote
source; the table is not stored in our RStudio environment. Nor
should you need to transfer the entire table to your RStudio
environment. You can perform operations directly on the remote source.
What is more, you can rely on the dplyr
syntax to formulate
your queries. R will automatically translate it into SQL (more on that
below).
Selecting Columns
You can select specific columns:
## # Source: SQL [?? x 5]
## # Database: sqlite 3.50.4 [:memory:]
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ℹ more rows
Filtering by Rows
Access only specific rows:
## # Source: SQL [?? x 19]
## # Database: sqlite 3.50.4 [:memory:]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 848 1835 853 1001 1950
## 2 2013 1 1 1815 1325 290 2120 1542
## 3 2013 1 1 1842 1422 260 1958 1535
## 4 2013 1 1 2115 1700 255 2330 1920
## 5 2013 1 1 2205 1720 285 46 2040
## 6 2013 1 1 2343 1724 379 314 1938
## 7 2013 1 2 1332 904 268 1616 1128
## 8 2013 1 2 1412 838 334 1710 1147
## 9 2013 1 2 1607 1030 337 2003 1355
## 10 2013 1 2 2131 1512 379 2340 1741
## # ℹ more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dbl>
Summary Statisitics
Or immediately generate summary statistics for different groups:
## # Source: SQL [?? x 2]
## # Database: sqlite 3.50.4 [:memory:]
## dest delay
## <chr> <dbl>
## 1 ABQ 2006.
## 2 ACK 1033.
## 3 ALB 1627.
## 4 ANC 1635.
## 5 ATL 1293.
## 6 AUS 1521.
## 7 AVL 1175.
## 8 BDL 1490.
## 9 BGR 1690.
## 10 BHM 1944.
## # ℹ more rows
More advanced operations
You can even generate and plot figures without the need to store the data in your local environment:
flights_db |>
dplyr::filter(distance > 75) |>
dplyr::group_by(origin, hour) |>
dplyr::summarize(delay = mean(dep_delay, na.rm = TRUE)) |>
ggplot(aes(hour, delay, color = origin)) +
geom_line()
Joins
Databases become more exciting with more tables. So let’s add a couple more:
dplyr::copy_to(
dest = con,
df = nycflights13::planes,
name = "planes", # we create the "planes" table
temporary = FALSE,
indexes = "tailnum"
)
dplyr::copy_to(
dest = con,
df = nycflights13::airlines,
name = "airlines", # we create the "airlines" table
temporary = FALSE,
indexes = "carrier"
)
dplyr::copy_to(
dest = con,
df = nycflights13::weather,
name = "weather", # we create the "weather" table
temporary = FALSE,
indexes = list(
c("year", "month", "day", "hour", "origin")
)
)
Exercise 2 🏋
Now it’s your turn! Let’s now add the airports
table to
our database.
Your task:
Inspect the
nycflights13::airports
dataset (usehead()
,glimpse()
, orView()
). Which column uniquely identifies each airport?Add the
airports
table to the database with the appropriate index.
Now let us call DBI::dbListTables()
again on our “con”
database connection. As you can see, there are several more tables
now.
## [1] "airlines" "flights" "planes" "sqlite_stat1" "sqlite_stat1"
## [6] "sqlite_stat4" "sqlite_stat4" "weather"
Working with joins
The join
syntax has its origin in SQL. Unsurprisingly,
you can join tables without having to store the data in memory. Here is
how you perform a left join:
planes_db <- dplyr::tbl(con, 'planes')
dplyr::left_join(
flights_db,
planes_db |> dplyr::rename(year_built = year), # taking care of name ambiguity
by = "tailnum" ## Important: Be specific about the joining column
) |>
dplyr::select(year, month, day, dep_time, arr_time, carrier, flight, tailnum,
year_built, type, model)
## # Source: SQL [?? x 11]
## # Database: sqlite 3.50.4 [:memory:]
## year month day dep_time arr_time carrier flight tailnum year_built type
## <int> <int> <int> <int> <int> <chr> <int> <chr> <int> <chr>
## 1 2013 1 1 517 830 UA 1545 N14228 1999 Fixed …
## 2 2013 1 1 533 850 UA 1714 N24211 1998 Fixed …
## 3 2013 1 1 542 923 AA 1141 N619AA 1990 Fixed …
## 4 2013 1 1 544 1004 B6 725 N804JB 2012 Fixed …
## 5 2013 1 1 554 812 DL 461 N668DN 1991 Fixed …
## 6 2013 1 1 554 740 UA 1696 N39463 2012 Fixed …
## 7 2013 1 1 555 913 B6 507 N516JB 2000 Fixed …
## 8 2013 1 1 557 709 EV 5708 N829AS 1998 Fixed …
## 9 2013 1 1 557 838 B6 79 N593JB 2004 Fixed …
## 10 2013 1 1 558 753 AA 301 N3ALAA NA <NA>
## # ℹ more rows
## # ℹ 1 more variable: model <chr>
This should all feel very familiar right? 😁
A look under the Hood
As you saw, you can conduct your analyses in a database, the same way you are used to do it in R. All that without your data having to be stored on your own device.
Unfortunately, there are however some differences between ordinary data frames and remote database queries that are worth pointing out.
The most important among these is that your R code is translated into SQL and executed in the database on the remote server, not in R on your local machine.
This has the following implications. When working with databases,
dplyr
tries to be as lazy as possible
😪:
- It never pulls data into R unless you explicitly ask for it.
- It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.
This even applies when you assign the output of a database query to an object:
tailnum_delay_db <- flights_db |>
dplyr::group_by(tailnum) |>
dplyr::summarize(
delay = mean(arr_delay),
n = n()
) |>
dplyr::arrange(desc(delay)) |>
dplyr::filter(n > 100)
This leads to some unexpected behaviour:
Exhibit A: Because there’s generally no way to
determine how many rows a query will return unless you actually run it,
nrow()
is always NA
.
## [1] NA
Exhibit B: Because you can’t find the last few rows
without executing the whole query, you can’t use
tail()
.
## Error in `tail()`:
## ! `tail()` is not supported on database backends.
Inspecting queries
We can always inspect the SQL code that dbplyr
is
generating in the background:
## <SQL>
## SELECT `tailnum`, AVG(`arr_delay`) AS `delay`, COUNT(*) AS `n`
## FROM `flights`
## GROUP BY `tailnum`
## HAVING (COUNT(*) > 100.0)
## ORDER BY `delay` DESC
That’s probably not how you would write the SQL yourself, but it works.
More information about SQL translation can be found here:
vignette("translation-verb")
.
From remote to local storage
If you then want to pull the data into a local data frame, use
dplyr::collect()
:
## # A tibble: 1,201 × 3
## tailnum delay n
## <chr> <dbl> <int>
## 1 N11119 30.3 148
## 2 N16919 29.9 251
## 3 N14998 27.9 230
## 4 N15910 27.6 280
## 5 N13123 26.0 121
## 6 N11192 25.9 154
## 7 N14950 25.3 219
## 8 N21130 25.0 126
## 9 N24128 24.9 129
## 10 N22971 24.7 230
## # ℹ 1,191 more rows
** (Optional) Exercise 3** 🏋
You want to find the full airline names (not just the carrier codes) for all flights on January 1st.
Your task:
Join
flights_db
withairlines_db
and filter for flights on January 1st. Select the carrier code, airline name, and flight number.Collect the results into a local dataframe.
Using SQL directly in R
If, for whatever reason you might want to write your SQL queries
yourself, you can use DBI::dbGetQuery()
to run SQL queries
in R scripts:
sql_query <- "SELECT * FROM flights WHERE dep_delay > 240.0 LIMIT 5"
DBI::dbGetQuery(con, sql_query)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 848 1835 853 1001 1950
## 2 2013 1 1 1815 1325 290 2120 1542
## 3 2013 1 1 1842 1422 260 1958 1535
## 4 2013 1 1 2115 1700 255 2330 1920
## 5 2013 1 1 2205 1720 285 46 2040
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1 851 MQ 3944 N942MQ JFK BWI 41 184 18 35
## 2 338 EV 4417 N17185 EWR OMA 213 1134 13 25
## 3 263 EV 4633 N18120 EWR BTV 46 266 14 22
## 4 250 9E 3347 N924XJ JFK CVG 115 589 17 0
## 5 246 AA 1999 N5DNAA EWR MIA 146 1085 17 20
## time_hour
## 1 1357081200
## 2 1357063200
## 3 1357066800
## 4 1357077600
## 5 1357077600
If you want to learn more about writing SQL with dbplyr
,
check out vignette('sql', package = 'dbplyr')
.
Disconnect from database
When you are done with your SQL queries, it is a good idea to disconnect from the database. This becomes increasingly important if you work with servers that charge you for their services!
Exercises on Databases
If you want to practice accessing databases going forward, have a look at the practice script here. It comes with a database file so you can open the connection locally without the need to register. You will get a practice script, as well as an answer sheet. You can thank our colleague Will Lowe for the excellent example!
(Advanced) BigQuery
If you are still curious about databases and SQL and wonder how you
might scale all this up in for the purposes of a real project, you might
be interested to look into Google’s BigQuery service.
You can register for a free Google Cloud account here.
Be aware that you only have a certain amount of free queries (1 TB /
month) before you are charged. BigQuery is the most widely used service
to interact with online databses and it has a number of public datasets
that you can easily practice with. Everything we saw above applies, with
the exception that you need to specify the backend
bigrquery::bigquery()
.
Here is an example of how it would look:
con <- DBI::dbConnect(
bigrquery::bigquery(),
project = "publicdata",
dataset = "samples",
billing = google_cloud_project_name # This will tell Google whom to charge
)
Acknowledgements
The section on databases and SQL relies on the vignette from the dbplyr package, RStudio Tutorial on databases as well as the Databases Session in McDermott’s Data Science for Economists by Grant McDermott.
This script was drafted by Tom Arendt and Lisa Oswald, with contributions by Steve Kerr, Hiba Ahmad, Carmen Garro, Sebastian Ramirez-Ruiz, Killian Conyngham and Carol Sobral.