Automation, Scheduling and Debugging
Welcome to Week 3! 👋
This week’s tutorial will be divided in three parts.
- More Iteration practice
- Automation with R scripts and Scheduling with
cronR
andtaskscheduleR
- Debugging with
traceback()
,browser()
,try()
andtryCatch()
Before we start: GitHub Fork 🍴
Now that we have seen some of the basics of Git and GitHub, lets expand on this knowledge to learn how to fork this repository to your own GitHub account. A fork is a copy of a repository that allows you to freely experiment with changes without affecting the original project. In the case of these labs, this allows you to add your own notes and work on the exercises without overwriting the original repository.
To do this you can follow these steps:
- Go to the labs Github page
- Click on the “Fork” button in the top right corner of the page.
- This will create a copy of the repository in your own GitHub account.
- You can then clone this forked repository to your local machine using the “Code” button and copying the URL just like we did in the previous session with our own repositories.
Whenever the original repository is updated, an option to
sync fork
will appear on your forked repository page in
Github. Clicking this will update your fork with the latest changes we
make to the labs. You can then pull those changes to your local cloned
version of the repository.
Note: This is by far the easiest way of doing things, but if you want to do everything properly through the command line, you would set the original labrepository as an upstream remote to fetch and merge changes from the original repository into your local clone. A full walkthrough of this can be found here.
1. More Iteration with purrr
🔁
We saw some examples of iteration with purrr::map()
in
the last session. Now that we have also covered it in the Lectures let’s
do a few more examples to get more comfortable with this approach.
Quick Recap
Remember the map()
family syntax:
purrr::map()
returns a list.purrr::map_lgl()
returns a logical vectorpurrr::map_int()
returns an integer vector.purrr::map_dbl()
returns a double vector.purrr::map_chr()
returns a character vector.
Introduction to Nesting
Sometimes we want to perform the same analysis on different groups in our data. For example, we might want to calculate summary statistics for each species of penguin, or fit a model for each country in a dataset.
Nesting is a powerful technique that helps us organize grouped data. When we nest data:
- We group our data by some variable(s)
- All the remaining data for each group gets “packed” into a special list-column
- Each row now represents one group, with all that group’s data stored in the list-column
# Before nesting: regular dataframe
penguins
#> species bill_length_mm bill_depth_mm ...
#> Adelie 39.1 18.7 ...
#> Adelie 39.5 17.4 ...
#> Chinstrap 46.5 17.9 ...
# After nesting by species: each species gets one row
penguins %>%
group_by(species) %>%
nest()
#> species data
#> Adelie <tibble [152 × 7]>
#> Chinstrap <tibble [68 × 7]>
#> Gentoo <tibble [124 × 7]>
The magic happens when we combine nesting with
purrr::map()
- we can apply the same function to each
group’s data, then unnest the results to get a clean summary table.
Exercise: Penguin Flipper analysis
Let’s look at our penguin dataset again as an example.
Exercise 1 Group the penguins by species and nest
the remaining data into a column called data. Save the result in a new
dataframe called penguins_nested
:
Exercise 2 Write a function called
summarize_bills
that takes a dataframe and returns: -
count
: number of rows - mean_bill_length
: mean
bill length (handle NAs) - max_bill_length
: max bill length
(handle NAs)
Exercise 3 Add a new column called
bill_stats
to penguins_nested
that applies
your summarize_bills
function to each nested dataframe
using map()
:
Exercise 4 Create a clean summary table by selecting
just species
and bill_stats
, then
unnesting:
Exercise 5
Write the complete pipeline from the original penguins
data to the final summary in one go:
2. Automation of Projects 🤖
When working on a data science project, you will often have to repeat the same steps multiple times. For example, you might have to:
- import data
- clean data
- run analyses
- create visualizations
- generate reports
- etc.
To avoid doing these steps manually every time, you can automate them
using pipelines. We already saw pipelines in the context of the
dplyr
package using the pipe operator (%>%
or |>
), now we will look at how to automate entire
workflows of R scripts using the source()
function.
Folder structure and thinking in pipelines ⚙️
When working on a data science project, it is important to keep your files organized. An example folder structure based on the example you saw in the lecture might look as follows:
lotr_project/
├── data/
│ ├── raw/
│ └── processed/
├── scripts/
│ ├── Makefile.R
│ └── 00-packages.R
│ └── 01-download-data.R
│ └── 02-process-data.R
│ └── 03_plot.R
├── outputs/
│ ├── figures/
│ └── tables/
└── README.md
In this example, the data/
folder contains raw and
processed data, the scripts/
folder contains R scripts that
perform specific tasks, and the outputs/
folder contains
figures and tables generated by the scripts. The Makefile.R
script is the main script that sources the other scripts in the correct
order.
Let’s try it ourselves
To work with this example I have created all the files and folders as
above in this weeks lab within the lotr_project folder. Try run the
Makefile.R
script to see how it works. You can open the
different script and run the makefile line by line to see what each part
does.
Exercise 1 Can you add a new script that creates a
summary table of the words spoken by Species and Film and saves it to
the outputs/tables/
folder? You can use the
dplyr
package to create the summary table and the
write_tsv()
function to save it as a .tsv file. Don’t
forget to source your new script in the Makefile.R
script.
Exercise 2 Update the Makefile.R
script
so that it also clears out the table you created in the exercise
above.
Exercise 3 Do your new script and the 03-plot.R script need to be run sequentially? How could we think about these two script in terms of the pipeline graphs you saw in the lecture?
Scheduling with cronR
and taskscheduleR
📆
Now that we have a working pipeline, let’s learn how to schedule it to run automatically!
Task scheduling allows you to:
- Run your pipeline at specific times (daily, weekly, monthly)
- Process new data automatically
- Generate reports on a schedule
- Perform maintenance tasks
Aside: Relative vs Absolute Paths 🗺
When working with file paths in R, you can use either relative or
absolute paths. So far we have used relative paths as this makes it
easier to share code with others. When we schedule tasks however the
working directory might not be what we expect, so it is often safer to
use absolute paths. A useful package for this is here
which
allows you to construct file paths relative to the root of your project.
You will see some examples of this in the code below. (Note that for
this reason also, the lotr project as we have set it up above will not
work if you just run the Makefile.R
script, as it uses
relative paths. You would have to amend the script to use absulute
paths, the here
package or set your working directory to
the lotr_project folder first. Below it is inlcuded just for
demonstration.)
Scheduling with cronR
on Mac and Linux ⏰
# install.packages("cronR")
library(cronR)
# define the path to your Master R script
cmd <- cron_rscript(here::here("session-03-automation","lotr_project","scripts","Makefile.R"))
# add a new cron job to run the script every day at 1am
cron_add(
command = cmd,
frequency = "daily",
at = "01:00",
id = "lotr_pipeline_daily",
description = "Daily LOTR analysis pipeline"
)
# List all scheduled tasks
cron_ls()
# To remove a scheduled task
# cron_rm(id = "lotr_pipeline_daily")
` ### Example cron schedule expressions
# Every 5 minutes
cron_add(command = cron_rscript("script.R"), frequency = "*/5 * * * *")
# Every hour
cron_add(command = cron_rscript("script.R"), frequency = "hourly")
# Every day at 3:30 PM
cron_add(command = cron_rscript("script.R"), frequency = "daily", at = "15:30")
# Every Monday at 9 AM
cron_add(command = cron_rscript("script.R"), frequency = "0 9 * * 1")
# First day of every month
cron_add(command = cron_rscript("script.R"), frequency = "0 9 1 * *")
Scheduling with taskscheduleR
on Windows ⏰
# install.packages("taskscheduleR")
library(taskscheduleR)
# define the path to your Master R script
makefile_path <- here::here("session-03-automation","lotr_project","scripts","Makefile.R")
# Schedule our LOTR pipeline to run daily at 8 AM
taskscheduler_create(
taskname = "lotr_pipeline_daily",
rscript = makefile_path,
schedule = "DAILY",
starttime = "08:00"
)
# List all scheduled tasks
taskscheduler_ls()
# To remove a scheduled task
# taskscheduler_delete(taskname = "lotr_pipeline_daily")
Example taskscheduleR
schedule expressions
# Every 5 minutes
taskscheduler_create(taskname = "lotr_pipeline_5min", rscript = "path/to/your/Makefile.R", schedule = "MINUTE", modifier = 5)
# Every hour
taskscheduler_create(taskname = "lotr_pipeline_hourly", rscript = "path/to/your/Makefile.R", schedule = "HOURLY")
# Every day at 3:30 PM
taskscheduler_create(taskname = "lotr_pipeline_daily_1530", rscript = "path/to/your/Makefile.R", schedule = "DAILY", starttime = "15:30")
# Every Monday at 9 AM
taskscheduler_create(taskname = "lotr_pipeline_weekly", rscript = "path/to/your/Makefile.R", schedule = "WEEKLY", starttime = "09:00", days = "MON")
# First day of every month
taskscheduler_create(taskname = "lotr_pipeline_monthly", rscript = "path/to/your/Makefile.R", schedule = "MONTHLY", starttime = "09:00", days = 1)
3. Debugging 🐞
When you write code, things will inevitably go wrong at some point. You can professionalize the way you:
- fix unanticipated problems (debugging)
- let functions communicate problems and take actions based on those communications (condition handling)
- learn how to avoid common problems before they occur (defensive programming)
Potential problems are communicated via “conditions” (e.g., errors, warnings, and messages). For example
- fatal errors are raised by
stop()
and force all execution to terminate - warnings are generated by
warning()
and display potential problems - messages are generated by
message()
and can provide informative output on the way
Debugging workflow
- Realize that you have a bug
- Make the bug repeatable: start with big chunk of code and narrow it down to isolate it
- Figure out where it is
- Fix it and test it
Debugging tools 🔦
The following function that calculates the geodesic distance between two points specified by radian latitude/longitude using the Haversine formula (hf); taken from here, as an example. We’ve inserted some bugs here… 🐛
geod_dist <- function(lat1, lon1, lat2, lon2, earth.radius = 6371) {
# from degrees to radians
deg2rad <- function(deg) return(deg*pi/180)
lon1 <- deg2rad(lon1)
lat1 <- deg2rad(lat1)
lon2 <- deg2rad(long2)
lat2 <- deg2rad(lat2)
# calculation
delta.long <- (lon2 - lon1)
delta.lat <- (lat2 - lat1)
a <- sin(delta.lat/2)^2 + cos(lat1) * cos(lat2) * sing(delta.long/2)^2
c <- 2 * asin(min(1,sqrt(a)))
d = earth_radius * c
return(d)
}
geod_dist(lat1 = 49.5, lon1 = 8.4, lat2 = 52.5, lon2 = 13.4)
Trial and error 🎓
That is, if you see the error right away, try and fix it. 🤓. One slightly more principled approach is to use a binary search strategy: run only half of the code, see if the error persists. If yes, the bug is in that half, if no, it is in the other half. Repeat until you find the bug.
Making function global 🌎
One way to approach debugging is to turn the arguments of the function into global objects (objects you can see in your environment, otherwise the objects are only available within your function). Then you can step through the code line by line to locate the bug.
# make the objects that are otherwise entered as input parameters to your function global
lat1 <- 49.5; lon1 <- 8.4; lat2 <- 52.5; lon2 <- 13.4
# now, execute line by line
deg2rad <- function(deg) return(deg*pi/180)
lon1 <- deg2rad(lon1)
lat1 <- deg2rad(lat1)
lon2 <- deg2rad(long2)
lat2 <- deg2rad(lat2)
delta.long <- (lon2 - lon1)
delta.lat <- (lat2 - lat1)
a <- sin(delta.lat/2)^2 + cos(lat1) * cos(lat2) * sing(delta.long/2)^2
c <- 2 * asin(min(1,sqrt(a)))
d = earth_radius * c
return(d)
Problem: This creates global objects that match arguments names, which can become confusing and cause problems that become obvious when the function is called in a different environment. ⚠️
In case you choose this option, it is a good idea to clean your environment afterwards, or simply to remove all the new global objects using
rm()
.
Side note If you haven’t done so already, as a general best practice advise, change the settings in your global options to “never” save the workspace as this can cause similar issues to the example described above.
Using traceback()
👈
This shows you where the error occurred (but not why). Read from bottom to top (e.g. 1. called function X, 2. called function Y, error occurred in line #6 of function Y)
Using browser()
🦊
You can add browser()
into your function somewhere
before you expect the error.
Note: While browser()
works best within
a clean .R script, it can also help with troubleshooting within .Rmd
files.
geod_dist <- function(lat1, lon1, lat2, lon2, earth.radius = 6371) {
# from degrees to radians
browser()
deg2rad <- function(deg) return(deg*pi/180)
lon1 <- deg2rad(lon1)
lat1 <- deg2rad(lat1)
lon2 <- deg2rad(lon2)
lat2 <- deg2rad(lat2)
# calculation
delta.long <- (lon2 - lon1)
delta.lat <- (lat2 - lat1)
a <- sin(delta.lat/2)^2 + cos(lat1) * cos(lat2) * sin(delta.long/2)^2
c <- 2 * asin(min(1,sqrt(a)))
d = earth_radius * c
return(d)
}
geod_dist(lat1 = 49.5, lon1 = 8.4, lat2 = 52.5, lon2 = 13.4)
You can then interactively work through the function line by line by hitting enter in the console or send additional lines of code.
Note: Other helpful tools for debugging R functions
written by someone else include debug()
which automatically
opens a debugger at the start of a function call and
trace()
which allows temporary code modifications inside
functions that you can’t easily access to
(e.g. ggplot()
).
Condition handling `r emo::ji("ant")`🐜🐝🐜
Sometimes errors come expected, and you want to handle them automatically, e.g.:
- model fails to converge
- file download fails
- stack processing of lists
Useful functions to deal with such cases: try()
and
tryCatch()
Using try()
🤷
Ignore error:
## Error in log(x) : non-numeric argument to mathematical function
## [1] 10
Suppress error message:
## [1] 10
Pass block of code to try()
:
## Error in a + b : non-numeric argument to binary operator
Capture the output of try()
:
## Error in "a" + "b" : non-numeric argument to binary operator
## [1] "numeric"
## [1] "try-error"
Use possibly()
, a function similar to try()
from the purrr package when applying a function to multiple elements in
a list. You can also provide a default value (here: NA
) in
case execution fails.
library(purrr)
elements <-list(1,2,3,"f")
results_a <- purrr::map(elements, log)
results_b <- purrr::map(elements, purrr::possibly(log, NA))
Condition handling with tryCatch()
🎣
React to conditions, such as errors, warnings, messages, or interruptions, with certain actions “handlers”. These are functions that map conditions to condition handler functions that can do anything but typically will return a value or create a more informative error message:
show_condition <- function(code) {
tryCatch(code,
error = function(c) "error",
warning = function(c) "warning",
message = function(c) "message" )
}
show_condition(stop("!"))
## [1] "error"
## [1] "warning"
## [1] "message"
If no condition is captured, tryCatch()
returns the
value of the input:
## [1] 15
A more real-life example of how to use tryCatch()
is
this one:
model_selection <- function(data, formula1, formula2){
tryCatch(lm(formula1, data), error = function(e) lm(formula2, data))
}
You try to fit formula1
to the data, however, maybe it
is a model with very strict requirements. You might also have a more
robust (but maybe less interesting) formula2
that you might
fit if the requirements are not met and the modeling process throws an
error.
Make more informative error messages:
read.csv_new <- function(file, ...) {
tryCatch(read.csv(file, ...), error = function(c) {
c$message <- paste0(c$message, " (in file: ", file, ")")
stop(c)
})
}
read.csv("code/dummy.csv")
read.csv_new("code/dummy.csv")
Defensive programming 🛡
“making code fail in a well-defined manner”
“fail fast”: as soon as something wrong is discovered, signal an error
Three rules to implement the “fail fast” principle:
- be strict about what you accept (e.g., only scalars)
- avoid functions that use non-standard evaluation, such as
subset()
,transform()
, orwith()
- avoid functions that return different types of output depending on
their input (e.g.
sapply
)
Debugging Exercise 💀
Here is a piece of code that comes with a few flaws. As an optional take-home exercise, please identify the bugs, remove them and report what you have done using comments.
# load packages
library(tidyverse)
library(LegislatoR)
# get political data on German legislators
political_df <-
left_join(x = filter(get_political(legislature = "ger"), as.numeric("session") == 18),
y = get_core(legislature = "ger"), by = "pageid")
# wiki traffic data
traffic_df <-
get_traffic(legislature = "ger") |>
filter(date >= "2013-10-22" & date <= "2017-10-24") |>
group_by(pageid) |>
summarize(traffic_mean = mean(traffic, na.rm = TRUE),
traffic_max = max(traffic, na.rm = TRUE))
# sessions served
sessions_served_df <-
get_political(legislature = "deu") %%
group_by(pageid) |>
dplyr::summarize(sessions_served = n())
# merge
legislator_df <-
left_join(political_df, sessions_served_df, by = "pageid") |>
left_join(traffic_df, by = "pageid")
# compute age
get_age <- function(birth, date_at) {
date_at_fmt <- date_at
birth_fmt <- birth
diff <- difftime(lubridate::ymd(birth_fmt), lubridate::ymd(date_at_fmt))
diff_years <- time_length(diff, "years")
diff_years
}
legislator_df$age_in_years <- round(get_age(legislator_df$birth, "2017-10-24"), 0)
# plot top 10 pageviews
legislator_df <- arrange(legislator_df, desc(traffic_mean))
legislator_df$rank <- 1:nrow(legislator_df)
legislator_df_table <- dplyr::select(rank, name, traffic_mean, traffic_max)
names(legislator_df_table) <- c("Rank", "Representative", "Mean", "Maximum")
legislator_df_table <- head(legislator_df_table, 10)
ggplot(legislator_df_table, aes(y = Mean, x = -Rank)) +
xlab("Rank") + ylab("Avg. daily page views") +
labs(title = "Top 10 representatives by average daily page views") +
geom_bar(stats = "identity") + # change to geom_col
scale_x_continuous(breaks = -nrow(legislator_df_table):-1, labels = rev(1:nrow(legislator_df_table))) # add +
geom_text(aes(y = 10, label = Representative), hjust = 0, color = "white", size = 2) +
coord_flip() +
theme_minimal()
# run model of page views as a function of sessions served, party, sex, and age in years
legislator_df$traffic_log <- log(legislator_df$traffic_mean)
covars <- c("sessions_served", "party", "sex", "age_in_years")
fmla <- paste("traffic_log", paste(covars, collapse = " - "), sep = " ~ ")
summary(log_traffic_model <- lm(fmla, legislator_df))
# plot table
sjPlot::tab_model(log_traffic_model)
Acknowledgements
This script was drafted by Tom Arendt and Lisa Oswald, with contributions by Steve Kerr, Hiba Ahmad, Carmen Garro, Sebastian Ramirez-Ruiz, Killian Conyngham and Carol Sobral. It draws heavily on the materials for the Introduction to R workshop within the Data Science Seminar series at Hertie, created by Therese Anders, Statistical Modeling and Causal Inference by Sebastian Ramirez-Ruiz and Lisa Oswald.
The Lord of the Rings dialogue analysis example is based on an the Github for STAT545 at UBC. You can find their version here.