Automation, Scheduling and Debugging

Welcome to Week 3! 👋

This week’s tutorial will be divided in three parts.

More Iteration practice
Automation with R scripts and Scheduling with cronR and taskscheduleR
Debugging with traceback(), browser(), try() and tryCatch()

Before we start: GitHub Fork 🍴

Now that we have seen some of the basics of Git and GitHub, lets expand on this knowledge to learn how to fork this repository to your own GitHub account. A fork is a copy of a repository that allows you to freely experiment with changes without affecting the original project. In the case of these labs, this allows you to add your own notes and work on the exercises without overwriting the original repository.

To do this you can follow these steps:

Go to the labs Github page
Click on the “Fork” button in the top right corner of the page.
This will create a copy of the repository in your own GitHub account.
You can then clone this forked repository to your local machine using the “Code” button and copying the URL just like we did in the previous session with our own repositories.

Whenever the original repository is updated, an option to sync fork will appear on your forked repository page in Github. Clicking this will update your fork with the latest changes we make to the labs. You can then pull those changes to your local cloned version of the repository.

Note: This is by far the easiest way of doing things, but if you want to do everything properly through the command line, you would set the original labrepository as an upstream remote to fetch and merge changes from the original repository into your local clone. A full walkthrough of this can be found here.

1. More Iteration with `purrr` 🔁

We saw some examples of iteration with purrr::map() in the last session. Now that we have also covered it in the Lectures let’s do a few more examples to get more comfortable with this approach.

Quick Recap

Remember the map() family syntax:

map(.x, .f, ...)

purrr::map() returns a list.
purrr::map_lgl() returns a logical vector
purrr::map_int() returns an integer vector.
purrr::map_dbl() returns a double vector.
purrr::map_chr() returns a character vector.

Introduction to Nesting

Sometimes we want to perform the same analysis on different groups in our data. For example, we might want to calculate summary statistics for each species of penguin, or fit a model for each country in a dataset.

Nesting is a powerful technique that helps us organize grouped data. When we nest data:

We group our data by some variable(s)
All the remaining data for each group gets “packed” into a special list-column
Each row now represents one group, with all that group’s data stored in the list-column

# Before nesting: regular dataframe
penguins
#> species    bill_length_mm  bill_depth_mm  ...
#> Adelie     39.1           18.7           ...  
#> Adelie     39.5           17.4           ...
#> Chinstrap  46.5           17.9           ...

# After nesting by species: each species gets one row
penguins %>% 
  group_by(species) %>% 
  nest()

#> species    data              
#> Adelie     <tibble [152 × 7]>
#> Chinstrap  <tibble [68 × 7]> 
#> Gentoo     <tibble [124 × 7]>

The magic happens when we combine nesting with purrr::map() - we can apply the same function to each group’s data, then unnest the results to get a clean summary table.

Exercise: Penguin Flipper analysis

Let’s look at our penguin dataset again as an example.

library(palmerpenguins)
library(tidyverse)
data(penguins)

Exercise 1 Group the penguins by species and nest the remaining data into a column called data. Save the result in a new dataframe called penguins_nested:

Exercise 2 Write a function called summarize_bills that takes a dataframe and returns: - count: number of rows - mean_bill_length: mean bill length (handle NAs) - max_bill_length: max bill length (handle NAs)

Exercise 3 Add a new column called bill_stats to penguins_nested that applies your summarize_bills function to each nested dataframe using map():

Exercise 4 Create a clean summary table by selecting just species and bill_stats, then unnesting:

Exercise 5

Write the complete pipeline from the original penguins data to the final summary in one go:

2. Automation of Projects 🤖

When working on a data science project, you will often have to repeat the same steps multiple times. For example, you might have to:

import data
clean data
run analyses
create visualizations
generate reports
etc.

To avoid doing these steps manually every time, you can automate them using pipelines. We already saw pipelines in the context of the dplyr package using the pipe operator (%>% or |>), now we will look at how to automate entire workflows of R scripts using the source() function.

Folder structure and thinking in pipelines ⚙️

When working on a data science project, it is important to keep your files organized. An example folder structure based on the example you saw in the lecture might look as follows:

lotr_project/
├── data/
│   ├── raw/
│   └── processed/
├── scripts/
│   ├── Makefile.R
│   └── 00-packages.R
│   └── 01-download-data.R
│   └── 02-process-data.R
│   └── 03_plot.R
├── outputs/
│   ├── figures/
│   └── tables/
└── README.md

In this example, the data/ folder contains raw and processed data, the scripts/ folder contains R scripts that perform specific tasks, and the outputs/ folder contains figures and tables generated by the scripts. The Makefile.R script is the main script that sources the other scripts in the correct order.

Let’s try it ourselves

To work with this example I have created all the files and folders as above in this weeks lab within the lotr_project folder. Try run the Makefile.R script to see how it works. You can open the different script and run the makefile line by line to see what each part does.

Exercise 1 Can you add a new script that creates a summary table of the words spoken by Species and Film and saves it to the outputs/tables/ folder? You can use the dplyr package to create the summary table and the write_tsv() function to save it as a .tsv file. Don’t forget to source your new script in the Makefile.R script.

Exercise 2 Update the Makefile.R script so that it also clears out the table you created in the exercise above.

Exercise 3 Do your new script and the 03-plot.R script need to be run sequentially? How could we think about these two script in terms of the pipeline graphs you saw in the lecture?

Scheduling with `cronR` and `taskscheduleR` 📆

Now that we have a working pipeline, let’s learn how to schedule it to run automatically!

Task scheduling allows you to:

Run your pipeline at specific times (daily, weekly, monthly)
Process new data automatically
Generate reports on a schedule
Perform maintenance tasks

Aside: Relative vs Absolute Paths 🗺

When working with file paths in R, you can use either relative or absolute paths. So far we have used relative paths as this makes it easier to share code with others. When we schedule tasks however the working directory might not be what we expect, so it is often safer to use absolute paths. A useful package for this is here which allows you to construct file paths relative to the root of your project. You will see some examples of this in the code below. (Note that for this reason also, the lotr project as we have set it up above will not work if you just run the Makefile.R script, as it uses relative paths. You would have to amend the script to use absulute paths, the here package or set your working directory to the lotr_project folder first. Below it is inlcuded just for demonstration.)

Scheduling with `cronR` on Mac and Linux ⏰

# install.packages("cronR")
library(cronR)
# define the path to your Master R script
cmd <- cron_rscript(here::here("session-03-automation","lotr_project","scripts","Makefile.R"))
# add a new cron job to run the script every day at 1am
cron_add(
  command = cmd,
  frequency = "daily",
  at = "01:00",
  id = "lotr_pipeline_daily",
  description = "Daily LOTR analysis pipeline"
)

# List all scheduled tasks
cron_ls()

# To remove a scheduled task
# cron_rm(id = "lotr_pipeline_daily")

` ### Example cron schedule expressions

# Every 5 minutes
cron_add(command = cron_rscript("script.R"), frequency = "*/5 * * * *")

# Every hour
cron_add(command = cron_rscript("script.R"), frequency = "hourly")

# Every day at 3:30 PM
cron_add(command = cron_rscript("script.R"), frequency = "daily", at = "15:30")

# Every Monday at 9 AM
cron_add(command = cron_rscript("script.R"), frequency = "0 9 * * 1")

# First day of every month
cron_add(command = cron_rscript("script.R"), frequency = "0 9 1 * *")

Scheduling with `taskscheduleR` on Windows ⏰

# install.packages("taskscheduleR")
library(taskscheduleR)
# define the path to your Master R script
makefile_path <- here::here("session-03-automation","lotr_project","scripts","Makefile.R")

# Schedule our LOTR pipeline to run daily at 8 AM
taskscheduler_create(
  taskname = "lotr_pipeline_daily",
  rscript = makefile_path,
  schedule = "DAILY",
  starttime = "08:00"
)

# List all scheduled tasks
taskscheduler_ls()

# To remove a scheduled task
# taskscheduler_delete(taskname = "lotr_pipeline_daily")

Example `taskscheduleR` schedule expressions

# Every 5 minutes
taskscheduler_create(taskname = "lotr_pipeline_5min", rscript = "path/to/your/Makefile.R", schedule = "MINUTE", modifier = 5)

# Every hour
taskscheduler_create(taskname = "lotr_pipeline_hourly", rscript = "path/to/your/Makefile.R", schedule = "HOURLY")

# Every day at 3:30 PM
taskscheduler_create(taskname = "lotr_pipeline_daily_1530", rscript = "path/to/your/Makefile.R", schedule = "DAILY", starttime = "15:30")

# Every Monday at 9 AM
taskscheduler_create(taskname = "lotr_pipeline_weekly", rscript = "path/to/your/Makefile.R", schedule = "WEEKLY", starttime = "09:00", days = "MON")

# First day of every month
taskscheduler_create(taskname = "lotr_pipeline_monthly", rscript = "path/to/your/Makefile.R", schedule = "MONTHLY", starttime = "09:00", days = 1)

3. Debugging 🐞

When you write code, things will inevitably go wrong at some point. You can professionalize the way you:

fix unanticipated problems (debugging)
let functions communicate problems and take actions based on those communications (condition handling)
learn how to avoid common problems before they occur (defensive programming)

Potential problems are communicated via “conditions” (e.g., errors, warnings, and messages). For example

fatal errors are raised by stop() and force all execution to terminate
warnings are generated by warning() and display potential problems
messages are generated by message() and can provide informative output on the way

Debugging workflow

Realize that you have a bug
Make the bug repeatable: start with big chunk of code and narrow it down to isolate it
Figure out where it is
Fix it and test it

Debugging tools 🔦

The following function that calculates the geodesic distance between two points specified by radian latitude/longitude using the Haversine formula (hf); taken from here, as an example. We’ve inserted some bugs here… 🐛

geod_dist <- function(lat1, lon1, lat2, lon2, earth.radius = 6371) {
  
  # from degrees to radians
  deg2rad <- function(deg) return(deg*pi/180)
  lon1 <- deg2rad(lon1)
  lat1 <- deg2rad(lat1)
  lon2 <- deg2rad(long2)
  lat2 <- deg2rad(lat2)
  
  # calculation
  delta.long <- (lon2 - lon1)
  delta.lat <- (lat2 - lat1)
  a <- sin(delta.lat/2)^2 + cos(lat1) * cos(lat2) * sing(delta.long/2)^2
  c <- 2 * asin(min(1,sqrt(a)))
  d = earth_radius * c
  return(d)
}

geod_dist(lat1 = 49.5, lon1 = 8.4, lat2 = 52.5, lon2 = 13.4)

Trial and error 🎓

That is, if you see the error right away, try and fix it. 🤓. One slightly more principled approach is to use a binary search strategy: run only half of the code, see if the error persists. If yes, the bug is in that half, if no, it is in the other half. Repeat until you find the bug.

Making function global 🌎

One way to approach debugging is to turn the arguments of the function into global objects (objects you can see in your environment, otherwise the objects are only available within your function). Then you can step through the code line by line to locate the bug.

# make the objects that are otherwise entered as input parameters to your function global
lat1 <- 49.5; lon1 <-  8.4; lat2 <-  52.5; lon2 <- 13.4

# now, execute line by line
deg2rad <- function(deg) return(deg*pi/180)
lon1 <- deg2rad(lon1)
lat1 <- deg2rad(lat1)
lon2 <- deg2rad(long2)
lat2 <- deg2rad(lat2)
delta.long <- (lon2 - lon1)
delta.lat <- (lat2 - lat1)
a <- sin(delta.lat/2)^2 + cos(lat1) * cos(lat2) * sing(delta.long/2)^2
c <- 2 * asin(min(1,sqrt(a)))
d = earth_radius * c
return(d)

Problem: This creates global objects that match arguments names, which can become confusing and cause problems that become obvious when the function is called in a different environment. ⚠️

In case you choose this option, it is a good idea to clean your environment afterwards, or simply to remove all the new global objects using rm().

Side note If you haven’t done so already, as a general best practice advise, change the settings in your global options to “never” save the workspace as this can cause similar issues to the example described above.

Using `traceback()` 👈

geod_dist(lat1 = 49.5, lon1 = 8.4, lat2 = 52.5, lon2 = 13.4)
traceback()

This shows you where the error occurred (but not why). Read from bottom to top (e.g. 1. called function X, 2. called function Y, error occurred in line #6 of function Y)

Using `browser()` 🦊

You can add browser() into your function somewhere before you expect the error.

Note: While browser() works best within a clean .R script, it can also help with troubleshooting within .Rmd files.

geod_dist <- function(lat1, lon1, lat2, lon2, earth.radius = 6371) {
  # from degrees to radians
  browser()
  deg2rad <- function(deg) return(deg*pi/180)
  lon1 <- deg2rad(lon1)
  lat1 <- deg2rad(lat1)
  lon2 <- deg2rad(lon2)
  lat2 <- deg2rad(lat2)
  # calculation
  delta.long <- (lon2 - lon1)
  delta.lat <- (lat2 - lat1)
  a <- sin(delta.lat/2)^2 + cos(lat1) * cos(lat2) * sin(delta.long/2)^2
  c <- 2 * asin(min(1,sqrt(a)))
  d = earth_radius * c
  return(d)
}
geod_dist(lat1 = 49.5, lon1 = 8.4, lat2 = 52.5, lon2 = 13.4)

You can then interactively work through the function line by line by hitting enter in the console or send additional lines of code.

Note: Other helpful tools for debugging R functions written by someone else include debug() which automatically opens a debugger at the start of a function call and trace() which allows temporary code modifications inside functions that you can’t easily access to (e.g. ggplot()).

Condition handling `r emo::ji("ant")`🐜🐝🐜

Sometimes errors come expected, and you want to handle them automatically, e.g.:

model fails to converge
file download fails
stack processing of lists

Useful functions to deal with such cases: try() and tryCatch()

f1 <- function(x) { 
  log(x) 
  10 
} 

f1("x")

Using `try()` 🤷

Ignore error:

f1 <- function(x) { 
  try(log(x))
  10 
} 

f1("x")

## Error in log(x) : non-numeric argument to mathematical function

## [1] 10

Suppress error message:

f1 <- function(x) { 
  try(log(x), silent = TRUE)
  10 
} 
f1("x")

## [1] 10

Pass block of code to try():

try({ 
  a <- 1 
  b <- "x" 
  a + b 
})

## Error in a + b : non-numeric argument to binary operator

Capture the output of try():

success <- try(1 + 2) 
failure <- try("a" + "b")

## Error in "a" + "b" : non-numeric argument to binary operator

class(success)

## [1] "numeric"

class(failure)

## [1] "try-error"

Use possibly(), a function similar to try() from the purrr package when applying a function to multiple elements in a list. You can also provide a default value (here: NA) in case execution fails.

library(purrr)

elements <-list(1,2,3,"f")

results_a <- purrr::map(elements, log)
results_b <- purrr::map(elements, purrr::possibly(log, NA))

Condition handling with `tryCatch()` 🎣

React to conditions, such as errors, warnings, messages, or interruptions, with certain actions “handlers”. These are functions that map conditions to condition handler functions that can do anything but typically will return a value or create a more informative error message:

show_condition <- function(code) { 
  tryCatch(code, 
           error = function(c) "error", 
           warning = function(c) "warning", 
           message = function(c) "message" )
}

show_condition(stop("!"))

## [1] "error"

show_condition(warning("?!"))

## [1] "warning"

show_condition(message("?"))

## [1] "message"

If no condition is captured, tryCatch() returns the value of the input:

show_condition(10+5)

## [1] 15

A more real-life example of how to use tryCatch() is this one:

model_selection <- function(data, formula1, formula2){
  
  tryCatch(lm(formula1, data), error = function(e) lm(formula2, data))
  
}

You try to fit formula1 to the data, however, maybe it is a model with very strict requirements. You might also have a more robust (but maybe less interesting) formula2 that you might fit if the requirements are not met and the modeling process throws an error.

Make more informative error messages:

read.csv_new <- function(file, ...) { 
  tryCatch(read.csv(file, ...), error = function(c) {
    c$message <- paste0(c$message, " (in file: ", file, ")") 
    stop(c) 
  })
}

read.csv("code/dummy.csv")
read.csv_new("code/dummy.csv")

Defensive programming 🛡

“making code fail in a well-defined manner”
“fail fast”: as soon as something wrong is discovered, signal an error
Three rules to implement the “fail fast” principle:

be strict about what you accept (e.g., only scalars)
avoid functions that use non-standard evaluation, such as subset(), transform(), or with()
avoid functions that return different types of output depending on their input (e.g. sapply)

Debugging Exercise 💀

Here is a piece of code that comes with a few flaws. As an optional take-home exercise, please identify the bugs, remove them and report what you have done using comments.

# load packages
library(tidyverse)
library(LegislatoR) 

# get political data on German legislators
political_df <- 
  left_join(x = filter(get_political(legislature = "ger"), as.numeric("session") == 18), 
            y = get_core(legislature = "ger"), by = "pageid") 

# wiki traffic data
traffic_df <- 
  get_traffic(legislature = "ger") |> 
  filter(date >= "2013-10-22" & date <= "2017-10-24") |> 
  group_by(pageid) |> 
  summarize(traffic_mean = mean(traffic, na.rm = TRUE),
            traffic_max = max(traffic, na.rm = TRUE))

# sessions served
sessions_served_df <- 
  get_political(legislature = "deu") %% 
  group_by(pageid) |> 
  dplyr::summarize(sessions_served = n())

# merge
legislator_df <- 
  left_join(political_df, sessions_served_df, by = "pageid") |> 
  left_join(traffic_df, by = "pageid") 

# compute age
get_age <- function(birth, date_at) {
  date_at_fmt <- date_at
  birth_fmt <- birth
  diff <- difftime(lubridate::ymd(birth_fmt), lubridate::ymd(date_at_fmt))
  diff_years <- time_length(diff, "years") 
  diff_years
}
legislator_df$age_in_years <- round(get_age(legislator_df$birth, "2017-10-24"), 0)

# plot top 10 pageviews
legislator_df <- arrange(legislator_df, desc(traffic_mean))
legislator_df$rank <- 1:nrow(legislator_df)
legislator_df_table <- dplyr::select(rank, name, traffic_mean, traffic_max) 
names(legislator_df_table) <- c("Rank", "Representative", "Mean", "Maximum") 
legislator_df_table <- head(legislator_df_table, 10)

ggplot(legislator_df_table, aes(y = Mean, x = -Rank)) + 
  xlab("Rank") + ylab("Avg. daily page views") + 
  labs(title = "Top 10 representatives by average daily page views") + 
  geom_bar(stats = "identity") + # change to geom_col
  scale_x_continuous(breaks = -nrow(legislator_df_table):-1, labels = rev(1:nrow(legislator_df_table))) # add +
geom_text(aes(y = 10, label = Representative), hjust = 0, color = "white", size = 2) + 
  coord_flip() + 
  theme_minimal()

# run model of page views as a function of sessions served, party, sex, and age in years
legislator_df$traffic_log <- log(legislator_df$traffic_mean)

covars <- c("sessions_served", "party", "sex", "age_in_years")
fmla <- paste("traffic_log", paste(covars, collapse = " - "), sep = " ~ ") 
summary(log_traffic_model <- lm(fmla, legislator_df))

# plot table
sjPlot::tab_model(log_traffic_model)

Acknowledgements

This script was drafted by Tom Arendt and Lisa Oswald, with contributions by Steve Kerr, Hiba Ahmad, Carmen Garro, Sebastian Ramirez-Ruiz, Killian Conyngham and Carol Sobral. It draws heavily on the materials for the Introduction to R workshop within the Data Science Seminar series at Hertie, created by Therese Anders, Statistical Modeling and Causal Inference by Sebastian Ramirez-Ruiz and Lisa Oswald.

The Lord of the Rings dialogue analysis example is based on an the Github for STAT545 at UBC. You can find their version here.