• 1 Setup
  • 2 Objectives
  • 3 Copy-and-paste programming
  • 4 Using a function
  • 5 Application (many models)

1 Setup

# Install packages 
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(tidyverse, # tidyverse pkgs including purrr
               tictoc, # performance test 
               broom) # tidy modeling

2 Objectives

  • How to use purrr to automate workflow in a cleaner, faster, and more extendable way

3 Copy-and-paste programming

Copy-and-paste programming, sometimes referred to as just pasting, is the production of highly repetitive computer programming code, as produced by copy and paste operations. It is primarily a pejorative term; those who use the term are often implying a lack of programming competence. It may also be the result of technology limitations (e.g., an insufficiently expressive development environment) as subroutines or libraries would normally be used instead. However, there are occasions when copy-and-paste programming is considered acceptable or necessary, such as for boilerplate, loop unrolling (when not supported automatically by the compiler), or certain programming idioms, and it is supported by some source code editors in the form of snippets. - Wikipedia

  • Example

  • Let’s imagine df is a survey data.

    • a, b, c, d = Survey respondents

    • -99: non-responses

    • Your goal: replace -99 with NA

# Data
df <- tibble("a" = -99,
             "b" = -99,
             "c" = -99,
             "d" = -99)
             
# Copy and paste 
df$a[df$a == -99] <- NA
df$b[df$b == -99] <- NA
df$c[df$c == -99] <- NA
df$d[df$d == -99] <- NA

df
ABCDEFGHIJ0123456789
a
<dbl>
b
<dbl>
c
<dbl>
d
<dbl>
NANANANA
  • Challenge 1. Explain why this solution is not very efficient. (e.g., If df$a[df$a == -99] <- NA has an error, how are you going to fix it?) A solution is not scalable if it’s not automatable and, thus, scalable.

4 Using a function

  • Let’s recall what’s function in R: input + computation + output

  • If you write a function, you gain efficiency because you don’t need to copy and paste the computation part.

` function(input){

computation

return(output)

} `

# Function
fix_missing <- function(x) {
  x[x == -99] <- NA
  x
}

# Apply function to each column (vector)
df$a <- fix_missing(df$a)
df$b <- fix_missing(df$b)
df$c <- fix_missing(df$c)
df$d <- fix_missing(df$d)

df
ABCDEFGHIJ0123456789
a
<dbl>
b
<dbl>
c
<dbl>
d
<dbl>
NANANANA
  • Challenge 2 Why using function is more efficient than 100% copying and pasting? Can you think about a way we can automate the process?

  • Many options for automation in R: for loop, apply family, etc.

  • Here’s a tidy solution comes from purrr package.

  • The power and joy of one-liner.

df <- purrr::map_df(df, fix_missing)

df
ABCDEFGHIJ0123456789
a
<dbl>
b
<dbl>
c
<dbl>
d
<dbl>
NANANANA

map() is a higher-order function that applies a given function to each element of a list/vector.

This is how map() works. It’s easier to understand with a picture.

- Input: Takes a vector/list. 

- Computation: Calls the function once for each element of the vector 

- Output: Returns in a list or whatever data format you prefer (e.g., `_df helper: dataframe`)

Challenge 3 If you run the code below, what’s going to be the data type of the output?

map_chr(df, fix_missing)
##  a  b  c  d 
## NA NA NA NA
# Built-in data 
data("airquality")

# 0.029 sec elapsed 
tic()

out1 <- vector("double", ncol(airquality)) # Placeholder 

for (i in seq_along(airquality)){ # Sequence variable 
  
  out1[[i]] <- mean(airquality[[i]], na.rm = TRUE) # Assign a computation result to each element 
  
}

toc()
## 0.007 sec elapsed
# 0.004 sec elapsed 

tic()

out1 <- airquality %>% map_dbl(mean, na.rm = TRUE)

toc()
## 0.003 sec elapsed
  • In short, map() is more readable, faster, and easily extensive with other data science tasks (e.g., wrangling, modeling, and visualization) using %>%.

  • Final point: Why not base R apply family?

Short answer: purrr::map() is simpler to write. For instance,

map_dbl(x, mean, na.rm = TRUE) = vapply(x, mean, na.rm = TRUE, FUN.VALUE = double(1))

5 Application (many models)

  • One popular application of map() is to run regression models (or whatever model you want to run) on list-columns. No more copying and pasting for running many regression models on subgroups!

  • For more about this technique, read the Many Models subchapter of the R for Data Science.

# Function
lm_model <- function(df) {
  
  lm(Temp ~ Ozone, data = df)

  }

# Map
models <- airquality %>%
  group_by(Month) %>%
  nest() %>% # Create list-columns 
  mutate(ols = map(data, lm_model)) # Map 

models$ols[1]
## [[1]]
## 
## Call:
## lm(formula = Temp ~ Ozone, data = df)
## 
## Coefficients:
## (Intercept)        Ozone  
##     62.8842       0.1629
# Add tidying 
tidy_lm_model <- purrr::compose( # compose multiple functions 
  broom::tidy, # convert lm objects into tidy tibbles 
  lm_model)

tidied_models <- airquality %>%
  group_by(Month) %>%
  nest() %>% # Create list-columns 
  mutate(ols = map(data, tidy_lm_model)) 

tidied_models$ols[1]
## [[1]]
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   62.9      1.61       39.2  2.88e-23
## 2 Ozone          0.163    0.0500      3.26 3.31e- 3