# Install packages
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(tidyverse, # tidyverse pkgs including purrr
tictoc, # performance test
broom) # tidy modeling
purrr
to automate workflow in a cleaner, faster, and more extendable wayCopy-and-paste programming, sometimes referred to as just pasting, is the production of highly repetitive computer programming code, as produced by copy and paste operations. It is primarily a pejorative term; those who use the term are often implying a lack of programming competence. It may also be the result of technology limitations (e.g., an insufficiently expressive development environment) as subroutines or libraries would normally be used instead. However, there are occasions when copy-and-paste programming is considered acceptable or necessary, such as for boilerplate, loop unrolling (when not supported automatically by the compiler), or certain programming idioms, and it is supported by some source code editors in the form of snippets. - Wikipedia
Example
Let’s imagine df
is a survey data.
a, b, c, d = Survey respondents
-99: non-responses
Your goal: replace -99 with NA
# Data
df <- tibble("a" = -99,
"b" = -99,
"c" = -99,
"d" = -99)
# Copy and paste
df$a[df$a == -99] <- NA
df$b[df$b == -99] <- NA
df$c[df$c == -99] <- NA
df$d[df$d == -99] <- NA
df
a <dbl> | b <dbl> | c <dbl> | d <dbl> | |
---|---|---|---|---|
NA | NA | NA | NA |
df$a[df$a == -99] <- NA
has an error, how are you going to fix it?) A solution is not scalable if it’s not automatable and, thus, scalable.Let’s recall what’s function in R: input + computation + output
If you write a function, you gain efficiency because you don’t need to copy and paste the computation part.
` function(input){
computation
return(output)
} `
# Function
fix_missing <- function(x) {
x[x == -99] <- NA
x
}
# Apply function to each column (vector)
df$a <- fix_missing(df$a)
df$b <- fix_missing(df$b)
df$c <- fix_missing(df$c)
df$d <- fix_missing(df$d)
df
a <dbl> | b <dbl> | c <dbl> | d <dbl> | |
---|---|---|---|---|
NA | NA | NA | NA |
Challenge 2 Why using function is more efficient than 100% copying and pasting? Can you think about a way we can automate the process?
Many options for automation in R: for loop
, apply
family, etc.
Here’s a tidy solution comes from purrr
package.
The power and joy of one-liner.
df <- purrr::map_df(df, fix_missing)
df
a <dbl> | b <dbl> | c <dbl> | d <dbl> | |
---|---|---|---|---|
NA | NA | NA | NA |
map()
is a higher-order function that applies a given function to each element of a list/vector.
This is how map() works. It’s easier to understand with a picture.
- Input: Takes a vector/list.
- Computation: Calls the function once for each element of the vector
- Output: Returns in a list or whatever data format you prefer (e.g., `_df helper: dataframe`)
Challenge 3 If you run the code below, what’s going to be the data type of the output?
map_chr(df, fix_missing)
## a b c d
## NA NA NA NA
map()
is a good alternative to for loop
. (For more information, watch Hadley Wickam’s talk titled “The Joy of Functional Programming (for Data Science)”.)# Built-in data
data("airquality")
# 0.029 sec elapsed
tic()
out1 <- vector("double", ncol(airquality)) # Placeholder
for (i in seq_along(airquality)){ # Sequence variable
out1[[i]] <- mean(airquality[[i]], na.rm = TRUE) # Assign a computation result to each element
}
toc()
## 0.007 sec elapsed
# 0.004 sec elapsed
tic()
out1 <- airquality %>% map_dbl(mean, na.rm = TRUE)
toc()
## 0.003 sec elapsed
In short, map()
is more readable, faster, and easily extensive with other data science tasks (e.g., wrangling, modeling, and visualization) using %>%
.
Final point: Why not base R apply
family?
Short answer: purrr::map()
is simpler to write. For instance,
map_dbl(x, mean, na.rm = TRUE)
= vapply(x, mean, na.rm = TRUE, FUN.VALUE = double(1))
One popular application of map()
is to run regression models (or whatever model you want to run) on list-columns. No more copying and pasting for running many regression models on subgroups!
For more about this technique, read the Many Models subchapter of the R for Data Science.
# Function
lm_model <- function(df) {
lm(Temp ~ Ozone, data = df)
}
# Map
models <- airquality %>%
group_by(Month) %>%
nest() %>% # Create list-columns
mutate(ols = map(data, lm_model)) # Map
models$ols[1]
## [[1]]
##
## Call:
## lm(formula = Temp ~ Ozone, data = df)
##
## Coefficients:
## (Intercept) Ozone
## 62.8842 0.1629
# Add tidying
tidy_lm_model <- purrr::compose( # compose multiple functions
broom::tidy, # convert lm objects into tidy tibbles
lm_model)
tidied_models <- airquality %>%
group_by(Month) %>%
nest() %>% # Create list-columns
mutate(ols = map(data, tidy_lm_model))
tidied_models$ols[1]
## [[1]]
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 62.9 1.61 39.2 2.88e-23
## 2 Ozone 0.163 0.0500 3.26 3.31e- 3