Last updated: 2020-11-10

Checks: 7 0

Knit directory: r4ds_book/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200814) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 8864bd0. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/images/
    Untracked:  code_snipp.txt
    Untracked:  data/at_health_facilities.csv
    Untracked:  data/infant_hiv.csv
    Untracked:  data/ranking.csv

Unstaged changes:
    Modified:   analysis/sample_exam1.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ch17_iteration.Rmd) and HTML (docs/ch17_iteration.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 8864bd0 sciencificity 2020-11-10 added ch17

Iteration

Click on the tab buttons below for each section

For loops

For loops

We have seen one tool to avoid repeating yourself functions. Another tool for reducing duplication is iteration, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.

Let’s say we want to calculate the median for each column in a dataframe, and we don’t want to repeat ourselves.

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

# what we don't want
median(df$a)
#> [1] -0.2988316
# ...
median(df$d)
#> [1] 0.7417024

df %>% 
  summarise(across(.cols = dplyr::everything(),
                median,
                .names = "median_{.col}"))
#> # A tibble: 1 x 4
#>   median_a median_b median_c median_d
#>      <dbl>    <dbl>    <dbl>    <dbl>
#> 1   -0.299    0.334    0.246    0.742

output <- vector("double", ncol(df)) #1. output
for (i in seq_along(df)){            #2. sequence 
  output[[i]] <- median(df[[i]])     #3. body
}
output
#> [1] -0.2988316  0.3338866  0.2461308  0.7417024

A for loop has three components:

  1. The output: output <- vector("double", length(x)). Before you start the loop, you must always allocate sufficient space for the output. This is very important for efficiency.

    A general way of creating an empty vector of given length is the vector() function. It has two arguments: the type of the vector (“logical”, “integer”, “double”, “character”, etc) and the length of the vector.

  2. The sequence: i in seq_along(df). This determines what to loop over: each run of the for loop will assign i to a different value from seq_along(df). It’s useful to think of i as a pronoun, like “it”.

    You might not have seen seq_along() before. It’s a safe version of the familiar 1:length(l).

  3. The body: output[[i]] <- median(df[[i]]). This is the code that does the work. It’s run repeatedly, each time with a different value for i. The first iteration will run output[[1]] <- median(df[[1]]), the second will run output[[2]] <- median(df[[2]]), and so on.

Exercises

  1. Write for loops to:

    Think about the output, sequence, and body before you start writing the loop.

    1. Compute the mean of every column in mtcars.

      output <- vector("double", ncol(mtcars))
      for (i in seq_along(mtcars)) {
        output[[i]] <- mean(mtcars[[i]])
      }
      output
      #>  [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
      #>  [7]  17.848750   0.437500   0.406250   3.687500   2.812500
      
      mtcars %>% 
        summarise(across(.cols = dplyr::everything(),
                         mean,
                         .names = "mean_{.col}"))
      #>   mean_mpg mean_cyl mean_disp  mean_hp mean_drat mean_wt mean_qsec mean_vs
      #> 1 20.09062   6.1875  230.7219 146.6875  3.596563 3.21725  17.84875  0.4375
      #>   mean_am mean_gear mean_carb
      #> 1 0.40625    3.6875    2.8125
    2. Determine the type of each column in nycflights13::flights.

      flights <- nycflights13::flights
      output <- vector("list", ncol(flights))
      for (i in seq_along(flights)) {
        output[[i]] <- class(flights[[i]])
      }
      str(output)
      #> List of 19
      #>  $ : chr "integer"
      #>  $ : chr "integer"
      #>  $ : chr "integer"
      #>  $ : chr "integer"
      #>  $ : chr "integer"
      #>  $ : chr "numeric"
      #>  $ : chr "integer"
      #>  $ : chr "integer"
      #>  $ : chr "numeric"
      #>  $ : chr "character"
      #>  $ : chr "integer"
      #>  $ : chr "character"
      #>  $ : chr "character"
      #>  $ : chr "character"
      #>  $ : chr "numeric"
      #>  $ : chr "numeric"
      #>  $ : chr "numeric"
      #>  $ : chr "numeric"
      #>  $ : chr [1:2] "POSIXct" "POSIXt"
    3. Compute the number of unique values in each column of iris.

      output <- vector("integer", ncol(iris))
      for (i in seq_along(iris)) {
        output[[i]] <- n_distinct(iris[[i]])
      }
      output
      #> [1] 35 23 43 22  3
    4. Generate 10 random normals from distributions with means of -10, 0, 10, and 100.

      mean_vec <- c(-10, 0, 10, 100)
      output <- vector("list", length(mean_vec))
      for (i in seq_along(mean_vec)){
        output[[i]] <- rnorm(10, mean = mean_vec[i])
      }
      str(output)
      #> List of 4
      #>  $ : num [1:10] -10.21 -12.18 -9.99 -10.16 -8.49 ...
      #>  $ : num [1:10] -1.294 0.1356 0.4745 -0.7789 0.0538 ...
      #>  $ : num [1:10] 11.58 10.67 10.13 10.93 9.97 ...
      #>  $ : num [1:10] 101.9 98.3 100.3 101 100.5 ...
  2. Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:

    out <- ""
    for (x in letters) {
      out <- stringr::str_c(out, x)
    }
    out
    
    x <- sample(100)
    sd <- 0
    for (i in seq_along(x)) {
      sd <- sd + (x[i] - mean(x)) ^ 2
    }
    sd <- sqrt(sd / (length(x) - 1))
    
    x <- runif(100)
    out <- vector("numeric", length(x))
    out[1] <- x[1]
    for (i in 2:length(x)) {
      out[i] <- out[i - 1] + x[i]
    }
    out <- ""
    for (x in letters) {
      out <- stringr::str_c(out, x)
    }
    out
    #> [1] "abcdefghijklmnopqrstuvwxyz"
    
    stringr::str_c(letters, collapse = "")
    #> [1] "abcdefghijklmnopqrstuvwxyz"
        x <- sample(100)
        sd <- 0
        for (i in seq_along(x)) {
          sd <- sd + (x[i] - mean(x)) ^ 2
        }
        sd <- sqrt(sd / (length(x) - 1))
        sd
    #> [1] 29.01149
    
        sd(x)
    #> [1] 29.01149
        x <- runif(100)
        out <- vector("numeric", length(x))
        out[1] <- x[1]
        for (i in 2:length(x)) {
          out[i] <- out[i - 1] + x[i]
        }
        out
    #>   [1]  0.6351527  0.7174654  1.4689535  1.6411959  1.8778177  1.9641860
    #>   [7]  2.0263957  2.8975083  3.3813111  4.1285753  4.4454395  4.6571805
    #>  [13]  4.9415380  5.0334833  5.5301909  6.3107788  6.4833254  6.9499442
    #>  [19]  7.6978861  8.1622564  8.4064616  8.5897091  9.1733486  9.3924883
    #>  [25]  9.5146333 10.0474487 10.8114975 11.0506961 11.8088200 11.8298386
    #>  [31] 12.4914362 13.1417364 13.8578049 13.9971990 14.4020404 15.2428350
    #>  [37] 15.7026112 16.3960690 17.0198206 17.8616812 17.8928900 18.3448546
    #>  [43] 18.7582465 18.8578642 19.6400662 19.8919098 20.7973655 21.3284186
    #>  [49] 22.1071170 22.4692254 22.9612479 23.6905152 23.8950712 24.1598892
    #>  [55] 24.8504630 25.4103966 25.9334435 26.2508637 26.9836438 27.7134747
    #>  [61] 28.0923461 28.1181665 28.2125266 29.1600522 30.0720412 31.0685855
    #>  [67] 31.2085178 31.9706640 32.3461839 33.2858005 33.8393848 34.6512925
    #>  [73] 34.7389255 35.6962343 36.6018704 36.7594209 37.6794123 38.4072685
    #>  [79] 39.0140218 39.8144649 40.6395816 41.5091561 42.2289230 43.1918513
    #>  [85] 44.1113285 44.3325169 45.1351856 45.2971424 45.6439158 46.5736370
    #>  [91] 47.5553769 48.0237049 48.1872471 48.5820102 48.6626798 49.1583621
    #>  [97] 49.3742582 49.4005363 49.4005479 49.4909540
        cumsum(x)
    #>   [1]  0.6351527  0.7174654  1.4689535  1.6411959  1.8778177  1.9641860
    #>   [7]  2.0263957  2.8975083  3.3813111  4.1285753  4.4454395  4.6571805
    #>  [13]  4.9415380  5.0334833  5.5301909  6.3107788  6.4833254  6.9499442
    #>  [19]  7.6978861  8.1622564  8.4064616  8.5897091  9.1733486  9.3924883
    #>  [25]  9.5146333 10.0474487 10.8114975 11.0506961 11.8088200 11.8298386
    #>  [31] 12.4914362 13.1417364 13.8578049 13.9971990 14.4020404 15.2428350
    #>  [37] 15.7026112 16.3960690 17.0198206 17.8616812 17.8928900 18.3448546
    #>  [43] 18.7582465 18.8578642 19.6400662 19.8919098 20.7973655 21.3284186
    #>  [49] 22.1071170 22.4692254 22.9612479 23.6905152 23.8950712 24.1598892
    #>  [55] 24.8504630 25.4103966 25.9334435 26.2508637 26.9836438 27.7134747
    #>  [61] 28.0923461 28.1181665 28.2125266 29.1600522 30.0720412 31.0685855
    #>  [67] 31.2085178 31.9706640 32.3461839 33.2858005 33.8393848 34.6512925
    #>  [73] 34.7389255 35.6962343 36.6018704 36.7594209 37.6794123 38.4072685
    #>  [79] 39.0140218 39.8144649 40.6395816 41.5091561 42.2289230 43.1918513
    #>  [85] 44.1113285 44.3325169 45.1351856 45.2971424 45.6439158 46.5736370
    #>  [91] 47.5553769 48.0237049 48.1872471 48.5820102 48.6626798 49.1583621
    #>  [97] 49.3742582 49.4005363 49.4005479 49.4909540
  3. Combine your function writing and for loop skills:

    1. Write a for loop that prints() the lyrics to the children’s song “Alice the camel”.

      “Alice The Camel” Lyrics
       Alice the camel has five humps.
       Alice the camel has five humps.
       Alice the camel has five humps.
       So go, Alice, go!
       Boom, boom, boom, boom!
      
       Alice the camel has four humps.
       Alice the camel has four humps.
       Alice the camel has four humps.
       So go, Alice, go!
       Boom, boom, boom, boom!
      
       Alice the camel has three humps.
       Alice the camel has three humps.
       Alice the camel has three humps.
       So go, Alice, go!
       Boom, boom, boom, boom!
      
       Alice the camel has two humps.
       Alice the camel has two humps.
       Alice the camel has two humps.
       So go, Alice, go!
       Boom, boom, boom, boom!
      
       Alice the camel has one hump.
       Alice the camel has one hump.
       Alice the camel has one hump.
       So go, Alice, go!
       Boom, boom, boom, boom!
      
       Alice the camel has no humps.
       Alice the camel has no humps.
       Alice the camel has no humps.
       ‘Cause Alice is a horse, of course!
       
        alice_song <- function(){
            times <- c("five", "four", "three", "two", "one", "no")
            song_lyrics <- vector("character", length(times))
        for (i in seq_along(times)) {
          if (times[i] == "no") {
            song_lyrics[[i]] <- str_glue(
            "\nAlice the camel has {times[i]} humps.
            Alice the camel has {times[i]} humps.
            Alice the camel has {times[i]} humps.
            'Cause Alice is a horse, of course!\n\n
            ")
      
          } else if (times[i] == "one") {
            song_lyrics[[i]] <- str_glue(
            "\nAlice the camel has {times[i]} hump.
            Alice the camel has {times[i]} hump.
            Alice the camel has {times[i]} hump.
            So go, Alice, go!
            Boom, boom, boom, boom!\n\n
            ")
          } else {
            song_lyrics[[i]] <- str_glue(
            "\nAlice the camel has {times[i]} humps.
            Alice the camel has {times[i]} humps.
            Alice the camel has {times[i]} humps.
            So go, Alice, go!
            Boom, boom, boom, boom!\n\n
            ")
          }
        }
        song_lyrics
      }
      
      song_lyrics <- alice_song()
      writeLines(str_c(song_lyrics, collapse = ""))
      #> Alice the camel has five humps.
      #>       Alice the camel has five humps.
      #>       Alice the camel has five humps.
      #>       So go, Alice, go!
      #>       Boom, boom, boom, boom!
      #> 
      #> Alice the camel has four humps.
      #>       Alice the camel has four humps.
      #>       Alice the camel has four humps.
      #>       So go, Alice, go!
      #>       Boom, boom, boom, boom!
      #> 
      #> Alice the camel has three humps.
      #>       Alice the camel has three humps.
      #>       Alice the camel has three humps.
      #>       So go, Alice, go!
      #>       Boom, boom, boom, boom!
      #> 
      #> Alice the camel has two humps.
      #>       Alice the camel has two humps.
      #>       Alice the camel has two humps.
      #>       So go, Alice, go!
      #>       Boom, boom, boom, boom!
      #> 
      #> Alice the camel has one hump.
      #>       Alice the camel has one hump.
      #>       Alice the camel has one hump.
      #>       So go, Alice, go!
      #>       Boom, boom, boom, boom!
      #> 
      #> Alice the camel has no humps.
      #>       Alice the camel has no humps.
      #>       Alice the camel has no humps.
      #>       'Cause Alice is a horse, of course!
    2. Convert the nursery rhyme “ten in the bed” to a function. Generalise it to any number of people in any sleeping structure.

      roll_over <- function(num = "ten") {
        num_levels <-  c("one", "two", "three", "four", "five",
                     "six", "seven", "eight", "nine", "ten")
        num_fact <- factor(num, levels = num_levels)
        output <- vector("character", as.integer(num_fact))
        for (i in seq_along(output)){
          if(num_levels[[length(output)-(i-1)]] == "one"){
            output[[i]] <- str_glue(
            "There was {num_levels[[length(output)-(i-1)]]} in the bed",
             " and the little one said Ahhhhhh ...\n",
             "\n")
          } else {
            output[[i]] <- str_glue(
            "There were {num_levels[[length(output)-(i-1)]]} in the bed",
            " and the little one said roll over, roll over ...\n",
            "\n")
          }
        }
        output
      }
      writeLines(str_c(roll_over("three"), collapse = ""))
      #> There were three in the bed and the little one said roll over, roll over ...
      #> There were two in the bed and the little one said roll over, roll over ...
      #> There was one in the bed and the little one said Ahhhhhh ...
      writeLines(str_c(roll_over("five"), collapse = ""))
      #> There were five in the bed and the little one said roll over, roll over ...
      #> There were four in the bed and the little one said roll over, roll over ...
      #> There were three in the bed and the little one said roll over, roll over ...
      #> There were two in the bed and the little one said roll over, roll over ...
      #> There was one in the bed and the little one said Ahhhhhh ...
    3. Convert the song “99 bottles of beer on the wall” to a function. Generalise to any number of any vessel containing any liquid on any surface.

  4. It’s common to see for loops that don’t preallocate the output and instead increase the length of a vector at each step:

    output <- vector("integer", 0)
    for (i in seq_along(x)) {
      output <- c(output, lengths(x[[i]]))
    }
    output

    How does this affect performance? Design and execute an experiment.

For loop variations

For loop variations

Here are some for loop variations.

  1. Modifying an existing object, instead of creating a new object.
  2. Looping over names or values, instead of indices.
  3. Handling outputs of unknown length.
  4. Handling sequences of unknown length.

Modifying an existing object

Sometimes you want to use a for loop to modify an existing object.

For example let’s say we wanted to rescale an entire df.

To solve this with a for loop we again think about the three components:

  1. Output: we already have the output — it’s the same as the input!

  2. Sequence: we can think about a data frame as a list of columns, so we can iterate over each column with seq_along(df).

  3. Body: apply rescale01().

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df
#> # A tibble: 10 x 4
#>         a      b      c       d
#>     <dbl>  <dbl>  <dbl>   <dbl>
#>  1 -0.184 -0.781 -0.992  0.694 
#>  2  0.317  1.41  -0.287  1.16  
#>  3  0.536  0.626 -1.26   0.798 
#>  4 -1.79  -0.700  0.336 -1.30  
#>  5 -1.68   0.803 -0.681 -0.575 
#>  6 -0.732  0.221 -0.802 -1.03  
#>  7 -0.459 -0.808  2.49   0.778 
#>  8  0.592  0.614  0.735  0.764 
#>  9 -1.25   0.411  1.32   0.0356
#> 10  0.873 -0.139  0.196  0.403

rescale01 <- function(x){
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) /(rng[2] - rng[1])
}

for(i in seq_along(df)){
  df[[i]] <- rescale01(df[[i]])
}
df
#> # A tibble: 10 x 4
#>         a      b      c     d
#>     <dbl>  <dbl>  <dbl> <dbl>
#>  1 0.604  0.0122 0.0720 0.812
#>  2 0.792  1      0.260  1    
#>  3 0.874  0.647  0      0.854
#>  4 0      0.0490 0.426  0    
#>  5 0.0436 0.726  0.155  0.296
#>  6 0.398  0.464  0.123  0.112
#>  7 0.501  0      1      0.846
#>  8 0.895  0.641  0.532  0.840
#>  9 0.204  0.550  0.687  0.544
#> 10 1      0.302  0.389  0.693

Looping patterns

To loop over a vector we have:

  1. for (i in seq_along(xs)), and extracting the value with x[[i]].

  2. Loop over the elements: for (x in xs). Useful if you only care about side-effects, like plotting or printing.

  3. Loop over the names for (nm in names(xs)). This gives you name, which you can use to access the value with x[[nm]].

results <- vector("list", length(x))
names(results) <- names(x)

for (i in seq_along(x)) {
  name <- names(x)[[i]]
  value <- x[[i]]
}

Unknown output length

Sometimes you might not know how long the output will be. Save results as a list and then compile into a single vector when loop is done. used unlist() to flatten a list of vectors into a single vector. A stricter option is to use purrr::flatten_dbl().

means <- c(0,1,2)

output <- vector("list", length(means))
for (i in seq_along(means)){
  n <- sample(100, 1)
  output[[i]] <- rnorm(n, means[[i]])
}
str(output)
#> List of 3
#>  $ : num [1:96] 1.085 1.192 -0.476 -0.342 0.208 ...
#>  $ : num [1:22] 2.548 0.3431 0.6396 1.8766 0.0867 ...
#>  $ : num [1:58] 2.01 3.17 2.01 2.36 2.05 ...
str(unlist(output))
#>  num [1:176] 1.085 1.192 -0.476 -0.342 0.208 ...

This pattern occurs in other places too:

  1. You might be generating a long string. Instead of paste()ing together each iteration with the previous, save the output in a character vector and then combine that vector into a single string with paste(output, collapse = "").

  2. You might be generating a big data frame. Instead of sequentially rbind()ing in each iteration, save the output in a list, then use dplyr::bind_rows(output) to combine the output into a single data frame.

Unknown sequence length

Sometimes you don’t even know how long the sequence should run for - you can use a while loop.

while (condition) {
  # body
}
flip <- function() sample(c("T", "H"), 1)
flips <- 0
nheads <- 0
while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips
#> [1] 7

Exercises

  1. Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, files <- dir("data/", pattern = "\\.csv$", full.names = TRUE), and now want to read each one with read_csv(). Write the for loop that will load them into a single data frame.

  2. What happens if you use for (nm in names(x)) and x has no names? What if only some of the elements are named? What if the names are not unique?

  3. Write a function that prints the mean of each numeric column in a data frame, along with its name. For example, show_mean(iris) would print:

    show_mean(iris)
    #> Sepal.Length: 5.84
    #> Sepal.Width:  3.06
    #> Petal.Length: 3.76
    #> Petal.Width:  1.20

    (Extra challenge: what function did I use to make sure that the numbers lined up nicely, even though the variable names had different lengths?)

  4. What does this code do? How does it work?

    trans <- list( 
      disp = function(x) x * 0.0163871,
      am = function(x) {
        factor(x, labels = c("auto", "manual"))
      }
    )
    for (var in names(trans)) {
      mtcars[[var]] <- trans[[var]](mtcars[[var]])
    }

Exercises

  1. Read the documentation for apply(). In the 2d case, what two for loops does it generalise?

  2. Adapt col_summary() so that it only applies to numeric columns You might want to start with an is_numeric() function that returns a logical vector that has a TRUE corresponding to each numeric column.

The map functions

The {purrr} package provides a family of functions for the common task of iteration. There is one function for each type of output:

  • map() makes a list.
  • map_lgl() makes a logical vector.
  • map_int() makes an integer vector.
  • map_dbl() makes a double vector.
  • map_chr() makes a character vector.

Each function takes a vector as input, applies a function to each piece, and then returns a new vector that’s the same length (and has the same names) as the input. The type of the vector is determined by the suffix to the map function.

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

output <- vector("double", length(df))
for (i in seq_along(df)) {
  output[[i]] <- mean(df[[i]])
}
output
#> [1] -0.3470941 -0.3799759 -0.1439421  0.4843150

# generlise it into a func
col_mean <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- mean(df[[i]])
  }
  output
}

# oh-oh two more functions of a slight variation needed!
col_median <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- median(df[[i]])
  }
  output
}
col_sd <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- sd(df[[i]])
  }
  output
}
col_mean(df)
#> [1] -0.3470941 -0.3799759 -0.1439421  0.4843150
col_sd(df)
#> [1] 1.0006300 0.9881514 0.9117865 0.6340628
col_median(df)
#> [1] -0.31932179 -0.51488140 -0.02374104  0.49163958

We would then generalise the function after realising we can pass functions into functions as an argument!!💪

col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, median)
#> [1] -0.31932179 -0.51488140 -0.02374104  0.49163958
col_summary(df, mean)
#> [1] -0.3470941 -0.3799759 -0.1439421  0.4843150

With {purrr} functions we could do this in a synch! 🐱

map_dbl(df, mean)
#>          a          b          c          d 
#> -0.3470941 -0.3799759 -0.1439421  0.4843150
map_dbl(df, median)
#>           a           b           c           d 
#> -0.31932179 -0.51488140 -0.02374104  0.49163958
# or using pipes
df %>% map_dbl(mean)
#>          a          b          c          d 
#> -0.3470941 -0.3799759 -0.1439421  0.4843150
df %>% map_dbl(median)
#>           a           b           c           d 
#> -0.31932179 -0.51488140 -0.02374104  0.49163958

There are a few differences between map_*() and the col_summary() function we wrote:

  • {purrr} functions are implemented in C - faster.

  • The second argument, .f, the function to apply, can be:

    • a formula,
    • a character vector, or
    • an integer vector.
  • map_*() uses … ([dot dot dot]) to pass along additional arguments to .f each time it’s called:

    map_dbl(df, mean, trim = 0.5)
    #>           a           b           c           d 
    #> -0.31932179 -0.51488140 -0.02374104  0.49163958
  • The map functions also preserve names:

    z <- list(x = 1:3, y = 4:5)
    map_int(z, length)
    #> x y 
    #> 3 2

Shortcuts

There are a few shortcuts that you can use with .f. Say we want to fit a linear model to each group in a dataset.

(models <- mtcars %>% 
  split(.$cyl) %>% 
   # explicitly create a func, with an input which is each
   # split
  map(function(df) lm(mpg ~ wt, data = df)))
#> $`4`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>      39.571       -5.647  
#> 
#> 
#> $`6`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>       28.41        -2.78  
#> 
#> 
#> $`8`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = df)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>      23.868       -2.192

purrr provides a convenient shortcut for an anonymous function: a one-sided formula.

(models <- mtcars %>% 
  split(.$cyl) %>% 
  map(~lm(mpg ~ wt, data = .)))
#> $`4`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>      39.571       -5.647  
#> 
#> 
#> $`6`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>       28.41        -2.78  
#> 
#> 
#> $`8`
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = .)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>      23.868       -2.192

We us . to refer to the current list element.

Let’s say we want to extract a summary statistic like the \(R^2\).

  • First run summary() and
  • Then extract the component called r.squared.
models %>% 
  map(summary) %>% 
  map_dbl(~.$r.squared) # alternate would be: map_dbl(function(df) df$r.squared)
#>         4         6         8 
#> 0.5086326 0.4645102 0.4229655

But extracting named components is a common operation, so purrr provides an even shorter shortcut: you can use a string. 😁

models %>% 
  map(summary) %>% 
  map_dbl("r.squared")
#>         4         6         8 
#> 0.5086326 0.4645102 0.4229655

You can also use an integer to select elements by position:

x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2)
#> [1] 2 5 8

Exercises

  1. Write code that uses one of the map functions to:

    1. Compute the mean of every column in mtcars.

      mtcars %>% 
        map_dbl(mean)
      #>        mpg        cyl       disp         hp       drat         wt       qsec 
      #>  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
      #>         vs         am       gear       carb 
      #>   0.437500   0.406250   3.687500   2.812500
    2. Determine the type of each column in nycflights13::flights.

      nycflights13::flights %>% 
        map(class) %>% 
        str()
      #> List of 19
      #>  $ year          : chr "integer"
      #>  $ month         : chr "integer"
      #>  $ day           : chr "integer"
      #>  $ dep_time      : chr "integer"
      #>  $ sched_dep_time: chr "integer"
      #>  $ dep_delay     : chr "numeric"
      #>  $ arr_time      : chr "integer"
      #>  $ sched_arr_time: chr "integer"
      #>  $ arr_delay     : chr "numeric"
      #>  $ carrier       : chr "character"
      #>  $ flight        : chr "integer"
      #>  $ tailnum       : chr "character"
      #>  $ origin        : chr "character"
      #>  $ dest          : chr "character"
      #>  $ air_time      : chr "numeric"
      #>  $ distance      : chr "numeric"
      #>  $ hour          : chr "numeric"
      #>  $ minute        : chr "numeric"
      #>  $ time_hour     : chr [1:2] "POSIXct" "POSIXt"
    3. Compute the number of unique values in each column of iris.

    iris %>% 
      map_int(n_distinct)
    #> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
    #>           35           23           43           22            3
    1. Generate 10 random normals from distributions with means of -10, 0, 10, and 100.
    means <- c(-10,0,10,100) 
    means %>% 
      map(.f = rnorm, n = 10) %>% 
      str()
    #> List of 4
    #>  $ : num [1:10] -8.72 -9.98 -8.7 -10.98 -8.89 ...
    #>  $ : num [1:10] 0.4208 2.1663 -0.9836 0.9044 -0.0486 ...
    #>  $ : num [1:10] 10.53 12.08 9.01 9.73 9.97 ...
    #>  $ : num [1:10] 99 99.5 101.5 100.2 99.7 ...
  2. How can you create a single vector that for each column in a data frame indicates whether or not it’s a factor?

    library(palmerpenguins)
    penguins %>% 
      map_chr(is.factor)
    #>           species            island    bill_length_mm     bill_depth_mm 
    #>            "TRUE"            "TRUE"           "FALSE"           "FALSE" 
    #> flipper_length_mm       body_mass_g               sex              year 
    #>           "FALSE"           "FALSE"            "TRUE"           "FALSE"
  3. What happens when you use the map functions on vectors that aren’t lists? What does map(1:5, runif) do? Why?

    It passes 1, then 2, then 3 etc, to the first argument of runif which is n - how many numbers you want the random uniform function to generate.

    map(1:5, runif)
    #> [[1]]
    #> [1] 0.1289916
    #> 
    #> [[2]]
    #> [1] 0.3131800 0.2377395
    #> 
    #> [[3]]
    #> [1] 0.6506688 0.8067705 0.9219552
    #> 
    #> [[4]]
    #> [1] 0.18483083 0.08303098 0.04313698 0.21822547
    #> 
    #> [[5]]
    #> [1] 0.8439198 0.6398833 0.1607786 0.5981102 0.7360123
  4. What does map(-2:2, rnorm, n = 5) do? Why? What does map_dbl(-2:2, rnorm, n = 5) do? Why?

    What does map(-2:2, rnorm, n = 5) do? Why? It send -2, -1, 0, 1, to rnorm as the mean value and returns a list.

    What does map_dbl(-2:2, rnorm, n = 5) do? Why? It send -2, -1, 0, 1, 2 to rnorm as the mean value and returns a double vector the same size as the input. But uh oh we gave it the mean vector of size 5 but this is used to generate 5 vectors of n = 5 in length. This is unexpected and hence will error.

    map(-2:2, rnorm, n = 5)
    #> [[1]]
    #> [1] -3.249627 -2.514916 -3.880303 -2.265142 -2.744386
    #> 
    #> [[2]]
    #> [1] -1.64073179 -0.82186021 -0.05958671 -0.49470928  0.22813335
    #> 
    #> [[3]]
    #> [1] -0.9775321  2.0979430  1.7334248 -1.1306294  0.9757886
    #> 
    #> [[4]]
    #> [1]  2.0303995  0.8985192 -0.5593058  1.0157854  1.0507722
    #> 
    #> [[5]]
    #> [1] 1.269560 3.608558 1.851002 2.412937 2.099466
  5. Rewrite map(x, function(df) lm(mpg ~ wt, data = df)) to eliminate the anonymous function.

    # map(x, function(df) lm(mpg ~ wt, data = df))
    
    map(x, ~lm(mpg~wt, data = .))

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_South Africa.1252  LC_CTYPE=English_South Africa.1252   
#> [3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C                         
#> [5] LC_TIME=English_South Africa.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] palmerpenguins_0.1.0 werpals_0.1.0        lubridate_1.7.9     
#>  [4] magrittr_1.5         flair_0.0.2          forcats_0.5.0       
#>  [7] stringr_1.4.0        dplyr_1.0.2          purrr_0.3.4         
#> [10] readr_1.4.0          tidyr_1.1.2          tibble_3.0.3        
#> [13] ggplot2_3.3.2        tidyverse_1.3.0      workflowr_1.6.2     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6       ps_1.3.2           assertthat_0.2.1   rprojroot_1.3-2   
#>  [5] digest_0.6.27      utf8_1.1.4         R6_2.4.1           cellranger_1.1.0  
#>  [9] backports_1.1.6    reprex_0.3.0       evaluate_0.14      httr_1.4.2        
#> [13] pillar_1.4.6       rlang_0.4.8        readxl_1.3.1       rstudioapi_0.11   
#> [17] whisker_0.4        rmarkdown_2.4      nycflights13_1.0.1 munsell_0.5.0     
#> [21] broom_0.7.2        compiler_3.6.3     httpuv_1.5.2       modelr_0.1.8      
#> [25] xfun_0.13          pkgconfig_2.0.3    htmltools_0.5.0    tidyselect_1.1.0  
#> [29] emo_0.0.0.9000     fansi_0.4.1        crayon_1.3.4       dbplyr_2.0.0      
#> [33] withr_2.2.0        later_1.0.0        grid_3.6.3         jsonlite_1.7.1    
#> [37] gtable_0.3.0       lifecycle_0.2.0    DBI_1.1.0          git2r_0.26.1      
#> [41] scales_1.1.0       cli_2.1.0          stringi_1.5.3      fs_1.5.0          
#> [45] promises_1.1.0     xml2_1.3.2         ellipsis_0.3.1     generics_0.0.2    
#> [49] vctrs_0.3.2        tools_3.6.3        glue_1.4.2         hms_0.5.3         
#> [53] yaml_2.2.1         colorspace_1.4-1   rvest_0.3.6        knitr_1.28        
#> [57] haven_2.3.1