Last updated: 2020-11-01

Checks: 7 0

Knit directory: r4ds_book/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200814)

The command set.seed(20200814) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: a8057e7

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version a8057e7. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/images/
    Untracked:  code_snipp.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ch11_strings.Rmd) and HTML (docs/ch11_strings.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	a8057e7	sciencificity	2020-11-01	added ch11
html	0aef1b0	sciencificity	2020-10-31	Build site.
Rmd	72ad7d9	sciencificity	2020-10-31	added ch10

Strings

Click on the tab buttons below for each section

String Basics

(string1 <- "This is a string")
#> [1] "This is a string"
(string2 <- 'To put a "quote" inside a string, use single quotes')
#> [1] "To put a \"quote\" inside a string, use single quotes"

writeLines(string1)
#> This is a string
writeLines(string2)
#> To put a "quote" inside a string, use single quotes

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

If you want to include a literal backslash, you’ll need to double it up: "\\".

The printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():

x <- c("\"", "\\")
x
#> [1] "\"" "\\"
writeLines(x)
#> "
#> \

Other useful ones:

"\n": newline
"\t": tab
See the complete list by getting help on ": ?'"', or ?"'".
When you see strings like "\u00b5", this is a way of writing non-English characters.

(string3 <- "This\tis\ta\tstring\twith\t\ttabs\tin\tit.\nHow about that?")
#> [1] "This\tis\ta\tstring\twith\t\ttabs\tin\tit.\nHow about that?"
writeLines(string3)
#> This is  a   string  with        tabs    in  it.
#> How about that?

## From `?'"'` help page
## Backslashes need doubling, or they have a special meaning.
x <- "In ALGOL, you could do logical AND with /\\."
print(x)      # shows it as above ("input-like")
#> [1] "In ALGOL, you could do logical AND with /\\."
writeLines(x) # shows it as you like it ;-)
#> In ALGOL, you could do logical AND with /\.

Some String Functions

String Length

Use str_length().

str_length(c("a", "R for Data Science", NA))
#> [1]  1 18 NA

Combining Strings

Use str_c().

Use sep = some_char to separate values with a character, the default separator is the empty string.
Shorter length vectors are recycled.
Use str_replace_na(list) to replace NAs with literal NA.
Objects of length 0 are silently dropped.
Use `collapse to reduce a vector of strings to a single string.

str_c("a", "R for Data Science")
#> [1] "aR for Data Science"

str_c("x", "y", "z")
#> [1] "xyz"

str_c("x", "y", "z", sep = ", ") # separate using character
#> [1] "x, y, z"

str_c("prefix-", c("a","b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

x <- c("abc", NA)

str_c("|=", x, "=|") # concatenating a 1 long, with 2 long, with 1 long
#> [1] "|=abc=|" NA

str_c("|=", str_replace_na(x), "=|") # to actually show the NA
#> [1] "|=abc=|" "|=NA=|"

Notice that the shorter vector is recycled.

Objects of 0 length are dropped.

name <- "Vebash"
time_of_day <- "evening"
birthday <- FALSE

str_c("Good ", time_of_day, " ",
      name, if(birthday) ' and Happy Birthday!')
#> [1] "Good evening Vebash"

str_c("prefix-", c("a","b", "c"), "-suffix", collapse = ', ')
#> [1] "prefix-a-suffix, prefix-b-suffix, prefix-c-suffix"

str_c("prefix-", c("a","b", "c"), "-suffix") # note the diff without
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

Subsetting Strings

Use str_sub().

start and end args give the (inclusive) position of the substring you’re looking for.
does not fail if string too short, returns as much as it can.
can use the assignment operator of str_sub() to modify strings.

x <- c("Apple", "Banana", "Pear")

str_sub(x, 1, 3) # get 1st three chars of each
#> [1] "App" "Ban" "Pea"

str_sub(x, -3, -1) # get last three chars of each
#> [1] "ple" "ana" "ear"

str_sub("a", 1, 5) # too short but no failure
#> [1] "a"

x # before change
#> [1] "Apple"  "Banana" "Pear"

# Go get from x the 1st char, and assign to it
# the lower version of its character
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))

x # after the str_sub assign above
#> [1] "apple"  "banana" "pear"

Locales

str_to_lower(), str_to_upper() and str_to_title() are all functions that amend case. Amending case may be dependant on your locale though.

# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
#> [1] "I" "I"
str_to_upper(c("i", "ı"), locale = "tr")
#> [1] "I" "I"

Sorting is also affected by locales. In Base R we use sort or order, in {stringr} we use str_sort() and str_order() with the additional argument locale.

x <- c("apple", "banana", "eggplant")

str_sort(x, locale = "en")
#> [1] "apple"    "banana"   "eggplant"

str_sort(x, locale = "haw")
#> [1] "apple"    "eggplant" "banana"

str_order(x, locale = "en")
#> [1] 1 2 3

str_order(x, locale = "haw")
#> [1] 1 3 2

Exercises

In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

# from the help page
## When passing a single vector, paste0 and paste work like as.character.
paste0(1:12)
#>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
paste(1:12)        # same
#>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
as.character(1:12) # same
#>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"

## If you pass several vectors to paste0, they are concatenated in a
## vectorized way.
(nth <- paste0(1:12, c("st", "nd", "rd", rep("th", 9))))
#>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
#> [11] "11th" "12th"

(nth <- paste(1:12, c("st", "nd", "rd", rep("th", 9))))
#>  [1] "1 st"  "2 nd"  "3 rd"  "4 th"  "5 th"  "6 th"  "7 th"  "8 th"  "9 th" 
#> [10] "10 th" "11 th" "12 th"

(nth <- str_c(1:12, c("st", "nd", "rd", rep("th", 9))))
#>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
#> [11] "11th" "12th"


(na_th <- paste0(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
#>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
#> [11] "11th" "12th" "13NA"

(na_th <- paste(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
#>  [1] "1 st"  "2 nd"  "3 rd"  "4 th"  "5 th"  "6 th"  "7 th"  "8 th"  "9 th" 
#> [10] "10 th" "11 th" "12 th" "13 NA"

(na_th <- str_c(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
#>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
#> [11] "11th" "12th" NA

paste() inserts a space between values, and may be overridden with sep = "". In other words the default separator is a space.
paste0() has a separator that is by default the empty string so resulting vector values have no spaces in between.
str_c() is the stringr equivalent.
paste() and paste0() treat NA values as literal string NA, whereas str_c treats NA as missing and that vectorised operation results in an NA.

In your own words, describe the difference between the sep and collapse arguments to str_c().

sep is the separator that appears between vector values when these are concatenated in a vectorised fashion.
collapse is the separator between values when all vectors are collapsed into a single contiguous string value.

(na_th_sep <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                    # sep only
                    sep = "'"))
#>  [1] "1'st"  "2'nd"  "3'rd"  "4'th"  "5'th"  "6'th"  "7'th"  "8'th"  "9'th" 
#> [10] "10'th" "11'th" "12'th"

(na_th_col <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                    # collapse only
                    collapse = "; "))
#> [1] "1st; 2nd; 3rd; 4th; 5th; 6th; 7th; 8th; 9th; 10th; 11th; 12th"

(na_th <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                # both
                sep = " ", collapse = ", "))
#> [1] "1 st, 2 nd, 3 rd, 4 th, 5 th, 6 th, 7 th, 8 th, 9 th, 10 th, 11 th, 12 th"

Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

x <- "This is a string."

y <- "This is a string, no full stop"

z <- "I"

str_length(x)/2
#> [1] 8.5
str_length(y)/2
#> [1] 15

str_sub(x, ceiling(str_length(x)/2),
        ceiling(str_length(x)/2))
#> [1] "a"

str_sub(y, str_length(y)/2,
        str_length(y)/2 + 1)
#> [1] "ng"

str_sub(z, ceiling(str_length(z)/2),
        ceiling(str_length(z)/2))
#> [1] "I"

What does str_wrap() do? When might you want to use it?

It is a wrapper around stringi::stri_wrap() which implements the Knuth-Plass paragraph wrapping algorithm.

The text is wrapped based on a given width. The default is 80, overridding this to 40 will mean 40 characters on a line. Further arguments such as indent (the indentation of start of each paragraph) may be specified.

What does str_trim() do? What’s the opposite of str_trim()?

It removes whitespace from the left and right of a string. str_pad() is the opposite functionality.

str_squish() removes extra whitepace, in beginning of string, end of string and the middle. 🥂

(x <- str_trim("  This has \n some spaces   in the     middle and end    "))
#> [1] "This has \n some spaces   in the     middle and end"
# whitespace removed from begin and end of string
writeLines(x)
#> This has 
#>  some spaces   in the     middle and end

(y <- str_squish("  This has \n some spaces   in the     middle and end    ... oh, not any more ;)"))
#> [1] "This has some spaces in the middle and end ... oh, not any more ;)"
# whitespace removed from begin, middle and end of string
writeLines(y)
#> This has some spaces in the middle and end ... oh, not any more ;)

Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

length 0: return empty string
length 1: return string
length 2: return first part “and” second part
length 3: return first part “,” second part “and” third part.

stringify <- function(v){
  if (length(v) == 0 | length(v) == 1){
    v
  }
  else if (length(v) == 2){
    str_c(v, collapse = " and ")
  }
  else if (length(v) > 2){
    str_c(c(rep("", (length(v) - 1)), " and "),
          v, c(rep(", ", (length(v) - 2)), rep("", 2)), 
           collapse = "")
  }
}
emp <- ""
stringify(emp)
#> [1] ""

x <- "a"
stringify(x)
#> [1] "a"

y <- c("a", "b")
stringify(y)
#> [1] "a and b"

z <- c("a", "b", "c")
stringify(z)
#> [1] "a, b and c"

l <- letters
stringify(letters)
#> [1] "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y and z"

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_South Africa.1252  LC_CTYPE=English_South Africa.1252   
#> [3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C                         
#> [5] LC_TIME=English_South Africa.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] tidyquant_1.0.0            quantmod_0.4.17           
#>  [3] TTR_0.23-6                 PerformanceAnalytics_2.0.4
#>  [5] xts_0.12-0                 zoo_1.8-7                 
#>  [7] magrittr_1.5               lubridate_1.7.8           
#>  [9] emo_0.0.0.9000             flair_0.0.2               
#> [11] forcats_0.5.0              stringr_1.4.0             
#> [13] dplyr_1.0.0                purrr_0.3.4               
#> [15] readr_1.3.1                tidyr_1.1.0               
#> [17] tibble_3.0.3               ggplot2_3.3.0             
#> [19] tidyverse_1.3.0            workflowr_1.6.2           
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6     lattice_0.20-38  assertthat_0.2.1 rprojroot_1.3-2 
#>  [5] digest_0.6.25    R6_2.4.1         cellranger_1.1.0 backports_1.1.6 
#>  [9] reprex_0.3.0     evaluate_0.14    httr_1.4.2       pillar_1.4.6    
#> [13] rlang_0.4.7      curl_4.3         readxl_1.3.1     rstudioapi_0.11 
#> [17] whisker_0.4      rmarkdown_2.4    munsell_0.5.0    broom_0.5.6     
#> [21] compiler_3.6.3   httpuv_1.5.2     modelr_0.1.6     xfun_0.13       
#> [25] pkgconfig_2.0.3  htmltools_0.5.0  tidyselect_1.1.0 quadprog_1.5-8  
#> [29] fansi_0.4.1      crayon_1.3.4     dbplyr_1.4.3     withr_2.2.0     
#> [33] later_1.0.0      Quandl_2.10.0    grid_3.6.3       nlme_3.1-144    
#> [37] jsonlite_1.7.0   gtable_0.3.0     lifecycle_0.2.0  DBI_1.1.0       
#> [41] git2r_0.26.1     scales_1.1.0     cli_2.0.2        stringi_1.4.6   
#> [45] fs_1.4.1         promises_1.1.0   xml2_1.3.2       ellipsis_0.3.1  
#> [49] generics_0.0.2   vctrs_0.3.2      tools_3.6.3      glue_1.4.1      
#> [53] hms_0.5.3        yaml_2.2.1       colorspace_1.4-1 rvest_0.3.5     
#> [57] knitr_1.28       haven_2.2.0

Chapter 11 - Strings with {stringr}

Vebash Naidoo

31/10/2020

Strings

String Basics

String Basics

Some String Functions

Some String Functions

String Length

Combining Strings

Subsetting Strings

Locales

Exercises