Last updated: 2020-11-01

Checks: 7 0

Knit directory: r4ds_book/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200814) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 3ec3460. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/images/
    Untracked:  code_snipp.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ch11_strings.Rmd) and HTML (docs/ch11_strings.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 3ec3460 sciencificity 2020-11-01 more of ch11
html bf15f3b sciencificity 2020-11-01 Build site.
Rmd a8057e7 sciencificity 2020-11-01 added ch11
html 0aef1b0 sciencificity 2020-10-31 Build site.
Rmd 72ad7d9 sciencificity 2020-10-31 added ch10

Strings

Click on the tab buttons below for each section

String Basics

String Basics

(string1 <- "This is a string")
#> [1] "This is a string"
(string2 <- 'To put a "quote" inside a string, use single quotes')
#> [1] "To put a \"quote\" inside a string, use single quotes"

writeLines(string1)
#> This is a string
writeLines(string2)
#> To put a "quote" inside a string, use single quotes
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

If you want to include a literal backslash, you’ll need to double it up: "\\".

The printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():

x <- c("\"", "\\")
x
#> [1] "\"" "\\"
writeLines(x)
#> "
#> \

Other useful ones:

  • "\n": newline
  • "\t": tab
  • See the complete list by getting help on ": ?'"', or ?"'".
  • When you see strings like "\u00b5", this is a way of writing non-English characters.
(string3 <- "This\tis\ta\tstring\twith\t\ttabs\tin\tit.\nHow about that?")
#> [1] "This\tis\ta\tstring\twith\t\ttabs\tin\tit.\nHow about that?"
writeLines(string3)
#> This is  a   string  with        tabs    in  it.
#> How about that?

## From `?'"'` help page
## Backslashes need doubling, or they have a special meaning.
x <- "In ALGOL, you could do logical AND with /\\."
print(x)      # shows it as above ("input-like")
#> [1] "In ALGOL, you could do logical AND with /\\."
writeLines(x) # shows it as you like it ;-)
#> In ALGOL, you could do logical AND with /\.

Some String Functions

Some String Functions

String Length

Use str_length().

str_length(c("a", "R for Data Science", NA))
#> [1]  1 18 NA

Combining Strings

Use str_c().

  • Use sep = some_char to separate values with a character, the default separator is the empty string.
  • Shorter length vectors are recycled.
  • Use str_replace_na(list) to replace NAs with literal NA.
  • Objects of length 0 are silently dropped.
  • Use `collapse to reduce a vector of strings to a single string.
str_c("a", "R for Data Science")
#> [1] "aR for Data Science"

str_c("x", "y", "z")
#> [1] "xyz"

str_c("x", "y", "z", sep = ", ") # separate using character
#> [1] "x, y, z"

str_c("prefix-", c("a","b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
x <- c("abc", NA)

str_c("|=", x, "=|") # concatenating a 1 long, with 2 long, with 1 long
#> [1] "|=abc=|" NA

str_c("|=", str_replace_na(x), "=|") # to actually show the NA
#> [1] "|=abc=|" "|=NA=|"

Notice that the shorter vector is recycled.

Objects of 0 length are dropped.

name <- "Vebash"
time_of_day <- "evening"
birthday <- FALSE

str_c("Good ", time_of_day, " ",
      name, if(birthday) ' and Happy Birthday!')
#> [1] "Good evening Vebash"

str_c("prefix-", c("a","b", "c"), "-suffix", collapse = ', ')
#> [1] "prefix-a-suffix, prefix-b-suffix, prefix-c-suffix"

str_c("prefix-", c("a","b", "c"), "-suffix") # note the diff without
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

Subsetting Strings

Use str_sub().

  • start and end args give the (inclusive) position of the substring you’re looking for.
  • does not fail if string too short, returns as much as it can.
  • can use the assignment operator of str_sub() to modify strings.
x <- c("Apple", "Banana", "Pear")

str_sub(x, 1, 3) # get 1st three chars of each
#> [1] "App" "Ban" "Pea"

str_sub(x, -3, -1) # get last three chars of each
#> [1] "ple" "ana" "ear"

str_sub("a", 1, 5) # too short but no failure
#> [1] "a"

x # before change
#> [1] "Apple"  "Banana" "Pear"

# Go get from x the 1st char, and assign to it
# the lower version of its character
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))

x # after the str_sub assign above
#> [1] "apple"  "banana" "pear"

Locales

str_to_lower(), str_to_upper() and str_to_title() are all functions that amend case. Amending case may be dependant on your locale though.

# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
#> [1] "I" "I"
str_to_upper(c("i", "ı"), locale = "tr")
#> [1] "I" "I"

Sorting is also affected by locales. In Base R we use sort or order, in {stringr} we use str_sort() and str_order() with the additional argument locale.

x <- c("apple", "banana", "eggplant")

str_sort(x, locale = "en")
#> [1] "apple"    "banana"   "eggplant"

str_sort(x, locale = "haw")
#> [1] "apple"    "eggplant" "banana"

str_order(x, locale = "en")
#> [1] 1 2 3

str_order(x, locale = "haw")
#> [1] 1 3 2

Exercises

  1. In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

    # from the help page
    ## When passing a single vector, paste0 and paste work like as.character.
    paste0(1:12)
    #>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
    paste(1:12)        # same
    #>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
    as.character(1:12) # same
    #>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
    
    ## If you pass several vectors to paste0, they are concatenated in a
    ## vectorized way.
    (nth <- paste0(1:12, c("st", "nd", "rd", rep("th", 9))))
    #>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
    #> [11] "11th" "12th"
    
    (nth <- paste(1:12, c("st", "nd", "rd", rep("th", 9))))
    #>  [1] "1 st"  "2 nd"  "3 rd"  "4 th"  "5 th"  "6 th"  "7 th"  "8 th"  "9 th" 
    #> [10] "10 th" "11 th" "12 th"
    
    (nth <- str_c(1:12, c("st", "nd", "rd", rep("th", 9))))
    #>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
    #> [11] "11th" "12th"
    
    
    (na_th <- paste0(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
    #>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
    #> [11] "11th" "12th" "13NA"
    
    (na_th <- paste(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
    #>  [1] "1 st"  "2 nd"  "3 rd"  "4 th"  "5 th"  "6 th"  "7 th"  "8 th"  "9 th" 
    #> [10] "10 th" "11 th" "12 th" "13 NA"
    
    (na_th <- str_c(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
    #>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
    #> [11] "11th" "12th" NA
    • paste() inserts a space between values, and may be overridden with sep = "". In other words the default separator is a space.

    • paste0() has a separator that is by default the empty string so resulting vector values have no spaces in between.

    • str_c() is the stringr equivalent.

    • paste() and paste0() treat NA values as literal string NA, whereas str_c treats NA as missing and that vectorised operation results in an NA.

  2. In your own words, describe the difference between the sep and collapse arguments to str_c().

    • sep is the separator that appears between vector values when these are concatenated in a vectorised fashion.
    • collapse is the separator between values when all vectors are collapsed into a single contiguous string value.
    (na_th_sep <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                        # sep only
                        sep = "'"))
    #>  [1] "1'st"  "2'nd"  "3'rd"  "4'th"  "5'th"  "6'th"  "7'th"  "8'th"  "9'th" 
    #> [10] "10'th" "11'th" "12'th"
    
    (na_th_col <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                        # collapse only
                        collapse = "; "))
    #> [1] "1st; 2nd; 3rd; 4th; 5th; 6th; 7th; 8th; 9th; 10th; 11th; 12th"
    
    (na_th <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                    # both
                    sep = " ", collapse = ", "))
    #> [1] "1 st, 2 nd, 3 rd, 4 th, 5 th, 6 th, 7 th, 8 th, 9 th, 10 th, 11 th, 12 th"
  3. Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

    x <- "This is a string."
    
    y <- "This is a string, no full stop"
    
    z <- "I"
    
    str_length(x)/2
    #> [1] 8.5
    str_length(y)/2
    #> [1] 15
    
    str_sub(x, ceiling(str_length(x)/2),
            ceiling(str_length(x)/2))
    #> [1] "a"
    
    str_sub(y, str_length(y)/2,
            str_length(y)/2 + 1)
    #> [1] "ng"
    
    str_sub(z, ceiling(str_length(z)/2),
            ceiling(str_length(z)/2))
    #> [1] "I"
  4. What does str_wrap() do? When might you want to use it?

    It is a wrapper around stringi::stri_wrap() which implements the Knuth-Plass paragraph wrapping algorithm.

    The text is wrapped based on a given width. The default is 80, overridding this to 40 will mean 40 characters on a line. Further arguments such as indent (the indentation of start of each paragraph) may be specified.

  5. What does str_trim() do? What’s the opposite of str_trim()?

    It removes whitespace from the left and right of a string. str_pad() is the opposite functionality.

    str_squish() removes extra whitepace, in beginning of string, end of string and the middle. 🥂

    (x <- str_trim("  This has \n some spaces   in the     middle and end    "))
    #> [1] "This has \n some spaces   in the     middle and end"
    # whitespace removed from begin and end of string
    writeLines(x)
    #> This has 
    #>  some spaces   in the     middle and end
    
    (y <- str_squish("  This has \n some spaces   in the     middle and end    ... oh, not any more ;)"))
    #> [1] "This has some spaces in the middle and end ... oh, not any more ;)"
    # whitespace removed from begin, middle and end of string
    writeLines(y)
    #> This has some spaces in the middle and end ... oh, not any more ;)
  6. Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

    • length 0: return empty string
    • length 1: return string
    • length 2: return first part “and” second part
    • length 3: return first part “,” second part “and” third part.
    stringify <- function(v){
      if (length(v) == 0 | length(v) == 1){
        v
      }
      else if (length(v) == 2){
        str_c(v, collapse = " and ")
      }
      else if (length(v) > 2){
        str_c(c(rep("", (length(v) - 1)), " and "),
              v, c(rep(", ", (length(v) - 2)), rep("", 2)), 
               collapse = "")
      }
    }
    emp <- ""
    stringify(emp)
    #> [1] ""
    
    x <- "a"
    stringify(x)
    #> [1] "a"
    
    y <- c("a", "b")
    stringify(y)
    #> [1] "a and b"
    
    z <- c("a", "b", "c")
    stringify(z)
    #> [1] "a, b and c"
    
    l <- letters
    stringify(letters)
    #> [1] "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y and z"

Pattern Matching with Regex

Pattern Matching with Regex

  • Find a specific pattern

    x <- c("apple", "banana", "pear")
      # find any "an" char seq in vector x
      str_view(x, "an")
  • Find any character besides the newline char.

    # find any char followed by an "a" followed by any char
    str_view(x, ".a.") 
  • What if we want to literally match .?

    We need to escape the . to say “hey, literally find me a . char in the string, I don’t want to use it’s special behaviour this time”.

    \\.

    (dot <- "\\.")
    #> [1] "\\."
    writeLines(dot)
    #> \.
    
    str_view(c("abc", "a$c", "a.c", "b.e"), 
             # find a char 
             # followed by a literal . 
             # followed by another char
             ".\\..")
  • What if we want the literal \?

    Recall that to add a literal backslash in a string we have to escape it using \\.

    (backslash <- "This string contains the \\ char and we
    want to find it.")
    #> [1] "This string contains the \\ char and we\nwant to find it."
    writeLines(backslash)
    #> This string contains the \ char and we
    #> want to find it.

    So to find it using regex we need to escape each backslash in our regex i.e. \\\\. 👿

    writeLines(backslash)
    #> This string contains the \ char and we
    #> want to find it.
    str_view(backslash, "\\\\")

Exercises

  1. Explain why each of these strings don’t match a \: "\", "\\", "\\\".

    As we saw above in a string to literally print a \ we use "\\". If we need to match it we need to escape each \, with a \. Since we have two \’s in a string, matching requires 2 * 2 i.e. 4 \

  2. How would you match the sequence "'\?

    (string4 <- "This is the funky string: \"\'\\")
    #> [1] "This is the funky string: \"'\\"
    writeLines(string4)
    #> This is the funky string: "'\
    str_view(string4, "\\\"\\\'\\\\")
  3. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

    It matches the pattern literal . followed by any character x 3.

    (string5 <- ".x.y.z something else .z.a.r")
    #> [1] ".x.y.z something else .z.a.r"
    writeLines(string5)
    #> .x.y.z something else .z.a.r
    str_view_all(string5, "\\..\\..\\..")

Anchors

Anchors

Use:

  • ^ to match the start of the string.

  • $ to match the end of the string.

    x
    #> [1] "apple"  "banana" "pear"
    str_view(x, "^a") # any starting with a?
    str_view(x, "a$") # any ending with a?
  • To match a full string (not just the string being a part of a bigger string).

    (x <- c("apple pie", "apple", "apple cake"))
    #> [1] "apple pie"  "apple"      "apple cake"
    str_view(x, "apple") # match any "apple"
    str_view(x, "^apple$") # match the word "apple"
  • Match boundary between words with \b.

Exercises

  1. How would you match the literal string "$^$"?

    (x <- "How would you match the literal string $^$?")
    #> [1] "How would you match the literal string $^$?"
    str_view(x, "\\$\\^\\$")
  2. Given the corpus of common words in stringr::words, create regular expressions that find all words that:

    1. Start with “y”.
    stringr::words %>% 
        as_tibble()
    #> # A tibble: 980 x 1
    #>    value   
    #>    <chr>   
    #>  1 a       
    #>  2 able    
    #>  3 about   
    #>  4 absolute
    #>  5 accept  
    #>  6 account 
    #>  7 achieve 
    #>  8 across  
    #>  9 act     
    #> 10 active  
    #> # ... with 970 more rows
    
    str_view(stringr::words, "^y", match = TRUE)
    1. End with “x”
    str_view(stringr::words, "x$", match = TRUE)
    1. Are exactly three letters long. (Don’t cheat by using str_length()!)
    str_view(stringr::words, "^...$", match = TRUE)
    1. Have seven letters or more.
    str_view(stringr::words, "^.......", match = TRUE)

    Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.

Character classes

Character classes

  • \d: matches any digit.
  • \s: matches any whitespace (e.g. space, tab, newline).
  • [abc]: matches a, b, or c.
  • [^abc]: matches anything except a, b, or c.

To create a regular expression containing \d or \s, we’ll need to escape the \ for the string, so we’ll type "\\d" or "\\s".

A character class containing a single character is a nice alternative to backslash escapes when we’re looking for a single metacharacter in a regex.

(x <- "How would you match the literal string $^$?")
#> [1] "How would you match the literal string $^$?"
str_view(x, "[$][\\^][$]")

(y <- "This sentence has a full stop. Can we find it?")
#> [1] "This sentence has a full stop. Can we find it?"
str_view(y, "[.]")

# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")

This works for most (but not all) regex metacharacters:

  • Works for: $ . | ? * + ( ) [ {.
  • Does not work for: Some characters have special meaning even inside a character class, and hence must be handled with backslash escapes. These are ] \ ^ and -. E.g. In the first example above.

You can use alternation to pick between one or more alternative patterns. For example, abc|d..f will match either ‘“abc”’, or "deaf". Note that the precedence for | is low, and hence may be confusing (e.g. we may have expected the above to match either abc or abdeaf or abchgf, but it does not - it matches either the first part abc OR the second part dxxf). We need to use parentheses to make it clear what we are looking for.

str_view(c("grey", "gray"), "gr(e|a)y")

Exercises

  1. Create regular expressions to find all words that:

    1. Start with a vowel.

      reg_ex <- "^[aeiou]"
      (x <- c("aardvark", "bat", "umbrella", 
              "escape", "xray", "owl"))
      #> [1] "aardvark" "bat"      "umbrella" "escape"   "xray"     "owl"
      str_view(x, reg_ex)
    2. That only contain consonants. (Hint: thinking about matching “not”-vowels.)

      I don’t know how to do this with only the tools we have learnt so far so you will see a new character below + that is after the character class end bracket - this means one or more, i.e. find words that contain one or more non-vowel words in stringr::words.

      reg_ex <- "^[^aeiou]+$"
      str_view(stringr::words, reg_ex, match = TRUE)
    3. End with ed, but not with eed.

      reg_ex <- "[^e][e][d]$"
      str_view(stringr::words, reg_ex, match = TRUE)
    4. End with ing or ise.

      reg_ex <- "i(ng|se)$"
      str_view(stringr::words, reg_ex, match = TRUE)
  2. Empirically verify the rule “i before e except after c”.

    correct_reg_ex <- "[^c]ie|[c]ei"
    str_view(stringr::words, correct_reg_ex, match = TRUE)
    opp_reg_ex <- "[^c]ei|[c]ie" # opp is e before i before a non c
    str_view(stringr::words, opp_reg_ex, match = TRUE)
  3. Is “q” always followed by a “u”?

    reg_ex <- "q[^u]"
    str_view(stringr::words, reg_ex, match = TRUE)
    reg_ex <- "qu"
    str_view(stringr::words, reg_ex, match = TRUE)

    In the stringr::words dataset yes.

  4. Write a regular expression that matches a word if it’s probably written in British English, not American English.

    reg_ex <- "col(o|ou)r"
    str_view(c("colour", "color", "colouring"), reg_ex)
    reg_ex <- "visuali(s|z)(e|ation)"
    str_view(c("visualisation", "visualization", 
               "visualise", "visualize"), 
             reg_ex)
  5. Create a regular expression that will match telephone numbers as commonly written in your country.

    reg_ex <- "[+]27[(]0[)][\\d]+"
    str_view(c("0828907654", "+27(0)862345678", "777-8923-111"),
             reg_ex)

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_South Africa.1252  LC_CTYPE=English_South Africa.1252   
#> [3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C                         
#> [5] LC_TIME=English_South Africa.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] magrittr_1.5    flair_0.0.2     forcats_0.5.0   stringr_1.4.0  
#>  [5] dplyr_1.0.0     purrr_0.3.4     readr_1.3.1     tidyr_1.1.0    
#>  [9] tibble_3.0.3    ggplot2_3.3.0   tidyverse_1.3.0 workflowr_1.6.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.0  xfun_0.13         haven_2.2.0       lattice_0.20-38  
#>  [5] colorspace_1.4-1  vctrs_0.3.2       generics_0.0.2    htmltools_0.5.0  
#>  [9] emo_0.0.0.9000    yaml_2.2.1        utf8_1.1.4        rlang_0.4.7      
#> [13] later_1.0.0       pillar_1.4.6      withr_2.2.0       glue_1.4.1       
#> [17] DBI_1.1.0         dbplyr_1.4.3      modelr_0.1.6      readxl_1.3.1     
#> [21] lifecycle_0.2.0   munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0 
#> [25] rvest_0.3.5       htmlwidgets_1.5.1 evaluate_0.14     knitr_1.28       
#> [29] httpuv_1.5.2      fansi_0.4.1       broom_0.5.6       Rcpp_1.0.4.6     
#> [33] promises_1.1.0    backports_1.1.6   scales_1.1.0      jsonlite_1.7.0   
#> [37] fs_1.4.1          hms_0.5.3         digest_0.6.25     stringi_1.4.6    
#> [41] grid_3.6.3        rprojroot_1.3-2   cli_2.0.2         tools_3.6.3      
#> [45] crayon_1.3.4      whisker_0.4       pkgconfig_2.0.3   ellipsis_0.3.1   
#> [49] xml2_1.3.2        reprex_0.3.0      lubridate_1.7.8   rstudioapi_0.11  
#> [53] assertthat_0.2.1  rmarkdown_2.4     httr_1.4.2        R6_2.4.1         
#> [57] nlme_3.1-144      git2r_0.26.1      compiler_3.6.3