Last updated: 2020-11-01

Checks: 7 0

Knit directory: r4ds_book/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200814) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 2cd3513. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/images/
    Untracked:  code_snipp.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ch11_strings.Rmd) and HTML (docs/ch11_strings.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 2cd3513 sciencificity 2020-11-01 more of ch11
html d8813e9 sciencificity 2020-11-01 Build site.
Rmd 3ec3460 sciencificity 2020-11-01 more of ch11
html bf15f3b sciencificity 2020-11-01 Build site.
Rmd a8057e7 sciencificity 2020-11-01 added ch11
html 0aef1b0 sciencificity 2020-10-31 Build site.
Rmd 72ad7d9 sciencificity 2020-10-31 added ch10

Strings

Click on the tab buttons below for each section

String Basics

String Basics

(string1 <- "This is a string")
#> [1] "This is a string"
(string2 <- 'To put a "quote" inside a string, use single quotes')
#> [1] "To put a \"quote\" inside a string, use single quotes"

writeLines(string1)
#> This is a string
writeLines(string2)
#> To put a "quote" inside a string, use single quotes
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

If you want to include a literal backslash, you’ll need to double it up: "\\".

The printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():

x <- c("\"", "\\")
x
#> [1] "\"" "\\"
writeLines(x)
#> "
#> \

Other useful ones:

  • "\n": newline
  • "\t": tab
  • See the complete list by getting help on ": ?'"', or ?"'".
  • When you see strings like "\u00b5", this is a way of writing non-English characters.
(string3 <- "This\tis\ta\tstring\twith\t\ttabs\tin\tit.\nHow about that?")
#> [1] "This\tis\ta\tstring\twith\t\ttabs\tin\tit.\nHow about that?"
writeLines(string3)
#> This is  a   string  with        tabs    in  it.
#> How about that?

## From `?'"'` help page
## Backslashes need doubling, or they have a special meaning.
x <- "In ALGOL, you could do logical AND with /\\."
print(x)      # shows it as above ("input-like")
#> [1] "In ALGOL, you could do logical AND with /\\."
writeLines(x) # shows it as you like it ;-)
#> In ALGOL, you could do logical AND with /\.

Some String Functions

Some String Functions

String Length

Use str_length().

str_length(c("a", "R for Data Science", NA))
#> [1]  1 18 NA

Combining Strings

Use str_c().

  • Use sep = some_char to separate values with a character, the default separator is the empty string.
  • Shorter length vectors are recycled.
  • Use str_replace_na(list) to replace NAs with literal NA.
  • Objects of length 0 are silently dropped.
  • Use `collapse to reduce a vector of strings to a single string.
str_c("a", "R for Data Science")
#> [1] "aR for Data Science"

str_c("x", "y", "z")
#> [1] "xyz"

str_c("x", "y", "z", sep = ", ") # separate using character
#> [1] "x, y, z"

str_c("prefix-", c("a","b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
x <- c("abc", NA)

str_c("|=", x, "=|") # concatenating a 1 long, with 2 long, with 1 long
#> [1] "|=abc=|" NA

str_c("|=", str_replace_na(x), "=|") # to actually show the NA
#> [1] "|=abc=|" "|=NA=|"

Notice that the shorter vector is recycled.

Objects of 0 length are dropped.

name <- "Vebash"
time_of_day <- "evening"
birthday <- FALSE

str_c("Good ", time_of_day, " ",
      name, if(birthday) ' and Happy Birthday!')
#> [1] "Good evening Vebash"

str_c("prefix-", c("a","b", "c"), "-suffix", collapse = ', ')
#> [1] "prefix-a-suffix, prefix-b-suffix, prefix-c-suffix"

str_c("prefix-", c("a","b", "c"), "-suffix") # note the diff without
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

Subsetting Strings

Use str_sub().

  • start and end args give the (inclusive) position of the substring you’re looking for.
  • does not fail if string too short, returns as much as it can.
  • can use the assignment operator of str_sub() to modify strings.
x <- c("Apple", "Banana", "Pear")

str_sub(x, 1, 3) # get 1st three chars of each
#> [1] "App" "Ban" "Pea"

str_sub(x, -3, -1) # get last three chars of each
#> [1] "ple" "ana" "ear"

str_sub("a", 1, 5) # too short but no failure
#> [1] "a"

x # before change
#> [1] "Apple"  "Banana" "Pear"

# Go get from x the 1st char, and assign to it
# the lower version of its character
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))

x # after the str_sub assign above
#> [1] "apple"  "banana" "pear"

Locales

str_to_lower(), str_to_upper() and str_to_title() are all functions that amend case. Amending case may be dependant on your locale though.

# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
#> [1] "I" "I"
str_to_upper(c("i", "ı"), locale = "tr")
#> [1] "I" "I"

Sorting is also affected by locales. In Base R we use sort or order, in {stringr} we use str_sort() and str_order() with the additional argument locale.

x <- c("apple", "banana", "eggplant")

str_sort(x, locale = "en")
#> [1] "apple"    "banana"   "eggplant"

str_sort(x, locale = "haw")
#> [1] "apple"    "eggplant" "banana"

str_order(x, locale = "en")
#> [1] 1 2 3

str_order(x, locale = "haw")
#> [1] 1 3 2

Exercises

  1. In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

    # from the help page
    ## When passing a single vector, paste0 and paste work like as.character.
    paste0(1:12)
    #>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
    paste(1:12)        # same
    #>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
    as.character(1:12) # same
    #>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"
    
    ## If you pass several vectors to paste0, they are concatenated in a
    ## vectorized way.
    (nth <- paste0(1:12, c("st", "nd", "rd", rep("th", 9))))
    #>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
    #> [11] "11th" "12th"
    
    (nth <- paste(1:12, c("st", "nd", "rd", rep("th", 9))))
    #>  [1] "1 st"  "2 nd"  "3 rd"  "4 th"  "5 th"  "6 th"  "7 th"  "8 th"  "9 th" 
    #> [10] "10 th" "11 th" "12 th"
    
    (nth <- str_c(1:12, c("st", "nd", "rd", rep("th", 9))))
    #>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
    #> [11] "11th" "12th"
    
    
    (na_th <- paste0(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
    #>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
    #> [11] "11th" "12th" "13NA"
    
    (na_th <- paste(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
    #>  [1] "1 st"  "2 nd"  "3 rd"  "4 th"  "5 th"  "6 th"  "7 th"  "8 th"  "9 th" 
    #> [10] "10 th" "11 th" "12 th" "13 NA"
    
    (na_th <- str_c(1:13, c("st", "nd", "rd", rep("th", 9), NA)))
    #>  [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"
    #> [11] "11th" "12th" NA
    • paste() inserts a space between values, and may be overridden with sep = "". In other words the default separator is a space.

    • paste0() has a separator that is by default the empty string so resulting vector values have no spaces in between.

    • str_c() is the stringr equivalent.

    • paste() and paste0() treat NA values as literal string NA, whereas str_c treats NA as missing and that vectorised operation results in an NA.

  2. In your own words, describe the difference between the sep and collapse arguments to str_c().

    • sep is the separator that appears between vector values when these are concatenated in a vectorised fashion.
    • collapse is the separator between values when all vectors are collapsed into a single contiguous string value.
    (na_th_sep <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                        # sep only
                        sep = "'"))
    #>  [1] "1'st"  "2'nd"  "3'rd"  "4'th"  "5'th"  "6'th"  "7'th"  "8'th"  "9'th" 
    #> [10] "10'th" "11'th" "12'th"
    
    (na_th_col <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                        # collapse only
                        collapse = "; "))
    #> [1] "1st; 2nd; 3rd; 4th; 5th; 6th; 7th; 8th; 9th; 10th; 11th; 12th"
    
    (na_th <- str_c(1:12, c("st", "nd", "rd", rep("th", 9)),
                    # both
                    sep = " ", collapse = ", "))
    #> [1] "1 st, 2 nd, 3 rd, 4 th, 5 th, 6 th, 7 th, 8 th, 9 th, 10 th, 11 th, 12 th"
  3. Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

    x <- "This is a string."
    
    y <- "This is a string, no full stop"
    
    z <- "I"
    
    str_length(x)/2
    #> [1] 8.5
    str_length(y)/2
    #> [1] 15
    
    str_sub(x, ceiling(str_length(x)/2),
            ceiling(str_length(x)/2))
    #> [1] "a"
    
    str_sub(y, str_length(y)/2,
            str_length(y)/2 + 1)
    #> [1] "ng"
    
    str_sub(z, ceiling(str_length(z)/2),
            ceiling(str_length(z)/2))
    #> [1] "I"
  4. What does str_wrap() do? When might you want to use it?

    It is a wrapper around stringi::stri_wrap() which implements the Knuth-Plass paragraph wrapping algorithm.

    The text is wrapped based on a given width. The default is 80, overridding this to 40 will mean 40 characters on a line. Further arguments such as indent (the indentation of start of each paragraph) may be specified.

  5. What does str_trim() do? What’s the opposite of str_trim()?

    It removes whitespace from the left and right of a string. str_pad() is the opposite functionality.

    str_squish() removes extra whitepace, in beginning of string, end of string and the middle. 🥂

    (x <- str_trim("  This has \n some spaces   in the     middle and end    "))
    #> [1] "This has \n some spaces   in the     middle and end"
    # whitespace removed from begin and end of string
    writeLines(x)
    #> This has 
    #>  some spaces   in the     middle and end
    
    (y <- str_squish("  This has \n some spaces   in the     middle and end    ... oh, not any more ;)"))
    #> [1] "This has some spaces in the middle and end ... oh, not any more ;)"
    # whitespace removed from begin, middle and end of string
    writeLines(y)
    #> This has some spaces in the middle and end ... oh, not any more ;)
  6. Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

    • length 0: return empty string
    • length 1: return string
    • length 2: return first part “and” second part
    • length 3: return first part “,” second part “and” third part.
    stringify <- function(v){
      if (length(v) == 0 | length(v) == 1){
        v
      }
      else if (length(v) == 2){
        str_c(v, collapse = " and ")
      }
      else if (length(v) > 2){
        str_c(c(rep("", (length(v) - 1)), " and "),
              v, c(rep(", ", (length(v) - 2)), rep("", 2)), 
               collapse = "")
      }
    }
    emp <- ""
    stringify(emp)
    #> [1] ""
    
    x <- "a"
    stringify(x)
    #> [1] "a"
    
    y <- c("a", "b")
    stringify(y)
    #> [1] "a and b"
    
    z <- c("a", "b", "c")
    stringify(z)
    #> [1] "a, b and c"
    
    l <- letters
    stringify(letters)
    #> [1] "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y and z"

Pattern Matching with Regex

Pattern Matching with Regex

  • Find a specific pattern

    x <- c("apple", "banana", "pear")
      # find any "an" char seq in vector x
      str_view(x, "an")
  • Find any character besides the newline char.

    # find any char followed by an "a" followed by any char
    str_view(x, ".a.") 
  • What if we want to literally match .?

    We need to escape the . to say “hey, literally find me a . char in the string, I don’t want to use it’s special behaviour this time”.

    \\.

    (dot <- "\\.")
    #> [1] "\\."
    writeLines(dot)
    #> \.
    
    str_view(c("abc", "a$c", "a.c", "b.e"), 
             # find a char 
             # followed by a literal . 
             # followed by another char
             ".\\..")
  • What if we want the literal \?

    Recall that to add a literal backslash in a string we have to escape it using \\.

    (backslash <- "This string contains the \\ char and we
    want to find it.")
    #> [1] "This string contains the \\ char and we\nwant to find it."
    writeLines(backslash)
    #> This string contains the \ char and we
    #> want to find it.

    So to find it using regex we need to escape each backslash in our regex i.e. \\\\. 👿

    writeLines(backslash)
    #> This string contains the \ char and we
    #> want to find it.
    str_view(backslash, "\\\\")

Exercises

  1. Explain why each of these strings don’t match a \: "\", "\\", "\\\".

    As we saw above in a string to literally print a \ we use "\\". If we need to match it we need to escape each \, with a \. Since we have two \’s in a string, matching requires 2 * 2 i.e. 4 \

  2. How would you match the sequence "'\?

    (string4 <- "This is the funky string: \"\'\\")
    #> [1] "This is the funky string: \"'\\"
    writeLines(string4)
    #> This is the funky string: "'\
    str_view(string4, "\\\"\\\'\\\\")
  3. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

    It matches the pattern literal . followed by any character x 3.

    (string5 <- ".x.y.z something else .z.a.r")
    #> [1] ".x.y.z something else .z.a.r"
    writeLines(string5)
    #> .x.y.z something else .z.a.r
    str_view_all(string5, "\\..\\..\\..")

Anchors

Anchors

Use:

  • ^ to match the start of the string.

  • $ to match the end of the string.

    x
    #> [1] "apple"  "banana" "pear"
    str_view(x, "^a") # any starting with a?
    str_view(x, "a$") # any ending with a?
  • To match a full string (not just the string being a part of a bigger string).

    (x <- c("apple pie", "apple", "apple cake"))
    #> [1] "apple pie"  "apple"      "apple cake"
    str_view(x, "apple") # match any "apple"
    str_view(x, "^apple$") # match the word "apple"
  • Match boundary between words with \b.

Exercises

  1. How would you match the literal string "$^$"?

    (x <- "How would you match the literal string $^$?")
    #> [1] "How would you match the literal string $^$?"
    str_view(x, "\\$\\^\\$")
  2. Given the corpus of common words in stringr::words, create regular expressions that find all words that:

    1. Start with “y”.
    stringr::words %>% 
        as_tibble()
    #> # A tibble: 980 x 1
    #>    value   
    #>    <chr>   
    #>  1 a       
    #>  2 able    
    #>  3 about   
    #>  4 absolute
    #>  5 accept  
    #>  6 account 
    #>  7 achieve 
    #>  8 across  
    #>  9 act     
    #> 10 active  
    #> # ... with 970 more rows
    
    str_view(stringr::words, "^y", match = TRUE)
    1. End with “x”
    str_view(stringr::words, "x$", match = TRUE)
    1. Are exactly three letters long. (Don’t cheat by using str_length()!)
    str_view(stringr::words, "^...$", match = TRUE)
    1. Have seven letters or more.
    str_view(stringr::words, "^.......", match = TRUE)

    Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.

Character classes

Character classes

  • \d: matches any digit.
  • \s: matches any whitespace (e.g. space, tab, newline).
  • [abc]: matches a, b, or c.
  • [^abc]: matches anything except a, b, or c.

To create a regular expression containing \d or \s, we’ll need to escape the \ for the string, so we’ll type "\\d" or "\\s".

A character class containing a single character is a nice alternative to backslash escapes when we’re looking for a single metacharacter in a regex.

(x <- "How would you match the literal string $^$?")
#> [1] "How would you match the literal string $^$?"
str_view(x, "[$][\\^][$]")

(y <- "This sentence has a full stop. Can we find it?")
#> [1] "This sentence has a full stop. Can we find it?"
str_view(y, "[.]")

# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")

This works for most (but not all) regex metacharacters:

  • Works for: $ . | ? * + ( ) [ {.
  • Does not work for: Some characters have special meaning even inside a character class, and hence must be handled with backslash escapes. These are ] \ ^ and -. E.g. In the first example above.

You can use alternation to pick between one or more alternative patterns. For example, abc|d..f will match either ‘“abc”’, or "deaf". Note that the precedence for | is low, and hence may be confusing (e.g. we may have expected the above to match either abc or abdeaf or abchgf, but it does not - it matches either the first part abc OR the second part dxxf). We need to use parentheses to make it clear what we are looking for.

str_view(c("grey", "gray"), "gr(e|a)y")

Exercises

  1. Create regular expressions to find all words that:

    1. Start with a vowel.

      reg_ex <- "^[aeiou]"
      (x <- c("aardvark", "bat", "umbrella", 
              "escape", "xray", "owl"))
      #> [1] "aardvark" "bat"      "umbrella" "escape"   "xray"     "owl"
      str_view(x, reg_ex)
    2. That only contain consonants. (Hint: thinking about matching “not”-vowels.)

      I don’t know how to do this with only the tools we have learnt so far so you will see a new character below + that is after the character class end bracket - this means one or more, i.e. find words that contain one or more non-vowel words in stringr::words.

      reg_ex <- "^[^aeiou]+$"
      str_view(stringr::words, reg_ex, match = TRUE)
    3. End with ed, but not with eed.

      reg_ex <- "[^e][e][d]$"
      str_view(stringr::words, reg_ex, match = TRUE)
    4. End with ing or ise.

      reg_ex <- "i(ng|se)$"
      str_view(stringr::words, reg_ex, match = TRUE)
  2. Empirically verify the rule “i before e except after c”.

    correct_reg_ex <- "[^c]ie|[c]ei"
    str_view(stringr::words, correct_reg_ex, match = TRUE)
    opp_reg_ex <- "[^c]ei|[c]ie" # opp is e before i before a non c
    str_view(stringr::words, opp_reg_ex, match = TRUE)
  3. Is “q” always followed by a “u”?

    reg_ex <- "q[^u]"
    str_view(stringr::words, reg_ex, match = TRUE)
    reg_ex <- "qu"
    str_view(stringr::words, reg_ex, match = TRUE)

    In the stringr::words dataset yes.

  4. Write a regular expression that matches a word if it’s probably written in British English, not American English.

    reg_ex <- "col(o|ou)r"
    str_view(c("colour", "color", "colouring"), reg_ex)
    reg_ex <- "visuali(s|z)(e|ation)"
    str_view(c("visualisation", "visualization", 
               "visualise", "visualize"), 
             reg_ex)
  5. Create a regular expression that will match telephone numbers as commonly written in your country.

    reg_ex <- "[+]27[(]0[)][\\d]+"
    str_view(c("0828907654", "+27(0)862345678", "777-8923-111"),
             reg_ex)

Repetition

Repetition

The next step up in power involves controlling how many times a pattern matches:

  • ?: 0 or 1
  • +: 1 or more
  • *: 0 or more

You can also specify the number of matches precisely:

  • {n}: exactly n
  • {n,}: n or more
  • {,m}: at most m
  • {n,m}: between n and m
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"

str_view(x, "CC?") # C or CC if exists

str_view(x, "CC+") # CC or CCC or CCCC etc. at least two C's

# CL or CX or CLX at least 1 C, followed by one of more L's & X's
str_view(x, "C[LX]+") 

str_view(x, "C{2}") # find exactly 2 C's

str_view(x, "C{1,}") # find 1 or more C's

str_view(x, "C{1,2}") # min 1 C, max 2 C's

(y <- '&lt;span style="color:#008080;background-color:#9FDDBA"&gt;`alpha`&lt;//span&gt;')
#> [1] "&lt;span style=\"color:#008080;background-color:#9FDDBA\"&gt;`alpha`&lt;//span&gt;"
writeLines(y)
#> &lt;span style="color:#008080;background-color:#9FDDBA"&gt;`alpha`&lt;//span&gt;

# .*? - find to the first > otherwise greedy
str_view(y, '^&lt;.*?(&gt;){1,}') 

The ? after .* makes the matching less greedy. It finds the first multiple characters until a > is encountered

Exercises

  1. Describe the equivalents of ?, +, * in {m,n} form.

    • ? - {0,1} 0 or 1
    • + - {1,} 1 or more
    • * - {0,} 0 or more
  2. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

    1. ^.*$ Matches any string that does not contain a newline character in it. String defining regular expression.

      reg_ex <-  "^.*$"
      (x <- "This is a string with 0 newline chars")
      #> [1] "This is a string with 0 newline chars"
      writeLines(x)
      #> This is a string with 0 newline chars
      str_view(x, reg_ex)
      
      (y <- "This is a string with a couple \n\n newline chars")
      #> [1] "This is a string with a couple \n\n newline chars"
      writeLines(y)
      #> This is a string with a couple 
      #> 
      #>  newline chars
      str_view(y, reg_ex)

      Notice no match for y (none of the text highlighted).

    2. "\\{.+\\}"

      Matches a { followed by one or more of any character but the newline character followed by the }. String defining a regular expression.

      reg_ex <- "\\{.+\\}"
      str_view(c("{a}", "{}", "{a,b,c}", "{a, b\n, c}"), reg_ex)

      Notice that {a, b , c} is not highlighted, this is because there is a \n (newline sequence) after the b.

    3. \d{4}-\d{2}-\d{2}

      Matches exactly 4 digits followed by a - followed by exactly 2 digits, followed by a -, followed by exactly 2 digits. Regular expression (the \d needs another ).

      reg_ex <- "\\d{4}-\\d{2}-\\d{2}" 
      str_view(c("1234-34-12", "12345-34-23", "084-87-98",
                 "2020-01-01"), reg_ex)
    4. "\\\\{4}"

      Matches exactly 4 backslashes. String defining reg expr.

      reg_ex <- "\\\\{4}"
      str_view(c("\\\\", "\\\\\\\\"),
               reg_ex)
  3. Create regular expressions to find all words that:

    1. Start with three consonants.
    reg_ex <- "^[^aeiou]{3}.*"
    str_view(c("fry", "fly", "scrape", "scream", "ate", "women",
               "strap", "splendid", "test"), reg_ex)
    1. Have three or more vowels in a row.
    reg_ex <- ".*[aeiou]{3,}.*"
    str_view(stringr::words, reg_ex, match=TRUE)
    1. Have two or more vowel-consonant pairs in a row.
    reg_ex <- ".*([aeiou][^aeiou]){2,}.*"
    str_view(stringr::words, reg_ex, match = TRUE)
  4. Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.

    regex complete

Backreferences

Backreferences

Parentheses can be used to make complex expressions more clear, and can also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like \1, \2 etc.

The following regex finds all fruits that have a repeated pair of letters.

# (..)\\1 says find any two letters - these are a group, is 
# this then followed by the same 2 letters?
# Yes - match found
# No - whawha
str_view(fruit, "(..)\\1", match = TRUE)

For e.g. for banana:

  • It starts at “ba” that becomes the group 1, then it moves it along and says is the next 2 letters “ba” (i.e. equivalent to group 1) too? Nope.
  • It moves along to “an” and that is the new group 1. Then it moves along and says - are the next two letters equivalent to group 1 (i.e. is it “an”) - Yes it is! found a word that matches.

Exercises

  1. Describe, in words, what these expressions will match:

    1. (.)\1\1

      This matches any character repeated three times.

      reg_ex <- "(.)\\1\\1"
      str_view(c("Oooh", "Ahhh", "Awww", "Ergh"), reg_ex)

      Note that O and o are different.

    2. "(.)(.)\\2\\1"

      This matches any two characters repeated once in reverse order. e.g. abba

      reg_ex <- "(.)(.)\\2\\1"
      str_view(c("abba"), reg_ex)
      str_view(words, reg_ex, match=TRUE)
    3. (..)\1

      This matches two letters that appear twice. banana.

      str_view(fruit, "(..)\\1", match = TRUE)
    4. "(.).\\1.\\1"

      This matches a character followed by another char followed by the same character as the start, followed by another char, followed by the character. e.g. abaca

      str_view(words, "(.).\\1.\\1", match = TRUE)
    5. "(.)(.)(.).*\\3\\2\\1"

      This matches three characters followed by 0 or more other characters, ending with the 3 characters at the start in reverse order.

      reg_ex <- "(.)(.)(.).*\\3\\2\\1"
      str_view(c("bbccbb"), reg_ex)
      str_view(words, reg_ex, match=TRUE)
  2. Construct regular expressions to match words that:

    1. Start and end with the same character.
    reg_ex <- "^(.).*\\1$"
    str_view(words, reg_ex, match = TRUE)
    1. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
    reg_ex <- "(..).*\\1"
    str_view("church", reg_ex)
    str_view(words, reg_ex, match=TRUE)
    1. Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
    reg_ex <- "(.).*\\1.*\\1"
    str_view(words, reg_ex, match = TRUE)

Detect Matches

Detect Matches

str_detect()

Use str_detect(). It returns a logical vector the same length as the input.

Since it is a logical vector and numerically TRUE == 1 and FALSE == 0 we can also use sum(), mean() to get information about matches found.

(x <- c("apple", "banana", "pear"))
#> [1] "apple"  "banana" "pear"
str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE
x
#> [1] "apple"  "banana" "pear"
sum(str_detect(x, "e"))
#> [1] 2
# How many common words start with t?
sum(str_detect(words, "^t"))
#> [1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.2765306
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
#> [1] TRUE
# you can also use `negate = TRUE`
no_vowels_3 <- str_detect(words, "[aeiou]", negate = TRUE)
identical(no_vowels_1, no_vowels_3)
#> [1] TRUE
identical(no_vowels_3, no_vowels_2)
#> [1] TRUE

str_subset()

We use str_detect() often to match patterns using the wrapper str_subset().

words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
# str_subset() is a wrapper around x[str_detect(x, pattern)]
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"

filter(str_detect())

When we want to find matches in a column in a dataframe we can combine str_detect() with filter().

(df <- tibble(
  word = words,
  i = seq_along(word)
))
#> # A tibble: 980 x 2
#>    word         i
#>        
#>  1 a            1
#>  2 able         2
#>  3 about        3
#>  4 absolute     4
#>  5 accept       5
#>  6 account      6
#>  7 achieve      7
#>  8 across       8
#>  9 act          9
#> 10 active      10
#> # ... with 970 more rows

df %>%
  filter(str_detect(word, "x$"))
#> # A tibble: 4 x 2
#>   word      i
#>   
#> 1 box     108
#> 2 sex     747
#> 3 six     772
#> 4 tax     841

str_count()

Instead of using str_detect() which returns a TRUE OR FALSE we can use str_count() which gives us a number of matches in each string.

(x <- c("apple", "banana", "pear"))
#> [1] "apple"  "banana" "pear"
str_count(x, "e")
#> [1] 1 0 1
str_count(x, "a")
#> [1] 1 3 1
# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
#> [1] 1.991837

We often use str_count() with mutate().

df %>% 
  mutate(vowels = str_count(word, "[aeiou]"),
         consonants = str_count(word, "[^aeiou]"))
#> # A tibble: 980 x 4
#>    word         i vowels consonants
#>    <chr>    <int>  <int>      <int>
#>  1 a            1      1          0
#>  2 able         2      2          2
#>  3 about        3      3          2
#>  4 absolute     4      4          4
#>  5 accept       5      2          4
#>  6 account      6      3          4
#>  7 achieve      7      4          3
#>  8 across       8      2          4
#>  9 act          9      1          2
#> 10 active      10      3          3
#> # ... with 970 more rows

Matches never overlap. For example, in "abababa", the pattern "aba" matches twice. You can think of it as placing a marker at the beginning of the string, then moving along looking for pattern, it sees a then b then a, so it has found one pattern == aba. The marker is lying at the 4th letter in the string. It proceeds from there to look for more occurrences of the pattern. b does not do it, so it skips over and goes to the 5th character a, then the 6th b, then the 7th a and has found another occurrence. Hence 2 occurrences found. I.e it moves sequentially over the string, and does not brute force every combination in the string.

how matching proceeds

str_count("abababa", "aba")
#> [1] 2
str_view_all("abababa", "aba")

Exercises

  1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

    1. Find all words that start or end with x.

      reg_ex <- "(^x.*|.*x$)"
      str_view(words, reg_ex, match = TRUE)
      str_detect(c("xray", "box", "text", "vex"), reg_ex)
      #> [1]  TRUE  TRUE FALSE  TRUE
      
      reg_ex <- "(^x.*|.*x$)"
      str_detect(c("xray", "box", "text", "vex"), "^x") |
        str_detect(c("xray", "box", "text", "vex"), "x$")
      #> [1]  TRUE  TRUE FALSE  TRUE
    2. Find all words that start with a vowel and end with a consonant.

      reg_ex <- "^[aeiou].*[^aeiou]$"
      df %>% 
        filter(str_detect(word, reg_ex))
      #> # A tibble: 122 x 2
      #>    word        i
      #>    <chr>   <int>
      #>  1 about       3
      #>  2 accept      5
      #>  3 account     6
      #>  4 across      8
      #>  5 act         9
      #>  6 actual     11
      #>  7 add        12
      #>  8 address    13
      #>  9 admit      14
      #> 10 affect     16
      #> # ... with 112 more rows
    3. Are there any words that contain at least one of each different vowel?

      # https://stackoverflow.com/questions/54267095/what-is-the-regex-to-match-the-words-containing-all-the-vowels
      reg_ex <- "\\b(?=\\w*?a)(?=\\w*?e)(?=\\w*?i)(?=\\w*?o)(?=\\w*?u)[a-zA-Z]+\\b"
      str_detect(c("eunomia", "eutopia", "sequoia"), reg_ex)
      #> [1] TRUE TRUE TRUE
      str_view(c("eunomia", "eutopia", "sequoia"), reg_ex)
    4. What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)

      df %>% 
        mutate(vowels = str_count(word, "[aeiou]+"),
               len_word = str_length(word),
               prop_vowels = vowels / len_word) %>% 
        arrange(-prop_vowels)
      #> # A tibble: 980 x 5
      #>    word        i vowels len_word prop_vowels
      #>    <chr>   <int>  <int>    <int>       <dbl>
      #>  1 a           1      1        1       1    
      #>  2 age        22      2        3       0.667
      #>  3 ago        24      2        3       0.667
      #>  4 eye       296      2        3       0.667
      #>  5 one       577      2        3       0.667
      #>  6 use       912      2        3       0.667
      #>  7 aware      63      3        5       0.6  
      #>  8 unite     906      3        5       0.6  
      #>  9 america    36      4        7       0.571
      #> 10 educate   258      4        7       0.571
      #> # ... with 970 more rows
      
      df %>% 
        mutate(vowels = str_count(word, "[aeiou]+"),
               len_word = str_length(word),
               prop_vowels = vowels / len_word) %>% 
        arrange(-vowels, -prop_vowels)          
      #> # A tibble: 980 x 5
      #>    word         i vowels len_word prop_vowels
      #>    <chr>    <int>  <int>    <int>       <dbl>
      #>  1 america     36      4        7       0.571
      #>  2 educate    258      4        7       0.571
      #>  3 imagine    415      4        7       0.571
      #>  4 operate    580      4        7       0.571
      #>  5 absolute     4      4        8       0.5  
      #>  6 definite   220      4        8       0.5  
      #>  7 evidence   283      4        8       0.5  
      #>  8 exercise   288      4        8       0.5  
      #>  9 organize   585      4        8       0.5  
      #> 10 original   586      4        8       0.5  
      #> # ... with 970 more rows

      I see these are two different things. The highest number of vowels, is just the word with the most vowels. The proportion on the other hand is num_vowels_in_word / num_letters_in_word.

Extract Matches

Extract Matches

To extract the actual text of a match, use str_extract().

length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."       
#> [6] "The juice of lemons makes fine punch."

Let’s say we want to find all sentences that contain a colour.

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
# make a match string by saying red|orange|...|purple
(colour_match <- str_c(colours, collapse = "|"))
#> [1] "red|orange|yellow|green|blue|purple"
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
#> [1] "blue" "blue" "red"  "red"  "red"  "blue"
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
  • To extract multiple matches, use str_extract_all().
  • This returns a list.
  • To get this in matrix format use simplify = TRUE

Exercises

  1. In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.

    colours <- c("red", "orange", "yellow", "green", "blue", "purple")
    # make a match string by saying red|orange|...|purple
    (colour_match <- str_c(prefix = "\\b", colours, 
                           suffix = "\\b", collapse = "|"))
    #> [1] "\\bred\\b|\\borange\\b|\\byellow\\b|\\bgreen\\b|\\bblue\\b|\\bpurple\\b"
    more <- sentences[str_count(sentences, colour_match) > 1]
    str_view_all(more, colour_match)
  2. From the Harvard sentences data, extract:

    1. The first word from each sentence.

      reg_ex <- "^[A-Za-z']+\\b"
      first_word <- str_extract(sentences, reg_ex)
      head(first_word)
      #> [1] "The"   "Glue"  "It's"  "These" "Rice"  "The"
    2. All words ending in ing.

      reg_ex <- "\\b[a-zA-Z']+ing\\b"
      words_ <- str_extract_all(str_subset(sentences, reg_ex),
                                reg_ex, simplify = TRUE)
      head(words_)
      #>      [,1]     
      #> [1,] "spring" 
      #> [2,] "evening"
      #> [3,] "morning"
      #> [4,] "winding"
      #> [5,] "living" 
      #> [6,] "king"
    3. All plurals.

      Ok so some words end with s but are NOT plurals! For e.g. bass, mass etc.

      reg_ex <- "\\b[a-zA-Z]{4,}(es|ies|s)\\b"
      words_ <- str_extract_all(sentences, reg_ex,
                            simplify = TRUE)
      head(words_, 10)
      #>       [,1]        [,2]    [,3]
      #>  [1,] "planks"    ""      ""  
      #>  [2,] ""          ""      ""  
      #>  [3,] ""          ""      ""  
      #>  [4,] ""          ""      ""  
      #>  [5,] "bowls"     ""      ""  
      #>  [6,] "lemons"    "makes" ""  
      #>  [7,] ""          ""      ""  
      #>  [8,] ""          ""      ""  
      #>  [9,] "hours"     ""      ""  
      #> [10,] "stockings" ""      ""

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_South Africa.1252  LC_CTYPE=English_South Africa.1252   
#> [3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C                         
#> [5] LC_TIME=English_South Africa.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] magrittr_1.5    flair_0.0.2     forcats_0.5.0   stringr_1.4.0  
#>  [5] dplyr_1.0.0     purrr_0.3.4     readr_1.3.1     tidyr_1.1.0    
#>  [9] tibble_3.0.3    ggplot2_3.3.0   tidyverse_1.3.0 workflowr_1.6.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.0  xfun_0.13         haven_2.2.0       lattice_0.20-38  
#>  [5] colorspace_1.4-1  vctrs_0.3.2       generics_0.0.2    htmltools_0.5.0  
#>  [9] emo_0.0.0.9000    yaml_2.2.1        utf8_1.1.4        rlang_0.4.7      
#> [13] later_1.0.0       pillar_1.4.6      withr_2.2.0       glue_1.4.1       
#> [17] DBI_1.1.0         dbplyr_1.4.3      modelr_0.1.6      readxl_1.3.1     
#> [21] lifecycle_0.2.0   munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0 
#> [25] rvest_0.3.5       htmlwidgets_1.5.1 evaluate_0.14     knitr_1.28       
#> [29] httpuv_1.5.2      fansi_0.4.1       broom_0.5.6       Rcpp_1.0.4.6     
#> [33] promises_1.1.0    backports_1.1.6   scales_1.1.0      jsonlite_1.7.0   
#> [37] fs_1.4.1          hms_0.5.3         digest_0.6.25     stringi_1.4.6    
#> [41] grid_3.6.3        rprojroot_1.3-2   cli_2.0.2         tools_3.6.3      
#> [45] crayon_1.3.4      whisker_0.4       pkgconfig_2.0.3   ellipsis_0.3.1   
#> [49] xml2_1.3.2        reprex_0.3.0      lubridate_1.7.8   rstudioapi_0.11  
#> [53] assertthat_0.2.1  rmarkdown_2.4     httr_1.4.2        R6_2.4.1         
#> [57] nlme_3.1-144      git2r_0.26.1      compiler_3.6.3