Last updated: 2020-11-10
Checks: 7 0
Knit directory: r4ds_book/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200814)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 8864bd0. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rproj.user/
Untracked files:
Untracked: analysis/images/
Untracked: code_snipp.txt
Untracked: data/at_health_facilities.csv
Untracked: data/infant_hiv.csv
Untracked: data/ranking.csv
Unstaged changes:
Modified: analysis/sample_exam1.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ch15_functions.Rmd
) and HTML (docs/ch15_functions.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 86457fa | sciencificity | 2020-11-10 | Build site. |
html | 4879249 | sciencificity | 2020-11-09 | Build site. |
html | e423967 | sciencificity | 2020-11-08 | Build site. |
html | 0d223fb | sciencificity | 2020-11-08 | Build site. |
Rmd | b67f0a7 | sciencificity | 2020-11-08 | added ch13 |
html | ecd1d8e | sciencificity | 2020-11-07 | Build site. |
Rmd | 9440e66 | sciencificity | 2020-11-07 | finished ch15 |
df <- tibble::tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
Note the mistake in the copy and paste - we’re using df$a
for the calculation for df$b
!
Let’s isolate the main functionality that we are repeating.
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
#> [1] 0.5358203 0.0000000 0.5941388 0.5494290 1.0000000 0.3947922 0.5101009
#> [8] 0.6675033 0.6034676 0.3683384
That’s all good and well, but to write a function you should generalise the functionality as a test first.
x <- df$a
(x - min(x, na.rm = TRUE)) /
(max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
#> [1] 0.5358203 0.0000000 0.5941388 0.5494290 1.0000000 0.3947922 0.5101009
#> [8] 0.6675033 0.6034676 0.3683384
Even better, we’re using the range
when we use max
and min
. So let’s re-write using that.
rng <- range(x, na.rm=TRUE)
(x - rng[1]) /
(rng[2] - rng[1])
#> [1] 0.5358203 0.0000000 0.5941388 0.5494290 1.0000000 0.3947922 0.5101009
#> [8] 0.6675033 0.6034676 0.3683384
Let’s pull it all together in a function.
rescale01 <- function(x) {
# rescales a vector to lie between 0 and 1
rng <- range(x, na.rm = TRUE)
(x - rng[1]) /(rng[2] - rng[1])
}
rescale01(c(0,5,10))
#> [1] 0.0 0.5 1.0
Key steps:
Pick a name for the function.
List the inputs, or arguments, to the function inside function
. Example: function(x, y, z)
.
Place your code in body of the function, a {
block that immediately follows function(...)
.
Make it work with a small input.
Check the function with a range of inputs.
rescale01(c(-10,0,10))
#> [1] 0.0 0.5 1.0
rescale01(c(1,2,3,NA,5))
#> [1] 0.00 0.25 0.50 NA 1.00
Simplifying the original calcs:
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
Now changes can be made in a single place.
x <- c(1:10, Inf)
rescale01(x)
#> [1] 0 0 0 0 0 0 0 0 0 0 NaN
rescale01 <- function(x){
rng <- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) /(rng[2] - rng[1])
}
rescale01(x)
#> [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
#> [8] 0.7777778 0.8888889 1.0000000 Inf
Why is TRUE
not a parameter to rescale01()
? What would happen if x
contained a single missing value, and na.rm
was FALSE
?
rescale01_miss <- function(x){
rng <- range(x, na.rm = FALSE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01_miss(c(1:5, NA, 10))
#> [1] NA NA NA NA NA NA NA
rescale01_miss <- function(x){
rng <- range(x, na.rm = FALSE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01_miss(c(1:5, NA, 10))
#> [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 NA 1.0000000
If any value is NA the range function returns NA as min and max of range. If finite = TRUE
is set only the NA value returns NA.
According to the help page: “If finite is TRUE, the minimum and maximum of all finite values is computed, i.e., finite = TRUE
includes na.rm = TRUE
.”
In the second variant of rescale01()
, infinite values are left unchanged. Rewrite rescale01()
so that -Inf
is mapped to 0, and Inf
is mapped to 1.
rescale01_inf <- function(x){
rng <- range(x, na.rm = TRUE, finite = TRUE)
x[x == -Inf] <- 0
x[x == Inf] <- 1
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01_inf(c(1, Inf, 0, -Inf))
#> [1] 1 1 0 0
Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?
mean(is.na(x))
x / sum(x, na.rm = TRUE)
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
mean(is.na(x))
finds what percentage of the data is NA.
x <- c(NA, 1:10, NA, 5)
mean(is.na(x))
#> [1] 0.1538462
mean_na <- function(x){
sum(x)/length(x)
}
mean_na(is.na(x))
#> [1] 0.1538462
x / sum(x, na.rm = TRUE)
converts each x to a percentage it contributes towards 100%, not considering the NAs.
contrib_to_one <- function(x) {
x / sum(x, na.rm = TRUE)
}
contrib_to_one(x)
#> [1] NA 0.01666667 0.03333333 0.05000000 0.06666667 0.08333333
#> [7] 0.10000000 0.11666667 0.13333333 0.15000000 0.16666667 NA
#> [13] 0.08333333
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
finds how the deviation as a proportion of the mean. sd
means standard deviation and is a measure of the variability in your data. The mean
is the average of your data.
deviation <- function(x, na.rm = TRUE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
deviation(x)
#> [1] 0.5273097
Write your own functions to compute the variance and skewness of a numeric vector. Variance is defined as \[ \mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,} \] where \(\bar{x} = (\sum_i^n x_i) / n\) is the sample mean. Skewness is defined as \[ \mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.} \]
variance <- function(x){
# find non na length of vector x
n <- sum(!is.na(x))
avg <- mean(x, na.rm = TRUE)
(1/(n-1))*(sum((x - avg)^2, na.rm=TRUE))
}
x <- c(1:10)
variance(x)
#> [1] 9.166667
var(x, na.rm = TRUE)
#> [1] 9.166667
y <- c(1:5, 1:5)
variance(y)
#> [1] 2.222222
var(y)
#> [1] 2.222222
z <- c(1:10, NA)
variance(z)
#> [1] 9.166667
var(z, na.rm = TRUE)
#> [1] 9.166667
skew <- function(x){
n <- sum(!is.na(x))
avg <- mean(x, na.rm = TRUE)
summation <- sum((x - avg)^3, na.rm = TRUE)
variance_val <- (variance(x))^(3/2)
((1/(n-2))*summation)/variance_val
}
a <- c(1:10, 1000)
skew(a)
#> [1] 3.014606
Write both_na()
, a function that takes two vectors of the same length and returns the number of positions that have an NA
in both vectors.
both_na <- function(x, y){
# find where x is NA - is.na(x) - returns a bunch of TRUE, FALSE
# find where y is NA - is.na(y) - returns a bunch of TRUE, FALSE
# where are they both NA - &
# sum where they are both NA
# e.g. (T, F, T) & (T, F, F) == 1; only T in same place once
sum(is.na(x) & is.na(y))
}
x <- c(1:10, NA, 12, NA)
y <- c(1:9, 10, NA, 12, 13)
both_na(x,y)
#> [1] 1
both_na(c(NA, 2, 4, 6, NA, 10, 12),
c(NA, 1, 3, 5, NA, 9, 11))
#> [1] 2
What do the following functions do? Why are they useful even though they are so short?
is_directory()
tells you whether a given path is a directory.
is_readable()
tells you whether you have permission to read a file.
is_directory <- function(x) file.info(x)$isdir
is_directory("C:/Personal")
#> [1] TRUE
is_directory("C:/Work/Learning/how-containers-work.pdf")
#> [1] FALSE
is_readable <- function(x) file.access(x, 4) == 0
is_readable("C:/Personal")
#> C:/Personal
#> TRUE
is_readable("C:/Work/Learning/how-containers-work.pdf")
#> C:/Work/Learning/how-containers-work.pdf
#> TRUE
Read the complete lyrics to “Little Bunny Foo Foo”. There’s a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.
Repeat (x3) Little Bunny Foo Foo, Hopping through the forest, Scooping up the field mice, And bopping them on the head. (Spoken) Down came the Good Fairy, and she said, "Little Bunny Foo Foo, I don't want to see you, Scooping up the field mice And bopping them on the head." (Spoken) "I'll give you three (two / one) chance(s), And if you don't behave, I'm gonna turn you into a goon!" END Repeat The next day... or That evening... or Later that night... "I gave you three chances, And you didn't behave, And now I'm gonna turn you into a goon. POOF!" "And the moral of the story is: Hare today, goon tomorrow."
little_bunny <- function(bunny = ""){
str_glue("{bunny}Little Bunny Foo-Foo,\n\n")
}
hop <- function(bunny, where = "forest"){
str_glue("{bunny}Hopping through the {where},\n\n")
}
scoop <- function(bunny, what = "field mice"){
str_glue("{bunny}Scooping up the {what},\n\n")
}
bop <- function(bunny, where = "head"){
str_glue("{bunny}And bopping them on the {where}.\n\n\n")
}
chances <- function(bunny, num = "three"){
if(num == "one") {
str_glue("{bunny}I'll give you {num} chance,\n\n")
}
else {
str_glue("{bunny}I'll give you {num} chances,\n\n")
}
}
static_line1 <- function(bunny){
str_glue("{bunny}I don't want to see you,\n\n")
}
static_line2 <- function(bunny){
str_glue("{bunny}And if you don't behave,\nI'm gonna turn you into a goon!\n\n\n")
}
static_line3 <- function(bunny){
str_glue("{bunny}Down came the Good Fairy, and she said,\n\n")
}
foo_foo <- little_bunny()
foo_foo %>%
hop(where = "forest") %>%
scoop(what = "field mice") %>%
bop(where = "head") %>%
static_line3() %>%
little_bunny() %>%
static_line1() %>%
scoop(what = "field mice") %>%
bop(where = "head") %>%
chances(num = "three") %>%
static_line2() %>%
little_bunny() %>%
hop(where = "forest") %>%
scoop(what = "field mice") %>%
bop(where = "head") %>%
static_line3()%>%
little_bunny() %>%
static_line1()%>%
scoop(what = "field mice") %>%
bop(where = "head") %>%
chances(num = "two") %>%
static_line2() %>%
little_bunny() %>%
hop(where = "forest") %>%
scoop(what = "field mice") %>%
bop(where = "head") %>%
static_line3()%>%
little_bunny() %>%
static_line1() %>%
scoop(what = "field mice") %>%
bop(where = "head") %>%
chances(num = "one") %>%
static_line2() %>%
little_bunny() %>%
hop(where = "forest") %>%
scoop(what = "field mice") %>%
bop(where = "head") %>%
static_line3() %>%
str_glue("I gave you three chances,\nAnd you didn't behave,\nAnd now I'm gonna turn you into a goon. POOF!")
#> Little Bunny Foo-Foo,
#> Hopping through the forest,
#> Scooping up the field mice,
#> And bopping them on the head.
#>
#> Down came the Good Fairy, and she said,
#> Little Bunny Foo-Foo,
#> I don't want to see you,
#> Scooping up the field mice,
#> And bopping them on the head.
#>
#> I'll give you three chances,
#> And if you don't behave,
#> I'm gonna turn you into a goon!
#>
#> Little Bunny Foo-Foo,
#> Hopping through the forest,
#> Scooping up the field mice,
#> And bopping them on the head.
#>
#> Down came the Good Fairy, and she said,
#> Little Bunny Foo-Foo,
#> I don't want to see you,
#> Scooping up the field mice,
#> And bopping them on the head.
#>
#> I'll give you two chances,
#> And if you don't behave,
#> I'm gonna turn you into a goon!
#>
#> Little Bunny Foo-Foo,
#> Hopping through the forest,
#> Scooping up the field mice,
#> And bopping them on the head.
#>
#> Down came the Good Fairy, and she said,
#> Little Bunny Foo-Foo,
#> I don't want to see you,
#> Scooping up the field mice,
#> And bopping them on the head.
#>
#> I'll give you one chance,
#> And if you don't behave,
#> I'm gonna turn you into a goon!
#>
#> Little Bunny Foo-Foo,
#> Hopping through the forest,
#> Scooping up the field mice,
#> And bopping them on the head.
#>
#> Down came the Good Fairy, and she said,
#> I gave you three chances,
#> And you didn't behave,
#> And now I'm gonna turn you into a goon. POOF!
Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names.
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f2 <- function(x) {
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
The first function f1
checks if the substring of string matches the prefix given.
match_prefix <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
match_prefix("===The quick brown fox", "===")
#> [1] TRUE
match_prefix("```The quick brown fox", "===")
#> [1] FALSE
match_prefix(c("the man in the bowler hat",
"that quick brown fox",
"the one and only",
"that crafy ol' fox"),
"the")
#> [1] TRUE FALSE TRUE FALSE
The function removes the last item of vector x.
remove_last <- function(x) {
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
remove_last(c(1:5, 2, 30, 45, NA))
#> [1] 1 2 3 4 5 2 30 45
remove_last(c(NA, 2, 30, 45, 1:5))
#> [1] NA 2 30 45 1 2 3 4
The function repeats the y vector as many times as the length of vector x.
rep_len <- function(x, y) {
rep(y, length.out = length(x))
}
rep_len(c(1:10), 1)
#> [1] 1 1 1 1 1 1 1 1 1 1
rep_len(c("a", "b", "abc", "abcd", "efghi"),
c("a", "b", "abc"))
#> [1] "a" "b" "abc" "a" "b"
Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.
Compare and contrast rnorm()
and MASS::mvrnorm()
. How could you make them more consistent?
rnorm() | MASS::mvrnorm() |
---|---|
univariate normal distribution | multivariate normal distribution |
n, mean, sd | n, mu, Sigma |
You could make mean / mu and sd / Sigma consistent by choosing one of these names. I would also make all names lower case so I would change Sigma to sigma.
Make a case for why norm_r()
, norm_d()
etc would be better than rnorm()
, dnorm()
. Make a case for the opposite.
For norm_r()
, norm_d()
the fact that you can type norm
and the autocomplete will give you a list of all norm_
functions is a win.
For rnorm()
, dnorm()
I would say it is more natural to say random normal distribution / density normal distribution than “normal random distribution” / “normal density distribution”.
An if
statement allows conditional execution of code.
if (condition) {
# code executed when condition is TRUE
} else {
# code executed when condition is FALSE
}
To get help on if
you need to surround it in backticks: ?`if`
.
The condition
must evaluate to either TRUE
or FALSE
. You will get a warning / error if it is a vector OR if it’s an NA
.
Use ||
(or) and &&
(and) to combine multiple logical expressions. These operators are “short-circuiting”
NEVER use |
or &
in an if
statement, since these are vectorised operations.
==
is also vectorised, which means you may get more than one output!
any()
or all()
.identical()
, BUT this is really strict and can result in unexpected output given how computers store numbers etc. Use dplyr::near()
for comparisons.is.na()
for NA checks.if (c(TRUE, FALSE)) {}
#> NULL
if (NA) {}
#> Error in if (NA) {: missing value where TRUE/FALSE needed
identical(0L, 0)
#> [1] FALSE
x <- sqrt(2) ^ 2
x
#> [1] 2
x == 2
#> [1] FALSE
x - 2
#> [1] 0.0000000000000004440892
Multiple if statements are also allowed.
if (this) {
# do that
} else if (that) {
# do something else
} else {
#
}
Another useful technique is the switch()
function for many if ... else
constructs.
perform_calc <- function(x, y, op) {
switch(op,
plus = x + y,
minus = x - y,
times = x * y,
divide = x / y,
stop("Unknown op!")
)
}
perform_calc(c(1,2,3), c(4,5,6), "plus")
#> [1] 5 7 9
perform_calc(c(1,2,3), c(4,5), "minus")
#> Warning in x - y: longer object length is not a multiple of shorter object
#> length
#> [1] -3 -3 -1
perform_calc(c(1,2,3), c(4,5,6), "power")
#> Error in perform_calc(c(1, 2, 3), c(4, 5, 6), "power"): Unknown op!
What’s the difference between if
and ifelse()
? Carefully read the help and construct three examples that illustrate the key differences.
if
tests just one item, so it has to be used in a loop or with any()
/ all()
if you want to ensure all in a vector meet a condition or any meet a condition.
ifelse
checks each item in the vector.
If you read the help [?ifelse
] it says use if
/else
construct for simple yes/no answers, and use ifelse
for over a set of values.
x <- c(1:5, 20, 25, 30)
if(length(x) > 10){
print(length(x))
TRUE
} else {
print(length(x))
FALSE
}
#> [1] 8
#> [1] FALSE
# You can use ifelse but kinda silly to use in this case
ifelse(length(x)>10, length(x), "smaller than 10 items in vector")
#> [1] "smaller than 10 items in vector"
# here is a better use
ifelse(x > 5, TRUE, FALSE)
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
# DOES NOT work for if --- notice the warning that only
# 1st is used
if (x > 5) TRUE else FALSE
#> Warning in if (x > 5) TRUE else FALSE: the condition has length > 1 and only the
#> first element will be used
#> [1] FALSE
Write a greeting function that says “good morning”, “good afternoon”, or “good evening”, depending on the time of day. (Hint: use a time argument that defaults to lubridate::now()
. That will make it easier to test your function.)
greeting <- function(){
time <- lubridate::now()
hr <- lubridate::hour(time)
if(hr >= 0 && hr < 12){
"good morning"
} else if (hr >= 12 && hr < 18){
"good afternoon"
} else {
"good evening"
}
}
greeting()
#> [1] "good afternoon"
Implement a fizzbuzz
function. It takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number. Make sure you first write working code before you create the function.
fizzbuzz <- function(x){
if((x %% 3 == 0) && (x %% 5 == 0)){
"fizzbuzz"
} else if (x %% 3 == 0) {
"fizz"
} else if (x %% 5 == 0){
"buzz"
} else {
x
}
}
fizzbuzz(5)
#> [1] "buzz"
fizzbuzz(9)
#> [1] "fizz"
fizzbuzz(30)
#> [1] "fizzbuzz"
fizzbuzz(10)
#> [1] "buzz"
How could you use cut()
to simplify this set of nested if-else statements?
temp <- 31
if (temp <= 0) {
"freezing"
} else if (temp <= 10) {
"cold"
} else if (temp <= 20) {
"cool"
} else if (temp <= 30) {
"warm"
} else {
"hot"
}
#> [1] "hot"
cut(temp, breaks = 10*(-3:10),
labels = c(rep("freezing", 3),
"cold",
"cool",
"warm",
rep("hot", 7)))
#> [1] hot
#> Levels: freezing cold cool warm hot
table(cut(temp, breaks = 10*(-3:10),
labels = c(rep("freezing", 3),
"cold",
"cool",
"warm",
rep("hot", 7))))
#>
#> freezing cold cool warm hot
#> 0 0 0 0 1
How would you change the call to cut()
if I’d used <
instead of <=
? What is the other chief advantage of cut()
for this problem? (Hint: what happens if you have many values in temp
?)
You would use right = FALSE
. cut
allows vectors of values whereas if else
constructs do not.
NOTE: If you look at jrnold’s solutions you note that he uses -Inf
and Inf
for the breaks. This is much better than my hacky solution above that contains boundaries at the bottom and top.
What happens if you use switch()
with numeric values?
Read more here.
switch
evaluates the EXPR
against the list item number. Below we have 3 and this evaluates to list item == 3, hence 6.
x <- 3
switch(x,
"2",
"4",
"6",
"8"
)
#> [1] "6"
What does this switch()
call do? What happens if x
is “e”?
If x is a or b, it returns “ab”; if x is c or d it returns cd. If x is e, no match is found and nothing is returned.
The docs say to have an unnamed value as default after the named values for no match ones. This can be done as per the last way - note the test.
x <- "c"
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
)
x <- "e"
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
)
x <- "e"
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
"test"
)
Experiment, then carefully read the documentation.
The data
to apply a function to should come first, followed by the arguments. Arguments should have default values as far as possible.
# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
se <- sd(x) / sqrt(length(x))
alpha <- 1 - conf
mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
x <- runif(100)
mean_ci(x)
#> [1] 0.4505862 0.5586331
mean_ci(x, conf = 0.99)
#> [1] 0.4336108 0.5756085
# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
se <- sd(x) / sqrt(length(x))
alpha <- 1 - conf
mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
x <- runif(100)
mean_ci(x)
#> [1] 0.4187221 0.5271327
mean_ci(x, conf = 0.99)
#> [1] 0.4016896 0.5441652
It’s good practice to check important preconditions, and throw an error (with stop()
), if they are not met.
wt_mean <- function(x, w) {
if (length(x) != length(w)) {
stop("`x` and `w` must be the same length", call. = FALSE)
}
sum(w * x) / sum(w)
}
wt_mean(1:6, 1:3)
#> Error: `x` and `w` must be the same length
A useful function is the built-in stopifnot()
which checks that each argument is TRUE
, and produces a generic error message if not.
wt_mean <- function(x, w, na.rm = FALSE) {
# I want na.rm to be logical: is.logical(na.rm)
# I want length na.rm to be 1: length(na.rm) == 1
# If it is NOT that STOP!
stopifnot(is.logical(na.rm), length(na.rm) == 1)
stopifnot(length(x) == length(w))
if (na.rm) {
miss <- is.na(x) | is.na(w)
x <- x[!miss]
w <- w[!miss]
}
sum(w * x) / sum(w)
}
wt_mean(1:6, 6:1, na.rm = "foo")
#> Error in wt_mean(1:6, 6:1, na.rm = "foo"): is.logical(na.rm) is not TRUE
NOTE: When using stopifnot()
you assert what should be true rather than checking for what might be wrong.
Many functions in R take an arbitrary number of inputs by using a special argument: ...
(pronounced dot-dot-dot). You can forward ...
to other functions.
sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
#> [1] 55
stringr::str_c("a", "b", "c", "d", "e", "f")
#> [1] "abcdef"
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])
#> [1] "a, b, c, d, e, f, g, h, i, j"
rule <- function(..., pad = "-") {
title <- paste0(...)
# getOption("width") tells you how many chars can be printed
# nchar(title) tells you how long title is
width <- getOption("width") - nchar(title) - 5
cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output")
#> Important output -----------------------------------------------------------
x <- c(1, 2)
sum(x, na.mr = TRUE)
#> [1] 4
Notice the result above is incorrect as well, since it sums 1 + 2 + TRUE
= 4, but it should be 3!
What does commas(letters, collapse = "-")
do? Why?
It throws an error. The reason is that both letters
and collapse
are passed into commas in the ...
argument.
commas(letters, collapse = "-")
#> Error in stringr::str_c(..., collapse = ", "): formal argument "collapse" matched by multiple actual arguments
It’d be nice if you could supply multiple characters to the pad
argument, e.g. rule("Title", pad = "-+")
. Why doesn’t this currently work? How could you fix it?
rule <- function(..., pad = "-") {
title <- paste0(...)
# getOption("width") tells you how many chars can be printed
# nchar(title) tells you how long title is
width <- getOption("width") - nchar(title) - 5
cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output", pad = "-+")
#> Important output -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The width of the resulting string is too long this way.
rule <- function(..., pad = "-") {
title <- paste0(...)
# getOption("width") tells you how many chars can be printed
# nchar(title) tells you how long title is
width <- getOption("width") - nchar(title) - 5
cat(title, " ", stringr::str_dup(pad, width/nchar(pad)), "\n", sep = "")
}
rule("Important output", pad = "-+")
#> Important output -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
What does the trim
argument to mean()
do? When might you use it?
The trim
argument removes / trims a certain percentage of the observations from each side. In the help page if we consider the examples, notice that the 0 and 50 are trimmed. So the values are sorted, and then the trimming occurs - this makes sense because we don’t want to just arbitrarily chop data, we want to remove outliers.
In the second example we sort 5, 10, 11, …, 19, 20. Then 5 and 20 trimmed.
x <- c(0:10, 50)
(xm <- mean(x))
#> [1] 8.75
c(xm, mean(x, trim = 0.10))
#> [1] 8.75 5.50
mean(c(1:10))
#> [1] 5.5
mean(c(1:10)) == mean(x, trim = 0.10)
#> [1] TRUE
x <- c(10:20, 5)
(xm <- mean(x))
#> [1] 14.16667
c(xm, mean(x, trim = 0.10))
#> [1] 14.16667 14.50000
mean(c(10:19))
#> [1] 14.5
mean(c(10:19)) == mean(x, trim = 0.10)
#> [1] TRUE
The default value for the method
argument to cor()
is c("pearson", "kendall", "spearman")
. What does that mean? What value is used by default?
The cor()
function can use any one of those values, by default the first is used which is pearson.
If you want to write your own pipeable functions, think about the return value. E.g. for dplyr and tidyr the object type is the data frame.
There are two basic types of pipeable functions: transformations and side-effects.
show_missings <- function(df) {
n <- sum(is.na(df))
cat("Missing values: ", n, "\n", sep = "")
invisible(df) # return df invisibly
}
# df does not get printed but it is there
show_missings(mtcars)
#> Missing values: 0
It’s still there, it’s just not printed by default.
x <- show_missings(mtcars)
#> Missing values: 0
class(x)
#> [1] "data.frame"
dim(x)
#> [1] 32 11
head(x)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can use it in a pipe:
library(dplyr)
mtcars %>%
show_missings() %>%
mutate(mpg = ifelse(mpg < 20, NA, mpg)) %>%
show_missings()
#> Missing values: 0
#> Missing values: 18
Want to see what the code for a function is?
Type the function name with no parantheses.
E.g. lm
, factorial
lm
#> function (formula, data, subset, weights, na.action, method = "qr",
#> model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
#> contrasts = NULL, offset, ...)
#> {
#> ret.x <- x
#> ret.y <- y
#> cl <- match.call()
#> mf <- match.call(expand.dots = FALSE)
#> m <- match(c("formula", "data", "subset", "weights", "na.action",
#> "offset"), names(mf), 0L)
#> mf <- mf[c(1L, m)]
#> mf$drop.unused.levels <- TRUE
#> mf[[1L]] <- quote(stats::model.frame)
#> mf <- eval(mf, parent.frame())
#> if (method == "model.frame")
#> return(mf)
#> else if (method != "qr")
#> warning(gettextf("method = '%s' is not supported. Using 'qr'",
#> method), domain = NA)
#> mt <- attr(mf, "terms")
#> y <- model.response(mf, "numeric")
#> w <- as.vector(model.weights(mf))
#> if (!is.null(w) && !is.numeric(w))
#> stop("'weights' must be a numeric vector")
#> offset <- model.offset(mf)
#> mlm <- is.matrix(y)
#> ny <- if (mlm)
#> nrow(y)
#> else length(y)
#> if (!is.null(offset)) {
#> if (!mlm)
#> offset <- as.vector(offset)
#> if (NROW(offset) != ny)
#> stop(gettextf("number of offsets is %d, should equal %d (number of observations)",
#> NROW(offset), ny), domain = NA)
#> }
#> if (is.empty.model(mt)) {
#> x <- NULL
#> z <- list(coefficients = if (mlm) matrix(NA_real_, 0,
#> ncol(y)) else numeric(), residuals = y, fitted.values = 0 *
#> y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w !=
#> 0) else ny)
#> if (!is.null(offset)) {
#> z$fitted.values <- offset
#> z$residuals <- y - offset
#> }
#> }
#> else {
#> x <- model.matrix(mt, mf, contrasts)
#> z <- if (is.null(w))
#> lm.fit(x, y, offset = offset, singular.ok = singular.ok,
#> ...)
#> else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,
#> ...)
#> }
#> class(z) <- c(if (mlm) "mlm", "lm")
#> z$na.action <- attr(mf, "na.action")
#> z$offset <- offset
#> z$contrasts <- attr(x, "contrasts")
#> z$xlevels <- .getXlevels(mt, mf)
#> z$call <- cl
#> z$terms <- mt
#> if (model)
#> z$model <- mf
#> if (ret.x)
#> z$x <- x
#> if (ret.y)
#> z$y <- y
#> if (!qr)
#> z$qr <- NULL
#> z
#> }
#> <bytecode: 0x000000002c19a108>
#> <environment: namespace:stats>
To figure out what arguments a function takes use args(func_name)
.
args(lm)
#> function (formula, data, subset, weights, na.action, method = "qr",
#> model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
#> contrasts = NULL, offset, ...)
#> NULL
args(dplyr::across)
#> function (.cols = everything(), .fns = NULL, ..., .names = NULL)
#> NULL
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18363)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_South Africa.1252 LC_CTYPE=English_South Africa.1252
#> [3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C
#> [5] LC_TIME=English_South Africa.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] magrittr_1.5 flair_0.0.2 forcats_0.5.0 stringr_1.4.0
#> [5] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
#> [9] tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0 workflowr_1.6.2
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.1.0 xfun_0.13 haven_2.3.1 colorspace_1.4-1
#> [5] vctrs_0.3.2 generics_0.0.2 htmltools_0.5.0 yaml_2.2.1
#> [9] rlang_0.4.8 later_1.0.0 pillar_1.4.6 withr_2.2.0
#> [13] glue_1.4.2 DBI_1.1.0 dbplyr_2.0.0 modelr_0.1.8
#> [17] readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0 gtable_0.3.0
#> [21] cellranger_1.1.0 rvest_0.3.6 evaluate_0.14 knitr_1.28
#> [25] ps_1.3.2 httpuv_1.5.2 fansi_0.4.1 broom_0.7.2
#> [29] Rcpp_1.0.4.6 promises_1.1.0 backports_1.1.6 scales_1.1.0
#> [33] jsonlite_1.7.1 fs_1.5.0 hms_0.5.3 digest_0.6.27
#> [37] stringi_1.5.3 rprojroot_1.3-2 grid_3.6.3 cli_2.1.0
#> [41] tools_3.6.3 crayon_1.3.4 whisker_0.4 pkgconfig_2.0.3
#> [45] ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9
#> [49] assertthat_0.2.1 rmarkdown_2.4 httr_1.4.2 rstudioapi_0.11
#> [53] R6_2.4.1 git2r_0.26.1 compiler_3.6.3