Last updated: 2019-10-02
Checks: 7 0
Knit directory: wflow-r4ds/
This reproducible R Markdown analysis was created with workflowr (version 1.4.0.9001). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20190925)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 5472b4d | John Blischak | 2019-10-02 | Build site. |
Rmd | a23c44c | John Blischak | 2019-10-02 | Chp 8 exercises on readr |
library(tidyverse)
p. 128
read_delim()
with delim = "|"
.
file
, skip
, and comment
, what other arguments do read_csv()
and read_tsv()
have in common?intersect(names(formals(read_csv)), names(formals(read_tsv)))
[1] "file" "col_names" "col_types"
[4] "locale" "na" "quoted_na"
[7] "quote" "comment" "trim_ws"
[10] "skip" "n_max" "guess_max"
[13] "progress" "skip_empty_rows"
In fact they share all the same arguments:
identical(names(formals(read_csv)), names(formals(read_tsv)))
[1] TRUE
read_csv()
and read_tsv()
are both wrappers to the internal function read_delimited()
:
names(formals(readr:::read_delimited))
[1] "file" "tokenizer" "col_names"
[4] "col_types" "locale" "skip"
[7] "skip_empty_rows" "comment" "n_max"
[10] "guess_max" "progress"
read_fwf()
?names(formals(read_fwf))
[1] "file" "col_positions" "col_types"
[4] "locale" "na" "comment"
[7] "trim_ws" "skip" "n_max"
[10] "guess_max" "progress" "skip_empty_rows"
col_positions
col_positions Column positions, as created by fwf_empty(), fwf_widths() or fwf_positions(). To read in only selected fields, use fwf_positions(). If the width of the last column is variable (a ragged fwf file), supply the last end position as NA.
"
or '
. By convention, read_csv()
assumes that the quoting character will be "
, and if you want to change it you’ll need to use read_delim()
instead. What arguments do you need to specify to read the following text into a data frame?"x,y\n1,'a,b'"
Set quote
to '
:
read_delim("x,y\n1,'a,b'", delim = ",", quote = "'")
# A tibble: 1 x 2
x y
<dbl> <chr>
1 1 a,b
As of readr 1.1.0 (released in March 2017), you can just use read_csv()
:
read_csv("x,y\n1,'a,b'", quote = "'")
# A tibble: 1 x 2
x y
<dbl> <chr>
1 1 a,b
# 2 column names but 3 columns of data
read_csv("a,b\n1,2,3\n4,5,6")
Warning: 2 parsing failures.
row col expected actual file
1 -- 2 columns 3 columns literal data
2 -- 2 columns 3 columns literal data
# A tibble: 2 x 2
a b
<dbl> <dbl>
1 1 2
2 4 5
# Each row has a different number of columns
read_csv("a,b,c\n1,2\n1,2,3,4")
Warning: 2 parsing failures.
row col expected actual file
1 -- 3 columns 2 columns literal data
2 -- 3 columns 4 columns literal data
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 1 2 NA
2 1 2 3
# There is an opening quote in the second row but no closing quote
read_csv("a,b\n\"1")
Warning: 2 parsing failures.
row col expected actual file
1 a closing quote at end of file literal data
1 -- 2 columns 1 columns literal data
# A tibble: 1 x 2
a b
<dbl> <chr>
1 1 <NA>
# Both columns are characters b/c they contain a mix of numbers and characters
read_csv("a,b\n1,2\na,b")
# A tibble: 2 x 2
a b
<chr> <chr>
1 1 2
2 a b
# The delimiter is a `;`, so everything is in one column
read_csv("a;b\n1;3")
# A tibble: 1 x 1
`a;b`
<chr>
1 1;3
p. 136
locale()
?Seems like a pretty context-dependent question. In this chapter, they use decimal_mark
to accomodate different numeric styles, date_names
to format the date names according to the tradition in a specific location, and encoding
to specify the encoding used by the file. I think tz
for time zone would also be useful.
decimal_mark
and grouping_mark
to the same character? What happens to the default value of grouping_mark
when you set decimal_mark
to “,”? What happens to the default value of decimal_mark
when you set the grouping_mark
to “.”?locale(decimal_mark = ".", grouping_mark = ".")
Error: `decimal_mark` and `grouping_mark` must be different
locale(decimal_mark = ",")$grouping_mark
[1] "."
locale(grouping_mark = ".")$decimal_mark
[1] ","
date_format
and time_format
options to locale()
. What do they do? Construct an example that shows when they might be useful.The date_format
can be used to parse dates that are not in the default YYYY-MM-DD
format:
parse_date("01/31/2000")
Warning: 1 parsing failure.
row col expected actual
1 -- date like 01/31/2000
[1] NA
# January 31, 2000
parse_date("01/31/2000", locale = locale(date_format = "%m/%d/%Y"))
[1] "2000-01-31"
According to the readr locales vignette, the argument time_format
is not used, so it is never useful. But the vignette is outdated. time_format
is used exactly the same as date_format
.
parse_time("17:55:14")
17:55:14
parse_time("5:55:14 PM")
17:55:14
# Example of a non-standard time
parse_time("h5m55s14 PM")
Warning: 1 parsing failure.
row col expected actual
1 -- time like h5m55s14 PM
NA
parse_time("h5m55s14 PM", locale = locale(time_format = "h%Hm%Ms%S %p"))
17:55:14
You can create it by passing custom arguments to locale
and saving the result. Many languages are already supported:
(es <- locale("es"))
<locale>
Numbers: 123,456.78
Formats: %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days: domingo (dom.), lunes (lun.), martes (mar.), miércoles (mié.),
jueves (jue.), viernes (vie.), sábado (sáb.)
Months: enero (ene.), febrero (feb.), marzo (mar.), abril (abr.), mayo
(may.), junio (jun.), julio (jul.), agosto (ago.),
septiembre (sept.), octubre (oct.), noviembre (nov.),
diciembre (dic.)
AM/PM: a. m./p. m.
str(es)
List of 7
$ date_names :List of 5
..$ mon : chr [1:12] "enero" "febrero" "marzo" "abril" ...
..$ mon_ab: chr [1:12] "ene." "feb." "mar." "abr." ...
..$ day : chr [1:7] "domingo" "lunes" "martes" "miércoles" ...
..$ day_ab: chr [1:7] "dom." "lun." "mar." "mié." ...
..$ am_pm : chr [1:2] "a. m." "p. m."
..- attr(*, "class")= chr "date_names"
$ date_format : chr "%AD"
$ time_format : chr "%AT"
$ decimal_mark : chr "."
$ grouping_mark: chr ","
$ tz : chr "UTC"
$ encoding : chr "UTF-8"
- attr(*, "class")= chr "locale"
read_csv()
and read_csv2()
?read_csv2() uses ; for the field separator and , for the decimal point. This is common in some European countries.
From the online book Programming with Unicode (CC BY-SA 3.0 license), the most popular encodings on the internet are:
1st (56%): ASCII
2nd (23%): Western Europe encodings (ISO 8859-1, ISO 8859-15 and cp1252)
3rd (8%): Chinese encodings (GB2312, …)
and then come Korean (EUC-KR), Cyrillic (cp1251, KOI8-R, …), East Europe (cp1250, ISO-8859-2), Arabic (cp1256, ISO-8859-6), etc.
(UTF-8 was not used on the web in 2001)
Note that I used DuckDuckGo for the online search :-)
See ?strptime
for the available conversion specifiers (not sure whether to be proud or depressed that I remembered off the top of my head that %B
was the full month name).
d1 <- "January 1, 2010"
parse_date(d1, "%B %d, %Y")
[1] "2010-01-01"
# Alternatively can specify date_format via locale argument
parse_date(d1, locale = locale(date_format = "%B %d, %Y"))
[1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, "%Y-%b-%d")
[1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, "%d-%b-%Y")
[1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, "%B %d (%Y)")
[1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, "%m/%d/%y")
[1] "2014-12-30"
t1 <- "1705"
parse_time(t1, "%H%M")
17:05:00
t2 <- "11:15:10.12 PM"
parse_time(t2, "%H:%M:%OS %p")
23:15:10.12
# Alternatively can specify time_format via locale argument
parse_time(t2, locale = locale(time_format = ("%H:%M:%OS %p")))
23:15:10.12
%OS
is strange. Apparently it is R-specific, and I couldn’t get readr to accept the decimal argument:
Specific to R is %OSn, which for output gives the seconds truncated to 0 <= n <= 6 decimal places (and if %OS is not followed by a digit, it uses the setting of getOption(“digits.secs”), or if that is unset, n = 0). Further, for strptime %OS will input seconds including fractional seconds. Note that %S does not read fractional parts on output.
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.2
[5] readr_1.3.1 tidyr_1.0.0 tibble_2.1.3 ggplot2_3.2.1
[9] tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 cellranger_1.1.0 pillar_1.4.2
[4] compiler_3.6.1 git2r_0.26.1.9000 workflowr_1.4.0.9001
[7] tools_3.6.1 zeallot_0.1.0 digest_0.6.21
[10] lubridate_1.7.4 jsonlite_1.6 evaluate_0.14
[13] lifecycle_0.1.0 nlme_3.1-141 gtable_0.3.0
[16] lattice_0.20-38 pkgconfig_2.0.3 rlang_0.4.0
[19] cli_1.1.0 rstudioapi_0.10 yaml_2.2.0
[22] haven_2.1.1 xfun_0.9 withr_2.1.2
[25] xml2_1.2.2 httr_1.4.1 knitr_1.25
[28] hms_0.5.1 generics_0.0.2 fs_1.3.1
[31] vctrs_0.2.0 rprojroot_1.2 grid_3.6.1
[34] tidyselect_0.2.5 glue_1.3.1 R6_2.4.0
[37] fansi_0.4.0 readxl_1.3.1 rmarkdown_1.15
[40] modelr_0.1.5 magrittr_1.5 whisker_0.4
[43] backports_1.1.4 scales_1.0.0 htmltools_0.3.6
[46] rvest_0.3.4 assertthat_0.2.1 colorspace_1.4-1
[49] utf8_1.1.4 stringi_1.4.3 lazyeval_0.2.2
[52] munsell_0.5.0 broom_0.5.2 crayon_1.3.4