Last updated: 2020-10-25
Checks: 7 0
Knit directory: r4ds_book/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200814)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 57f23a8. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rproj.user/
Untracked files:
Untracked: VideoDecodeStats/
Untracked: analysis/images/
Untracked: code_snipp.txt
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ch9_tidy_data.Rmd
) and HTML (docs/ch9_tidy_data.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 57f23a8 | sciencificity | 2020-10-25 | added Ch9 |
options(scipen=10000)
library(tidyverse)
library(flair)
library(emo)
library(lubridate)
library(magrittr)
library(tidyquant)
theme_set(theme_tq())
In all the examples tidyr::table to tidyr::table4b, only tidyr::table1
is tidy.
(
# practising the read_csv function to create table1
# just note however that table1 is in tidyr ;)
# tidyr::table1 etc.
# In all honesty, I only figured this out after "practising" :P
table1 <- read_csv("country, year, cases, population
Afghanistan, 1999, 745, 19987071
Afghanistan, 2000, 2666, 20595360
Brazil, 1999, 37737, 172006362
Brazil, 2000, 80488, 174504898
China, 1999, 212258, 1272915272
China, 2000, 213766, 1280428583")
)
# A tibble: 6 x 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table1 %>%
mutate(rate = cases/population * 10000)
# A tibble: 6 x 5
country year cases population rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
table1 %>%
count(year, wt=cases) # same as group_by and sum
# A tibble: 2 x 2
year n
<dbl> <dbl>
1 1999 250740
2 2000 296920
table1 %>%
group_by(year) %>%
summarise(sum(cases))
# A tibble: 2 x 2
year `sum(cases)`
<dbl> <dbl>
1 1999 250740
2 2000 296920
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
scale_colour_tq()
Using prose, describe how the variables and observations are organised in each of the sample tables.
tidyr::table1
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
tidyr::table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
tidyr::table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
rate
.tidyr::table4a
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
tidyr::table4b
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
tidyr::table5
# A tibble: 6 x 4
country century year rate
* <chr> <chr> <chr> <chr>
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
rate
(separated within the column by /)Compute the rate
for table2
, and table4a
+ table4b
. You will need to perform four operations:
Which representation is easiest to work with? Which is hardest? Why?
(
tbl1 <- tidyr::table2 %>%
filter(type == "cases") %>%
group_by(country, year) %>%
mutate(cases = count) %>%
ungroup() %>%
select(country, year, cases) %>%
arrange(country, year)
)
# A tibble: 6 x 3
country year cases
<chr> <int> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
(
tbl2 <- tidyr::table2 %>%
filter(type == "population") %>%
group_by(country, year) %>%
mutate(population = count) %>%
ungroup() %>%
select(country_temp = country,
year_temp = year,
population) %>%
arrange(country_temp, year_temp)
)
# A tibble: 6 x 3
country_temp year_temp population
<chr> <int> <int>
1 Afghanistan 1999 19987071
2 Afghanistan 2000 20595360
3 Brazil 1999 172006362
4 Brazil 2000 174504898
5 China 1999 1272915272
6 China 2000 1280428583
(
tbl3 <- tbl1 %>%
bind_cols(tbl2) %>%
select(c(1:3,6)) %>%
mutate(rate = (cases / population) * 10000) %>%
arrange(country, year) %>%
select(country, year, rate) %>%
mutate(type = "rate",
count = rate) %>%
select(c(1,2,4,5))
)
# A tibble: 6 x 4
country year type count
<chr> <int> <chr> <dbl>
1 Afghanistan 1999 rate 0.373
2 Afghanistan 2000 rate 1.29
3 Brazil 1999 rate 2.19
4 Brazil 2000 rate 4.61
5 China 1999 rate 1.67
6 China 2000 rate 1.67
tidyr::table2 %>%
bind_rows(tbl3) %>%
mutate(count = round(count, 2)) %>%
arrange(country, year, type) %>%
gt::gt()
country | year | type | count |
---|---|---|---|
Afghanistan | 1999 | cases | 745.00 |
Afghanistan | 1999 | population | 19987071.00 |
Afghanistan | 1999 | rate | 0.37 |
Afghanistan | 2000 | cases | 2666.00 |
Afghanistan | 2000 | population | 20595360.00 |
Afghanistan | 2000 | rate | 1.29 |
Brazil | 1999 | cases | 37737.00 |
Brazil | 1999 | population | 172006362.00 |
Brazil | 1999 | rate | 2.19 |
Brazil | 2000 | cases | 80488.00 |
Brazil | 2000 | population | 174504898.00 |
Brazil | 2000 | rate | 4.61 |
China | 1999 | cases | 212258.00 |
China | 1999 | population | 1272915272.00 |
China | 1999 | rate | 1.67 |
China | 2000 | cases | 213766.00 |
China | 2000 | population | 1280428583.00 |
China | 2000 | rate | 1.67 |
(
tbl1_cases <- tidyr::table4a %>%
select(country, `1999`) %>%
mutate(year = 1999,
cases = `1999`) %>%
select(country, year, cases)
)
# A tibble: 3 x 3
country year cases
<chr> <dbl> <int>
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
(
tbl2_cases <- tidyr::table4a %>%
select(country, "2000") %>%
mutate(year = 2000,
cases = `2000`) %>%
select(country, year, cases)
)
# A tibble: 3 x 3
country year cases
<chr> <dbl> <int>
1 Afghanistan 2000 2666
2 Brazil 2000 80488
3 China 2000 213766
(
tbl_cases <- tbl1_cases %>%
bind_rows(tbl2_cases) %>%
arrange(country, year)
)
# A tibble: 6 x 3
country year cases
<chr> <dbl> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
(
tbl1_pop <- tidyr::table4b %>%
select(country, `1999`) %>%
mutate(year = 1999,
population = `1999`) %>%
select(country, year, population)
)
# A tibble: 3 x 3
country year population
<chr> <dbl> <int>
1 Afghanistan 1999 19987071
2 Brazil 1999 172006362
3 China 1999 1272915272
(
tbl2_pop <- tidyr::table4b %>%
select(country, "2000") %>%
mutate(year = 2000,
population = `2000`) %>%
select(country, year, population)
)
# A tibble: 3 x 3
country year population
<chr> <dbl> <int>
1 Afghanistan 2000 20595360
2 Brazil 2000 174504898
3 China 2000 1280428583
(
tbl_pop <- tbl1_pop %>%
bind_rows(tbl2_pop) %>%
arrange(country, year)
)
# A tibble: 6 x 3
country year population
<chr> <dbl> <int>
1 Afghanistan 1999 19987071
2 Afghanistan 2000 20595360
3 Brazil 1999 172006362
4 Brazil 2000 174504898
5 China 1999 1272915272
6 China 2000 1280428583
(
tbl_rate <- tbl_cases %>%
bind_cols(tbl_pop) %>%
janitor::clean_names() %>%
select(country = country_1,
year = year_2,
cases, population) %>%
mutate(rate = cases / population * 10000)
)
# A tibble: 6 x 5
country year cases population rate
<chr> <dbl> <int> <int> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
(
tbl_1999 <- tbl_rate %>%
select(country, year, rate) %>%
filter(year == 1999) %>%
mutate(`1999` = rate) %>%
select(country, `1999`)
)
# A tibble: 3 x 2
country `1999`
<chr> <dbl>
1 Afghanistan 0.373
2 Brazil 2.19
3 China 1.67
(
tbl_2000 <- tbl_rate %>%
select(country, year, rate) %>%
filter(year == 2000) %>%
mutate(`2000` = rate) %>%
select(country_temp = country, `2000`)
)
# A tibble: 3 x 2
country_temp `2000`
<chr> <dbl>
1 Afghanistan 1.29
2 Brazil 4.61
3 China 1.67
(
tbl_4c <-
tbl_1999 %>%
bind_cols(tbl_2000) %>%
select(country, `1999`, `2000`)
)
# A tibble: 3 x 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 0.373 1.29
2 Brazil 2.19 4.61
3 China 1.67 1.67
Recreate the plot showing change in cases over time using table2
instead of table1
. What do you need to do first?
tidyr::table1
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
scale_colour_tq()
tidyr::table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table2 %>%
filter(type == "cases") %>%
ggplot(aes(year, count)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
scale_colour_tq()
Sometimes you will have to resolve one of two common problems:
One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.
pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns.
table4a
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table4a %>%
# gather(list out columns you want to gather like dplyr::select() style,
# key = what do you want to call the column
# these column names go into,
# value = the values of the columns will go here)
gather(`1999`, `2000`,
key = "year",
value = "cases" )
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
(tidy_4a <- table4a %>%
# cols = list the columns you want to pivot
# names_to = what will you call the new column these
# column names go into
# values_to = the values in the columns will go here
pivot_longer(cols = c(`1999`, `2000`),
names_to = "year",
values_to = "cases"))
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
table4b
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
table4b %>%
gather(`1999`, `2000`,
key = "year",
value = "population")
# A tibble: 6 x 3
country year population
<chr> <chr> <int>
1 Afghanistan 1999 19987071
2 Brazil 1999 172006362
3 China 1999 1272915272
4 Afghanistan 2000 20595360
5 Brazil 2000 174504898
6 China 2000 1280428583
(tidy_4b <- table4b %>%
pivot_longer(cols = c(`1999`, `2000`),
names_to = "year",
values_to = "population"))
# A tibble: 6 x 3
country year population
<chr> <chr> <int>
1 Afghanistan 1999 19987071
2 Afghanistan 2000 20595360
3 Brazil 1999 172006362
4 Brazil 2000 174504898
5 China 1999 1272915272
6 China 2000 1280428583
left_join(tidy_4a, tidy_4b) %>%
arrange(country, year)
# A tibble: 6 x 4
country year cases population
<chr> <chr> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
pivot_wider()
is the opposite of pivot_longer()
. You use it when an observation is scattered across multiple rows.
table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table2 %>%
# key = column with the variable name, here `type`
spread(key = type,
# value = column with the value that will be assigned
# to new columns
value = count)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table2 %>%
pivot_wider(names_from = type,
values_from = count)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Why are pivot_longer()
and pivot_wider()
not perfectly symmetrical?
Carefully consider the following example:
(stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
))
# A tibble: 4 x 3
year half return
<dbl> <dbl> <dbl>
1 2015 1 1.88
2 2015 2 0.59
3 2016 1 0.92
4 2016 2 0.17
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
# A tibble: 4 x 3
half year return
<dbl> <chr> <dbl>
1 1 2015 1.88
2 1 2016 0.92
3 2 2015 0.59
4 2 2016 0.17
(Hint: look at the variable types and think about column names.)
pivot_longer()
has a names_ptypes
argument, e.g. names_ptypes = list(year = double())
. What does it do?
# vignette("pivot")
stocks %>%
pivot_wider(names_from = year, values_from = return)
# A tibble: 2 x 3
half `2015` `2016`
<dbl> <dbl> <dbl>
1 1 1.88 0.92
2 2 0.59 0.17
Let’s have a look at the first part - here we take the year and make it a variable. That means that 2015
and 2016
become variables (new columns) in our new tibble, and the return gets pulled into the appropriate column (2015
/2016
) against the appropriate half
. By nature of this move we changed year which was a double into two new column names which are 2015
and 2016
and hence “character”.
(stocks_ <- stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return"))
# A tibble: 4 x 3
half year return
<dbl> <chr> <dbl>
1 1 2015 1.88
2 1 2016 0.92
3 2 2015 0.59
4 2 2016 0.17
colnames(stocks)
[1] "year" "half" "return"
colnames(stocks_)
[1] "half" "year" "return"
So following on that we take these new columns and then collapse them into a column year
again. But now we have changed the type given we made them columns in the pivot_wider()
step. So they keep their “character” nature when they are made longer again. Final result is year
started off double (when we created it) but ends up character (after the pivot_wider and pivot_longer steps).
Th columns also get rearranged since the pivot_wider spreads the year
column into 2015
and 2016
which come after half
in that initial step. When we subsequently pivot_longer half
remains as the first column, followed by the names_to =
column (year in this case), and finally the values_to =
column (return in this case).
Q: pivot_longer()
has a names_ptypes
argument, e.g. names_ptypes = list(year = double())
. What does it do?
Okay so upon reading the help page and the info I expected that this function would convert my character
column year created after the pivot_wider() step into a double, but instead it throws an error. 😕
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`,
names_to = "year",
names_ptypes = list(year = double()),
values_to = "return"
)
Error: Can't convert <character> to <double>.
We use this to confirm that the columns we create are of the type / class we expect - so here it provides a check it seems 🤷.
To transform the column from character to double you would need to use the names_transform
.
(stocks_ptypes <- stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`,
names_to = "year",
names_transform = list(year = as.double),
values_to = "return",
# is the value column of the type expected
values_ptypes = list(return = double())
))
# A tibble: 4 x 3
half year return
<dbl> <dbl> <dbl>
1 1 2015 1.88
2 1 2016 0.92
3 2 2015 0.59
4 2 2016 0.17
Strangely though I would expect that if I transform a column from x to y (using names_transform
), and then use names_ptypes
to check if my name column is indeed now of type y that would be fine? It still throws an error, so my thinking is flawed here.
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`,
names_to = "year",
names_transform = list(year = as.double),
names_ptypes = list(year = double()),
values_to = "return",
# is the value column of the type expected
values_ptypes = list(return = double())
)
Error: Can't convert <character> to <double>.
Why does this code fail?
table4a %>%
pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
Error: Can't subset columns that don't exist.
[31mx[39m Locations 1999 and 2000 don't exist.
[34mi[39m There are only 3 columns.
# Error: Can't subset columns that don't exist.
# x Locations 1999 and 2000 don't exist.
# i There are only 3 columns.
# fixing it
table4a %>%
pivot_longer(c("1999", `2000`), names_to = "year", values_to = "cases")
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
The 1999
and 2000
are non-syntactically named columns. These have to be surrounded by backticks (``) or quotations ""
. Here tidyr is trying to read columns numbered 1999, and 2000 which don’t exist.
What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?
people <- tribble(
~name, ~names, ~values,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
You get a warning and it has a list for each variable age
and height
since Philips Woods has two ages which are different.
people %>%
pivot_wider(names_from = names,
values_from = "values")
# A tibble: 2 x 3
name age height
<chr> <list> <list>
1 Phillip Woods <dbl [2]> <dbl [1]>
2 Jessica Cordero <dbl [1]> <dbl [1]>
people2 <- tribble(
~name, ~names, ~values,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age2", 50, # second age gets diff col name
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
people2 %>%
pivot_wider(names_from = names,
values_from = "values")
# A tibble: 2 x 4
name age height age2
<chr> <dbl> <dbl> <dbl>
1 Phillip Woods 45 186 50
2 Jessica Cordero 37 156 NA
Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?
(preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
))
# A tibble: 2 x 3
pregnant male female
<chr> <dbl> <dbl>
1 yes NA 10
2 no 20 12
We need to make it longer. The variable’s are pregnant
(yes or no), and the number of male(s)/female(s) in each outcome of pregnant
.
preg %>%
pivot_longer(c('male', 'female'),
names_to = 'sex',
values_to = 'count')
# A tibble: 4 x 3
pregnant sex count
<chr> <chr> <dbl>
1 yes male NA
2 yes female 10
3 no male 20
4 no female 12
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_South Africa.1252 LC_CTYPE=English_South Africa.1252
[3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C
[5] LC_TIME=English_South Africa.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tidyquant_1.0.0 quantmod_0.4.17
[3] TTR_0.23-6 PerformanceAnalytics_2.0.4
[5] xts_0.12-0 zoo_1.8-7
[7] magrittr_1.5 lubridate_1.7.8
[9] emo_0.0.0.9000 flair_0.0.2
[11] forcats_0.5.0 stringr_1.4.0
[13] dplyr_1.0.0 purrr_0.3.4
[15] readr_1.3.1 tidyr_1.1.0
[17] tibble_3.0.3 ggplot2_3.3.0
[19] tidyverse_1.3.0 workflowr_1.6.2
loaded via a namespace (and not attached):
[1] httr_1.4.2 sass_0.2.0 jsonlite_1.7.0 modelr_0.1.6
[5] assertthat_0.2.1 cellranger_1.1.0 yaml_2.2.1 pillar_1.4.6
[9] backports_1.1.6 lattice_0.20-38 glue_1.4.1 quadprog_1.5-8
[13] digest_0.6.25 promises_1.1.0 checkmate_2.0.0 rvest_0.3.5
[17] snakecase_0.11.0 colorspace_1.4-1 htmltools_0.5.0 httpuv_1.5.2
[21] pkgconfig_2.0.3 broom_0.5.6 haven_2.2.0 scales_1.1.0
[25] whisker_0.4 later_1.0.0 git2r_0.26.1 generics_0.0.2
[29] farver_2.0.3 ellipsis_0.3.1 withr_2.2.0 janitor_2.0.1
[33] cli_2.0.2 crayon_1.3.4 readxl_1.3.1 evaluate_0.14
[37] fs_1.4.1 fansi_0.4.1 nlme_3.1-144 xml2_1.3.2
[41] tools_3.6.3 hms_0.5.3 lifecycle_0.2.0 munsell_0.5.0
[45] reprex_0.3.0 compiler_3.6.3 rlang_0.4.7 grid_3.6.3
[49] gt_0.2.2 rstudioapi_0.11 labeling_0.3 rmarkdown_2.4
[53] gtable_0.3.0 DBI_1.1.0 curl_4.3 R6_2.4.1
[57] knitr_1.28 utf8_1.1.4 rprojroot_1.3-2 Quandl_2.10.0
[61] stringi_1.4.6 Rcpp_1.0.4.6 vctrs_0.3.2 dbplyr_1.4.3
[65] tidyselect_1.1.0 xfun_0.13