Chapter 9 - Tidy Data with tidyr

What’s tidy data anyway?

Working with tidy data
Exercises

Pivot Longer / Gather

Pivot Wider / Spread

Exercises

Last updated: 2020-10-25

Checks: 7 0

Knit directory: r4ds_book/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200814)

The command set.seed(20200814) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 57f23a8

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 57f23a8. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  VideoDecodeStats/
    Untracked:  analysis/images/
    Untracked:  code_snipp.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/ch9_tidy_data.Rmd) and HTML (docs/ch9_tidy_data.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	57f23a8	sciencificity	2020-10-25	added Ch9

options(scipen=10000)
library(tidyverse)
library(flair)
library(emo)
library(lubridate)
library(magrittr)
library(tidyquant)
theme_set(theme_tq())

What’s tidy data anyway?

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

In all the examples tidyr::table to tidyr::table4b, only tidyr::table1 is tidy.

(
  # practising the read_csv function to create table1
  # just note however that table1 is in tidyr ;)
  # tidyr::table1 etc. 
  # In all honesty, I only figured this out after "practising" :P
  table1 <- read_csv("country, year, cases, population
                     Afghanistan, 1999, 745, 19987071
                     Afghanistan, 2000, 2666, 20595360
                     Brazil, 1999, 37737, 172006362
                     Brazil, 2000, 80488, 174504898
                     China, 1999, 212258, 1272915272
                     China, 2000, 213766, 1280428583")
)

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Working with tidy data

table1 %>% 
  mutate(rate = cases/population * 10000)

# A tibble: 6 x 5
  country      year  cases population  rate
  <chr>       <dbl>  <dbl>      <dbl> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67

table1 %>% 
  count(year, wt=cases) # same as group_by and sum

# A tibble: 2 x 2
   year      n
  <dbl>  <dbl>
1  1999 250740
2  2000 296920

table1 %>% 
  group_by(year) %>% 
  summarise(sum(cases))

# A tibble: 2 x 2
   year `sum(cases)`
  <dbl>        <dbl>
1  1999       250740
2  2000       296920

ggplot(table1, aes(year, cases)) +
  geom_line(aes(group = country), colour = "grey50") +
  geom_point(aes(colour = country)) +
  scale_colour_tq()

Exercises

Using prose, describe how the variables and observations are organised in each of the sample tables.

tidyr::table1

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Each column is a variable ✅
Each observation is in a row ✅
Each value is in a cell ✅
The table describes the number of cases, and the population count (each in its own column) for each country and year combination.

tidyr::table2

# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Each column is a variable ❌
Each observation is in a row ✅
Each value is in a cell ✅
Each row contains either the number of cases, or the population count for each country and year combination.

tidyr::table3

# A tibble: 6 x 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Each column is a variable ❌
Each observation is in a row ✅
Each value is in a cell ❌
Each row considers the country and year combination, and shows the number of cases and the population count (separated by a /) in one variable named rate.

tidyr::table4a

# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Each column is a variable ✅
Each observation is in a row ❌
Each value is in a cell ✅

tidyr::table4b

# A tibble: 3 x 3
  country         `1999`     `2000`
* <chr>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

Each column is a variable ✅
Each observation is in a row ❌
Each value is in a cell ✅
Each table houses either the number of cases, or the population count for each country, in separated columns for each year.

tidyr::table5

# A tibble: 6 x 4
  country     century year  rate             
* <chr>       <chr>   <chr> <chr>            
1 Afghanistan 19      99    745/19987071     
2 Afghanistan 20      00    2666/20595360    
3 Brazil      19      99    37737/172006362  
4 Brazil      20      00    80488/174504898  
5 China       19      99    212258/1272915272
6 China       20      00    213766/1280428583

Each column is a variable ❌
Each observation is in a row ✅
Each value is in a cell ❌
The table considers each year separated into century and year for each country and then similar to table3 it combines the cases and population count in one variable rate (separated within the column by /)

Compute the rate for table2, and table4a + table4b. You will need to perform four operations:

Extract the number of TB cases per country per year.
Extract the matching population per country per year.
Divide cases by population, and multiply by 10000.
Store back in the appropriate place.

Which representation is easiest to work with? Which is hardest? Why?

(
  tbl1 <- tidyr::table2 %>% 
  filter(type == "cases") %>% 
  group_by(country, year) %>% 
  mutate(cases = count) %>% 
  ungroup() %>% 
  select(country, year, cases) %>% 
  arrange(country, year)
)

# A tibble: 6 x 3
  country      year  cases
  <chr>       <int>  <int>
1 Afghanistan  1999    745
2 Afghanistan  2000   2666
3 Brazil       1999  37737
4 Brazil       2000  80488
5 China        1999 212258
6 China        2000 213766

(
  tbl2 <- tidyr::table2 %>% 
  filter(type == "population") %>% 
  group_by(country, year) %>% 
  mutate(population = count) %>% 
  ungroup() %>% 
  select(country_temp = country, 
         year_temp = year, 
         population) %>% 
  arrange(country_temp, year_temp)
)

# A tibble: 6 x 3
  country_temp year_temp population
  <chr>            <int>      <int>
1 Afghanistan       1999   19987071
2 Afghanistan       2000   20595360
3 Brazil            1999  172006362
4 Brazil            2000  174504898
5 China             1999 1272915272
6 China             2000 1280428583

(
tbl3 <- tbl1 %>% 
  bind_cols(tbl2) %>% 
  select(c(1:3,6)) %>% 
  mutate(rate = (cases / population) * 10000) %>% 
  arrange(country, year) %>% 
  select(country, year, rate) %>% 
  mutate(type = "rate",
         count = rate) %>% 
  select(c(1,2,4,5))
)

# A tibble: 6 x 4
  country      year type  count
  <chr>       <int> <chr> <dbl>
1 Afghanistan  1999 rate  0.373
2 Afghanistan  2000 rate  1.29 
3 Brazil       1999 rate  2.19 
4 Brazil       2000 rate  4.61 
5 China        1999 rate  1.67 
6 China        2000 rate  1.67

tidyr::table2 %>% 
  bind_rows(tbl3) %>% 
  mutate(count = round(count, 2)) %>% 
  arrange(country, year, type) %>% 
  gt::gt()

country	year	type	count
Afghanistan	1999	cases	745.00
Afghanistan	1999	population	19987071.00
Afghanistan	1999	rate	0.37
Afghanistan	2000	cases	2666.00
Afghanistan	2000	population	20595360.00
Afghanistan	2000	rate	1.29
Brazil	1999	cases	37737.00
Brazil	1999	population	172006362.00
Brazil	1999	rate	2.19
Brazil	2000	cases	80488.00
Brazil	2000	population	174504898.00
Brazil	2000	rate	4.61
China	1999	cases	212258.00
China	1999	population	1272915272.00
China	1999	rate	1.67
China	2000	cases	213766.00
China	2000	population	1280428583.00
China	2000	rate	1.67

(
  tbl1_cases <- tidyr::table4a %>% 
  select(country, `1999`) %>% 
  mutate(year = 1999,
         cases  = `1999`) %>% 
  select(country, year, cases)
)

# A tibble: 3 x 3
  country      year  cases
  <chr>       <dbl>  <int>
1 Afghanistan  1999    745
2 Brazil       1999  37737
3 China        1999 212258

(
  tbl2_cases <- tidyr::table4a %>% 
  select(country, "2000") %>% 
  mutate(year = 2000, 
         cases = `2000`) %>% 
  select(country, year, cases)
)

# A tibble: 3 x 3
  country      year  cases
  <chr>       <dbl>  <int>
1 Afghanistan  2000   2666
2 Brazil       2000  80488
3 China        2000 213766

(
  tbl_cases <- tbl1_cases %>% 
  bind_rows(tbl2_cases) %>% 
  arrange(country, year)
)

# A tibble: 6 x 3
  country      year  cases
  <chr>       <dbl>  <int>
1 Afghanistan  1999    745
2 Afghanistan  2000   2666
3 Brazil       1999  37737
4 Brazil       2000  80488
5 China        1999 212258
6 China        2000 213766

(
  tbl1_pop <- tidyr::table4b %>% 
  select(country, `1999`) %>% 
  mutate(year = 1999,
         population  = `1999`) %>% 
  select(country, year, population)
)

# A tibble: 3 x 3
  country      year population
  <chr>       <dbl>      <int>
1 Afghanistan  1999   19987071
2 Brazil       1999  172006362
3 China        1999 1272915272

(
  tbl2_pop <- tidyr::table4b %>% 
  select(country, "2000") %>% 
  mutate(year = 2000, 
         population = `2000`) %>% 
  select(country, year, population)
)

# A tibble: 3 x 3
  country      year population
  <chr>       <dbl>      <int>
1 Afghanistan  2000   20595360
2 Brazil       2000  174504898
3 China        2000 1280428583

(
  tbl_pop <- tbl1_pop %>% 
  bind_rows(tbl2_pop) %>% 
  arrange(country, year)
)

# A tibble: 6 x 3
  country      year population
  <chr>       <dbl>      <int>
1 Afghanistan  1999   19987071
2 Afghanistan  2000   20595360
3 Brazil       1999  172006362
4 Brazil       2000  174504898
5 China        1999 1272915272
6 China        2000 1280428583

(
  tbl_rate <- tbl_cases %>% 
    bind_cols(tbl_pop) %>% 
    janitor::clean_names() %>% 
    select(country = country_1,
           year = year_2,
           cases, population) %>% 
    mutate(rate = cases / population * 10000)
)

# A tibble: 6 x 5
  country      year  cases population  rate
  <chr>       <dbl>  <int>      <int> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67

(
  tbl_1999 <- tbl_rate %>% 
    select(country, year, rate) %>% 
    filter(year == 1999) %>% 
    mutate(`1999` = rate) %>% 
    select(country, `1999`)
)

# A tibble: 3 x 2
  country     `1999`
  <chr>        <dbl>
1 Afghanistan  0.373
2 Brazil       2.19 
3 China        1.67

(
  tbl_2000 <- tbl_rate %>% 
    select(country, year, rate) %>% 
    filter(year == 2000) %>% 
    mutate(`2000` = rate) %>% 
    select(country_temp = country, `2000`)
)

# A tibble: 3 x 2
  country_temp `2000`
  <chr>         <dbl>
1 Afghanistan    1.29
2 Brazil         4.61
3 China          1.67

(
  tbl_4c <- 
    tbl_1999 %>% 
    bind_cols(tbl_2000) %>% 
    select(country, `1999`, `2000`)
)

# A tibble: 3 x 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan  0.373   1.29
2 Brazil       2.19    4.61
3 China        1.67    1.67

Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?

tidyr::table1

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

ggplot(table1, aes(year, cases)) +
  geom_line(aes(group = country), colour = "grey50") +
  geom_point(aes(colour = country)) +
  scale_colour_tq()

tidyr::table2

# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

table2 %>% 
  filter(type == "cases") %>% 
  ggplot(aes(year, count)) +
  geom_line(aes(group = country), colour = "grey50") +
  geom_point(aes(colour = country)) +
  scale_colour_tq()

Sometimes you will have to resolve one of two common problems:

One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.

Pivot Longer / Gather

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

table4a


# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

table4a %>% 
  # gather(list out columns you want to gather like dplyr::select() style,
  #        key = what do you want to call the column
  #              these column names go into,
  #        value = the values of the columns will go here)
  gather(`1999`, `2000`, 
         key = "year",
         value = "cases" )


# A tibble: 6 x 3
  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Brazil      1999   37737
3 China       1999  212258
4 Afghanistan 2000    2666
5 Brazil      2000   80488
6 China       2000  213766

(tidy_4a <- table4a %>% 
  # cols = list the columns you want to pivot
  # names_to = what will you call the new column these
  #            column names go into
  # values_to = the values in the columns will go here
  pivot_longer(cols = c(`1999`, `2000`),
               names_to = "year",
               values_to = "cases"))


# A tibble: 6 x 3
  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

table4b


# A tibble: 3 x 3
  country         `1999`     `2000`
* <chr>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

table4b %>% 
  gather(`1999`, `2000`, 
         key = "year",
         value = "population")


# A tibble: 6 x 3
  country     year  population
  <chr>       <chr>      <int>
1 Afghanistan 1999    19987071
2 Brazil      1999   172006362
3 China       1999  1272915272
4 Afghanistan 2000    20595360
5 Brazil      2000   174504898
6 China       2000  1280428583

(tidy_4b <- table4b %>% 
  pivot_longer(cols = c(`1999`, `2000`),
               names_to = "year",
               values_to = "population"))


# A tibble: 6 x 3
  country     year  population
  <chr>       <chr>      <int>
1 Afghanistan 1999    19987071
2 Afghanistan 2000    20595360
3 Brazil      1999   172006362
4 Brazil      2000   174504898
5 China       1999  1272915272
6 China       2000  1280428583

left_join(tidy_4a, tidy_4b) %>% 
  arrange(country, year)


# A tibble: 6 x 4
  country     year   cases population
  <chr>       <chr>  <int>      <int>
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3 Brazil      1999   37737  172006362
4 Brazil      2000   80488  174504898
5 China       1999  212258 1272915272
6 China       2000  213766 1280428583

Pivot Wider / Spread

pivot_wider() is the opposite of pivot_longer(). You use it when an observation is scattered across multiple rows.

table2


# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

table2 %>% 
  # key = column with the variable name, here `type`
  spread(key = type, 
  # value = column with the value that will be assigned
  # to new columns
         value = count)


# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

table2 %>% 
  pivot_wider(names_from = type,
              values_from = count)


# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Exercises

Why are pivot_longer() and pivot_wider() not perfectly symmetrical?
Carefully consider the following example:

(stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
))

# A tibble: 4 x 3
   year  half return
  <dbl> <dbl>  <dbl>
1  2015     1   1.88
2  2015     2   0.59
3  2016     1   0.92
4  2016     2   0.17

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")

# A tibble: 4 x 3
   half year  return
  <dbl> <chr>  <dbl>
1     1 2015    1.88
2     1 2016    0.92
3     2 2015    0.59
4     2 2016    0.17

(Hint: look at the variable types and think about column names.)

pivot_longer() has a names_ptypes argument, e.g. names_ptypes = list(year = double()). What does it do?

# vignette("pivot")
stocks %>% 
    pivot_wider(names_from = year, values_from = return)

# A tibble: 2 x 3
   half `2015` `2016`
  <dbl>  <dbl>  <dbl>
1     1   1.88   0.92
2     2   0.59   0.17

Let’s have a look at the first part - here we take the year and make it a variable. That means that 2015 and 2016 become variables (new columns) in our new tibble, and the return gets pulled into the appropriate column (2015/2016) against the appropriate half. By nature of this move we changed year which was a double into two new column names which are 2015 and 2016 and hence “character”.

(stocks_ <- stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return"))

# A tibble: 4 x 3
   half year  return
  <dbl> <chr>  <dbl>
1     1 2015    1.88
2     1 2016    0.92
3     2 2015    0.59
4     2 2016    0.17

colnames(stocks)

[1] "year"   "half"   "return"

colnames(stocks_)

[1] "half"   "year"   "return"

So following on that we take these new columns and then collapse them into a column year again. But now we have changed the type given we made them columns in the pivot_wider() step. So they keep their “character” nature when they are made longer again. Final result is year started off double (when we created it) but ends up character (after the pivot_wider and pivot_longer steps).

Th columns also get rearranged since the pivot_wider spreads the year column into 2015 and 2016 which come after half in that initial step. When we subsequently pivot_longer half remains as the first column, followed by the names_to = column (year in this case), and finally the values_to = column (return in this case).

Q: pivot_longer() has a names_ptypes argument, e.g. names_ptypes = list(year = double()). What does it do?

Okay so upon reading the help page and the info I expected that this function would convert my character column year created after the pivot_wider() step into a double, but instead it throws an error. 😕

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, 
               names_to = "year", 
               names_ptypes = list(year = double()),
               values_to = "return"
               )

Error: Can't convert <character> to <double>.

We use this to confirm that the columns we create are of the type / class we expect - so here it provides a check it seems 🤷.

To transform the column from character to double you would need to use the names_transform.

(stocks_ptypes <- stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, 
               names_to = "year", 
               names_transform = list(year = as.double),
               values_to = "return",
               # is the value column of the type expected
               values_ptypes = list(return = double())
               ))

# A tibble: 4 x 3
   half  year return
  <dbl> <dbl>  <dbl>
1     1  2015   1.88
2     1  2016   0.92
3     2  2015   0.59
4     2  2016   0.17

Strangely though I would expect that if I transform a column from x to y (using names_transform), and then use names_ptypes to check if my name column is indeed now of type y that would be fine? It still throws an error, so my thinking is flawed here.

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, 
               names_to = "year", 
               names_transform = list(year = as.double),
               names_ptypes = list(year = double()),
               values_to = "return",
               # is the value column of the type expected
               values_ptypes = list(return = double())
               )

Error: Can't convert <character> to <double>.

Why does this code fail?

table4a %>% 
  pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")

Error: Can't subset columns that don't exist.
[31mx[39m Locations 1999 and 2000 don't exist.
[34mi[39m There are only 3 columns.

  # Error: Can't subset columns that don't exist.
  # x Locations 1999 and 2000 don't exist.
  # i There are only 3 columns.

# fixing it
table4a %>% 
  pivot_longer(c("1999", `2000`), names_to = "year", values_to = "cases")

# A tibble: 6 x 3
  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

The 1999 and 2000 are non-syntactically named columns. These have to be surrounded by backticks (``) or quotations "". Here tidyr is trying to read columns numbered 1999, and 2000 which don’t exist.

What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?

people <- tribble(
  ~name,             ~names,  ~values,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)

You get a warning and it has a list for each variable age and height since Philips Woods has two ages which are different.

people %>% 
  pivot_wider(names_from = names,
              values_from = "values")

# A tibble: 2 x 3
  name            age       height   
  <chr>           <list>    <list>   
1 Phillip Woods   <dbl [2]> <dbl [1]>
2 Jessica Cordero <dbl [1]> <dbl [1]>

people2 <- tribble(
  ~name,             ~names,  ~values,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age2",      50, # second age gets diff col name
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)

people2 %>% 
  pivot_wider(names_from = names,
              values_from = "values")

# A tibble: 2 x 4
  name              age height  age2
  <chr>           <dbl>  <dbl> <dbl>
1 Phillip Woods      45    186    50
2 Jessica Cordero    37    156    NA

Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?

(preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
))

# A tibble: 2 x 3
  pregnant  male female
  <chr>    <dbl>  <dbl>
1 yes         NA     10
2 no          20     12

We need to make it longer. The variable’s are pregnant (yes or no), and the number of male(s)/female(s) in each outcome of pregnant.

preg %>% 
  pivot_longer(c('male', 'female'), 
               names_to = 'sex',
               values_to = 'count')

# A tibble: 4 x 3
  pregnant sex    count
  <chr>    <chr>  <dbl>
1 yes      male      NA
2 yes      female    10
3 no       male      20
4 no       female    12

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_South Africa.1252  LC_CTYPE=English_South Africa.1252   
[3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C                         
[5] LC_TIME=English_South Africa.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tidyquant_1.0.0            quantmod_0.4.17           
 [3] TTR_0.23-6                 PerformanceAnalytics_2.0.4
 [5] xts_0.12-0                 zoo_1.8-7                 
 [7] magrittr_1.5               lubridate_1.7.8           
 [9] emo_0.0.0.9000             flair_0.0.2               
[11] forcats_0.5.0              stringr_1.4.0             
[13] dplyr_1.0.0                purrr_0.3.4               
[15] readr_1.3.1                tidyr_1.1.0               
[17] tibble_3.0.3               ggplot2_3.3.0             
[19] tidyverse_1.3.0            workflowr_1.6.2           

loaded via a namespace (and not attached):
 [1] httr_1.4.2       sass_0.2.0       jsonlite_1.7.0   modelr_0.1.6    
 [5] assertthat_0.2.1 cellranger_1.1.0 yaml_2.2.1       pillar_1.4.6    
 [9] backports_1.1.6  lattice_0.20-38  glue_1.4.1       quadprog_1.5-8  
[13] digest_0.6.25    promises_1.1.0   checkmate_2.0.0  rvest_0.3.5     
[17] snakecase_0.11.0 colorspace_1.4-1 htmltools_0.5.0  httpuv_1.5.2    
[21] pkgconfig_2.0.3  broom_0.5.6      haven_2.2.0      scales_1.1.0    
[25] whisker_0.4      later_1.0.0      git2r_0.26.1     generics_0.0.2  
[29] farver_2.0.3     ellipsis_0.3.1   withr_2.2.0      janitor_2.0.1   
[33] cli_2.0.2        crayon_1.3.4     readxl_1.3.1     evaluate_0.14   
[37] fs_1.4.1         fansi_0.4.1      nlme_3.1-144     xml2_1.3.2      
[41] tools_3.6.3      hms_0.5.3        lifecycle_0.2.0  munsell_0.5.0   
[45] reprex_0.3.0     compiler_3.6.3   rlang_0.4.7      grid_3.6.3      
[49] gt_0.2.2         rstudioapi_0.11  labeling_0.3     rmarkdown_2.4   
[53] gtable_0.3.0     DBI_1.1.0        curl_4.3         R6_2.4.1        
[57] knitr_1.28       utf8_1.1.4       rprojroot_1.3-2  Quandl_2.10.0   
[61] stringi_1.4.6    Rcpp_1.0.4.6     vctrs_0.3.2      dbplyr_1.4.3    
[65] tidyselect_1.1.0 xfun_0.13

Chapter 9 - Tidy Data with tidyr

Vebash Naidoo

23/10/2020

What’s tidy data anyway?

Working with tidy data

Exercises

Pivot Longer / Gather

Pivot Wider / Spread

Exercises