Data transformation with dplyr

Last updated: 2019-09-26

Checks: 7 0

Knit directory: wflow-r4ds/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190925)

The command set.seed(20190925) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 5f7de6b

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Unstaged changes:
    Modified:   analysis/02-workflow-basics.Rmd
    Modified:   analysis/index.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	5f7de6b	John Blischak	2019-09-26	Start chp 3 exercises on dplyr

Setup

library(nycflights13)
library(tidyverse)

── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

✔ ggplot2 3.2.1     ✔ purrr   0.3.2
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   1.0.0     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0

── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Filter rows with `filter()`

p. 49

Find all flights that

Had an arrival delay of two or more hours

filter(flights, arr_delay > 2 * 60)

# A tibble: 10,034 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      811            630       101     1047
 2  2013     1     1      848           1835       853     1001
 3  2013     1     1      957            733       144     1056
 4  2013     1     1     1114            900       134     1447
 5  2013     1     1     1505           1310       115     1638
 6  2013     1     1     1525           1340       105     1831
 7  2013     1     1     1549           1445        64     1912
 8  2013     1     1     1558           1359       119     1718
 9  2013     1     1     1732           1630        62     2028
10  2013     1     1     1803           1620       103     2008
# … with 10,024 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Flew to Houston (IAH or HOU)

filter(flights, dest == "IAH" | dest == "HOU")

# A tibble: 9,313 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      623            627        -4      933
 4  2013     1     1      728            732        -4     1041
 5  2013     1     1      739            739         0     1104
 6  2013     1     1      908            908         0     1228
 7  2013     1     1     1028           1026         2     1350
 8  2013     1     1     1044           1045        -1     1352
 9  2013     1     1     1114            900       134     1447
10  2013     1     1     1205           1200         5     1503
# … with 9,303 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Were operated by United, American, or Delta

filter(flights, carrier %in% c("UA", "AA", "DL"))

# A tibble: 139,504 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      554            600        -6      812
 5  2013     1     1      554            558        -4      740
 6  2013     1     1      558            600        -2      753
 7  2013     1     1      558            600        -2      924
 8  2013     1     1      558            600        -2      923
 9  2013     1     1      559            600        -1      941
10  2013     1     1      559            600        -1      854
# … with 139,494 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Departed in summer (July, August, and September)

filter(flights, month %in% 7:9)

# A tibble: 86,326 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     7     1        1           2029       212      236
 2  2013     7     1        2           2359         3      344
 3  2013     7     1       29           2245       104      151
 4  2013     7     1       43           2130       193      322
 5  2013     7     1       44           2150       174      300
 6  2013     7     1       46           2051       235      304
 7  2013     7     1       48           2001       287      308
 8  2013     7     1       58           2155       183      335
 9  2013     7     1      100           2146       194      327
10  2013     7     1      100           2245       135      337
# … with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Arrived more than two hours late, but didn’t leave late

filter(flights, arr_delay > 2 * 60, dep_delay <=0)

# A tibble: 29 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1    27     1419           1420        -1     1754
 2  2013    10     7     1350           1350         0     1736
 3  2013    10     7     1357           1359        -2     1858
 4  2013    10    16      657            700        -3     1258
 5  2013    11     1      658            700        -2     1329
 6  2013     3    18     1844           1847        -3       39
 7  2013     4    17     1635           1640        -5     2049
 8  2013     4    18      558            600        -2     1149
 9  2013     4    18      655            700        -5     1213
10  2013     5    22     1827           1830        -3     2217
# … with 19 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Were delayed by at least an hour, but made up over 30 minutes in flight

filter(flights, dep_delay >= 60, arr_delay < 30)

# A tibble: 206 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     3     1850           1745        65     2148
 2  2013     1     3     1950           1845        65     2228
 3  2013     1     3     2015           1915        60     2135
 4  2013     1     6     1019            900        79     1558
 5  2013     1     7     1543           1430        73     1758
 6  2013     1    11     1020            920        60     1311
 7  2013     1    12     1706           1600        66     1949
 8  2013     1    12     1953           1845        68     2154
 9  2013     1    19     1456           1355        61     1636
10  2013     1    21     1531           1430        61     1843
# … with 196 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Departed between midnight and 6am (inclusive)

filter(flights, hour <= 6)

# A tibble: 27,905 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 27,895 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

between() selects numeric values between a minimum and maximum value (inclusive).

While I think between() is useful, I don’t find these examples very compelling.

filter(flights, between(month, 7, 9))

# A tibble: 86,326 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     7     1        1           2029       212      236
 2  2013     7     1        2           2359         3      344
 3  2013     7     1       29           2245       104      151
 4  2013     7     1       43           2130       193      322
 5  2013     7     1       44           2150       174      300
 6  2013     7     1       46           2051       235      304
 7  2013     7     1       48           2001       287      308
 8  2013     7     1       58           2155       183      335
 9  2013     7     1      100           2146       194      327
10  2013     7     1      100           2245       135      337
# … with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

filter(flights, between(hour, 0, 6))

# A tibble: 27,905 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# … with 27,895 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

filter(flights, is.na(dep_time))

# A tibble: 8,255 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1       NA           1630        NA       NA
 2  2013     1     1       NA           1935        NA       NA
 3  2013     1     1       NA           1500        NA       NA
 4  2013     1     1       NA            600        NA       NA
 5  2013     1     2       NA           1540        NA       NA
 6  2013     1     2       NA           1620        NA       NA
 7  2013     1     2       NA           1355        NA       NA
 8  2013     1     2       NA           1420        NA       NA
 9  2013     1     2       NA           1321        NA       NA
10  2013     1     2       NA           1545        NA       NA
# … with 8,245 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

I assume these are cancelled flights since all columns relating to departure and arrival times are missing.

Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

NA ^ 0 # same as NA ^ FALSE

[1] 1

NA | TRUE

[1] TRUE

FALSE & NA

[1] FALSE

NA * 0

[1] NA

I think the NA | TRUE and FALSE & NA make sense. They are short-circuitng the logic. For an “or” statement, a TRUE on one of the sides is sufficient to render the entire statement true, regardless if the other data is missing. Vice verse for the “and”: a FALSE on either side is sufficient to make the entire statement false.

# have to be TRUE
NA | TRUE

[1] TRUE

TRUE | NA

[1] TRUE

# ambiguous
NA | FALSE

[1] NA

FALSE | NA

[1] NA

# have to be TRUE
NA & FALSE

[1] FALSE

FALSE & NA

[1] FALSE

# ambiguous
NA & TRUE

[1] NA

TRUE & NA

[1] NA

The “exclusive or” logic makes less sense to me. From the docs for ?Arithmetic:

1 ^ y and y ^ 0 are 1, always.

And that doesn’t make sense to me.

NA ^ FALSE

[1] 1

NA ^ 1

[1] NA

1 ^ NA

[1] 1

1 ^ 1

[1] 1

0 ^ 0

[1] 1

While R is willing to make assumptions about logic statements, it doesn’t do this for arithmetic, e.g. NA * 0 is likely 0 no matter the value. I assume this is riskier due to potential division by zero.

0 / NA

[1] NA

NA / NA * 0

[1] NA

Arrange rows with `arrange()`

p. 51

How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

arrange(flights, desc(is.na(dep_time)))

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1       NA           1630        NA       NA
 2  2013     1     1       NA           1935        NA       NA
 3  2013     1     1       NA           1500        NA       NA
 4  2013     1     1       NA            600        NA       NA
 5  2013     1     2       NA           1540        NA       NA
 6  2013     1     2       NA           1620        NA       NA
 7  2013     1     2       NA           1355        NA       NA
 8  2013     1     2       NA           1420        NA       NA
 9  2013     1     2       NA           1321        NA       NA
10  2013     1     2       NA           1545        NA       NA
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Sort flights to find the most delayed flights. Find the flights that left earliest.

arrange(flights, desc(dep_delay))

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     9      641            900      1301     1242
 2  2013     6    15     1432           1935      1137     1607
 3  2013     1    10     1121           1635      1126     1239
 4  2013     9    20     1139           1845      1014     1457
 5  2013     7    22      845           1600      1005     1044
 6  2013     4    10     1100           1900       960     1342
 7  2013     3    17     2321            810       911      135
 8  2013     6    27      959           1900       899     1236
 9  2013     7    22     2257            759       898      121
10  2013    12     5      756           1700       896     1058
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

arrange(flights, hour, minute) # left earliest in the AM

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     7    27       NA            106        NA       NA
 2  2013     1     2      458            500        -2      703
 3  2013     1     3      458            500        -2      650
 4  2013     1     4      456            500        -4      631
 5  2013     1     5      458            500        -2      640
 6  2013     1     6      458            500        -2      718
 7  2013     1     7      454            500        -6      637
 8  2013     1     8      454            500        -6      625
 9  2013     1     9      457            500        -3      647
10  2013     1    10      450            500       -10      634
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

arrange(flights, dep_delay) # left earliest in relation to scheduled dep time

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013    12     7     2040           2123       -43       40
 2  2013     2     3     2022           2055       -33     2240
 3  2013    11    10     1408           1440       -32     1549
 4  2013     1    11     1900           1930       -30     2233
 5  2013     1    29     1703           1730       -27     1947
 6  2013     8     9      729            755       -26     1002
 7  2013    10    23     1907           1932       -25     2143
 8  2013     3    30     2030           2055       -25     2213
 9  2013     3     2     1431           1455       -24     1601
10  2013     5     5      934            958       -24     1225
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Sort flights to find the fastest flights.

arrange(flights, air_time)

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1    16     1355           1315        40     1442
 2  2013     4    13      537            527        10      622
 3  2013    12     6      922            851        31     1021
 4  2013     2     3     2153           2129        24     2247
 5  2013     2     5     1303           1315       -12     1342
 6  2013     2    12     2123           2130        -7     2211
 7  2013     3     2     1450           1500       -10     1547
 8  2013     3     8     2026           1935        51     2131
 9  2013     3    18     1456           1329        87     1533
10  2013     3    19     2226           2145        41     2305
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Which flights travelled the longest? Which travelled the shortest?

arrange(flights, desc(distance))

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      857            900        -3     1516
 2  2013     1     2      909            900         9     1525
 3  2013     1     3      914            900        14     1504
 4  2013     1     4      900            900         0     1516
 5  2013     1     5      858            900        -2     1519
 6  2013     1     6     1019            900        79     1558
 7  2013     1     7     1042            900       102     1620
 8  2013     1     8      901            900         1     1504
 9  2013     1     9      641            900      1301     1242
10  2013     1    10      859            900        -1     1449
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

arrange(flights, distance)

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     7    27       NA            106        NA       NA
 2  2013     1     3     2127           2129        -2     2222
 3  2013     1     4     1240           1200        40     1333
 4  2013     1     4     1829           1615       134     1937
 5  2013     1     4     2128           2129        -1     2218
 6  2013     1     5     1155           1200        -5     1241
 7  2013     1     6     2125           2129        -4     2224
 8  2013     1     7     2124           2129        -5     2212
 9  2013     1     8     2127           2130        -3     2304
10  2013     1     9     2126           2129        -3     2217
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

Select columns with `select()`

p. 54

Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

select(flights, dep_time, dep_delay, arr_time, arr_delay)

# A tibble: 336,776 x 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# … with 336,766 more rows

select(flights, dep_time, dep_delay:arr_time, arr_delay)

# A tibble: 336,776 x 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# … with 336,766 more rows

select(flights, starts_with("dep"), starts_with("arr"))

# A tibble: 336,776 x 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# … with 336,766 more rows

select(flights, dep_time, arr_time, ends_with("delay"))

# A tibble: 336,776 x 4
   dep_time arr_time dep_delay arr_delay
      <int>    <int>     <dbl>     <dbl>
 1      517      830         2        11
 2      533      850         4        20
 3      542      923         2        33
 4      544     1004        -1       -18
 5      554      812        -6       -25
 6      554      740        -4        12
 7      555      913        -5        19
 8      557      709        -3       -14
 9      557      838        -3        -8
10      558      753        -2         8
# … with 336,766 more rows

What happens if you include the name of a variable multiple times in a select() call?

The column is added in the first place it is mentioned:

select(flights, dep_time, dep_time)

# A tibble: 336,776 x 1
   dep_time
      <int>
 1      517
 2      533
 3      542
 4      544
 5      554
 6      554
 7      555
 8      557
 9      557
10      558
# … with 336,766 more rows

select(flights, dep_time, arr_time, dep_time)

# A tibble: 336,776 x 2
   dep_time arr_time
      <int>    <int>
 1      517      830
 2      533      850
 3      542      923
 4      544     1004
 5      554      812
 6      554      740
 7      555      913
 8      557      709
 9      557      838
10      558      753
# … with 336,766 more rows

What does the one_of() function do? Why might it be helpful in conjunction with this vector?

From docs:

one_of(): Matches variable names in a character vector.

Thus it serves a similar purpose as %in%. But %in% returns a logical vector (which could be used in filter()), whereas, select() accepts the integer position of the columns:

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))

# A tibble: 336,776 x 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# … with 336,766 more rows

# This would be how to replace one_of()
select(flights, which(colnames(flights) %in% vars))

# A tibble: 336,776 x 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# … with 336,766 more rows

Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))

# A tibble: 336,776 x 6
   dep_time sched_dep_time arr_time sched_arr_time air_time
      <int>          <int>    <int>          <int>    <dbl>
 1      517            515      830            819      227
 2      533            529      850            830      227
 3      542            540      923            850      160
 4      544            545     1004           1022      183
 5      554            600      812            837      116
 6      554            558      740            728      150
 7      555            600      913            854      158
 8      557            600      709            723       53
 9      557            600      838            846      140
10      558            600      753            745      138
# … with 336,766 more rows, and 1 more variable: time_hour <dttm>

The tidyselect helpers ignore case by default (ignore.case = TRUE):

formals(contains)$ignore.case

[1] TRUE

select(flights, contains("TIME", ignore.case = FALSE))

# A tibble: 336,776 x 0

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0      stringr_1.4.0      dplyr_0.8.3       
 [4] purrr_0.3.2        readr_1.3.1        tidyr_1.0.0       
 [7] tibble_2.1.3       ggplot2_3.2.1      tidyverse_1.2.1   
[10] nycflights13_1.0.1

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 xfun_0.9         haven_2.1.1      lattice_0.20-38 
 [5] colorspace_1.4-1 vctrs_0.2.0      generics_0.0.2   htmltools_0.3.6 
 [9] yaml_2.2.0       utf8_1.1.4       rlang_0.4.0      pillar_1.4.2    
[13] glue_1.3.1       withr_2.1.2      modelr_0.1.5     readxl_1.3.1    
[17] lifecycle_0.1.0  munsell_0.5.0    gtable_0.3.0     workflowr_1.4.0 
[21] cellranger_1.1.0 rvest_0.3.4      evaluate_0.14    knitr_1.25      
[25] fansi_0.4.0      broom_0.5.2      Rcpp_1.0.2       backports_1.1.4 
[29] scales_1.0.0     jsonlite_1.6     fs_1.3.1         hms_0.5.1       
[33] digest_0.6.21    stringi_1.4.3    grid_3.6.1       rprojroot_1.2   
[37] cli_1.1.0        tools_3.6.1      magrittr_1.5     lazyeval_0.2.2  
[41] crayon_1.3.4     whisker_0.4      pkgconfig_2.0.2  zeallot_0.1.0   
[45] xml2_1.2.2       lubridate_1.7.4  assertthat_0.2.1 rmarkdown_1.15  
[49] httr_1.4.1       rstudioapi_0.10  R6_2.4.0         nlme_3.1-141    
[53] git2r_0.26.1     compiler_3.6.1

Data transformation with dplyr

John Blischak

2019-09-26

Setup

Filter rows with filter()

Arrange rows with arrange()

Select columns with select()

Filter rows with `filter()`

Arrange rows with `arrange()`

Select columns with `select()`