Last updated: 2019-09-26
Checks: 7 0
Knit directory: wflow-r4ds/
This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20190925)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Unstaged changes:
Modified: analysis/02-workflow-basics.Rmd
Modified: analysis/index.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 5f7de6b | John Blischak | 2019-09-26 | Start chp 3 exercises on dplyr |
library(nycflights13)
library(tidyverse)
── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.2.1 ✔ purrr 0.3.2
✔ tibble 2.1.3 ✔ dplyr 0.8.3
✔ tidyr 1.0.0 ✔ stringr 1.4.0
✔ readr 1.3.1 ✔ forcats 0.4.0
── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
filter()
p. 49
Find all flights that
Had an arrival delay of two or more hours
filter(flights, arr_delay > 2 * 60)
# A tibble: 10,034 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 811 630 101 1047
2 2013 1 1 848 1835 853 1001
3 2013 1 1 957 733 144 1056
4 2013 1 1 1114 900 134 1447
5 2013 1 1 1505 1310 115 1638
6 2013 1 1 1525 1340 105 1831
7 2013 1 1 1549 1445 64 1912
8 2013 1 1 1558 1359 119 1718
9 2013 1 1 1732 1630 62 2028
10 2013 1 1 1803 1620 103 2008
# … with 10,024 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Flew to Houston (
IAH
orHOU
)
filter(flights, dest == "IAH" | dest == "HOU")
# A tibble: 9,313 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 623 627 -4 933
4 2013 1 1 728 732 -4 1041
5 2013 1 1 739 739 0 1104
6 2013 1 1 908 908 0 1228
7 2013 1 1 1028 1026 2 1350
8 2013 1 1 1044 1045 -1 1352
9 2013 1 1 1114 900 134 1447
10 2013 1 1 1205 1200 5 1503
# … with 9,303 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Were operated by United, American, or Delta
filter(flights, carrier %in% c("UA", "AA", "DL"))
# A tibble: 139,504 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 554 600 -6 812
5 2013 1 1 554 558 -4 740
6 2013 1 1 558 600 -2 753
7 2013 1 1 558 600 -2 924
8 2013 1 1 558 600 -2 923
9 2013 1 1 559 600 -1 941
10 2013 1 1 559 600 -1 854
# … with 139,494 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Departed in summer (July, August, and September)
filter(flights, month %in% 7:9)
# A tibble: 86,326 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 7 1 1 2029 212 236
2 2013 7 1 2 2359 3 344
3 2013 7 1 29 2245 104 151
4 2013 7 1 43 2130 193 322
5 2013 7 1 44 2150 174 300
6 2013 7 1 46 2051 235 304
7 2013 7 1 48 2001 287 308
8 2013 7 1 58 2155 183 335
9 2013 7 1 100 2146 194 327
10 2013 7 1 100 2245 135 337
# … with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Arrived more than two hours late, but didn’t leave late
filter(flights, arr_delay > 2 * 60, dep_delay <=0)
# A tibble: 29 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 27 1419 1420 -1 1754
2 2013 10 7 1350 1350 0 1736
3 2013 10 7 1357 1359 -2 1858
4 2013 10 16 657 700 -3 1258
5 2013 11 1 658 700 -2 1329
6 2013 3 18 1844 1847 -3 39
7 2013 4 17 1635 1640 -5 2049
8 2013 4 18 558 600 -2 1149
9 2013 4 18 655 700 -5 1213
10 2013 5 22 1827 1830 -3 2217
# … with 19 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Were delayed by at least an hour, but made up over 30 minutes in flight
filter(flights, dep_delay >= 60, arr_delay < 30)
# A tibble: 206 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 3 1850 1745 65 2148
2 2013 1 3 1950 1845 65 2228
3 2013 1 3 2015 1915 60 2135
4 2013 1 6 1019 900 79 1558
5 2013 1 7 1543 1430 73 1758
6 2013 1 11 1020 920 60 1311
7 2013 1 12 1706 1600 66 1949
8 2013 1 12 1953 1845 68 2154
9 2013 1 19 1456 1355 61 1636
10 2013 1 21 1531 1430 61 1843
# … with 196 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Departed between midnight and 6am (inclusive)
filter(flights, hour <= 6)
# A tibble: 27,905 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 27,895 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Another useful dplyr filtering helper is
between()
. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
between()
selects numeric values between a minimum and maximum value (inclusive).
While I think between()
is useful, I don’t find these examples very compelling.
filter(flights, between(month, 7, 9))
# A tibble: 86,326 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 7 1 1 2029 212 236
2 2013 7 1 2 2359 3 344
3 2013 7 1 29 2245 104 151
4 2013 7 1 43 2130 193 322
5 2013 7 1 44 2150 174 300
6 2013 7 1 46 2051 235 304
7 2013 7 1 48 2001 287 308
8 2013 7 1 58 2155 183 335
9 2013 7 1 100 2146 194 327
10 2013 7 1 100 2245 135 337
# … with 86,316 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
filter(flights, between(hour, 0, 6))
# A tibble: 27,905 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 27,895 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
How many flights have a missing
dep_time
? What other variables are missing? What might these rows represent?
filter(flights, is.na(dep_time))
# A tibble: 8,255 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 NA 1630 NA NA
2 2013 1 1 NA 1935 NA NA
3 2013 1 1 NA 1500 NA NA
4 2013 1 1 NA 600 NA NA
5 2013 1 2 NA 1540 NA NA
6 2013 1 2 NA 1620 NA NA
7 2013 1 2 NA 1355 NA NA
8 2013 1 2 NA 1420 NA NA
9 2013 1 2 NA 1321 NA NA
10 2013 1 2 NA 1545 NA NA
# … with 8,245 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
I assume these are cancelled flights since all columns relating to departure and arrival times are missing.
Why is
NA ^ 0
not missing? Why isNA | TRUE
not missing? Why isFALSE & NA
not missing? Can you figure out the general rule? (NA * 0
is a tricky counterexample!)
NA ^ 0 # same as NA ^ FALSE
[1] 1
NA | TRUE
[1] TRUE
FALSE & NA
[1] FALSE
NA * 0
[1] NA
I think the NA | TRUE
and FALSE & NA
make sense. They are short-circuitng the logic. For an “or” statement, a TRUE
on one of the sides is sufficient to render the entire statement true, regardless if the other data is missing. Vice verse for the “and”: a FALSE
on either side is sufficient to make the entire statement false.
# have to be TRUE
NA | TRUE
[1] TRUE
TRUE | NA
[1] TRUE
# ambiguous
NA | FALSE
[1] NA
FALSE | NA
[1] NA
# have to be TRUE
NA & FALSE
[1] FALSE
FALSE & NA
[1] FALSE
# ambiguous
NA & TRUE
[1] NA
TRUE & NA
[1] NA
The “exclusive or” logic makes less sense to me. From the docs for ?Arithmetic
:
1 ^ y and y ^ 0 are 1, always.
And that doesn’t make sense to me.
NA ^ FALSE
[1] 1
NA ^ 1
[1] NA
1 ^ NA
[1] 1
1 ^ 1
[1] 1
0 ^ 0
[1] 1
While R is willing to make assumptions about logic statements, it doesn’t do this for arithmetic, e.g. NA * 0
is likely 0
no matter the value. I assume this is riskier due to potential division by zero.
0 / NA
[1] NA
NA / NA * 0
[1] NA
arrange()
p. 51
How could you use
arrange()
to sort all missing values to the start? (Hint: useis.na()
).
arrange(flights, desc(is.na(dep_time)))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 NA 1630 NA NA
2 2013 1 1 NA 1935 NA NA
3 2013 1 1 NA 1500 NA NA
4 2013 1 1 NA 600 NA NA
5 2013 1 2 NA 1540 NA NA
6 2013 1 2 NA 1620 NA NA
7 2013 1 2 NA 1355 NA NA
8 2013 1 2 NA 1420 NA NA
9 2013 1 2 NA 1321 NA NA
10 2013 1 2 NA 1545 NA NA
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Sort
flights
to find the most delayed flights. Find the flights that left earliest.
arrange(flights, desc(dep_delay))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 9 641 900 1301 1242
2 2013 6 15 1432 1935 1137 1607
3 2013 1 10 1121 1635 1126 1239
4 2013 9 20 1139 1845 1014 1457
5 2013 7 22 845 1600 1005 1044
6 2013 4 10 1100 1900 960 1342
7 2013 3 17 2321 810 911 135
8 2013 6 27 959 1900 899 1236
9 2013 7 22 2257 759 898 121
10 2013 12 5 756 1700 896 1058
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
arrange(flights, hour, minute) # left earliest in the AM
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 7 27 NA 106 NA NA
2 2013 1 2 458 500 -2 703
3 2013 1 3 458 500 -2 650
4 2013 1 4 456 500 -4 631
5 2013 1 5 458 500 -2 640
6 2013 1 6 458 500 -2 718
7 2013 1 7 454 500 -6 637
8 2013 1 8 454 500 -6 625
9 2013 1 9 457 500 -3 647
10 2013 1 10 450 500 -10 634
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
arrange(flights, dep_delay) # left earliest in relation to scheduled dep time
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 12 7 2040 2123 -43 40
2 2013 2 3 2022 2055 -33 2240
3 2013 11 10 1408 1440 -32 1549
4 2013 1 11 1900 1930 -30 2233
5 2013 1 29 1703 1730 -27 1947
6 2013 8 9 729 755 -26 1002
7 2013 10 23 1907 1932 -25 2143
8 2013 3 30 2030 2055 -25 2213
9 2013 3 2 1431 1455 -24 1601
10 2013 5 5 934 958 -24 1225
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Sort
flights
to find the fastest flights.
arrange(flights, air_time)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 16 1355 1315 40 1442
2 2013 4 13 537 527 10 622
3 2013 12 6 922 851 31 1021
4 2013 2 3 2153 2129 24 2247
5 2013 2 5 1303 1315 -12 1342
6 2013 2 12 2123 2130 -7 2211
7 2013 3 2 1450 1500 -10 1547
8 2013 3 8 2026 1935 51 2131
9 2013 3 18 1456 1329 87 1533
10 2013 3 19 2226 2145 41 2305
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
Which flights travelled the longest? Which travelled the shortest?
arrange(flights, desc(distance))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 857 900 -3 1516
2 2013 1 2 909 900 9 1525
3 2013 1 3 914 900 14 1504
4 2013 1 4 900 900 0 1516
5 2013 1 5 858 900 -2 1519
6 2013 1 6 1019 900 79 1558
7 2013 1 7 1042 900 102 1620
8 2013 1 8 901 900 1 1504
9 2013 1 9 641 900 1301 1242
10 2013 1 10 859 900 -1 1449
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
arrange(flights, distance)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 7 27 NA 106 NA NA
2 2013 1 3 2127 2129 -2 2222
3 2013 1 4 1240 1200 40 1333
4 2013 1 4 1829 1615 134 1937
5 2013 1 4 2128 2129 -1 2218
6 2013 1 5 1155 1200 -5 1241
7 2013 1 6 2125 2129 -4 2224
8 2013 1 7 2124 2129 -5 2212
9 2013 1 8 2127 2130 -3 2304
10 2013 1 9 2126 2129 -3 2217
# … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
select()
p. 54
Brainstorm as many ways as possible to select
dep_time
,dep_delay
,arr_time
, andarr_delay
fromflights
.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
# A tibble: 336,776 x 4
dep_time dep_delay arr_time arr_delay
<int> <dbl> <int> <dbl>
1 517 2 830 11
2 533 4 850 20
3 542 2 923 33
4 544 -1 1004 -18
5 554 -6 812 -25
6 554 -4 740 12
7 555 -5 913 19
8 557 -3 709 -14
9 557 -3 838 -8
10 558 -2 753 8
# … with 336,766 more rows
select(flights, dep_time, dep_delay:arr_time, arr_delay)
# A tibble: 336,776 x 4
dep_time dep_delay arr_time arr_delay
<int> <dbl> <int> <dbl>
1 517 2 830 11
2 533 4 850 20
3 542 2 923 33
4 544 -1 1004 -18
5 554 -6 812 -25
6 554 -4 740 12
7 555 -5 913 19
8 557 -3 709 -14
9 557 -3 838 -8
10 558 -2 753 8
# … with 336,766 more rows
select(flights, starts_with("dep"), starts_with("arr"))
# A tibble: 336,776 x 4
dep_time dep_delay arr_time arr_delay
<int> <dbl> <int> <dbl>
1 517 2 830 11
2 533 4 850 20
3 542 2 923 33
4 544 -1 1004 -18
5 554 -6 812 -25
6 554 -4 740 12
7 555 -5 913 19
8 557 -3 709 -14
9 557 -3 838 -8
10 558 -2 753 8
# … with 336,766 more rows
select(flights, dep_time, arr_time, ends_with("delay"))
# A tibble: 336,776 x 4
dep_time arr_time dep_delay arr_delay
<int> <int> <dbl> <dbl>
1 517 830 2 11
2 533 850 4 20
3 542 923 2 33
4 544 1004 -1 -18
5 554 812 -6 -25
6 554 740 -4 12
7 555 913 -5 19
8 557 709 -3 -14
9 557 838 -3 -8
10 558 753 -2 8
# … with 336,766 more rows
What happens if you include the name of a variable multiple times in a
select()
call?
The column is added in the first place it is mentioned:
select(flights, dep_time, dep_time)
# A tibble: 336,776 x 1
dep_time
<int>
1 517
2 533
3 542
4 544
5 554
6 554
7 555
8 557
9 557
10 558
# … with 336,766 more rows
select(flights, dep_time, arr_time, dep_time)
# A tibble: 336,776 x 2
dep_time arr_time
<int> <int>
1 517 830
2 533 850
3 542 923
4 544 1004
5 554 812
6 554 740
7 555 913
8 557 709
9 557 838
10 558 753
# … with 336,766 more rows
What does the
one_of()
function do? Why might it be helpful in conjunction with this vector?
From docs:
one_of()
: Matches variable names in a character vector.
Thus it serves a similar purpose as %in%
. But %in%
returns a logical vector (which could be used in filter()
), whereas, select()
accepts the integer position of the columns:
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
# A tibble: 336,776 x 5
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
4 2013 1 1 -1 -18
5 2013 1 1 -6 -25
6 2013 1 1 -4 12
7 2013 1 1 -5 19
8 2013 1 1 -3 -14
9 2013 1 1 -3 -8
10 2013 1 1 -2 8
# … with 336,766 more rows
# This would be how to replace one_of()
select(flights, which(colnames(flights) %in% vars))
# A tibble: 336,776 x 5
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
4 2013 1 1 -1 -18
5 2013 1 1 -6 -25
6 2013 1 1 -4 12
7 2013 1 1 -5 19
8 2013 1 1 -3 -14
9 2013 1 1 -3 -8
10 2013 1 1 -2 8
# … with 336,766 more rows
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
# A tibble: 336,776 x 6
dep_time sched_dep_time arr_time sched_arr_time air_time
<int> <int> <int> <int> <dbl>
1 517 515 830 819 227
2 533 529 850 830 227
3 542 540 923 850 160
4 544 545 1004 1022 183
5 554 600 812 837 116
6 554 558 740 728 150
7 555 600 913 854 158
8 557 600 709 723 53
9 557 600 838 846 140
10 558 600 753 745 138
# … with 336,766 more rows, and 1 more variable: time_hour <dttm>
The tidyselect helpers ignore case by default (ignore.case = TRUE
):
formals(contains)$ignore.case
[1] TRUE
select(flights, contains("TIME", ignore.case = FALSE))
# A tibble: 336,776 x 0
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
[4] purrr_0.3.2 readr_1.3.1 tidyr_1.0.0
[7] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.2.1
[10] nycflights13_1.0.1
loaded via a namespace (and not attached):
[1] tidyselect_0.2.5 xfun_0.9 haven_2.1.1 lattice_0.20-38
[5] colorspace_1.4-1 vctrs_0.2.0 generics_0.0.2 htmltools_0.3.6
[9] yaml_2.2.0 utf8_1.1.4 rlang_0.4.0 pillar_1.4.2
[13] glue_1.3.1 withr_2.1.2 modelr_0.1.5 readxl_1.3.1
[17] lifecycle_0.1.0 munsell_0.5.0 gtable_0.3.0 workflowr_1.4.0
[21] cellranger_1.1.0 rvest_0.3.4 evaluate_0.14 knitr_1.25
[25] fansi_0.4.0 broom_0.5.2 Rcpp_1.0.2 backports_1.1.4
[29] scales_1.0.0 jsonlite_1.6 fs_1.3.1 hms_0.5.1
[33] digest_0.6.21 stringi_1.4.3 grid_3.6.1 rprojroot_1.2
[37] cli_1.1.0 tools_3.6.1 magrittr_1.5 lazyeval_0.2.2
[41] crayon_1.3.4 whisker_0.4 pkgconfig_2.0.2 zeallot_0.1.0
[45] xml2_1.2.2 lubridate_1.7.4 assertthat_0.2.1 rmarkdown_1.15
[49] httr_1.4.1 rstudioapi_0.10 R6_2.4.0 nlme_3.1-141
[53] git2r_0.26.1 compiler_3.6.1