class: center, middle, inverse, title-slide .title[ # Data Literacy: Introduction to R ] .subtitle[ ## Relational Data ] .author[ ### Veronika Batzdorfer ] .date[ ### 2025-05-24 ] --- layout: true --- ## Combining data Combine different data sets (e.g., data on the same subjects from different sources). The simplest case is: - have the same **cases/rows** in the same order but different variables/columns - have the same **columns/variables** (with the same names) but different cases/rows then -> use `base R` functions .highlight[`rbind()`] to add/combine rows or .highlight[`cbind()`] to add/combine columns, or, better yet, `bind_rows()` or `bind_cols()` from `dplyr`. --- ## Relational data Use the so-called *two-table verbs* from `dplyr` which allow you to combine two (or more) tabular data sets in various ways. 1. Mutating joins 2. Filtering joins If you have ever worked with `SQL` code, many of the following things will look familiar. For a more, you can have a look at the [chapter on relational data](https://r4ds.had.co.nz/relational-data.html) in *R for Data Science*. --- ## Superhero data 🦇⚡🕸🧜♂️ We illustrate the different joins from `dplyr`, with the [Super Heroes data set from *Kaggle*](https://www.kaggle.com/claudiodavi/superhero-set). .pull-left[ .center[ **Superheroes** <table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ **Publishers** <table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] --- ## Mutating joins > A mutating join [...] .highlight[first matches observations by their keys], then .highlight[copies across variables from one table to the other] ([R for Data Science](https://r4ds.had.co.nz/relational-data.html#mutating-joins)). - `inner_join()` - `left_join()` - `right_join()` - `full_join()` --- ## Inner join > All rows from `x` where there are matching values in `y`, and all columns from `x` and `y`. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/inner-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Inner join example .pull-left[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> ``` r heroes %>% inner_join(publishers, by = "Publisher") ``` <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> </tbody> </table> --- ## Left join > All rows from `x`, and all columns from `x` and `y`. Rows in `x` with no match in `y` will have `NA` values in the new columns. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/left-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Left join example .pull-left[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> .small[ ``` r heroes %>% left_join(publishers, by = "Publisher") ``` ] <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> </tbody> </table> --- ## Right join > All rows from y, and all columns from `x` and `y`. Rows in `y` with no match in `x` will have `NA` values in the new columns. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/right-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Right join example .pull-left[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> .small[ ``` r heroes %>% right_join(publishers, by = "Publisher") ``` ] <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> --- ## Full join > All rows and all columns from both `x` and `y`. Where there are not matching values, returns `NA` for the one missing. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/full-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Full join example .pull-left[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> .small[ ``` r heroes %>% full_join(publishers, by = "Publisher") ``` ] <table class="table" style="font-size: 9px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> <td style="text-align:right;"> NA </td> <td style="text-align:left;"> NA </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> --- ## Filtering joins > Filtering joins .highlight[match observations in the same way as mutating joins], but .highlight[affect the observations, not the variables] ([R for Data Science](https://r4ds.had.co.nz/relational-data.html#filtering-joins)). - `semi_join()` - `anti_join()` --- ## Semi join > All rows from `x` where there are matching values in `y`, keeping just columns from `x`. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/semi-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Semi join example .pull-left[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> ``` r heroes %>% semi_join(publishers, by = "Publisher") ``` <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> --- ## Anti join > All rows from `x` where there are not matching values in `y`, keeping just columns from `x`. <img src="data:image/png;base64,#https://github.com/gadenbuie/tidyexplain/blob/main/images/anti-join.gif?raw=true" width="45%" style="display: block; margin: auto;" /> .small[ Animation by [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain) ] --- ## Anti join example .pull-left[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Aquaman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Atlantean </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Batgirl </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Catwoman </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> <tr> <td style="text-align:left;"> Deadpool </td> <td style="text-align:left;"> neutral </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> <tr> <td style="text-align:left;"> Magneto </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Mutant </td> <td style="text-align:left;"> Marvel </td> </tr> <tr> <td style="text-align:left;"> Poison Ivy </td> <td style="text-align:left;"> bad </td> <td style="text-align:left;"> Human </td> <td style="text-align:left;"> DC </td> </tr> </tbody> </table> ] ] .pull-right[ .center[ <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Publisher </th> <th style="text-align:right;"> Founded </th> <th style="text-align:left;"> Location </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Marvel </td> <td style="text-align:right;"> 1939 </td> <td style="text-align:left;"> NYC (NY) </td> </tr> <tr> <td style="text-align:left;"> DC </td> <td style="text-align:right;"> 1934 </td> <td style="text-align:left;"> Burbank (CA) </td> </tr> <tr> <td style="text-align:left;"> Image </td> <td style="text-align:right;"> 1992 </td> <td style="text-align:left;"> Berkeley (CA) </td> </tr> </tbody> </table> ] ] <br> ``` r heroes %>% anti_join(publishers, by = "Publisher") ``` <table class="table" style="font-size: 10px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Name </th> <th style="text-align:left;"> Alignment </th> <th style="text-align:left;"> Race </th> <th style="text-align:left;"> Publisher </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Hellboy </td> <td style="text-align:left;"> good </td> <td style="text-align:left;"> Demon </td> <td style="text-align:left;"> Dark Horse </td> </tr> </tbody> </table> --- ## Expanding the joy of joins Even more join operations (including fuzzy joins) are possible with the [`powerjoin` package](https://github.com/moodymudskipper/powerjoin).