Introduction to Data Science - Workshop Edition

class: center, middle, inverse, title-slide

.title[
# Introduction to Data Science - Workshop Edition
]
.subtitle[
## Cleaning and wrangling data with <code>janitor</code> and <code>forcats</code>
]
.author[
### Luis Fernando Ramirez, Elena Dreyer and Shruti Kakade
]
.institute[
### Hertie School | <a href="https://github.com/intro-to-data-science-23/workshop-presentations">GRAD-C11/E1339</a>
]

---

a, a > code {
  color: rgb(249, 38, 114); || sets color of links */
  text-decoration: none; /* turns off background coloring of links */
}

.title-slide {
  background-color: #23373B;
  border-top: 80px solid #23373B;
}

.title-slide h1  {
  color: #FFFFFF;
  font-size: 40px;
  text-shadow: none;
  font-weight: 400;
  text-align: left;
  margin-left: 15px;
  padding-top: 80px;
}

.title-slide h2  {
  margin-top: -25px;
  padding-bottom: -20px;
  color: #FFFFFF;
  text-shadow: none;
  font-weight: 300;
  font-size: 35px;
  text-align: left;
  margin-left: 15px;
}
.title-slide h3  {
  color: #FFFFFF;
  text-shadow: none;
  font-weight: 300;
  font-size: 25px;
  text-align: left;
  margin-left: 15px;
  margin-bottom: -30px;
}

hr, .title-slide h2::after, .mline h1::after {
  content: '';
  display: block;
  border: none;
  background-color: rgb(249, 38, 114);
  color: rgb(249, 38, 114);
  height: 2px;
}

h1 code, h2 code, h3 code, h4 code, h5 code {
    background-color: #f5f5f5;  /* This is a light gray color. Modify as needed. */
    color: #333;                /* This is almost black. Modify as desired. */
    padding: 2px 5px;           /* Adjusts the padding around the inline code */
}

.pull-left {
  padding-top: 0px;
}

.pull-left-narrow {
  float: left;
  width: 15%;
  padding-bottom: 5px
}

.pull-right-wide {
  float: right;
  width: 85%;
  padding-top: 0;
  padding-bottom: 5px/* Set to 0 or any value you feel looks right */
}

.pull-right-wide-2{
  float: right;
  width: 85%;
  padding-top: 30px;
  padding-bottom: 0px/* Set to 0 or any value you feel looks right */
}

.column {
  float: left;
  width: 20%;
  padding: 3px;
  box-sizing: border-box; /* Include padding in the width calculation */
}

</style>

# Table of Contents

.pull-left[
**janitor**
- Introduction to `janitor`
- Cleaning Names with `clean_names()`
- Tabulating Data with `tabyl()`
- Hands-on with `janitor`: Public Innovation Lab in Mexico
- Additional functionalities
]

.pull-right[
**forcats**
- Introduction to `forcats`
- Overview of Key `forcats` Functions
  - Inspect, Combine, and Modify Factors
  - Change Order, Value, and Add/Drop Levels
- Hands-on with `forcats`: A Small Example using `gss_cat`
]

---
class: inverse, center, middle

# Let's Get Started
---

# Overview

`janitor` is a package built by Sam Firke. Its main functionalities include:
1. Simple functions for **cleaning** and **examining** data
2. It is specifically optimized for users, so its preferred audience are **beginners** and **intermediate** R coders. 
3. **Advanced** R users can do data wrangling data faster.

> "Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets." - The New York Times

### Main Use-Cases:
#### Unstructured or raw data:

- perfectly **format** data.frame column names; 
- create and format **frequency tables of one, two, or three variables**; 
- provide other tools for **cleaning and examining data frames**.
---

# Overview

`janitor` is a **tidyverse-oriented package**. It plays nicely with the %>% pipe and is optimized for cleaning data brought in with the `readr` and `readxl` packages.

- It provides simple functions that can be used to quickly and effectively clean and manage data

### Let's get started
#### Step 1: Installing and loading library:

```r
# install.packages("janitor")
library("janitor")
```

---

# First Use-Case: Cleaning Data

### Clean data frame names with `clean_names()`

Datasets not always include clear and concise names for columns. Column names in a dataset should use lowercase letters, numbers, and underscores. Underscores areu sed to separate 
words within a variable name.

- Function to be used: `clean_names()`
- This function can be used for piped dataframes workflows.

> “Variable and function names should use only lowercase letters, numbers, and underscores. Use underscores (so called snake_case) to separate words within a name." -  [The tidyverse style guide](https://style.tidyverse.org/syntax.html)
 
---

#Some Examples

### Spaces, caps, special characters in columns names:

```r
# We can create a test data frame with variable names that contain special characters, 
# dots instead of "_" and Caps on.

dirty_data <- as.data.frame(matrix(ncol = 5))
names(dirty_data) <- c("FIRST Name", "Last Name" ,"cl@33#***","CAPS", "scored.goals.season")
clean_data <- dirty_data %>% 
  clean_names()

colnames(clean_data)
```

```
## [1] "first_name"          "last_name"           "cl_33_number"       
## [4] "caps"                "scored_goals_season"
```
---
# `janitor` Demonstration

### Using dataset from a public innovation lab in Mexico City

By using a real-life dataset, we want to demonatrate the perks `janitor` has to offer. First we need to download [this dataset](https://datos.cdmx.gob.mx/dataset/interrupcion-legal-del-embarazo/resource/755e9b54-401d-40a6-8787-c45a8f97937e) from a Mexican public agency about legal pregnancy interruption from 2016 - 2018 (updated in 2023).

```r
# loading the dataset
data <- read.csv("ile_2016_2018.csv", stringsAsFactors = F, fileEncoding = "UTF-8")
# don't forget to include the path to your WD when importing a dataset in the project space.
# (colnames are in Spanish)

names(data)[1:5]
```

```
## [1] "anio"          "mes"           "fecha_ingreso" "referida"     
## [5] "estado_civil"
```

```r
data_clean_names <-  clean_names(data)

names(data_clean_names)[1:5]
```

```
## [1] "anio"          "mes"           "fecha_ingreso" "referida"     
## [5] "estado_civil"
```

---

# Second Use Case: Using the `tabyl` function
**This dataset is perfect for us to use the `tabyl` function to display neat tables on variables we want to analyze.**

Why not using `table()`? 
- we cannot using the pipe %>% with this base R function 
- `table()` doesn't give us data frames as outputs. (`tabyl` gives us a dataframe in tidy format consistent with tidyverse)
- Only gives us one-dimensional variables, and `tabyl` provides automatic proportions and counts.

### Example 1: One-dimensional table

```r
# a one-dimensional table
tabyl(data_clean_names$anio)
```

```
##  data_clean_names$anio     n      percent
##                   2015     1 1.947268e-05
##                   2016 18038 3.512482e-01
##                   2017 17597 3.426607e-01
##                   2018 15718 3.060716e-01
```
---

# Second Use Case: Using the `tabyl` function

### Example 2: Two-way table

```r
table<- data_clean_names %>% 
  tabyl(anio, estado_civil) %>% 
  drop_na(anio)
print(table)
```

```
##  anio Casada Divorciada Separada Soltera Unión libre Viuda NA_
##  2015      0          0        1       0           0     0   0
##  2016   2006        581      301    9741        5168    58 183
##  2017   1950        420      267    9658        5045    55 202
##  2018   1589        370      204    8880        4540    38  97
```

---

# Second Use Case: Using the `tabyl` function

### Now we recreate percentages
Here, we have to do some additional coding the add the % in this two-way table.

```r
#now with % on both tables
table2 <- data_clean_names %>% 
  tabyl(anio, estado_civil) %>% 
  adorn_totals("col") %>%              #Total per row
  adorn_percentages("row") %>%         #percentage per row
  adorn_pct_formatting(digits = 2) %>% #rounded value
  adorn_ns() %>%                       #adding absolute numbers
  adorn_title()                      #adds variable name

head(table2)
```

```
##         estado_civil                                                        
##  anio         Casada  Divorciada      Separada        Soltera    Unión libre
##  2015  0.00%     (0) 0.00%   (0) 100.00%   (1)  0.00%     (0)  0.00%     (0)
##  2016 11.12% (2,006) 3.22% (581)   1.67% (301) 54.00% (9,741) 28.65% (5,168)
##  2017 11.08% (1,950) 2.39% (420)   1.52% (267) 54.88% (9,658) 28.67% (5,045)
##  2018 10.11% (1,589) 2.35% (370)   1.30% (204) 56.50% (8,880) 28.88% (4,540)
##                                         
##       Viuda         NA_            Total
##  0.00%  (0) 0.00%   (0) 100.00%      (1)
##  0.32% (58) 1.01% (183) 100.00% (18,038)
##  0.31% (55) 1.15% (202) 100.00% (17,597)
##  0.24% (38) 0.62%  (97) 100.00% (15,718)
```

---

# Second Use Case: Using the `tabyl` function
### With three variables
As expected, oftentimes we need tables that can relate over two variables at once.
Although, for this reason we need prettier design to relate the information easily.

For this reason, we can do the following: create a table with tabyl and 
use adorn for prettier designs.

#### Three-way `tabyl`

```r
data_clean_names  %>% 
  tabyl(resultado_ile, recibio_consejeria, anio) %>% 
  adorn_percentages("all") %>% 
  adorn_pct_formatting(digits = 1) %>%  
  adorn_title() %>% 
  kable() %>% 
  kable_styling(font_size = 8.5)#To fit the table by reducing font size
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;">  </th>
   <th style="text-align:left;"> recibio_consejeria </th>
   <th style="text-align:left;">  </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> resultado_ile </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Si </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Completa </td>
   <td style="text-align:left;"> 0.0% </td>
   <td style="text-align:left;"> 0.0% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Otro </td>
   <td style="text-align:left;"> 0.0% </td>
   <td style="text-align:left;"> 0.0% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> 0.0% </td>
   <td style="text-align:left;"> 100.0% </td>
  </tr>
</tbody>
</table>

</td>
   <td>

<table>
 <thead>
  <tr>
   <th style="text-align:left;">  </th>
   <th style="text-align:left;"> recibio_consejeria </th>
   <th style="text-align:left;">  </th>
   <th style="text-align:left;">  </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> resultado_ile </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Si </td>
   <td style="text-align:left;"> NA_ </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Completa </td>
   <td style="text-align:left;"> 0.1% </td>
   <td style="text-align:left;"> 34.1% </td>
   <td style="text-align:left;"> 0.0% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Otro </td>
   <td style="text-align:left;"> 0.0% </td>
   <td style="text-align:left;"> 2.6% </td>
   <td style="text-align:left;"> 0.0% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> 1.5% </td>
   <td style="text-align:left;"> 59.0% </td>
   <td style="text-align:left;"> 2.6% </td>
  </tr>
</tbody>
</table>

</td>
   <td>

<table>
 <thead>
  <tr>
   <th style="text-align:left;">  </th>
   <th style="text-align:left;"> recibio_consejeria </th>
   <th style="text-align:left;">  </th>
   <th style="text-align:left;">  </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> resultado_ile </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Si </td>
   <td style="text-align:left;"> NA_ </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Completa </td>
   <td style="text-align:left;"> 0.0% </td>
   <td style="text-align:left;"> 0.0% </td>
   <td style="text-align:left;"> 0.0% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Otro </td>
   <td style="text-align:left;"> 0.0% </td>
   <td style="text-align:left;"> 0.0% </td>
   <td style="text-align:left;"> 0.0% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> 0.8% </td>
   <td style="text-align:left;"> 96.0% </td>
   <td style="text-align:left;"> 3.1% </td>
  </tr>
</tbody>
</table>

</td>
   <td>

<table>
 <thead>
  <tr>
   <th style="text-align:left;">  </th>
   <th style="text-align:left;"> recibio_consejeria </th>
   <th style="text-align:left;">  </th>
   <th style="text-align:left;">  </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> resultado_ile </td>
   <td style="text-align:left;"> No </td>
   <td style="text-align:left;"> Si </td>
   <td style="text-align:left;"> NA_ </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Completa </td>
   <td style="text-align:left;"> 0.7% </td>
   <td style="text-align:left;"> 99.1% </td>
   <td style="text-align:left;"> 0.1% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Otro </td>
   <td style="text-align:left;"> 0.0% </td>
   <td style="text-align:left;"> 0.1% </td>
   <td style="text-align:left;"> 0.0% </td>
  </tr>
  <tr>
   <td style="text-align:left;"> NA </td>
   <td style="text-align:left;"> 0.0% </td>
   <td style="text-align:left;"> 0.1% </td>
   <td style="text-align:left;"> 0.0% </td>
  </tr>
</tbody>
</table>

</td>
  </tr>
</tbody>
</table>
---
# Additional Useful Functions

### Fixing dates
.pull-left[
For fixing dates we have the option to use janitor. 
The `excel_numeric_to_date()` function solves the problem where in Excel dates are represented in numbers.

This function turns Excel's numeric format into R date format.

```r
library(janitor)
excel_date <- 44197 # this means the number of days since a 
#specific start date (january 1 1900 or january 1 1904 depending on the system). 
r_date <- excel_numeric_to_date(excel_date)
print(r_date)
```

```
## [1] "2021-01-01"
```
]
.pull-right[
We also have the `convert_to_date()` function which converts a mix of date and datetime formats to date.

```r
excel_dates <- c(44197, 44228)  # Corresponding to 2021-01-01 and 2021-02-01
r_dates <- convert_to_date(excel_dates)
print(r_dates)  # Outputs "2021-01-01" and "2021-02-01
```

```
## [1] "2021-01-01" "2021-02-01"
```
]
---
# Additional Useful Functions 
### Rounding numbers

This function "round_to_fraction" to round decimals to the nearest fraction. 
WE don't use base R "round" function uses "bankers rounding" which means that halves are rounded to the nearest even number. 
Example:

```r
# Rounding to the nearest quarter que denominator = 4.
round_to_fraction(0.37, 4)  
```

```
## [1] 0.25
```

```r
numbers<- c(3.6,2.4) 
round(numbers)
```

```
## [1] 4 2
```
---
# Additional Useful Functions

### Detecting duplicated records 
In this case we use the `get_dupes()` function.  
- Detects duplicate records when cleaning data in specific columns
- The output includes a dataframe with the specified columns (or all), and 
a column named `dupe_count` which has the number of times each value 
appears in the dataset.

.pull-left[

```r
#EXAMPLE: Create a sample dataframe
df <- data.frame(ID = c(1, 2, 3, 4, 2, 5),
                 Name = c("Alice", "Bob", "Charlie", "David", "Bob", "Eve"))
```
]
.pull-right[

```r
# Identify duplicates in the ID column
get_dupes(df, ID)
```

```
##   ID dupe_count Name
## 1  2          2  Bob
## 2  2          2  Bob
```
]
.center[

```r
# Identify duplicates based on combinations of ID and Name
get_dupes(df, ID, Name)
```

```
##   ID Name dupe_count
## 1  2  Bob          2
## 2  2  Bob          2
```
]
---
# Further Useful Functions
### Dealing with NAs

- `remove_empty()`: Remove empty rows or columns from a dataframe.
- `convert_to_na()`: Convert specific values in a dataframe to NAs.

```r
#sample data
df <- data.frame(
  Name = c("Alice", "Bob", "Charlie", "David", "Eve", "Frank"),
  Age = c(25, NA, 30, 32, NA, NA),
  Gender = c("Female", "Male", "", "Male", "Female", ""),
  Alt_Gender = c("", "", "Male", "", "", "Male"),
  Comments = c("", "", "", "", "", ""))
#remove empty 
df_clean <- df %>% 
  remove_empty(c("rows", "cols"))
#convert_to_na()
#df_clean <- df_clean %>% 
 # convert_to_na(list(Gender = "", Alt_Gender = "", Name = ""))
```

---
class: inverse, center, middle

# Now let's introduce `forcats`

---
# Table of Contents

---

# Introduction to Factors

Factors play an essential role when dealing with **categorical data** (variables that have a fixed and known set of possible values) in R.

> "A **factor** is an integer vector with a **levels** attribute that stores a set of mappings between integers and categorical values." - [Factors with forcats, RStudio](https://rstudio.github.io/cheatsheets/html/factors.html)

Key points about factors:
1. **Structured Representation**: Unlike character vectors, they don't need to follow alphabetical order.
2. **Tidyverse Compatibility**: Factors seamlessly integrate into the tidyverse ecosystem.

### Where could data scientists come across factors?
#### Surveys and Questionnaires:

- Questions with **multiple-choice** answers (e.g., "How satisfied are you with our service?" with responses: "Very Satisfied", "Satisfied", "Neutral", "Dissatisfied", "Very Dissatisfied").
- **Demographic data** like gender (e.g., "Male", "Female", "Other", "Prefer not to say").
---

# Say Hello to `forcats`

`forcats` is a member of the tidyverse family, dedicated to dealing with **categorical variables** in R.

.pull-left-narrow[.center[
  <img src="https://forcats.tidyverse.org/logo.png" style="width:90px;">
]]
.pull-right-wide-2[
**Name Origin**: "for" + "cats" - because it's for categorical variables ;-) 
]
<div style="clear:both;"></div>

**Installation**: To install, simply run:

```r
install.packages("forcats")
# Or
install.packages("tidyverse")
```

**Key Functionalities**:
- Inspecting factor levels
- Changing order and value of levels
- Combining and expanding factors
- Handling missing values and unused levels

---
# Overview of (selected) `forcats` functions

.column[
**Inspect**
<br>
`fct_count()`
`fct_unique()`
]

.column[
**Combine**
<br>
`fct_c(...)`
]

.column[
**Change Order**
<br>
`fct_relevel()`
`fct_infreq()`
`fct_reorder()`
]

.column[
**Change Value(s)**
<br>
`fct_recode()`
`fct_lump_min()`
`fct_other()`
]

.column[
**Add or Drop**
<br>
`fct_drop()`
`fct_na_value_to_level()`
]

---

# Inspect Factors

.pull-left-narrow[
  <img src="img/fct_count.png" style="width:115px;">
]
.pull-right-wide[
`fct_count(f, sort = FALSE, prop = FALSE)`  
Count the number of values with each level.
]
<div style="clear:both;"></div>

**Example**:

```r
f <- factor(c("apple", "banana", "apple", "cherry", "banana", "apple"))
fct_count(f)
```

```
## # A tibble: 3 × 2
##   f          n
##   <fct>  <int>
## 1 apple      3
## 2 banana     2
## 3 cherry     1
```

---

# Inspect Factors (cont.)

**Example**:

```r
fruits <- factor(c("apple", "banana", "apple", "cherry", "banana", "apple"))
fct_count(fruits)
```

.pull-left-narrow[
<img src="img/fct_unique.png" style="width:125px;">
]
.pull-right-wide[
`fct_unique(f)` 
<br>
Return the unique values, removing duplicates.
]
<div style="clear:both;"></div>

**Example**:

```r
fct_unique(fruits)
```

```
## [1] apple  banana cherry
## Levels: apple banana cherry
```
---
# Combine Factors

.pull-left-narrow[
  <img src="img/fct_c.png" style="width:125px;">
]
.pull-right-wide[
`fct_c(...)`  
Combine factors with different levels.
]
<div style="clear:both;"></div>

**Example**:

```r
f1 <- factor(c("a", "c"))
f2 <- factor(c("b", "a"))
fct_c(f1, f2)
```

```
## [1] a c b a
## Levels: a c b
```
---
# Combine Factors (cont.)

.pull-left-narrow[
  <img src="img/fct_c.png" style="width:125px;">
]
.pull-right-wide[
`fct_c(...)`  
Combine factors with different levels.
]
<div style="clear:both;"></div>

**Example**:

```r
fct_c(f1, f2)
```

.pull-left-narrow[
<img src="img/fct_unify.png" style="width:120px;">
]
.pull-right-wide[
`fct_unify(fs, levels = lvls_union(fs))`
<br>
Standardize levels across a list of factors.
]

<div style="clear:both;"></div>
**Example**:

```r
fct_unify(list(f2, f1))
```

```
## [[1]]
## [1] b a
## Levels: a b c
## 
## [[2]]
## [1] a c
## Levels: a b c
```
---

# Change the Order of Levels

.pull-left-narrow[
  <img src="img/fct_infreq.png" style="width:120px;">
]
.pull-right-wide[
`fct_infreq(f, ordered = NA)` 
<br>
Reorder levels by the frequency in which they appear in the data (highest frequency first).
]
<div style="clear:both;"></div>

**Example**:

```r
fruits <- factor(c("apple", "banana", "apple", "cherry", "banana", "apple"))
fct_infreq(fruits)
```

```
## [1] apple  banana apple  cherry banana apple 
## Levels: apple banana cherry
```

---

# Change the Order of Levels (cont.)

**Example**:

```r
fruits <- factor(c("apple", "banana", "apple", "cherry", "banana", "apple"))
fct_infreq(fruits)
```

.pull-left-narrow[
<img src="img/fct_inorder.png" style="width:120px;">
]
.pull-right-wide[
`fct_inorder(f, ordered = NA)`
<br>
Reorder levels by the order in which they appear in the data.
]
<div style="clear:both;"></div>

**Example**:

```r
fct_inorder(fruits)
```

```
## [1] apple  banana apple  cherry banana apple 
## Levels: apple banana cherry
```
---

# Change the Value of Levels

.pull-left-narrow[
  <img src="img/fct_recode.png" style="width:120px;">
]
.pull-right-wide[
`fct_recode(f, ...)`  
Manually change levels.
]
<div style="clear:both;"></div>

**Example**:

```r
fruits <- factor(c("apple", "banana", "apple", "cherry", "banana", "apple"))
fct_recode(fruits, cherry = "apple")
```

```
## [1] cherry banana cherry cherry banana cherry
## Levels: cherry banana
```
---

# Change the Value of Levels (cont.)

.pull-left-narrow[
  <img src="img/fct_recode.png" style="width:120px;">
]
.pull-right-wide[
`fct_recode(f, ...)`  
Manually change levels.
]
<div style="clear:both;"></div>

**Example**:

```r
fruits <- factor(c("apple", "banana", "apple", "cherry", "banana", "apple"))
fct_recode(fruits, cherry = "apple")
```

.pull-left-narrow[
  <img src="img/fct_other.png" style="width:120px;">
]
.pull-right-wide[
`fct_other(f, keep, drop, other_level = "Other")`  
Replace all levels not in `keep` with the value of `other`.
]
<div style="clear:both;"></div>

**Example**:

```r
fct_other(f, keep = c("banana", "cherry"))
```

```
## [1] Other  banana Other  cherry banana Other 
## Levels: banana cherry Other
```
---
# Add or Drop Levels

.pull-left-narrow[
  <img src="img/fct_drop.png" style="width:120px;">
]
.pull-right-wide[
`fct_drop(f)`  
Drop unused levels from a factor.
]
<div style="clear:both;"></div>

**Example**:

```r
vehicles <- factor(c("car", "bike", "bus"), 
                   levels = c("car", "bike", "bus", "train"))
# "train" is an unused level
fct_drop(vehicles)
```

```
## [1] car  bike bus 
## Levels: car bike bus
```
---
# Add or Drop Levels (cont.)

.pull-left-narrow[
  <img src="img/fct_drop.png" style="width:120px;">
]
.pull-right-wide[
`fct_drop(f)`  
Drop unused levels from a factor.
]
<div style="clear:both;"></div>

**Example**:

```r
fct_drop(vehicles)
```

.pull-left-narrow[
<img src="img/fct_explicit_na.png" style="width:120px;">
]
.pull-right-wide[
`fct_na_value_to_level(f, level = "(Missing)")`
<br>
Convert NA to a specified level in a factor.
]

<div style="clear:both;"></div>
**Example**:

```r
vehicles_with_na <- factor(c("car", "bike", NA, "bus", NA, "car"), 
                           levels = c("car", "bike", "bus", "train"))
# Convert NA values to "train"
fct_na_value_to_level(vehicles_with_na, level = "train")
```

```
## [1] car   bike  train bus   train car  
## Levels: car bike bus train
```

---
# Application

### Aggregating factor levels in `gss_cat` data

The `gss_cat` dataset provides a sample from the General Social Survey, a US survey by the NORC at the University of Chicago.

```r
library(tidyverse)
head(gss_cat)
```

```
# # A tibble: 6 × 9
#    year marital         age race  rincome        partyid     relig denom tvhours
#   <int> <fct>         <int> <fct> <fct>          <fct>       <fct> <fct>   <int>
# 1  2000 Never married    26 White $8000 to 9999  Ind,near r… Prot… Sout…      12
# 2  2000 Divorced         48 White $8000 to 9999  Not str re… Prot… Bapt…      NA
# 3  2000 Widowed          67 White Not applicable Independent Prot… No d…       2
# 4  2000 Never married    39 White Not applicable Ind,near r… Orth… Not …       4
# 5  2000 Divorced         25 White Not applicable Not str de… None  Not …       1
# 6  2000 Married          25 White $20000 - 24999 Strong dem… Prot… Sout…      NA
```
---
# Application (cont.)

### Aggregating factor levels in `gss_cat` data

The `gss_cat` dataset provides a sample from the General Social Survey, a US survey by the NORC at the University of Chicago.

```r
relig_counts <- fct_count(gss_cat$relig)
head(relig_counts, 10)
```

```
# # A tibble: 10 × 2
#    f                           n
#    <fct>                   <int>
#  1 No answer                  93
#  2 Don't know                 15
#  3 Inter-nondenominational   109
#  4 Native american            23
#  5 Christian                 689
#  6 Orthodox-christian         95
#  7 Moslem/islam              104
#  8 Other eastern              32
#  9 Hinduism                   71
# 10 Buddhism                  147
```
---

# Application (cont.)

### Aggregating factor levels in `gss_cat` data

The `gss_cat` dataset provides a sample from the General Social Survey, a US survey by the NORC at the University of Chicago.

To simplify plots or tables, we often lump together the small groups of a factor. The `fct_lump_n()` function from the `forcats` package serves this purpose by grouping the smallest categories into "Other".

```r
library(forcats)
library(dplyr)

gss_cat %>%
  mutate(relig = fct_lump_n(relig, n = 4)) %>%
  count(relig) %>%
  filter(!relig == "None") %>%
  arrange(desc(n))
```

```
## # A tibble: 4 × 2
##   relig          n
##   <fct>      <int>
## 1 Protestant 10846
## 2 Catholic    5124
## 3 Other       1301
## 4 Christian    689
```
In this code:

- `fct_lump_n(relig, n = 3)` will keep the top 3 most frequent religions and lump the rest into the "Other" category.
- `filter(!relig == "None")` will remove the "None" category from the output.
- `arrange(desc(n))` arranges the levels in descending order.
---

# Application (cont.)

### Aggregating factor levels in `gss_cat` data

The `gss_cat` dataset provides a sample from the General Social Survey, a US survey by the NORC at the University of Chicago.

```r
library(forcats)
library(dplyr)

gss_cat %>%
  mutate(relig = fct_lump_n(relig, n = 4)) %>%
  count(relig) %>%
  filter(!relig == "None") %>%
  arrange(desc(n))
```
In this code:

- `fct_lump_n(relig, n = 4)` will keep the top 4 most frequent religions and lump the rest into the "Other" category.
- `filter(!relig == "None")` will remove the "None" category from the output.
- `arrange(desc(n))` arranges the levels in descending order.
---
class: inverse, center, middle

# Thank you for watching!

---
class: inverse, center, middle

# Ready to become an expert in `janitor` and `forcats`?

Scan the QR code below join the quiz:

Or go to **menti.com** and enter the code **49 85 53 3**

---

# Resources

.pull-left[
#### `janitor`
- [Package janitor documentation](https://cran.r-project.org/web/packages/janitor/janitor.pdf)
- [GitHub Repository](https://github.com/sfirke/janitor)
   
**For original material on the** `janitor` **package**

- [Package janitor documentation](https://cran.r-project.org/web/packages/janitor/janitor.pdf)
- [GitHub Repository](https://github.com/sfirke/janitor)

**Additional Resources**

- [Cleaning and Exploring Data with the “janitor”](https://towardsdatascience.com/cleaning-and-exploring-data-with-the-janitor-package-ee4a3edf085e) 
- [RDocumentation - Janitor](https://www.rdocumentation.org/packages/janitor/versions/2.2.0r)
- [Tabyl: frequency tables for R users](https://towardsdatascience.com/tabyl-a-frequency-table-for-the-modern-r-user-e061cd48baef)
]

.pull-right[
#### `forcats`
- [Official R Documentation](https://forcats.tidyverse.org/) (including all sublinks)
- [Another R Documentation](https://forcats.tidyverse.org/articles/forcats.html)
- Wickham, H., & Grolemund, G. (2017). Factors. In R for Data Science (1st ed.). O’Reilly Media, Inc. https://r4ds.had.co.nz/factors.html
- Wickham, H., & Grolemund, G. (2023). Factors. In R for Data Science (2nd ed.). O’Reilly Media, Inc. https://r4ds.hadley.nz/factors.html
- McNamara A, Horton NJ. 2017. Wrangling categorical data in R. PeerJ Preprints 5:e3163v2 https://doi.org/10.7287/peerj.preprints.3163v2

**Additional Resources**
<br>
[`forcats` cheatsheet](https://rstudio.github.io/cheatsheets/factors.pdf) *Credits to the nice images!*
]