Introduction to tidymodels
2026
1. Why: Using tidymodels may be useful
2. Preprocessing: Learn the basics of preprocessing data using tidymodels
tidymodels?It is a unified, tidy framework for building, tuning, and evaluating statistical and machine-learning models in R
What problem does it solve?
data \(\Rightarrow\) preprocessing \(\Rightarrow\) model \(\Rightarrow\) tuning \(\Rightarrow\) validation \(\Rightarrow\) evaluation
To better understand it, let’s see it in action
Let’s begin by loading in some packages
For these examples we will use data from the modeldata package.
The dataset is (amazingly) named credit_data
Remember you should always familiarize yourself with the data you are working with
Rows: 4,454
Columns: 14
$ Status <fct> good, good, bad, good, good, good, good, good, good, bad, go…
$ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0, 15, 33, …
$ Home <fct> rent, rent, owner, rent, rent, owner, owner, parents, owner,…
$ Time <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60, 36, 18, …
$ Age <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30, 37, 21, …
$ Marital <fct> married, widow, married, single, single, married, married, s…
$ Records <fct> no, no, yes, no, no, no, no, no, no, no, no, no, no, no, yes…
$ Job <fct> freelance, fixed, freelance, fixed, fixed, fixed, fixed, fix…
$ Expenses <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75, 75, 35, …
$ Income <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 125, 121, 19…
$ Assets <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 4000, 3000, 5…
$ Debt <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, 0, 0, 2000…
$ Amount <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1200, 1150,…
$ Price <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957, 1468, 15…
Ugly data should be cleaned:
And we can take a deeper peak in the data (we are looking for NAs and distributions)
| Name | Piped data |
| Number of rows | 4454 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| factor | 5 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| status | 0 | 1 | FALSE | 2 | goo: 3200, bad: 1254 |
| home | 6 | 1 | FALSE | 6 | own: 2107, ren: 973, par: 783, oth: 319 |
| marital | 1 | 1 | FALSE | 5 | mar: 3241, sin: 977, sep: 130, wid: 67 |
| records | 0 | 1 | FALSE | 2 | no: 3681, yes: 773 |
| job | 2 | 1 | FALSE | 4 | fix: 2805, fre: 1024, par: 452, oth: 171 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| seniority | 0 | 1.00 | 7.99 | 8.17 | 0 | 2.00 | 5 | 12.0 | 48 | ▇▃▁▁▁ |
| time | 0 | 1.00 | 46.44 | 14.66 | 6 | 36.00 | 48 | 60.0 | 72 | ▁▂▅▃▇ |
| age | 0 | 1.00 | 37.08 | 10.98 | 18 | 28.00 | 36 | 45.0 | 68 | ▆▇▆▃▁ |
| expenses | 0 | 1.00 | 55.57 | 19.52 | 35 | 35.00 | 51 | 72.0 | 180 | ▇▃▁▁▁ |
| income | 381 | 0.91 | 141.69 | 80.75 | 6 | 90.00 | 125 | 170.0 | 959 | ▇▂▁▁▁ |
| assets | 47 | 0.99 | 5403.98 | 11574.42 | 0 | 0.00 | 3000 | 6000.0 | 300000 | ▇▁▁▁▁ |
| debt | 18 | 1.00 | 343.03 | 1245.99 | 0 | 0.00 | 0 | 0.0 | 30000 | ▇▁▁▁▁ |
| amount | 0 | 1.00 | 1038.92 | 474.55 | 100 | 700.00 | 1000 | 1300.0 | 5000 | ▇▆▁▁▁ |
| price | 0 | 1.00 | 1462.78 | 628.13 | 105 | 1117.25 | 1400 | 1691.5 | 11140 | ▇▁▁▁▁ |
Suppose we want to create an indicator for individuals’ whose credit status is "good". Easy enough, can use tidyverse to do that.
Solution.
That’s great! We have an outcome that we want to test our model on
But we still have some missing data that could be of use
Here’s where tidymodels can come in handy in the preprocessing part
tidymodels Help Us?Imagine we want to clean each variable, then our tidyverse approach will take a lot of typing/copying/pasting which may become difficult to track.
For example, you might want to:
tidymodelsis a better way! We will userecipesandparsniplater on to help our modeling
tidymodels to Preprocess DataRecipes
The tidymodels workflow begins with defining a recipe
This tells R which variable is your outcome and which variable(s) are your explanatory ones
Think of it as a defining the formula and dataset within lm(y ~ x, data = df) without running the regression
What we will be doing is try to predict which individuals have status_good = 1 \(\rightarrow\) where status_good is our outcome
Two important things before we move on:
1. No functions in your recipe: We are only defining the roles of the data. We will be able to make transformations later
2. All other variables: You can use . on the right-hand side of the ~ to say “the rest of the variables”. Saves you some time from writing everything down
Here is our first recipe. It’ll help us understand further.
# Define the recipe: status_good predicted by all other variables in the dataframe
recipe_all <- recipe(status_good ~ ., data = credit_df)
# What does recipe do?
recipe_all
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 13
Here is our first recipe. It’ll help us understand further.
Tip
Note: If you have an “ID” variable, you can give it an “ID” role.
recipe(...) %>% update_role(id_var, new_role = "ID")
R now “knows” the roles each variable will play in our model.
Now we can start cleaning/transforming our dataset.
We can apply whatever steps we want to clean/pre-process/manipulate our data. The tidymodels universe has many possible steps for:
Imputation
step_impute_mean()step_impute_mode()step_impute_knn()step_impute_bag()Transformation
step_log()step_poly()step_mutate()step_interact()Dummies & Discretization
step_dummy()step_num2factor()step_discretize()step_cut()Many More
step_center()step_normalize()step_date()step_lag()To apply any of these, you need to also tell the function step_*() which variables you want it to target.
all_vars()all_predictors(), all_outcomes(), has_role()all_nominal(), all_numeric()starts_with(), contains(), etc.Try writing a simple mean imputation for all variables that are
(1) predictors AND (2) numeric
Solution.
# Mean imputation for all numeric predictors
recipe_all %>% step_impute_mean(all_predictors() & all_numeric())
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 13
── Operations
• Mean imputation for: all_predictors() & all_numeric()
To apply any of these, you need to also tell the function step_*() which variables you want it to target.
all_vars()all_predictors(), all_outcomes(), has_role()all_nominal(), all_numeric()starts_with(), contains(), etc.Notice this is not a processed dataframe.
To get a processed dataframe, we need to add two more things to our recipe
prep(): estimates the parameters defined by the recipes’ preprocessingjuice(): applies the preprocessing to the training dataset (we will talk about CV in tidymodels soon)prep() and juice()
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1
predictor: 13
── Training information
Training data contained 4454 data points and 415 incomplete rows.
── Operations
• Mean imputation for: seniority, time, age, expenses, income, ... | Trained
The output is still a recipe, but now it has trained the preprocessing operations on the training data and knows the number of observations and number of incomplete rows
prep() and juice()Now we add (a dash of) juice() to get the preprocessed (cleaned) dataset
# Add juice()
credit_clean <- recipe_all %>%
step_impute_mean(all_predictors() & all_numeric()) %>%
prep() %>%
juice()
# Skim it
credit_clean %>% skim()| Name | Piped data |
| Number of rows | 4454 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| factor | 4 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| home | 6 | 1 | FALSE | 6 | own: 2107, ren: 973, par: 783, oth: 319 |
| marital | 1 | 1 | FALSE | 5 | mar: 3241, sin: 977, sep: 130, wid: 67 |
| records | 0 | 1 | FALSE | 2 | no: 3681, yes: 773 |
| job | 2 | 1 | FALSE | 4 | fix: 2805, fre: 1024, par: 452, oth: 171 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| seniority | 0 | 1 | 7.99 | 8.17 | 0 | 2.00 | 5 | 12.0 | 48 | ▇▃▁▁▁ |
| time | 0 | 1 | 46.44 | 14.66 | 6 | 36.00 | 48 | 60.0 | 72 | ▁▂▅▃▇ |
| age | 0 | 1 | 37.08 | 10.98 | 18 | 28.00 | 36 | 45.0 | 68 | ▆▇▆▃▁ |
| expenses | 0 | 1 | 55.57 | 19.52 | 35 | 35.00 | 51 | 72.0 | 180 | ▇▃▁▁▁ |
| income | 0 | 1 | 141.71 | 77.22 | 6 | 93.00 | 130 | 164.0 | 959 | ▇▂▁▁▁ |
| assets | 0 | 1 | 5403.98 | 11513.17 | 0 | 0.00 | 3500 | 6000.0 | 300000 | ▇▁▁▁▁ |
| debt | 0 | 1 | 343.03 | 1243.47 | 0 | 0.00 | 0 | 0.0 | 30000 | ▇▁▁▁▁ |
| amount | 0 | 1 | 1038.92 | 474.55 | 100 | 700.00 | 1000 | 1300.0 | 5000 | ▇▆▁▁▁ |
| price | 0 | 1 | 1462.78 | 628.13 | 105 | 1117.25 | 1400 | 1691.5 | 11140 | ▇▁▁▁▁ |
| status_good | 0 | 1 | 0.72 | 0.45 | 0 | 0.00 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
No more missing values for our numeric predictors
However, we still have some missing values in our categorical predictors
We just gotta fix those too!
Let’s define our preprocessing as:
1. Impute the numeric predictors using the means
2. Impute the categorical predictors using knn (k = 5)
3. Create dummy variables for all categorical predictors
step_dummy() drops the original column4. Interact the outputted home_ dummies with income
# Starting from scratch
recipe_all <- recipe(status_good ~ ., data = credit_df)
# Putting it all together
credit_clean <- recipe_all %>%
# 1. Impute the numerical predictors using the means
step_impute_mean(all_predictors() & all_numeric()) %>%
# 2. Impute the categorical variables using KNN (k = 5)
step_impute_knn(all_predictors() & all_nominal(), neighbors = 5) %>%
# 3. Create dummies for categorical variables
step_dummy(all_predictors() & all_nominal()) %>%
# 4. Interact the created dummies with `income`
step_interact(~income:starts_with("home")) %>%
# Prep
prep() %>%
# And Juice
juice()
credit_clean %>% skim()| Name | Piped data |
| Number of rows | 4454 |
| Number of columns | 28 |
| _______________________ | |
| Column type frequency: | |
| numeric | 28 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| seniority | 0 | 1 | 7.99 | 8.17 | 0 | 2.00 | 5 | 12.0 | 48 | ▇▃▁▁▁ |
| time | 0 | 1 | 46.44 | 14.66 | 6 | 36.00 | 48 | 60.0 | 72 | ▁▂▅▃▇ |
| age | 0 | 1 | 37.08 | 10.98 | 18 | 28.00 | 36 | 45.0 | 68 | ▆▇▆▃▁ |
| expenses | 0 | 1 | 55.57 | 19.52 | 35 | 35.00 | 51 | 72.0 | 180 | ▇▃▁▁▁ |
| income | 0 | 1 | 141.71 | 77.22 | 6 | 93.00 | 130 | 164.0 | 959 | ▇▂▁▁▁ |
| assets | 0 | 1 | 5403.98 | 11513.17 | 0 | 0.00 | 3500 | 6000.0 | 300000 | ▇▁▁▁▁ |
| debt | 0 | 1 | 343.03 | 1243.47 | 0 | 0.00 | 0 | 0.0 | 30000 | ▇▁▁▁▁ |
| amount | 0 | 1 | 1038.92 | 474.55 | 100 | 700.00 | 1000 | 1300.0 | 5000 | ▇▆▁▁▁ |
| price | 0 | 1 | 1462.78 | 628.13 | 105 | 1117.25 | 1400 | 1691.5 | 11140 | ▇▁▁▁▁ |
| status_good | 0 | 1 | 0.72 | 0.45 | 0 | 0.00 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| home_other | 0 | 1 | 0.07 | 0.26 | 0 | 0.00 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| home_owner | 0 | 1 | 0.47 | 0.50 | 0 | 0.00 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
| home_parents | 0 | 1 | 0.18 | 0.38 | 0 | 0.00 | 0 | 0.0 | 1 | ▇▁▁▁▂ |
| home_priv | 0 | 1 | 0.06 | 0.23 | 0 | 0.00 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| home_rent | 0 | 1 | 0.22 | 0.41 | 0 | 0.00 | 0 | 0.0 | 1 | ▇▁▁▁▂ |
| marital_married | 0 | 1 | 0.73 | 0.45 | 0 | 0.00 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| marital_separated | 0 | 1 | 0.03 | 0.17 | 0 | 0.00 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| marital_single | 0 | 1 | 0.22 | 0.41 | 0 | 0.00 | 0 | 0.0 | 1 | ▇▁▁▁▂ |
| marital_widow | 0 | 1 | 0.02 | 0.12 | 0 | 0.00 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| records_yes | 0 | 1 | 0.17 | 0.38 | 0 | 0.00 | 0 | 0.0 | 1 | ▇▁▁▁▂ |
| job_freelance | 0 | 1 | 0.23 | 0.42 | 0 | 0.00 | 0 | 0.0 | 1 | ▇▁▁▁▂ |
| job_others | 0 | 1 | 0.04 | 0.19 | 0 | 0.00 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| job_partime | 0 | 1 | 0.10 | 0.30 | 0 | 0.00 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| income_x_home_other | 0 | 1 | 9.42 | 38.04 | 0 | 0.00 | 0 | 0.0 | 700 | ▇▁▁▁▁ |
| income_x_home_owner | 0 | 1 | 71.83 | 94.94 | 0 | 0.00 | 0 | 135.0 | 905 | ▇▁▁▁▁ |
| income_x_home_parents | 0 | 1 | 21.49 | 55.43 | 0 | 0.00 | 0 | 0.0 | 857 | ▇▁▁▁▁ |
| income_x_home_priv | 0 | 1 | 7.59 | 35.53 | 0 | 0.00 | 0 | 0.0 | 959 | ▇▁▁▁▁ |
| income_x_home_rent | 0 | 1 | 30.77 | 66.47 | 0 | 0.00 | 0 | 0.0 | 535 | ▇▁▁▁▁ |
If we did it correctly, after imputing missing values from the training data, we should have:
No more missing values
Created dummies
Created interactions between home and income
Next, we will explore modeling with
tidymodels
EC 524 Week 02 Lab | Intro. to tidymodels