EC524: Lab 03

Introduction to tidymodels

Jose Rojas-Fallas

2026

Lab Agenda


1. Why: Using tidymodels may be useful


2. Preprocessing: Learn the basics of preprocessing data using tidymodels


Why tidymodels?

It is a unified, tidy framework for building, tuning, and evaluating statistical and machine-learning models in R


What problem does it solve?

  • It standardizes the modeling workflow

data \(\Rightarrow\) preprocessing \(\Rightarrow\) model \(\Rightarrow\) tuning \(\Rightarrow\) validation \(\Rightarrow\) evaluation


To better understand it, let’s see it in action

Setup

Let’s begin by loading in some packages

library(pacman)
p_load(tidyverse, modeldata, skimr, janitor, kknn, magrittr, tidymodels)


For these examples we will use data from the modeldata package.

The dataset is (amazingly) named credit_data

# Load credit dataset
data(credit_data)
# (Optional) Assign it a new name
credit_df <- credit_data

Remember you should always familiarize yourself with the data you are working with

# Glimpse at it
credit_df %>% glimpse()
Rows: 4,454
Columns: 14
$ Status    <fct> good, good, bad, good, good, good, good, good, good, bad, go…
$ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0, 15, 33, …
$ Home      <fct> rent, rent, owner, rent, rent, owner, owner, parents, owner,…
$ Time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60, 36, 18, …
$ Age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30, 37, 21, …
$ Marital   <fct> married, widow, married, single, single, married, married, s…
$ Records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, no, no, yes…
$ Job       <fct> freelance, fixed, freelance, fixed, fixed, fixed, fixed, fix…
$ Expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75, 75, 35, …
$ Income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 125, 121, 19…
$ Assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 4000, 3000, 5…
$ Debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, 0, 0, 2000…
$ Amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1200, 1150,…
$ Price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957, 1468, 15…

Look at Your Data

Ugly data should be cleaned:

# "Fix" variable names
credit_df %<>% clean_names()

And we can take a deeper peak in the data (we are looking for NAs and distributions)

credit_df %>% skim()
Data summary
Name Piped data
Number of rows 4454
Number of columns 14
_______________________
Column type frequency:
factor 5
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
status 0 1 FALSE 2 goo: 3200, bad: 1254
home 6 1 FALSE 6 own: 2107, ren: 973, par: 783, oth: 319
marital 1 1 FALSE 5 mar: 3241, sin: 977, sep: 130, wid: 67
records 0 1 FALSE 2 no: 3681, yes: 773
job 2 1 FALSE 4 fix: 2805, fre: 1024, par: 452, oth: 171

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
seniority 0 1.00 7.99 8.17 0 2.00 5 12.0 48 ▇▃▁▁▁
time 0 1.00 46.44 14.66 6 36.00 48 60.0 72 ▁▂▅▃▇
age 0 1.00 37.08 10.98 18 28.00 36 45.0 68 ▆▇▆▃▁
expenses 0 1.00 55.57 19.52 35 35.00 51 72.0 180 ▇▃▁▁▁
income 381 0.91 141.69 80.75 6 90.00 125 170.0 959 ▇▂▁▁▁
assets 47 0.99 5403.98 11574.42 0 0.00 3000 6000.0 300000 ▇▁▁▁▁
debt 18 1.00 343.03 1245.99 0 0.00 0 0.0 30000 ▇▁▁▁▁
amount 0 1.00 1038.92 474.55 100 700.00 1000 1300.0 5000 ▇▆▁▁▁
price 0 1.00 1462.78 628.13 105 1117.25 1400 1691.5 11140 ▇▁▁▁▁

Create Our Outcome Variable

Suppose we want to create an indicator for individuals’ whose credit status is "good". Easy enough, can use tidyverse to do that.

Solution.

# Create a dummy variable for "good" credit status
credit_df %<>% 
    # There are many ways to do this. You could also use ifelse() or case_when().
    mutate(status_good = 1 * (status == "good")) %>% 
    # Drop the old status variable
    select(-status)

That’s great! We have an outcome that we want to test our model on


But we still have some missing data that could be of use


Here’s where tidymodels can come in handy in the preprocessing part

How Does tidymodels Help Us?

Imagine we want to clean each variable, then our tidyverse approach will take a lot of typing/copying/pasting which may become difficult to track.

For example, you might want to:

  • Standardize your numeric variables
  • Create dummies for your categorical variables
  • Remove variables that are perfectly determined by other variables (multicollinearity)
  • Imput missing values using KNN

tidymodels is a better way! We will use recipes and parsnip later on to help our modeling

Using tidymodels to Preprocess Data

Recipes

The tidymodels workflow begins with defining a recipe

  • This tells R which variable is your outcome and which variable(s) are your explanatory ones

  • Think of it as a defining the formula and dataset within lm(y ~ x, data = df) without running the regression

What we will be doing is try to predict which individuals have status_good = 1 \(\rightarrow\) where status_good is our outcome

Two important things before we move on:

1. No functions in your recipe: We are only defining the roles of the data. We will be able to make transformations later

2. All other variables: You can use . on the right-hand side of the ~ to say “the rest of the variables”. Saves you some time from writing everything down

Recipe

Here is our first recipe. It’ll help us understand further.

# Define the recipe: status_good predicted by all other variables in the dataframe
recipe_all <- recipe(status_good ~ ., data = credit_df)
# What does recipe do?
recipe_all
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:    1
predictor: 13

Recipe

Here is our first recipe. It’ll help us understand further.

# Define the recipe: status_good predicted by all other variables in the dataframe
recipe_all <- recipe(status_good ~ ., data = credit_df)
# What does recipe do?
recipe_all %>% str()


Tip

Note: If you have an “ID” variable, you can give it an “ID” role.

recipe(...) %>% update_role(id_var, new_role = "ID")

Recipe

R now “knows” the roles each variable will play in our model.

Now we can start cleaning/transforming our dataset.

We can apply whatever steps we want to clean/pre-process/manipulate our data. The tidymodels universe has many possible steps for:

Imputation

  • step_impute_mean()
  • step_impute_mode()
  • step_impute_knn()
  • step_impute_bag()

Transformation

  • step_log()
  • step_poly()
  • step_mutate()
  • step_interact()

Dummies & Discretization

  • step_dummy()
  • step_num2factor()
  • step_discretize()
  • step_cut()

Many More

  • step_center()
  • step_normalize()
  • step_date()
  • step_lag()

Recipe

To apply any of these, you need to also tell the function step_*() which variables you want it to target.

  • All: all_vars()
  • Role: all_predictors(), all_outcomes(), has_role()
  • Type: all_nominal(), all_numeric()
  • Variable Names: Listing the name of the variables or using selectors starts_with(), contains(), etc.

Try writing a simple mean imputation for all variables that are
(1) predictors AND (2) numeric

Solution.

# Mean imputation for all numeric predictors

recipe_all %>% step_impute_mean(all_predictors() & all_numeric())
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:    1
predictor: 13
── Operations 
• Mean imputation for: all_predictors() & all_numeric()

Recipe

To apply any of these, you need to also tell the function step_*() which variables you want it to target.

  • All: all_vars()
  • Role: all_predictors(), all_outcomes(), has_role()
  • Type: all_nominal(), all_numeric()
  • Variable Names: Listing the name of the variables or using selectors starts_with(), contains(), etc.

Notice this is not a processed dataframe.

To get a processed dataframe, we need to add two more things to our recipe

  • prep(): estimates the parameters defined by the recipes’ preprocessing
  • juice(): applies the preprocessing to the training dataset (we will talk about CV in tidymodels soon)

prep() and juice()

# Applying prep()
recipe_all %>%
    step_impute_mean(all_predictors() & all_numeric()) %>%
    prep()
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:    1
predictor: 13
── Training information 
Training data contained 4454 data points and 415 incomplete rows.
── Operations 
• Mean imputation for: seniority, time, age, expenses, income, ... | Trained

The output is still a recipe, but now it has trained the preprocessing operations on the training data and knows the number of observations and number of incomplete rows

prep() and juice()

Now we add (a dash of) juice() to get the preprocessed (cleaned) dataset

# Add juice()
credit_clean <- recipe_all %>%
    step_impute_mean(all_predictors() & all_numeric()) %>%
    prep() %>%
    juice()
# Skim it 
credit_clean %>% skim()
Data summary
Name Piped data
Number of rows 4454
Number of columns 14
_______________________
Column type frequency:
factor 4
numeric 10
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
home 6 1 FALSE 6 own: 2107, ren: 973, par: 783, oth: 319
marital 1 1 FALSE 5 mar: 3241, sin: 977, sep: 130, wid: 67
records 0 1 FALSE 2 no: 3681, yes: 773
job 2 1 FALSE 4 fix: 2805, fre: 1024, par: 452, oth: 171

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
seniority 0 1 7.99 8.17 0 2.00 5 12.0 48 ▇▃▁▁▁
time 0 1 46.44 14.66 6 36.00 48 60.0 72 ▁▂▅▃▇
age 0 1 37.08 10.98 18 28.00 36 45.0 68 ▆▇▆▃▁
expenses 0 1 55.57 19.52 35 35.00 51 72.0 180 ▇▃▁▁▁
income 0 1 141.71 77.22 6 93.00 130 164.0 959 ▇▂▁▁▁
assets 0 1 5403.98 11513.17 0 0.00 3500 6000.0 300000 ▇▁▁▁▁
debt 0 1 343.03 1243.47 0 0.00 0 0.0 30000 ▇▁▁▁▁
amount 0 1 1038.92 474.55 100 700.00 1000 1300.0 5000 ▇▆▁▁▁
price 0 1 1462.78 628.13 105 1117.25 1400 1691.5 11140 ▇▁▁▁▁
status_good 0 1 0.72 0.45 0 0.00 1 1.0 1 ▃▁▁▁▇

Success! 🤔

No more missing values for our numeric predictors

However, we still have some missing values in our categorical predictors

We just gotta fix those too!


Let’s define our preprocessing as:

1. Impute the numeric predictors using the means

2. Impute the categorical predictors using knn (k = 5)

3. Create dummy variables for all categorical predictors

  • Note: by default, step_dummy() drops the original column

4. Interact the outputted home_ dummies with income

All Together Now

# Starting from scratch
recipe_all <- recipe(status_good ~ ., data = credit_df)
# Putting it all together
credit_clean <- recipe_all %>%
    # 1. Impute the numerical predictors using the means
    step_impute_mean(all_predictors() & all_numeric()) %>%
    # 2. Impute the categorical variables using KNN (k = 5)
    step_impute_knn(all_predictors() & all_nominal(), neighbors = 5) %>%
    # 3. Create dummies for categorical variables
    step_dummy(all_predictors() & all_nominal()) %>%
    # 4. Interact the created dummies with `income`
    step_interact(~income:starts_with("home")) %>%
    # Prep
    prep() %>%
    # And Juice
    juice()

credit_clean %>% skim()
Data summary
Name Piped data
Number of rows 4454
Number of columns 28
_______________________
Column type frequency:
numeric 28
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
seniority 0 1 7.99 8.17 0 2.00 5 12.0 48 ▇▃▁▁▁
time 0 1 46.44 14.66 6 36.00 48 60.0 72 ▁▂▅▃▇
age 0 1 37.08 10.98 18 28.00 36 45.0 68 ▆▇▆▃▁
expenses 0 1 55.57 19.52 35 35.00 51 72.0 180 ▇▃▁▁▁
income 0 1 141.71 77.22 6 93.00 130 164.0 959 ▇▂▁▁▁
assets 0 1 5403.98 11513.17 0 0.00 3500 6000.0 300000 ▇▁▁▁▁
debt 0 1 343.03 1243.47 0 0.00 0 0.0 30000 ▇▁▁▁▁
amount 0 1 1038.92 474.55 100 700.00 1000 1300.0 5000 ▇▆▁▁▁
price 0 1 1462.78 628.13 105 1117.25 1400 1691.5 11140 ▇▁▁▁▁
status_good 0 1 0.72 0.45 0 0.00 1 1.0 1 ▃▁▁▁▇
home_other 0 1 0.07 0.26 0 0.00 0 0.0 1 ▇▁▁▁▁
home_owner 0 1 0.47 0.50 0 0.00 0 1.0 1 ▇▁▁▁▇
home_parents 0 1 0.18 0.38 0 0.00 0 0.0 1 ▇▁▁▁▂
home_priv 0 1 0.06 0.23 0 0.00 0 0.0 1 ▇▁▁▁▁
home_rent 0 1 0.22 0.41 0 0.00 0 0.0 1 ▇▁▁▁▂
marital_married 0 1 0.73 0.45 0 0.00 1 1.0 1 ▃▁▁▁▇
marital_separated 0 1 0.03 0.17 0 0.00 0 0.0 1 ▇▁▁▁▁
marital_single 0 1 0.22 0.41 0 0.00 0 0.0 1 ▇▁▁▁▂
marital_widow 0 1 0.02 0.12 0 0.00 0 0.0 1 ▇▁▁▁▁
records_yes 0 1 0.17 0.38 0 0.00 0 0.0 1 ▇▁▁▁▂
job_freelance 0 1 0.23 0.42 0 0.00 0 0.0 1 ▇▁▁▁▂
job_others 0 1 0.04 0.19 0 0.00 0 0.0 1 ▇▁▁▁▁
job_partime 0 1 0.10 0.30 0 0.00 0 0.0 1 ▇▁▁▁▁
income_x_home_other 0 1 9.42 38.04 0 0.00 0 0.0 700 ▇▁▁▁▁
income_x_home_owner 0 1 71.83 94.94 0 0.00 0 135.0 905 ▇▁▁▁▁
income_x_home_parents 0 1 21.49 55.43 0 0.00 0 0.0 857 ▇▁▁▁▁
income_x_home_priv 0 1 7.59 35.53 0 0.00 0 0.0 959 ▇▁▁▁▁
income_x_home_rent 0 1 30.77 66.47 0 0.00 0 0.0 535 ▇▁▁▁▁

Great Success!

If we did it correctly, after imputing missing values from the training data, we should have:

  • No more missing values

  • Created dummies

  • Created interactions between home and income


Next, we will explore modeling with

tidymodels