EC524: Lab 03

Introduction to tidymodels

Jose Rojas-Fallas

2026

Lab Agenda

1. Why: Using tidymodels may be useful

2. Preprocessing: Learn the basics of preprocessing data using tidymodels

Why `tidymodels`?

It is a unified, tidy framework for building, tuning, and evaluating statistical and machine-learning models in R

What problem does it solve?

It standardizes the modeling workflow

data \(\Rightarrow\) preprocessing \(\Rightarrow\) model \(\Rightarrow\) tuning \(\Rightarrow\) validation \(\Rightarrow\) evaluation

To better understand it, let’s see it in action

Setup

Let’s begin by loading in some packages

library(pacman)
p_load(tidyverse, modeldata, skimr, janitor, kknn, magrittr, tidymodels)

For these examples we will use data from the modeldata package.

The dataset is (amazingly) named credit_data

# Load credit dataset
data(credit_data)
# (Optional) Assign it a new name
credit_df <- credit_data

Remember you should always familiarize yourself with the data you are working with

# Glimpse at it
credit_df %>% glimpse()

Rows: 4,454
Columns: 14
$ Status    <fct> good, good, bad, good, good, good, good, good, good, bad, go…
$ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0, 15, 33, …
$ Home      <fct> rent, rent, owner, rent, rent, owner, owner, parents, owner,…
$ Time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60, 36, 18, …
$ Age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30, 37, 21, …
$ Marital   <fct> married, widow, married, single, single, married, married, s…
$ Records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, no, no, yes…
$ Job       <fct> freelance, fixed, freelance, fixed, fixed, fixed, fixed, fix…
$ Expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75, 75, 35, …
$ Income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 125, 121, 19…
$ Assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 4000, 3000, 5…
$ Debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, 0, 0, 2000…
$ Amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1200, 1150,…
$ Price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957, 1468, 15…

Look at Your Data

Ugly data should be cleaned:

# "Fix" variable names
credit_df %<>% clean_names()

And we can take a deeper peak in the data (we are looking for NAs and distributions)

credit_df %>% skim()

Data summary
Name	Piped data
Number of rows	4454
Number of columns	14
_______________________
Column type frequency:
factor	5
numeric	9
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
status	0	1	FALSE	2	goo: 3200, bad: 1254
home	6	1	FALSE	6	own: 2107, ren: 973, par: 783, oth: 319
marital	1	1	FALSE	5	mar: 3241, sin: 977, sep: 130, wid: 67
records	0	1	FALSE	2	no: 3681, yes: 773
job	2	1	FALSE	4	fix: 2805, fre: 1024, par: 452, oth: 171

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
seniority	0	1.00	7.99	8.17	0	2.00	5	12.0	48	▇▃▁▁▁
time	0	1.00	46.44	14.66	6	36.00	48	60.0	72	▁▂▅▃▇
age	0	1.00	37.08	10.98	18	28.00	36	45.0	68	▆▇▆▃▁
expenses	0	1.00	55.57	19.52	35	35.00	51	72.0	180	▇▃▁▁▁
income	381	0.91	141.69	80.75	6	90.00	125	170.0	959	▇▂▁▁▁
assets	47	0.99	5403.98	11574.42	0	0.00	3000	6000.0	300000	▇▁▁▁▁
debt	18	1.00	343.03	1245.99	0	0.00	0	0.0	30000	▇▁▁▁▁
amount	0	1.00	1038.92	474.55	100	700.00	1000	1300.0	5000	▇▆▁▁▁
price	0	1.00	1462.78	628.13	105	1117.25	1400	1691.5	11140	▇▁▁▁▁

Create Our Outcome Variable

Suppose we want to create an indicator for individuals’ whose credit status is "good". Easy enough, can use tidyverse to do that.

Solution.

# Create a dummy variable for "good" credit status
credit_df %<>% 
    # There are many ways to do this. You could also use ifelse() or case_when().
    mutate(status_good = 1 * (status == "good")) %>% 
    # Drop the old status variable
    select(-status)

That’s great! We have an outcome that we want to test our model on

But we still have some missing data that could be of use

Here’s where tidymodels can come in handy in the preprocessing part

How Does `tidymodels` Help Us?

Imagine we want to clean each variable, then our tidyverse approach will take a lot of typing/copying/pasting which may become difficult to track.

For example, you might want to:

Standardize your numeric variables
Create dummies for your categorical variables
Remove variables that are perfectly determined by other variables (multicollinearity)
Imput missing values using KNN

tidymodels is a better way! We will use recipes and parsnip later on to help our modeling

Using `tidymodels` to Preprocess Data

Recipes

The tidymodels workflow begins with defining a recipe

This tells R which variable is your outcome and which variable(s) are your explanatory ones
Think of it as a defining the formula and dataset within lm(y ~ x, data = df) without running the regression

What we will be doing is try to predict which individuals have status_good = 1 \(\rightarrow\) where status_good is our outcome

Two important things before we move on:

1. No functions in your recipe: We are only defining the roles of the data. We will be able to make transformations later

2. All other variables: You can use . on the right-hand side of the ~ to say “the rest of the variables”. Saves you some time from writing everything down

Recipe

Here is our first recipe. It’ll help us understand further.

# Define the recipe: status_good predicted by all other variables in the dataframe
recipe_all <- recipe(status_good ~ ., data = credit_df)
# What does recipe do?
recipe_all

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 13

Recipe

Here is our first recipe. It’ll help us understand further.

# Define the recipe: status_good predicted by all other variables in the dataframe
recipe_all <- recipe(status_good ~ ., data = credit_df)
# What does recipe do?
recipe_all %>% str()

Tip

Note: If you have an “ID” variable, you can give it an “ID” role.

recipe(...) %>% update_role(id_var, new_role = "ID")

Recipe

R now “knows” the roles each variable will play in our model.

Now we can start cleaning/transforming our dataset.

We can apply whatever steps we want to clean/pre-process/manipulate our data. The tidymodels universe has many possible steps for:

Imputation

step_impute_mean()
step_impute_mode()
step_impute_knn()
step_impute_bag()

Transformation

step_log()
step_poly()
step_mutate()
step_interact()

Dummies & Discretization

step_dummy()
step_num2factor()
step_discretize()
step_cut()

Many More

step_center()
step_normalize()
step_date()
step_lag()

Recipe

To apply any of these, you need to also tell the function step_*() which variables you want it to target.

All: all_vars()
Role: all_predictors(), all_outcomes(), has_role()

Type: all_nominal(), all_numeric()
Variable Names: Listing the name of the variables or using selectors starts_with(), contains(), etc.

Try writing a simple mean imputation for all variables that are
(1) predictors AND (2) numeric

Solution.

# Mean imputation for all numeric predictors

recipe_all %>% step_impute_mean(all_predictors() & all_numeric())

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 13

── Operations

• Mean imputation for: all_predictors() & all_numeric()

Recipe

To apply any of these, you need to also tell the function step_*() which variables you want it to target.

All: all_vars()
Role: all_predictors(), all_outcomes(), has_role()

Type: all_nominal(), all_numeric()
Variable Names: Listing the name of the variables or using selectors starts_with(), contains(), etc.

Notice this is not a processed dataframe.

To get a processed dataframe, we need to add two more things to our recipe

prep(): estimates the parameters defined by the recipes’ preprocessing
juice(): applies the preprocessing to the training dataset (we will talk about CV in tidymodels soon)

`prep()` and `juice()`

# Applying prep()
recipe_all %>%
    step_impute_mean(all_predictors() & all_numeric()) %>%
    prep()

── Recipe ──────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:    1
predictor: 13

── Training information

Training data contained 4454 data points and 415 incomplete rows.

── Operations

• Mean imputation for: seniority, time, age, expenses, income, ... | Trained

The output is still a recipe, but now it has trained the preprocessing operations on the training data and knows the number of observations and number of incomplete rows

`prep()` and `juice()`

Now we add (a dash of) juice() to get the preprocessed (cleaned) dataset

# Add juice()
credit_clean <- recipe_all %>%
    step_impute_mean(all_predictors() & all_numeric()) %>%
    prep() %>%
    juice()
# Skim it 
credit_clean %>% skim()

Data summary
Name	Piped data
Number of rows	4454
Number of columns	14
_______________________
Column type frequency:
factor	4
numeric	10
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
home	6	1	FALSE	6	own: 2107, ren: 973, par: 783, oth: 319
marital	1	1	FALSE	5	mar: 3241, sin: 977, sep: 130, wid: 67
records	0	1	FALSE	2	no: 3681, yes: 773
job	2	1	FALSE	4	fix: 2805, fre: 1024, par: 452, oth: 171

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
seniority	1	7.99	8.17	0	2.00	5	12.0	48	▇▃▁▁▁
time	1	46.44	14.66	6	36.00	48	60.0	72	▁▂▅▃▇
age	1	37.08	10.98	18	28.00	36	45.0	68	▆▇▆▃▁
expenses	1	55.57	19.52	35	35.00	51	72.0	180	▇▃▁▁▁
income	1	141.71	77.22	6	93.00	130	164.0	959	▇▂▁▁▁
assets	1	5403.98	11513.17	0	0.00	3500	6000.0	300000	▇▁▁▁▁
debt	1	343.03	1243.47	0	0.00	0	0.0	30000	▇▁▁▁▁
amount	1	1038.92	474.55	100	700.00	1000	1300.0	5000	▇▆▁▁▁
price	1	1462.78	628.13	105	1117.25	1400	1691.5	11140	▇▁▁▁▁
status_good	1	0.72	0.45	0	0.00	1	1.0	1	▃▁▁▁▇

Success! 🤔

No more missing values for our numeric predictors

However, we still have some missing values in our categorical predictors

We just gotta fix those too!

Let’s define our preprocessing as:

1. Impute the numeric predictors using the means

2. Impute the categorical predictors using knn (k = 5)

3. Create dummy variables for all categorical predictors

Note: by default, step_dummy() drops the original column

4. Interact the outputted home_ dummies with income

All Together Now

# Starting from scratch
recipe_all <- recipe(status_good ~ ., data = credit_df)
# Putting it all together
credit_clean <- recipe_all %>%
    # 1. Impute the numerical predictors using the means
    step_impute_mean(all_predictors() & all_numeric()) %>%
    # 2. Impute the categorical variables using KNN (k = 5)
    step_impute_knn(all_predictors() & all_nominal(), neighbors = 5) %>%
    # 3. Create dummies for categorical variables
    step_dummy(all_predictors() & all_nominal()) %>%
    # 4. Interact the created dummies with `income`
    step_interact(~income:starts_with("home")) %>%
    # Prep
    prep() %>%
    # And Juice
    juice()

credit_clean %>% skim()

Data summary
Name	Piped data
Number of rows	4454
Number of columns	28
_______________________
Column type frequency:
numeric	28
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
seniority	1	7.99	8.17	0	2.00	5	12.0	48	▇▃▁▁▁
time	1	46.44	14.66	6	36.00	48	60.0	72	▁▂▅▃▇
age	1	37.08	10.98	18	28.00	36	45.0	68	▆▇▆▃▁
expenses	1	55.57	19.52	35	35.00	51	72.0	180	▇▃▁▁▁
income	1	141.71	77.22	6	93.00	130	164.0	959	▇▂▁▁▁
assets	1	5403.98	11513.17	0	0.00	3500	6000.0	300000	▇▁▁▁▁
debt	1	343.03	1243.47	0	0.00	0	0.0	30000	▇▁▁▁▁
amount	1	1038.92	474.55	100	700.00	1000	1300.0	5000	▇▆▁▁▁
price	1	1462.78	628.13	105	1117.25	1400	1691.5	11140	▇▁▁▁▁
status_good	1	0.72	0.45	0	0.00	1	1.0	1	▃▁▁▁▇
home_other	1	0.07	0.26	0	0.00	0	0.0	1	▇▁▁▁▁
home_owner	1	0.47	0.50	0	0.00	0	1.0	1	▇▁▁▁▇
home_parents	1	0.18	0.38	0	0.00	0	0.0	1	▇▁▁▁▂
home_priv	1	0.06	0.23	0	0.00	0	0.0	1	▇▁▁▁▁
home_rent	1	0.22	0.41	0	0.00	0	0.0	1	▇▁▁▁▂
marital_married	1	0.73	0.45	0	0.00	1	1.0	1	▃▁▁▁▇
marital_separated	1	0.03	0.17	0	0.00	0	0.0	1	▇▁▁▁▁
marital_single	1	0.22	0.41	0	0.00	0	0.0	1	▇▁▁▁▂
marital_widow	1	0.02	0.12	0	0.00	0	0.0	1	▇▁▁▁▁
records_yes	1	0.17	0.38	0	0.00	0	0.0	1	▇▁▁▁▂
job_freelance	1	0.23	0.42	0	0.00	0	0.0	1	▇▁▁▁▂
job_others	1	0.04	0.19	0	0.00	0	0.0	1	▇▁▁▁▁
job_partime	1	0.10	0.30	0	0.00	0	0.0	1	▇▁▁▁▁
income_x_home_other	1	9.42	38.04	0	0.00	0	0.0	700	▇▁▁▁▁
income_x_home_owner	1	71.83	94.94	0	0.00	0	135.0	905	▇▁▁▁▁
income_x_home_parents	1	21.49	55.43	0	0.00	0	0.0	857	▇▁▁▁▁
income_x_home_priv	1	7.59	35.53	0	0.00	0	0.0	959	▇▁▁▁▁
income_x_home_rent	1	30.77	66.47	0	0.00	0	0.0	535	▇▁▁▁▁

Great Success!

If we did it correctly, after imputing missing values from the training data, we should have:

No more missing values
Created dummies
Created interactions between home and income

Next, we will explore modeling with

tidymodels

EC524: Lab 03

Lab Agenda

Why tidymodels?

Setup

Look at Your Data

Create Our Outcome Variable

How Does tidymodels Help Us?

Using tidymodels to Preprocess Data

Recipe

Recipe

Recipe

Recipe

Recipe

prep() and juice()

prep() and juice()

Success! 🤔

All Together Now

Great Success!

Why `tidymodels`?

How Does `tidymodels` Help Us?

Using `tidymodels` to Preprocess Data

`prep()` and `juice()`

`prep()` and `juice()`