October 30, 2023
A non-linear, iterative process:
… this iteration between modeling, and preprocessing is what makes Tidymodels so powerful.
3 Types of models: - Descriptive, Inferential, Predictive
model <- lm(outcome ~ predictor1 + predictor2 + predictor3 + control1 + control2, data=synthetic_data) summary(model)
## ## Call: ## lm(formula = outcome ~ predictor1 + predictor2 + predictor3 + ## control1 + control2, data = synthetic_data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.5231 -0.6038 -0.0456 0.6086 2.3181 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.11940 0.21154 0.564 0.574 ## predictor1 0.95068 0.10495 9.059 1.84e-14 *** ## predictor2 2.10331 0.09876 21.297 < 2e-16 *** ## predictor3 0.02312 0.10138 0.228 0.820 ## control1Yes 0.13202 0.19070 0.692 0.490 ## control2 -0.43514 0.32878 -1.323 0.189 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9439 on 94 degrees of freedom ## Multiple R-squared: 0.8489, Adjusted R-squared: 0.8409 ## F-statistic: 105.6 on 5 and 94 DF, p-value: < 2.2e-16
And after all that, having finally run your model(s)…
One of the strengths of R is that it encourages developers to create a user interface that fits their needs.
On the one hand this means that R has an immense functionality which is continuously being added to by the community…
… on the other hand, this means that R modeling is unwieldy at best, incoherent at worst.
What follows are three common methods for creating a scatter plot of two numeric variables. In each separate groups of developers devised three distinct interfaces for the same task. Each has advantages and disadvantages. Each is different!
Keep in mind for comparison: in Python, when approaching a problem:
“There should be one – and preferably only one – obvious way to do it.” - the Python Developer’s Guide
plot(data$x, data$y, main="Base R Plot", pch=19, col=rgb(0.2,0.4,0.6,0.6))
xyplot(y ~ x, data = data, main = "Scatter Plot with Lattice", xlab = "Independent", ylab = "Dependent", col = "blue", pch = 16)
ggplot(data, aes(x = x, y = y)) + geom_point(color = 'blue') + geom_smooth(method = 'lm', se = FALSE, color = 'red') + labs( title = "Generated Scatter Plot with Smooth Line", x = "Independent", y = "Dependent" ) + theme_minimal()
Tidymodels is a collection of R packages and a framework for modeling and machine learning that follows the principles of tidy data and integrates seamlessly with the tidyverse ecosystem.
It was developed to provide a consistent and organized way to perform machine learning tasks in R, making it easier for data scientists and analysts to build, evaluate, and deploy predictive models.
The TidyModels universe, credit: tidymodels,org
Transitioning from the world of scattered functions and inconsistent methodologies, tidymodeling simplifies your workflow but also introduces an elegance to data preprocessing.
With tidymodeling, you no longer need to be a ‘jack of all trades’ — the package suite integrates essential tools, providing a cohesive experience.
Tidyverse
syntax > Base R
expressions…Tidyverse integration: Built on the principles of the tidyverse, promoting consistent and user-friendly data manipulation. Familiarity with the tidyverse makes using Tidymodels seamless and consistent in data analysis and modeling workflows.
Consistency & workflow: Tidymodels ensures a structured workflow for modeling, encapsulating data pre-processing, model specification, tuning, and evaluation, enhancing organization and transparency.
Recipes for data pre-processing: The inclusion of the recipes
package in Tidymodels allows structured, reproducible data pre-processing steps, beneficial for feature engineering and data transformation.
Model Agnosticism: Facilitates easy swapping of different models for experimentation and selection without extensive code adjustments.
Hyperparameter Tuning: Streamlined process for adjusting and finding the best model parameters.
Resampling & Cross-validation: Offers resampling methods, like cross-validation and bootstrapping, essential for reliable estimations of model generalization.
Extensive metrics: The yardstick
package in Tidymodels provides a broad spectrum of evaluation metrics, simplifying model performance comparison and assessment.
Pre-processing involves preparing the data for modeling. This includes tasks such as data splitting, feature engineering, and data transformation.
A ‘recipe’ prepares your data for fitting a model. It is a framework for getting your ‘ingredients’ (data) into shape for ‘cooking’ with (modeling).
Here we focus on 3 processes:
Data splitting
Feature engineering
Assigning roles
rsample
rsample
splits your dataset into training, validation, and test sets.
data_split <- initial_split(data, prop = 3/4) train_data <- training(data_split) test_data <- testing(data_split)
How to ‘spend your data’ between training and testing with the proportion argument is up for debate, but we suggest around a 1:4 split as here.
update_role()
:
hosp_rec <- recipe(patient_outcome ~ ., data = heart_cond_data) %>% update_role(patient, hospital, new_role = "ID")
step_{}
:Feature engineering involves several preprocessing steps essential for transforming your raw data into a format that is better suited for modeling. It involves creating or deriving new features from existing ones, selecting relevant features, and encoding data in ways that provide significant context or expose meaningful patterns.
hosp_rec <- recipe(patient_outcome ~ ., data = heart_cond_data) %>% update_role(patient, hospital, new_role = "ID") %>% step_date(date, features = c("dow", "month")) %>% step_holiday(date, holidays = timeDate::listHolidays("US"), keep_original_cols = FALSE) %>% step_dummy(all_nominal_predictors()) %>% step_zv(all_predictors())
parsnip
:Select (i) model, (ii) the specific ‘engine’ and (iii) any initial hyperpermaters
lr_mod <- logistic_reg() %>% set_engine("glm") rf_mod <- rand_forest(trees = 1000) %>% set_engine("ranger") %>% set_mode("classification")
Workflow
:Workflows
are a framework structure that pair a model
(above) and a recipe
(from pre-processing) together.
hosp_wflow <- workflow() %>% add_model(lr_mod) %>% add_recipe(hosp_rec)
fit
:Fit
can be used to prepare the recipe and train the model from the resulting predictorshosp_fit <- hosp_wflow %>% fit(data = train_data)
This object has the finalized recipe and fitted model objects inside.
Next we would use the trained workflow (hosp_fit) to predict with the unseen test data.
predict(hosp_fit, test_data)
Predict()
applies the recipe to the new data, then passes them to the fitted model. We would then save this as an object and evaluate this prediction against the test data.
Post-processing involves selecting the best model configuration, and fine-tuning hyperparameters. We have used the prefixes ‘pre-’ and ‘post-’ for clarity, but it should be understood these are all part of a non-linear process.
Post-processing is outside the scope of this workshop, but at a very high level, it involves:
broom
: to tidy up model results, making them easier to interpret and visualize.yardstick
: to calculate evaluation metrics like RMSE, MAE, or ROC AUC.tune
and dials
: we can fine-tune model hyperparameters to optimize performance based on evaluation metrics.Recipes
# Sample data data(mtcars) # Starting the recipe rec <- recipe(mpg ~ ., data = mtcars) # Data preprocessing steps rec <- rec %>% step_normalize(all_predictors()) %>% step_pca(all_numeric()) %>% step_dummy(all_nominal()) # Preprocessed data summary summary(prepped <- prep(rec, training = mtcars))
## # A tibble: 5 × 4 ## variable type role source ## <chr> <list> <chr> <chr> ## 1 PC1 <chr [2]> predictor derived ## 2 PC2 <chr [2]> predictor derived ## 3 PC3 <chr [2]> predictor derived ## 4 PC4 <chr [2]> predictor derived ## 5 PC5 <chr [2]> predictor derived
The power of tidymodeling becomes evident in its holistic approach. It’s not just about individual steps; it’s the journey from raw data to insightful models, all within the same, consistent environment. Tidymodeling eliminates the disjointed processes often encountered in traditional methods, replacing them with a streamlined, intuitive workflow.
Install the complete tidymodeling package set with:
install.packages("tidymodel")
Load the library:
library(tidymodels)
tidymodels_packages()
## [1] "broom" "cli" "conflicted" "dials" "dplyr" ## [6] "ggplot2" "hardhat" "infer" "modeldata" "parsnip" ## [11] "purrr" "recipes" "rlang" "rsample" "rstudioapi" ## [16] "tibble" "tidyr" "tune" "workflows" "workflowsets" ## [21] "yardstick" "tidymodels"
We see that we have actually loaded a number of packages (which could also be loaded individually). Core packages include; recipe
, parsnip
, workflows
, yardstick
etc.
dials
: has tools to create and manage values of tuning parameters.
infer
: is a modern approach to statistical inference.
parsnip
: is a tidy, unified interface to creating models.
recipe
: is a general data preprocessor with a modern interface. It can create model matrices that incorporate feature engineering, imputation, and other help tools.
rsample
: has infrastructure for resampling data so that models can be assessed and empirically validated. `tune: contains the functions to optimize model hyper-parameters.
workflows
: has methods to combine pre-processing steps and models into a single object.
yardstick
: contains tools for evaluating models (e.g.accuracy, RMSE, etc.).