msm
Dynamic ForecastingThe HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data analysis. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below:
Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective.
Data Projection: Dimension projection of the compressed cells to 1D,2D or Interactive surface plot with the Sammons Non-linear Algorithm. This step creates topology preserving map (also called embeddings) coordinates into the desired output dimension.
Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map useful for semi-supervised tasks.
Scoring: Scoring new data sets or test data and recording their assignment using the map objects from the above steps, in a sequence of maps if required.
Temporal Analysis and Visualization: Collection of functions that analyzes time series data for its underlying patterns, calculation of transitioning probabilities and the visualizations for the flow of data over time.
Dynamic Forecasting: Simulate future states of dynamic systems using Monte Carlo simulations of Markov Chain (MSM), enabling predictions for time-series data.
Objective
This notebook is designed to systematically explore different
combinations of parameters — specifically, the number of cells in
trainHVT
function, the number of clusters (k) and the
number of nearest neighbors in the msm
function. The goal
is to evaluate the MAE (Mean Absolute Error) across all variables for
each combination and identify the configuration that yields the lowest
MAE, thereby indicating the most accurate model selection.
Number of Cells:
This value determines how many distinct cells should be formed at each level of the hierarchy during the vector quantization process in the trainHVT function.
Number of Clusters:
This determines the granularity of the clustering. It is recommended to provide value within the range of 2 to 9, as higher values may lead to overly fine-grained clusters that lose interpretability and practical significance.
Number of Nearest Neighbors:
The number of nearest neighbors is dependent on the cluster structure. For each cluster, it ranges from 2 to 7. This parameter helps to pick the next state for the problematic states in the simulation.
Since all the new functions for hyperparameter tuning approach is not in CRAN yet, we are sourcing the scripts from local.
script_dir <- "../R"
r_files <- list.files(script_dir, pattern = "\\.R$", full.names = TRUE)
invisible(lapply(r_files, function(file) { source(file, echo = FALSE); }))
Below is the function for more dynamic drop down display of data tables.
Let’s start with importing the dataset. The below code reads and displays the dataset.
entire_dataset <- read.csv("./sample_dataset/macro_raw_model_inputs_v5.csv")
entire_dataset <- entire_dataset %>% mutate(across(where(is.numeric), ~ round(., 4)))
dynamic_length_menu <- calculate_dynamic_length_menu(nrow(entire_dataset))
DT::datatable(entire_dataset,options = list(pageLength = 10,scrollX = TRUE, lengthMenu = dynamic_length_menu), rownames = FALSE)
Before proceeding, it is crucial to examine the structure of the dataset. This involves verifying the data types of the columns and resolving any inconsistencies. Make sure all data types are accurate and suitable for the intended functions.
## 'data.frame': 320 obs. of 18 variables:
## $ t : chr "12/1/98" "1/1/99" "2/1/99" "3/1/99" ...
## $ CPI_U_All_Items_Less_Food_Energy : num 175 176 176 176 176 ...
## $ Coincident_Index : num 80.1 80.3 80.8 80.8 80.9 81.2 81.4 81.7 82 81.9 ...
## $ Copper_ETF : num 0.663 0.641 0.624 0.622 0.717 ...
## $ SnP500_ETF : num 1229 1280 1238 1286 1335 ...
## $ Spot_Oil_Price_West_Texas_Intermediate: num 11.3 12.5 12 14.7 17.3 ...
## $ US_Dollar_Index : num 94.2 96.1 98.7 100.1 101 ...
## $ Unemployment_Rate : num 4.4 4.3 4.4 4.2 4.3 4.2 4.3 4.3 4.2 4.2 ...
## $ X10_Year_Treasury_Note_Yield : num 4.65 4.72 5 5.23 5.18 5.54 5.9 5.79 5.94 5.92 ...
## $ X2_Year_Treasury_Note_Yield : num 4.51 4.62 4.88 5.05 4.98 5.25 5.62 5.55 5.68 5.66 ...
## $ High_Yield_Effective_Yield : num 10.5 10.4 10.7 10.6 10.2 ...
## $ XLY : num 19.4 20.3 20.2 21.2 21.8 ...
## $ XLP : num 14.6 14.4 14.3 14.3 13.8 ...
## $ XLE : num 11.8 11 10.9 12.5 14.3 ...
## $ XLF : num 11.4 11.6 11.7 12.1 13 ...
## $ XLV : num 17.7 18.5 18.5 19 19.7 ...
## $ XLI : num 15.4 15.2 15.4 15.7 18 ...
## $ XLB : num 12.2 11.8 11.9 12.2 15.2 ...
Since the time column is in ‘Character’ format, we are changing it to ‘datetime’ (POSIXct) format.
We transform the data to compute the 12-month rate of change, which standardizes the features and brings them to a comparable scale, simplifying the analysis of their relative changes. Log difference reduces variability and removes trends, stabilizing the data for more accurate forecasting.
The Rate of change is calculated as follows:
\[ \text{Rate of Change} = \log(\text{Current Value}) - \log(\text{12-Month Lag Value}) \]
features_data <- entire_dataset %>% select(-t) %>% colnames()
entire_dataset[features_data] <- lapply(entire_dataset[features_data], function(x) as.numeric(as.character(x)))
invisible(lapply(features_data, function(col) {
entire_dataset[[col]] <<- entire_dataset[[col]] %>% log()
entire_dataset[[col]] <<- c(rep(NA, 12), round(diff(entire_dataset[[col]], 12),4))}))
entire_dataset <- entire_dataset %>% na.omit() %>% data.frame()
rownames(entire_dataset) <- NULL
After the log difference of 12 datapoints (months), the dataset ranges from December 1999 to July 2025. Below is the table displaying the transformed dataset.
entire_dataset$t <- as.character(entire_dataset$t)
dynamic_length_menu <- calculate_dynamic_length_menu(nrow(entire_dataset))
DT::datatable(entire_dataset,options = list(pageLength = 10,scrollX = TRUE,lengthMenu = dynamic_length_menu,columnDefs = list(list(width = '150px', targets = 0))),
class = "nowrap display",rownames = FALSE)
The entire dataset is split into train and test datasets in the 9:1
ratio. The train split is passed to the HVTMSMoptimization
function for model training and scoring. The test split is used as the
Ex-Post forecasting period.
train_size <- round(0.9 * nrow(entire_dataset))
train_dataset <- entire_dataset %>% slice_head(n = train_size)
test_dataset <- entire_dataset %>% slice_tail(n = nrow(entire_dataset) - train_size)
The train dataset is from 1999-12-01 to 2022-12-01. Below is the table displaying the train data.
train_dataset$t <- as.character(train_dataset$t)
dynamic_length_menu <- calculate_dynamic_length_menu(nrow(train_dataset))
DT::datatable(train_dataset,options = list(pageLength = 10,scrollX = TRUE,lengthMenu = dynamic_length_menu,
columnDefs = list(list(width = '150px', targets = 0))),class = "nowrap display",rownames = FALSE)
The test dataset is from 2023-01-01 to 2025-07-01. Below is the table displaying the test data.
test_dataset$t <- as.character(test_dataset$t)
dynamic_length_menu <- calculate_dynamic_length_menu(nrow(test_dataset))
DT::datatable(test_dataset,options = list(pageLength = 10,scrollX = TRUE,lengthMenu = dynamic_length_menu,
columnDefs = list(list(width = '150px', targets = 0))),class = "nowrap display",rownames = FALSE)
Case 1:
trainHVT()
= train dataset from 1999-12-01 to
2022-12-01
scoreHVT()
= train dataset from 1999-12-01 to
2022-12-01
# trainHVT
hvt.results <- trainHVT(
train_dataset[,-1],
n_cells = 13,
depth = 1,
quant.err = 0.25,
normalize = TRUE,
distance_metric = "L1_Norm",
error_metric = "max",
dim_reduction_method = "sammon")
# scoreHVT
scoring <- scoreHVT(train_dataset,hvt.results,analysis.plots = TRUE,names.column = train_dataset[,1])
scoring$scoredPredictedData$t <- format(train_dataset$t, "%Y-%m-%d")
scoring$scoredPredictedData <- scoring$scoredPredictedData %>% dplyr::select(c(t, everything()))
# Temporal Data
temporal_data <-scoring$scoredPredictedData %>% select(t,Cell.ID)
temporal_data$t <- as.POSIXct(temporal_data$t, format = "%Y-%m-%d")
# Transition Probability Matrix
prob_trans_matx <- getTransitionProbability(df = temporal_data,cellid_column = "Cell.ID",time_column = "t", type = "with_self_state")
#Ex-Post Forecasting
ex_post <- msm(state_time_data = temporal_data,
forecast_type = "ex-post",
transition_probability_matrix = prob_trans_matx,
initial_state <- temporal_data$Cell.ID[which.max(temporal_data$t)],
num_simulations = 5,
scoreHVT_results = scoring,
trainHVT_results = hvt.results,
actual_data = test_dataset,
raw_dataset = entire_dataset,
handle_problematic_states = FALSE,
k = 2,
n_nearest_neighbor = 5,
mae_metric = "median",
show_simulation = FALSE,
time_column = "t")
## Error in validate_inputs(): No temporal state data available for actual data period - states comparison will be skipped
Observation:
Using train dataset for trainHVT()
and
scoreHVT()
will still allow forecasting by taking the last
train state (transitions learned on train) as the initial state to
forecast. But the test set isn’t scored, so there are no test Cell.IDs
(“Actual States”) to compute ex-post MAE, therefore we discard this
setup for evaluation.
Case 2:
trainHVT()
= train dataset from 1999-12-01 to
2022-12-01
scoreHVT()
= test dataset from 2023-01-01 to
2025-07-01
# trainHVT
hvt.results <- trainHVT(
train_dataset[,-1],
n_cells = 13,
depth = 1,
quant.err = 0.25,
normalize = TRUE,
distance_metric = "L1_Norm",
error_metric = "max",
dim_reduction_method = "sammon")
# scoreHVT
scoring <- scoreHVT(test_dataset,hvt.results,analysis.plots = TRUE,names.column = test_dataset[,1])
scoring$scoredPredictedData$t <- format(test_dataset$t, "%Y-%m-%d")
scoring$scoredPredictedData <- scoring$scoredPredictedData %>% dplyr::select(c(t, everything()))
# temporal Data
temporal_data <-scoring$scoredPredictedData %>% select(t,Cell.ID)
temporal_data$t <- as.POSIXct(temporal_data$t, format = "%Y-%m-%d")
ids <- sort(unique(temporal_data$Cell.ID))
cat("The Unique Cell IDs in Temporal Data:", paste(ids, collapse = ", "), "\n")
## The Unique Cell IDs in Temporal Data: 3, 4, 5, 6, 7, 9
Observation:
Using train dataset for trainHVT()
and
test dataset for scoreHVT()
, the test
period maps to only a subset of trained cells and not all Cell.IDs
appear in temporal data. Creating probability matrix with this temporal
data will be invalid as simulations will be stuck between similar
cells.
Case 3:
trainHVT()
= train dataset from 1999-12-01 to
2022-12-01
scoreHVT()
= train + test dataset from 1999-12-01 to
2025-07-01
# trainHVT
hvt.results <- trainHVT(
train_dataset[,-1],
n_cells = 13,
depth = 1,
quant.err = 0.25,
normalize = TRUE,
distance_metric = "L1_Norm",
error_metric = "max",
dim_reduction_method = "sammon")
# scoreHVT
scoring <- scoreHVT(entire_dataset,hvt.results,analysis.plots = TRUE,names.column = entire_dataset[,1])
scoring$scoredPredictedData$t <- format(entire_dataset$t, "%Y-%m-%d")
scoring$scoredPredictedData <- scoring$scoredPredictedData %>% dplyr::select(c(t, everything()))
# Temporal Data
temporal_data <-scoring$scoredPredictedData %>% select(t,Cell.ID)
temporal_data$t <- as.POSIXct(temporal_data$t, format = "%Y-%m-%d")
ids <- sort(unique(temporal_data$Cell.ID))
cat("The Unique Cell IDs in Temporal Data:", paste(ids, collapse = ", "), "\n")
## The Unique Cell IDs in Temporal Data: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
Observation:
Using train dataset for trainHVT()
and
entire dataset(Train set + Test set) for
scoreHVT()
, every row (train and test) is mapped to trained
cells, so temporal_data has full cell coverage and continuous time as
printed above. This enables the Actual States overlay for the test
period and computes state-level residuals/MAE. This resolves the problem
observed in the other two cases.
Since Case 3 is the optimal scenario, where trainHVT is trained with the training set and scoreHVT is evaluated on the entire dataset, this case is considered for further hyperparameter tuning and forecasting.
Hyperparameter 1 – Number of Cells in
trainHVT
:
Hyperparameter 2 – Optimal Clusters:
msm
function when problematic states
are identified during the simulation process.k
clusters to guide the
selection of subsequent states.Hyperparameter 3 – Number of Nearest Neighbors:
k
, applied within the msm
function.Below is the workflow for the hyperparameter tuning:
Train and Score HVT Model: Loop through each
value of ncell
range, for every iteration, train HVT model
(trainHVT
) on the training dataset and
perform scoring (scoreHVT
) on entire
dataset to assign Cell.ID
and obtain centroid
coordinates.
Prepare Temporal Data and Compute Transition
Probabilities: Merge the assigned Cell.ID
from
scoring results with the training dataset’s time column (t
)
to form temporal data showing state transitions. Then, use this temporal
data to generate a transition probability matrix indicating the
likelihood of transitions between states.
Identify Problematic States and Run MSM
Simulations: For each ncell
:
Calculate MAE: Run MSM simulations for each
combination of cluster counts (k
) and nearest neighbors,
then calculate the MAE by comparing simulated predictions with actual
test data (expost_forecasting
) for all three MAE metrics
Mean, Median,
Mode.
Identify and Aggregate Best Parameters: Record
the MAE for every combination of ncell
, k
, and
Nearest Neighbor
values tested across all iterations.
Summarize all parameter combinations and their results comprehensively
across the complete set of iterations.
Visualize Results: Plot MAE performance across all optimal parameter combinations of ncell, k, and nearest neighbor. Highlight highest and lowest MAE values for optimal parameter selection.
Note: The expost_forecasting
period
will be fixed for all iterations.
The HVTMSMoptimization
function automates the entire
workflow described above. Rather than executing trainHVT, scoreHVT, and
msm separately, it encapsulates all these steps into a single process.
This function systematically explores the parameter space to identify
the optimal combination of clustering granularity and temporal modeling
parameters for superior forecasting performance.
entire_dataset
– The train dataset
containing timestamps and features; used for HVT model training,
scoring, and MSM simulations.expost_forecasting
– The test dataset
containing timestamps and actual feature values; used for scoring and to
calculate Mean Absolute Error (MAE) for MSM predictions.time_column
– Column name representing
timestamps; not included in HVT model training but utilized for building
temporal sequences and MSM analysis.ncell_range
– Possible numbers of
cells for tessellation in trainHVT
, min = 3.k_range
– Range of clusters in
msm
when problematic states are found, min = 2nn_range
– Range of nearest-neighbor
counts within a cluster to select as next state for problematic states,
min = 1.num_simulations
– Number of
simulations in MSM functionmae_metric
– Aggregation method for
MAE calculation (“mean”, “median”, “mode” or “all”)depth
– Integer specifying hierarchy
depth for vector quantizationquant.err
– Numeric quantization error
threshold for HVTnormalize
– Logical value to indicate
whether to normalize input datasetdistance_metric
– Distance measure
used (e.g., “L1_Norm”, “L2_Norm”)error_metric
– Error calculation
method for HVT (e.g., “max”, “mean”)dim_reduction_method
– Dimensionality
reduction technique (e.g., “sammon”, “tsne”, “umap”)parallel
– Enables parallel processing
across cores to speed up optimization (default: TRUE).For each variable MAE is calculates as:
\[\text{MAE} = \sum_{t=1}^{T} |\text{Actual}_{t} - \text{Predicted}_{t}|\] where,
T = Total number of forecast periods (the forecasting horizon)
t = Individual time period counter (goes from 1 to T)
For the states : The predicted states is selected based on mae_metric -mean/median/mode of all values across simulations for a particular timestamp, will be subtracted from the actual state which is the scored state from temporal data.
For the variables : The predicted states from the simulations are translated to respective feature centroids from the trainHVT (HVQ compression) and scaled to the raw dataset values by multiplying them by the standard deviation of the raw feature and adding the mean of the raw feature(if data is normalized during model training). The actual values is from the primary dataset which is log diffed in the section 3.3
Input Parameters for the HVTMSMoptimization function:
entire_dataset
: 1999-12-01 to 2022-12-01expost_forecasting
: 2023-01-01 to 2025-07-01ncell_range
: 10:Maximum cells (i.e., 308/2 = 154)k_range
: 2:9nn_range
: 2:7num_simulations
: 100mae_metric
: medianhvt_params
with default valuesmax_cells <- round((nrow(train_dataset)/2))
msm_model <- HVTMSMoptimization(
entire_dataset = train_dataset,
expost_forecasting = test_dataset,
time_column = "t",
ncell_range = 10:max_cells,
k_range = 2:9,
nn_range = 2:7,
num_simulations = 100,
mae_metric = "median",
hvt_params = list(
depth = 1,
quant.err = 0.2,
normalize = TRUE,
distance_metric = "L1_Norm",
error_metric = "max",
dim_reduction_method = "sammon"
), parallel = TRUE)
The table below displays all tested combinations of
parameters for each cell count Number of Cells
. For each
cell count, multiple combinations of cluster count (k) and nearest
neighbors were evaluated to find the optimal parameters for handling
problematic states.
Problematic states can be due to three cases:
Case 1: Absence Transitions: This occurs when a state lacks any outgoing transitions to other states or itself. Once the simulation reaches such a state, it becomes stuck because there is no path forward, effectively halting the simulation in the same state.
Case 2: Self-State Only Transitions: In this case, a state transitions exclusively to itself. When the simulation reaches such a state, it remains trapped in this state, unable to explore other states, causing stagnation.
Case 3: Cyclic Transitions: This refers to a situation where states transition back and forth between each other in a closed loop. While the simulation remains active, it is constrained within the cycle, preventing the exploration of states outside the cycle.
The status column categories listed below explain why MAE (Mean Absolute Error) values cannot be computed for certain cells:
MSM failed: Insufficient NN - The clustering process cannot proceed because the number of nearest neighbors requested for the simulation is greater than the number of valid neighbors available in the current cluster.
MSM failed: Only Problematic states - A problematic state is placed in a cluster where all other states are also problematic, leaving no valid neighbors for the nearest neighbor simulation.
Legend:
Green: Lowest MAE for each cell value in the given range (cell-level optimal performance)
Red: Lowest MAE achieved globally across all cells and combinations (overall best performance)
msm_model_median <- msm_model$median_mae$all_results
create_highlighted_results_table(msm_model_median)
Now that we have all the MAE values for each combination, we have sorted the below table to display the top 3 lowest MAE values.
The interactive plot shows MAE on the Y-axis and the number of cells on the X-axis. Hovering displays the number of cells, MAE, and the hyperparameters k and nn (nearest neighbors). The Red dot represents the highest MAE, and the Green dot represents the lowest MAE.
## NULL
From the above plot, we can see that the cell has lowest MAE which is , when the hyperparameters k is set to and number of nearest neighbor is set to .
Let’s run msm
function with the selected optimal model
from HVTMSMoptimization
function.
Note: Follow the same values given in
HVTMSMoptimization
function for similar results
# Temporal Data
temporal_data <-scoring$scoredPredictedData %>% select(t,Cell.ID)
temporal_data$t <- as.POSIXct(temporal_data$t, format = "%Y-%m-%d")
train_end_timestamp <- max(as.POSIXct(train_dataset$t), na.rm = TRUE)
temporal_train <- temporal_data %>%
dplyr::filter(t <= train_end_timestamp) %>%
dplyr::arrange(t)
temporal_test <- temporal_data %>%
dplyr::filter(t > train_end_timestamp) %>%
dplyr::arrange(t)
# Transition matrix
prob_trans_matx <- getTransitionProbability(
df = temporal_train,
cellid_column = "Cell.ID",
time_column = "t",
type = "with_self_state")
initial_state <- dplyr::last(temporal_train$Cell.ID)
Ex-Post forecast time period:
Transition Probability Matrix: 1999-12-01 to 2022-12-01
Initial timestamp: 2022-12-01
Initial state: 11
Ex-Post Forecast: 2023-01-01 to 2025-07-01
Residuals: Actual - Predicted Median
k: 2
Nearest Neighbor: 3
NOTE: We have selected Median
as the metric for
calculating residuals in HVTMSMoptimization
function, so
using the same here.
ex_post <- msm(
state_time_data = temporal_data,
forecast_type = "ex-post",
transition_probability_matrix = prob_trans_matx,
initial_state = initial_state,
num_simulations = 100,
scoreHVT_results = scoring,
trainHVT_results = hvt.results,
actual_data = test_dataset,
raw_dataset = entire_dataset,
k = k,
handle_problematic_states = TRUE,
n_nearest_neighbor = nn,
mae_metric = "median",
show_simulation = FALSE,
time_column = "t")
Below are the MAE (Mean Absolute Deviation) values for each feature and state.The Mean MAE is calculated from only the variables not the states.
mae_values_centroid <- sapply(ex_post[["plots"]][[1]], function(x) x[["mae"]])
mae_value_states <- ex_post[["plots"]][[2]][["mae"]]
mean_mae <- round(mean(mae_values_centroid),4)
mae_values <- c(mae_value_states,mae_values_centroid,mean_mae)
plot_labels <- c(
"States", "CPI_U_All_Items_Less_Food_Energy","Coincident_Index" ,"Copper_ETF" , "SnP500_ETF","Spot_Oil_Price_West_Texas_Intermediate","US_Dollar_Index" , "Unemployment_Rate","X10_Year_Treasury_Note_Yield" ,"X2_Year_Treasury_Note_Yield", "High_Yield_Effective_Yield" ,"XLY" ,"XLP","XLE" , "XLF", "XLV","XLI", "XLB" , "Mean of Variables' MAE")
data_1 <- data.frame(Plot = plot_labels,MAE = mae_values)
DT::datatable(data_1, rownames = FALSE, options = list(pageLength = 12))
Observation:
From the results, the Mean of Variables MAE is
0.1173, while the MAE for the same combination from
HVTMSMoptimization is 0.1068, and the two values are
quite close. Since the msm
process involves random number
generation for simulation to select the next state, an error
range of 0.01 to 0.1 occurs.
This notebook presents a systematic hyperparameter optimization framework for MSM dynamic forecasting that explored three critical parameters: number of cells (range: 10 to 138), optimal clusters k (range: 2 to 9), and nearest neighbors (range: 2 to 7). The HVTMSMoptimization function automated the complete workflow and identified the global minimum MAE of 0.1068 at 15 cells (k=2, nn=3) that effectively handled problematic states through clustering and neighbor selection strategies.
Dataset preparation and configuration involved loading macroeconomic data from December 1999 to July 2025, applying 12-month log difference transformation, and splitting into training (277 observations) and test sets (31 observations). Case 3 configuration was selected where trainHVT uses training data while scoreHVT processes the entire dataset, providing complete cell coverage for accurate MAE computation.
MSM integration within optimization ensures the HVTMSMoptimization function incorporates MSM simulations directly within the parameter search process for all simulation runs and MAE value calculations