Last updated: 2026-04-08

Checks: 7 0

Knit directory: Integrating-nir-genomic-kernel/

This reproducible R Markdown analysis was created with workflowr (version 1.7.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20250829) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ee9b0ba. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/analysis.Rmd
    Ignored:    analysis/run_cv0_cv00_only.R
    Ignored:    analysis/run_cv1_cv2_only.R
    Ignored:    analysis/script_tabelas_resultados_pipeline_v2.R
    Ignored:    data/Article_documents/
    Ignored:    data/Maize-NIRS-GBS-main/
    Ignored:    output/Matrizes/Geno.all.rds
    Ignored:    output/Matrizes/NIR.all.rds
    Ignored:    output/Matrizes/figures/
    Ignored:    output/adjust_models.rar
    Ignored:    output/climate_results/additional_climate_plots.tiff
    Ignored:    output/climate_results/combined_additional_climate_plots.tiff
    Ignored:    output/climate_results/combined_climate_plot.tiff
    Ignored:    output/cv-schemes.tiff
    Ignored:    output/variance_components/bglr_runs/
    Ignored:    output/variance_components/figures/

Untracked files:
    Untracked:  output/results/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/analysis_prediction_run_outputs.Rmd) and HTML (docs/analysis_prediction_run_outputs.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd d7caa48 WevertonGomesCosta 2026-04-07 update
Rmd bfbf03c WevertonGomesCosta 2026-04-07 update
Rmd e1fdde0 WevertonGomesCosta 2026-04-07 update
Rmd 0b2db18 WevertonGomesCosta 2026-04-07 update
Rmd 2d2b88e WevertonGomesCosta 2026-04-06 update
Rmd 5aac46d WevertonGomesCosta 2026-04-06 update
Rmd d1beaba WevertonGomesCosta 2026-04-06 update
html d1beaba WevertonGomesCosta 2026-04-06 update
Rmd 04efebb WevertonGomesCosta 2026-04-05 update
html 6b32571 WevertonGomesCosta 2026-04-05 update
Rmd 0a6059c WevertonGomesCosta 2026-04-03 update
html f6d5560 WevertonGomesCosta 2026-04-03 update files
Rmd 3c8baec WevertonGomesCosta 2026-04-03 update
Rmd 262e87a WevertonGomesCosta 2026-04-02 update
Rmd 9e064c5 WevertonGomesCosta 2026-04-02 update
Rmd 5581b38 WevertonGomesCosta 2026-04-02 update
Rmd cda6f85 WevertonGomesCosta 2026-04-02 update
html cda6f85 WevertonGomesCosta 2026-04-02 update
Rmd 00ef3ef WevertonGomesCosta 2026-04-01 update
Rmd 731cac2 WevertonGomesCosta 2026-04-01 update

Language / Idioma: English | Português

Module navigation / Navegação do módulo: Part I: setup, models and CV design

1. Introduction

This submodule is the execution counterpart of analysis_prediction.Rmd. Its role is to run the prediction jobs for the reduced catalog of 18 models, save the CSV files under output/results/, and inventory the outputs that will later be consumed by the visualization stage.

The execution contract used here has four operational rules:

  • CV1 and CV2 are derived from the same fit within each fold;
  • CV0 and CV00 are derived from the same fit for each held-out environment within each fold;
  • the fold partition is deterministic for each MODEL × REP × TRAIT task and is reused in both CV blocks;
  • output files are organized explicitly by validation scheme, environment when applicable, trait, repetition, and fold, with the fold represented both in the directory tree and in the file name.

The full fitting stage is preserved here for completeness. When the prediction stage needs to be rerun, the code shown below is the operational version used by the pipeline.

2. Packages and project paths

The execution module needs both the matrix objects created upstream and the packages used by the parallel fitting layer. The path logic below keeps the file runnable from either the project root or the analysis/ directory.

rm(list = ls())
options(scipen = 999)

library(readr)
Warning: pacote 'readr' foi compilado no R versão 4.5.2
library(dplyr)
Warning: pacote 'dplyr' foi compilado no R versão 4.5.2

Anexando pacote: 'dplyr'
Os seguintes objetos são mascarados por 'package:stats':

    filter, lag
Os seguintes objetos são mascarados por 'package:base':

    intersect, setdiff, setequal, union
library(stringr)
library(knitr)
library(BGLR)
library(foreach)
Warning: pacote 'foreach' foi compilado no R versão 4.5.2
library(doParallel)
Warning: pacote 'doParallel' foi compilado no R versão 4.5.2
Carregando pacotes exigidos: iterators
Carregando pacotes exigidos: parallel
library(parallel)
library(tidyr)
Warning: pacote 'tidyr' foi compilado no R versão 4.5.2
project_root <- getwd()

if (!dir.exists(file.path(project_root, "output")) &&
    dir.exists(file.path(project_root, "..", "output"))) {
  project_root <- normalizePath(file.path(project_root, ".."))
}

if (!dir.exists(file.path(project_root, "output"))) {
  stop("Could not find the folder 'output/'. Render this file from the project root or from analysis/.")
}

results_dir <- file.path(project_root, "output", "results")
analysis_tables_dir <- file.path(project_root, "output", "tables", "analysis_prediction")
bglr_runs_dir <- file.path(results_dir, "bglr_runs")

dir.create(results_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(analysis_tables_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(bglr_runs_dir, recursive = TRUE, showWarnings = FALSE)

3. Reload the execution objects from Part I

This section rebuilds the execution environment when the file is rendered on its own. It follows the contract of analysis_prediction.Rmd and recreates the reduced set of 18 ETA objects exactly as defined in Part I.

matrices_dir <- file.path(project_root, "output", "Matrizes")

required_matrix_files <- c(
  "Pheno.rds",
  "ZG.rds", "ZP.rds", "ZE.rds", "ZW.rds",
  "ZGZE.rds", "ZPZE.rds", "ZGZW.rds", "ZPZW.rds",
  "GGK.rds", "PGK.rds", "GAK.rds", "PAK.rds",
  "GGKE.rds", "PGKE.rds", "GAKE.rds", "PAKE.rds",
  "GGKW.rds", "PGKW.rds", "GAKW.rds", "PAKW.rds"
)

required_matrix_paths <- file.path(matrices_dir, required_matrix_files)

if (!all(file.exists(required_matrix_paths))) {
  missing_matrix_files <- required_matrix_files[!file.exists(required_matrix_paths)]
  stop(
    paste0(
      "The execution objects required by analysis_prediction_run_outputs were not found in output/Matrizes/. Missing files: ",
      paste(missing_matrix_files, collapse = ", ")
    )
  )
}

Pheno <- readRDS(file.path(matrices_dir, "Pheno.rds"))
ZG   <- readRDS(file.path(matrices_dir, "ZG.rds"))
ZP   <- readRDS(file.path(matrices_dir, "ZP.rds"))
ZE   <- readRDS(file.path(matrices_dir, "ZE.rds"))
ZW   <- readRDS(file.path(matrices_dir, "ZW.rds"))
ZGZE <- readRDS(file.path(matrices_dir, "ZGZE.rds"))
ZPZE <- readRDS(file.path(matrices_dir, "ZPZE.rds"))
ZGZW <- readRDS(file.path(matrices_dir, "ZGZW.rds"))
ZPZW <- readRDS(file.path(matrices_dir, "ZPZW.rds"))
GGK  <- readRDS(file.path(matrices_dir, "GGK.rds"))
PGK  <- readRDS(file.path(matrices_dir, "PGK.rds"))
GAK  <- readRDS(file.path(matrices_dir, "GAK.rds"))
PAK  <- readRDS(file.path(matrices_dir, "PAK.rds"))
GGKE <- readRDS(file.path(matrices_dir, "GGKE.rds"))
PGKE <- readRDS(file.path(matrices_dir, "PGKE.rds"))
GAKE <- readRDS(file.path(matrices_dir, "GAKE.rds"))
PAKE <- readRDS(file.path(matrices_dir, "PAKE.rds"))
GGKW <- readRDS(file.path(matrices_dir, "GGKW.rds"))
PGKW <- readRDS(file.path(matrices_dir, "PGKW.rds"))
GAKW <- readRDS(file.path(matrices_dir, "GAKW.rds"))
PAKW <- readRDS(file.path(matrices_dir, "PAKW.rds"))

if (!exists("Pedigree")) Pedigree <- unique(Pheno$Pedigree)

candidate_traits <- c("GY", "KW")
traits_to_run <- candidate_traits[candidate_traits %in% names(Pheno)]
if (length(traits_to_run) == 0) {
  stop("None of the expected traits (GY, KW) were found in Pheno.")
}

if (!exists("nIter"))   nIter   <- 5000
if (!exists("burnIn"))  burnIn  <- 1000
if (!exists("thin"))    thin    <- 10
if (!exists("num_rep")) num_rep <- 10
if (!exists("n_cores")) n_cores <- max(1L, parallel::detectCores() - 1L)

model_catalog <- data.frame(
  ModelNumber = 1:18,
  Model = paste0("Eta", 1:18),
  Description = c(
    "E + G + GE",
    "E + P + PE",
    "E + GGK + GGKE",
    "E + PGK + PGKE",
    "E + GAK + GAKE",
    "E + PAK + PAKE",
    "E + G + P + GE + PE",
    "E + GGK + PGK + GGKE + PGKE",
    "E + GAK + PAK + GAKE + PAKE",
    "E + W + G + GE + GW",
    "E + W + P + PE + PW",
    "E + W + G + P + GE + PE + GW + PW",
    "E + W + GGK + GGKE + GGKW",
    "E + W + PGK + PGKE + PGKW",
    "E + W + GGK + PGK + GGKE + PGKE + GGKW + PGKW",
    "E + W + GAK + GAKE + GAKW",
    "E + W + PAK + PAKE + PAKW",
    "E + W + GAK + PAK + GAKE + PAKE + GAKW + PAKW"
  ),
  stringsAsFactors = FALSE
)

env_design <- model.matrix(~ Pheno$Env - 1)

Eta_list <- vector("list", 18)
Eta_list[[1]]  <- list(list(X = env_design, model = "BRR"), list(K = ZG, model = "RKHS"), list(K = ZGZE, model = "RKHS"))
Eta_list[[2]]  <- list(list(X = env_design, model = "BRR"), list(K = ZP, model = "RKHS"), list(K = ZPZE, model = "RKHS"))
Eta_list[[3]]  <- list(list(X = env_design, model = "BRR"), list(K = GGK, model = "RKHS"), list(K = GGKE, model = "RKHS"))
Eta_list[[4]]  <- list(list(X = env_design, model = "BRR"), list(K = PGK, model = "RKHS"), list(K = PGKE, model = "RKHS"))
Eta_list[[5]]  <- list(list(X = env_design, model = "BRR"), list(K = GAK, model = "RKHS"), list(K = GAKE, model = "RKHS"))
Eta_list[[6]]  <- list(list(X = env_design, model = "BRR"), list(K = PAK, model = "RKHS"), list(K = PAKE, model = "RKHS"))
Eta_list[[7]]  <- list(list(X = env_design, model = "BRR"), list(K = ZG, model = "RKHS"), list(K = ZP, model = "RKHS"), list(K = ZGZE, model = "RKHS"), list(K = ZPZE, model = "RKHS"))
Eta_list[[8]]  <- list(list(X = env_design, model = "BRR"), list(K = GGK, model = "RKHS"), list(K = PGK, model = "RKHS"), list(K = GGKE, model = "RKHS"), list(K = PGKE, model = "RKHS"))
Eta_list[[9]]  <- list(list(X = env_design, model = "BRR"), list(K = GAK, model = "RKHS"), list(K = PAK, model = "RKHS"), list(K = GAKE, model = "RKHS"), list(K = PAKE, model = "RKHS"))
Eta_list[[10]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = ZG, model = "RKHS"), list(K = ZGZE, model = "RKHS"), list(K = ZGZW, model = "RKHS"))
Eta_list[[11]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = ZP, model = "RKHS"), list(K = ZPZE, model = "RKHS"), list(K = ZPZW, model = "RKHS"))
Eta_list[[12]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = ZG, model = "RKHS"), list(K = ZP, model = "RKHS"), list(K = ZGZE, model = "RKHS"), list(K = ZPZE, model = "RKHS"), list(K = ZGZW, model = "RKHS"), list(K = ZPZW, model = "RKHS"))
Eta_list[[13]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = GGK, model = "RKHS"), list(K = GGKE, model = "RKHS"), list(K = GGKW, model = "RKHS"))
Eta_list[[14]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = PGK, model = "RKHS"), list(K = PGKE, model = "RKHS"), list(K = PGKW, model = "RKHS"))
Eta_list[[15]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = GGK, model = "RKHS"), list(K = PGK, model = "RKHS"), list(K = GGKE, model = "RKHS"), list(K = PGKE, model = "RKHS"), list(K = GGKW, model = "RKHS"), list(K = PGKW, model = "RKHS"))
Eta_list[[16]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = GAK, model = "RKHS"), list(K = GAKE, model = "RKHS"), list(K = GAKW, model = "RKHS"))
Eta_list[[17]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = PAK, model = "RKHS"), list(K = PAKE, model = "RKHS"), list(K = PAKW, model = "RKHS"))
Eta_list[[18]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = GAK, model = "RKHS"), list(K = PAK, model = "RKHS"), list(K = GAKE, model = "RKHS"), list(K = PAKE, model = "RKHS"), list(K = GAKW, model = "RKHS"), list(K = PAKW, model = "RKHS"))

4. Define the execution contract

This block restores the operational objects used by the current workflow: the fixed leave-one-environment-out set, the number of folds, and the MODEL × REP × TRAIT task grid for the 18-model catalog. The fitting and summarization logic are written directly inside the CV blocks so that the execution path stays explicit.

envs_leaveout <- c("CS11_WS", "CS11_WW", "CS12_WS", "CS12_WW")
envs_leaveout <- envs_leaveout[envs_leaveout %in% unique(Pheno$Env)]

num_fold <- 5L
n_hybrids <- length(Pedigree)
combos <- expand.grid(
  MODEL = seq_along(Eta_list),
  REP = seq_len(num_rep),
  TRAIT = traits_to_run,
  stringsAsFactors = FALSE
)

5. Start the parallel backend

The backend below mirrors the original workflow: one worker receives one combination of model, repetition, and trait.

useCores <- max(1L, n_cores)
cl <- parallel::makeCluster(useCores)
doParallel::registerDoParallel(cl)

parallel::clusterEvalQ(cl, {
  library(BGLR)
  library(dplyr)
  library(plyr)
  NULL
})

cat("Running on ", useCores, " cores.")

6. Run prediction analyses for CV1 and CV2

This block implements genotype-based validation for the 18-model workflow. The deterministic fold map, the BGLR fit, and the environment-wise summary are all written directly inside the loop so that each step stays explicit. One fit is run per fold, and the resulting predictions are split into CV1 (held-out genotypes) and CV2 (training genotypes under the same fit). In the current standard, both the final CSV summaries and the BGLR fitting artifacts are stored in a rep/fold hierarchy.

foreach(i = seq_len(nrow(combos)),
        .packages = c("BGLR", "dplyr", "plyr")) %dopar% {
  MODEL <- combos$MODEL[i]
  rep_num <- combos$REP[i]
  TRAIT <- as.character(combos$TRAIT[i])

  pheno <- data.frame(Pheno)

  trait_id <- match(TRAIT, trait_reference_order)
  if (is.na(trait_id)) {
    trait_id <- 1L
  }

  seed_value <- as.integer(100000L * MODEL + 100L * rep_num + trait_id)
  old_seed_exists <- exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
  if (old_seed_exists) {
    old_seed <- get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
  }

  set.seed(seed_value)
  obs_per_fold <- ceiling(length(Pedigree) / num_fold)
  folds <- sample(rep(seq_len(num_fold), each = obs_per_fold, length.out = length(Pedigree)))

  if (old_seed_exists) {
    assign(".Random.seed", old_seed, envir = .GlobalEnv)
  } else if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) {
    rm(".Random.seed", envir = .GlobalEnv)
  }

  folds_df <- data.frame(Pedigree = Pedigree, fold = folds, stringsAsFactors = FALSE)

  for (fold_num in seq_len(num_fold)) {
    rep_label <- sprintf("rep%02d", rep_num)
    fold_label <- sprintf("fold%02d", fold_num)

    cv1_dir <- file.path(results_dir, "CV1", TRAIT, rep_label, fold_label)
    cv2_dir <- file.path(results_dir, "CV2", TRAIT, rep_label, fold_label)
    dir.create(cv1_dir, recursive = TRUE, showWarnings = FALSE)
    dir.create(cv2_dir, recursive = TRUE, showWarnings = FALSE)

    cv1_file <- file.path(cv1_dir, paste0("CV1_", TRAIT, "_Eta", MODEL, "_", rep_label, "_", fold_label, ".csv"))
    cv2_file <- file.path(cv2_dir, paste0("CV2_", TRAIT, "_Eta", MODEL, "_", rep_label, "_", fold_label, ".csv"))

    if (file.exists(cv1_file) && file.exists(cv2_file)) {
      next
    }

    train_geno <- folds_df$Pedigree[folds_df$fold != fold_num]

    yield <- pheno
    yield$Y2 <- NA
    yield$Y2[yield$Pedigree %in% train_geno] <- yield[[TRAIT]][yield$Pedigree %in% train_geno]

    run_dir <- file.path(bglr_runs_dir, "CV1", TRAIT, rep_label, fold_label, paste0("Eta", MODEL))
    dir.create(run_dir, recursive = TRUE, showWarnings = FALSE)

    save_prefix_full <- file.path(run_dir, paste0("CV1_", TRAIT, "_Eta", MODEL, "_", rep_label, "_", fold_label, "_"))

    y_t <- as.numeric(yield$Y2)
    start_time <- Sys.time()
    fit <- BGLR(y = y_t, ETA = Eta_list[[MODEL]], nIter = nIter, burnIn = burnIn, thin = thin, saveAt = save_prefix_full)
    end_time <- Sys.time()
    runtime_full <- as.numeric(difftime(end_time, start_time, units = "secs"))

    yield$yhat <- fit$yHat
    rm(fit)
    gc()

    df_test <- yield[!yield$Pedigree %in% train_geno, ]
    df_train <- yield[yield$Pedigree %in% train_geno, ]

    tmp1 <- df_test %>%
      dplyr::group_by(Env) %>%
      dplyr::summarize(cor = cor(.data[[TRAIT]], yhat, use = "complete.obs"), .groups = "drop") %>%
      as.data.frame()

    tmp2 <- df_train %>%
      dplyr::group_by(Env) %>%
      dplyr::summarize(cor = cor(.data[[TRAIT]], yhat, use = "complete.obs"), .groups = "drop") %>%
      as.data.frame()

    tmp1$Runtime_sec <- runtime_full
    tmp2$Runtime_sec <- runtime_full
    tmp1$Fold <- fold_label
    tmp2$Fold <- fold_label

    write.csv(tmp1, file = cv1_file, row.names = FALSE)
    write.csv(tmp2, file = cv2_file, row.names = FALSE)
  }

  NULL
}

7. Run prediction analyses for CV0 and CV00

This block implements leave-one-environment-out validation for the same 18-model workflow. For each task, the deterministic fold map is rebuilt with the same rule used in CV1/CV2, one fit is run for each env_leave × fold combination, and the predictions are split into CV0 (training genotypes under the environment mask) and CV00 (held-out genotypes under the same mask). As in the genotype-based block, the fold construction, the BGLR fit, and the environment-wise summary are all written directly inside the loop so that the execution path remains explicit. The outputs are written in a rep/fold hierarchy and the BGLR artifacts are preserved in matching persistent folders.

foreach(i = seq_len(nrow(combos)),
        .packages = c("BGLR", "dplyr", "plyr")) %dopar% {
  MODEL <- combos$MODEL[i]
  rep_num <- combos$REP[i]
  TRAIT <- as.character(combos$TRAIT[i])

  pheno <- data.frame(Pheno)

  trait_id <- match(TRAIT, trait_reference_order)
  if (is.na(trait_id)) {
    trait_id <- 1L
  }

  seed_value <- as.integer(100000L * MODEL + 100L * rep_num + trait_id)
  old_seed_exists <- exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
  if (old_seed_exists) {
    old_seed <- get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
  }

  set.seed(seed_value)
  obs_per_fold <- ceiling(length(Pedigree) / num_fold)
  folds <- sample(rep(seq_len(num_fold), each = obs_per_fold, length.out = length(Pedigree)))

  if (old_seed_exists) {
    assign(".Random.seed", old_seed, envir = .GlobalEnv)
  } else if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) {
    rm(".Random.seed", envir = .GlobalEnv)
  }

  folds_df <- data.frame(Pedigree = Pedigree, fold = folds, stringsAsFactors = FALSE)

  for (fold_num in seq_len(num_fold)) {
    train_geno <- folds_df$Pedigree[folds_df$fold != fold_num]
    rep_label <- sprintf("rep%02d", rep_num)
    fold_label <- sprintf("fold%02d", fold_num)

    for (env_idx in seq_along(envs_leaveout)) {
      env_name <- envs_leaveout[env_idx]

      cv00_dir <- file.path(results_dir, "CV00", env_name, TRAIT, rep_label, fold_label)
      cv0_dir <- file.path(results_dir, "CV0", env_name, TRAIT, rep_label, fold_label)
      dir.create(cv00_dir, recursive = TRUE, showWarnings = FALSE)
      dir.create(cv0_dir, recursive = TRUE, showWarnings = FALSE)

      cv00_file <- file.path(cv00_dir, paste0("CV00_", TRAIT, "_", env_name, "_Eta", MODEL, "_", rep_label, "_", fold_label, ".csv"))
      cv0_file <- file.path(cv0_dir, paste0("CV0_", TRAIT, "_", env_name, "_Eta", MODEL, "_", rep_label, "_", fold_label, ".csv"))

      if (file.exists(cv00_file) && file.exists(cv0_file)) {
        next
      }

      yield2 <- pheno
      yield2$Y2 <- NA
      yield2$Y2[yield2$Pedigree %in% train_geno] <- yield2[[TRAIT]][yield2$Pedigree %in% train_geno]
      yield2$Y2[yield2$Env == env_name] <- NA

      run_dir <- file.path(bglr_runs_dir, "CV0", env_name, TRAIT, rep_label, fold_label, paste0("Eta", MODEL))
      dir.create(run_dir, recursive = TRUE, showWarnings = FALSE)

      save_prefix_env <- file.path(run_dir, paste0("CV0_", env_name, "_", TRAIT, "_Eta", MODEL, "_", rep_label, "_", fold_label, "_"))

      y_t <- as.numeric(yield2$Y2)
      start_time <- Sys.time()
      fit <- BGLR(y = y_t, ETA = Eta_list[[MODEL]], nIter = nIter, burnIn = burnIn, thin = thin, saveAt = save_prefix_env)
      end_time <- Sys.time()
      runtime_env <- as.numeric(difftime(end_time, start_time, units = "secs"))

      yield2$yhat <- fit$yHat
      rm(fit)
      gc()

      df_test2 <- yield2[!yield2$Pedigree %in% train_geno & yield2$Env == env_name, , drop = FALSE]
      df_train2 <- yield2[yield2$Pedigree %in% train_geno & yield2$Env == env_name, , drop = FALSE]

      tmp00 <- df_test2 %>%
        dplyr::group_by(Env) %>%
        dplyr::summarize(cor = cor(.data[[TRAIT]], yhat, use = "complete.obs"), .groups = "drop") %>%
        as.data.frame()

      tmp0 <- df_train2 %>%
        dplyr::group_by(Env) %>%
        dplyr::summarize(cor = cor(.data[[TRAIT]], yhat, use = "complete.obs"), .groups = "drop") %>%
        as.data.frame()

      tmp00$Runtime_sec <- runtime_env
      tmp0$Runtime_sec <- runtime_env
      tmp00$Fold <- fold_label
      tmp0$Fold <- fold_label

      write.csv(tmp00, file = cv00_file, row.names = FALSE)
      write.csv(tmp0, file = cv0_file, row.names = FALSE)
    }
  }

  NULL
}

8. Stop the parallel backend

stopCluster(cl)
foreach::registerDoSEQ()

9. Inventory of the prediction output files

The inventories below summarize the CSV files produced by the execution stage using the current rep/fold folder contract for the 18-model workflow.

prediction_files <- character(0)

if (dir.exists(results_dir)) {
  prediction_files <- list.files(results_dir, pattern = "\\.csv$", full.names = TRUE, recursive = TRUE)
}

inventory_cv12 <- data.frame()
inventory_cvloo <- data.frame()

if (length(prediction_files) > 0) {
  results_root <- normalizePath(results_dir, winslash = "/")

  for (current_file in prediction_files) {
    current_norm <- normalizePath(current_file, winslash = "/")
    rel_path <- substring(current_norm, nchar(results_root) + 2)
    path_parts <- strsplit(rel_path, "/", fixed = TRUE)[[1]]
    current_name <- basename(current_file)
    current_name_no_ext <- sub("\\.csv$", "", current_name)
    current_parts <- strsplit(current_name_no_ext, "_")[[1]]

    if (length(path_parts) >= 5 && current_parts[1] %in% c("CV1", "CV2")) {
      inventory_cv12 <- dplyr::bind_rows(
        inventory_cv12,
        data.frame(
          relative_path = rel_path,
          file_name = current_name,
          CV = current_parts[1],
          Trait = path_parts[2],
          Rep = path_parts[3],
          Fold = path_parts[4],
          Model = current_parts[3],
          stringsAsFactors = FALSE
        )
      )
    }

    if (length(path_parts) >= 6 && current_parts[1] %in% c("CV0", "CV00")) {
      inventory_cvloo <- dplyr::bind_rows(
        inventory_cvloo,
        data.frame(
          relative_path = rel_path,
          file_name = current_name,
          CV = current_parts[1],
          Env_leave = path_parts[2],
          Trait = path_parts[3],
          Rep = path_parts[4],
          Fold = path_parts[5],
          Model = current_parts[length(current_parts) - 2],
          stringsAsFactors = FALSE
        )
      )
    }
  }
}

write_csv(inventory_cv12, file.path(analysis_tables_dir, "analysis_prediction_inventory_cv12.csv"))
write_csv(inventory_cvloo, file.path(analysis_tables_dir, "analysis_prediction_inventory_cvloo.csv"))

10. Counts of files by validation scheme and trait

counts_by_cv <- data.frame()
counts_by_trait <- data.frame()

if (nrow(inventory_cv12) > 0 || nrow(inventory_cvloo) > 0) {
  combined_inventory <- bind_rows(
    inventory_cv12 %>% mutate(Env_leave = NA_character_),
    inventory_cvloo
  )

  counts_by_cv <- combined_inventory %>%
    count(CV, name = "n_files") %>%
    arrange(CV)

  counts_by_trait <- combined_inventory %>%
    count(CV, Trait, name = "n_files") %>%
    arrange(CV, Trait)
}

write_csv(counts_by_cv, file.path(analysis_tables_dir, "analysis_prediction_counts_by_cv.csv"))
write_csv(counts_by_trait, file.path(analysis_tables_dir, "analysis_prediction_counts_by_trait.csv"))

if (nrow(counts_by_cv) > 0) {
  knitr::kable(counts_by_cv, caption = "Number of prediction files by validation scheme.", align = "cc")
} else {
  cat("Prediction counts by validation scheme are not available yet.
")
}
Number of prediction files by validation scheme.
CV n_files
CV0 7200
CV00 7200
CV1 1800
CV2 1800

11. Examples of the output path pattern

output_examples <- bind_rows(
  tibble::tibble(
    ExampleType = "CV1/CV2",
    relative_path = c(
      "CV1/GY/rep01/fold01/CV1_GY_Eta1_rep01_fold01.csv",
      "CV2/KW/rep10/fold05/CV2_KW_Eta18_rep10_fold05.csv"
    ),
    bglr_example = c(
      "bglr_runs/CV1/GY/rep01/fold01/Eta1/CV1_GY_Eta1_rep01_fold01_GE_RKHS_varU.dat",
      "bglr_runs/CV2/KW/rep10/fold05/Eta18/CV2_KW_Eta18_rep10_fold05_PAKW_RKHS_varU.dat"
    ),
    Meaning = c(
      "Prediction accuracy for CV1, trait GY, model Eta1, repetition 1, fold 1.",
      "Prediction accuracy for CV2, trait KW, model Eta18, repetition 10, fold 5."
    )
  ),
  tibble::tibble(
    ExampleType = "CV0/CV00",
    relative_path = c(
      "CV0/CS11_WS/GY/rep01/fold01/CV0_GY_CS11_WS_Eta1_rep01_fold01.csv",
      "CV00/CS12_WW/KW/rep10/fold02/CV00_KW_CS12_WW_Eta18_rep10_fold02.csv"
    ),
    bglr_example = c(
      "bglr_runs/CV0/CS11_WS/GY/rep01/fold01/Eta1/CV0_CS11_WS_GY_Eta1_rep01_fold01_GW_RKHS_varU.dat",
      "bglr_runs/CV00/CS12_WW/KW/rep10/fold02/Eta18/CV00_CS12_WW_KW_Eta18_rep10_fold02_PAKW_RKHS_varU.dat"
    ),
    Meaning = c(
      "Prediction accuracy for CV0, trait GY, held-out environment CS11_WS, model Eta1, repetition 1, fold 1.",
      "Prediction accuracy for CV00, trait KW, held-out environment CS12_WW, model Eta18, repetition 10, fold 2."
    )
  )
)

write_csv(output_examples, file.path(analysis_tables_dir, "analysis_prediction_output_examples.csv"))
knitr::kable(output_examples, caption = "Examples of output files stored under the current rep/fold layout, together with the scenario-specific prefixes used for the persistent BGLR artifacts.", align = "c")
Examples of output files stored under the current rep/fold layout, together with the scenario-specific prefixes used for the persistent BGLR artifacts.
ExampleType relative_path bglr_example Meaning
CV1/CV2 CV1/GY/rep01/fold01/CV1_GY_Eta1_rep01_fold01.csv bglr_runs/CV1/GY/rep01/fold01/Eta1/CV1_GY_Eta1_rep01_fold01_GE_RKHS_varU.dat Prediction accuracy for CV1, trait GY, model Eta1, repetition 1, fold 1.
CV1/CV2 CV2/KW/rep10/fold05/CV2_KW_Eta18_rep10_fold05.csv bglr_runs/CV2/KW/rep10/fold05/Eta18/CV2_KW_Eta18_rep10_fold05_PAKW_RKHS_varU.dat Prediction accuracy for CV2, trait KW, model Eta18, repetition 10, fold 5.
CV0/CV00 CV0/CS11_WS/GY/rep01/fold01/CV0_GY_CS11_WS_Eta1_rep01_fold01.csv bglr_runs/CV0/CS11_WS/GY/rep01/fold01/Eta1/CV0_CS11_WS_GY_Eta1_rep01_fold01_GW_RKHS_varU.dat Prediction accuracy for CV0, trait GY, held-out environment CS11_WS, model Eta1, repetition 1, fold 1.
CV0/CV00 CV00/CS12_WW/KW/rep10/fold02/CV00_KW_CS12_WW_Eta18_rep10_fold02.csv bglr_runs/CV00/CS12_WW/KW/rep10/fold02/Eta18/CV00_CS12_WW_KW_Eta18_rep10_fold02_PAKW_RKHS_varU.dat Prediction accuracy for CV00, trait KW, held-out environment CS12_WW, model Eta18, repetition 10, fold 2.

12. Note on temporary BGLR artifacts

In the current 18-model workflow, persistent saveAt artifacts are written under output/results/bglr_runs/ during each fit and preserved for later convergence diagnostics. The scenario-specific paths now separate CV1, CV2, CV0, and CV00, and the persisted filenames use the scenario prefix together with explicit effect labels instead of generic ETA_1, ETA_2, and similar placeholders. Therefore, this execution module inventories both the final CSV outputs stored under output/results/ and the hierarchical paths that locate the corresponding BGLR run folders.

13. Final note

This file closes the execution layer of the prediction stage while preserving consistency with the reduced 18-model preprocessing module documented in analysis_prediction.Rmd and with the current rep/fold output contract used by the downstream summaries.

Stage navigation

Previous stage: Prediction — Part I
Next stage: Visualization


sessionInfo()
R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=Portuguese_Brazil.utf8  LC_CTYPE=Portuguese_Brazil.utf8   
[3] LC_MONETARY=Portuguese_Brazil.utf8 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.utf8    

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] tidyr_1.3.2       doParallel_1.0.17 iterators_1.0.14  foreach_1.5.2    
[5] BGLR_1.1.4        knitr_1.50        stringr_1.6.0     dplyr_1.2.0      
[9] readr_2.2.0      

loaded via a namespace (and not attached):
 [1] sass_0.4.10       generics_0.1.4    stringi_1.8.7     hms_1.1.4        
 [5] digest_0.6.39     magrittr_2.0.4    evaluate_1.0.5    fastmap_1.2.0    
 [9] rprojroot_2.1.1   workflowr_1.7.2   jsonlite_2.0.0    whisker_0.4.1    
[13] promises_1.5.0    purrr_1.2.1       truncnorm_1.0-9   codetools_0.2-20 
[17] jquerylib_0.1.4   cli_3.6.5         rlang_1.1.7       crayon_1.5.3     
[21] bit64_4.6.0-1     cachem_1.1.0      yaml_2.3.12       otel_0.2.0       
[25] tools_4.5.1       tzdb_0.5.0        httpuv_1.6.16     vctrs_0.7.1      
[29] R6_2.6.1          lifecycle_1.0.5   git2r_0.36.2      fs_1.6.6         
[33] bit_4.6.0         vroom_1.7.0       MASS_7.3-65       pkgconfig_2.0.3  
[37] pillar_1.11.1     bslib_0.9.0       later_1.4.4       glue_1.8.0       
[41] Rcpp_1.1.1        xfun_0.54         tibble_3.3.1      tidyselect_1.2.1 
[45] rstudioapi_0.17.1 htmltools_0.5.9   rmarkdown_2.30    compiler_4.5.1   

  1. Weverton Gomes da Costa, Postdoctoral Researcher, Department of Statistics - Universidade Federal de Viçosa, ↩︎

---
title: "Prediction analyses - run prediction and outputs"
author:
  - Costa, W. G.^[Weverton Gomes da Costa, Postdoctoral Researcher, Department of Statistics - Universidade Federal de Viçosa, wevertonufv@gmail.com]
date: "`r Sys.Date()`"
site: workflowr::wflow_site
url: https://wevertongomescosta.github.io/Integrating-nir-genomic-kernel/
output:
  workflowr::wflow_html:
    toc: true
    toc_depth: 3
    toc_float: true
    code_download: true
editor_options:
  chunk_output_type: console
github-repo: WevertonGomesCosta/Integrating-nir-genomic-kernel
---

**Language / Idioma:** [English](analysis_prediction_run_outputs.html) | [Português](analysis_prediction_run_outputs_pt.html)

**Module navigation / Navegação do módulo:** [Part I: setup, models and CV design](analysis_prediction.html)

## 1. Introduction

This submodule is the execution counterpart of `analysis_prediction.Rmd`. Its role is to run the prediction jobs for the reduced catalog of 18 models, save the CSV files under `output/results/`, and inventory the outputs that will later be consumed by the visualization stage.

The execution contract used here has four operational rules:

- `CV1` and `CV2` are derived from the same fit within each fold;
- `CV0` and `CV00` are derived from the same fit for each held-out environment within each fold;
- the fold partition is deterministic for each `MODEL × REP × TRAIT` task and is reused in both CV blocks;
- output files are organized explicitly by validation scheme, environment when applicable, trait, repetition, and fold, with the fold represented both in the directory tree and in the file name.

The full fitting stage is preserved here for completeness. When the prediction stage needs to be rerun, the code shown below is the operational version used by the pipeline.

## 2. Packages and project paths

The execution module needs both the matrix objects created upstream and the packages used by the parallel fitting layer. The path logic below keeps the file runnable from either the project root or the `analysis/` directory.


```{r setup}
rm(list = ls())
options(scipen = 999)

library(readr)
library(dplyr)
library(stringr)
library(knitr)
library(BGLR)
library(foreach)
library(doParallel)
library(parallel)
library(tidyr)

project_root <- getwd()

if (!dir.exists(file.path(project_root, "output")) &&
    dir.exists(file.path(project_root, "..", "output"))) {
  project_root <- normalizePath(file.path(project_root, ".."))
}

if (!dir.exists(file.path(project_root, "output"))) {
  stop("Could not find the folder 'output/'. Render this file from the project root or from analysis/.")
}

results_dir <- file.path(project_root, "output", "results")
analysis_tables_dir <- file.path(project_root, "output", "tables", "analysis_prediction")
bglr_runs_dir <- file.path(results_dir, "bglr_runs")

dir.create(results_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(analysis_tables_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(bglr_runs_dir, recursive = TRUE, showWarnings = FALSE)
```


## 3. Reload the execution objects from Part I

This section rebuilds the execution environment when the file is rendered on its own. It follows the contract of `analysis_prediction.Rmd` and recreates the reduced set of 18 ETA objects exactly as defined in Part I.

```{r load-required-objects, eval=FALSE}
matrices_dir <- file.path(project_root, "output", "Matrizes")

required_matrix_files <- c(
  "Pheno.rds",
  "ZG.rds", "ZP.rds", "ZE.rds", "ZW.rds",
  "ZGZE.rds", "ZPZE.rds", "ZGZW.rds", "ZPZW.rds",
  "GGK.rds", "PGK.rds", "GAK.rds", "PAK.rds",
  "GGKE.rds", "PGKE.rds", "GAKE.rds", "PAKE.rds",
  "GGKW.rds", "PGKW.rds", "GAKW.rds", "PAKW.rds"
)

required_matrix_paths <- file.path(matrices_dir, required_matrix_files)

if (!all(file.exists(required_matrix_paths))) {
  missing_matrix_files <- required_matrix_files[!file.exists(required_matrix_paths)]
  stop(
    paste0(
      "The execution objects required by analysis_prediction_run_outputs were not found in output/Matrizes/. Missing files: ",
      paste(missing_matrix_files, collapse = ", ")
    )
  )
}

Pheno <- readRDS(file.path(matrices_dir, "Pheno.rds"))
ZG   <- readRDS(file.path(matrices_dir, "ZG.rds"))
ZP   <- readRDS(file.path(matrices_dir, "ZP.rds"))
ZE   <- readRDS(file.path(matrices_dir, "ZE.rds"))
ZW   <- readRDS(file.path(matrices_dir, "ZW.rds"))
ZGZE <- readRDS(file.path(matrices_dir, "ZGZE.rds"))
ZPZE <- readRDS(file.path(matrices_dir, "ZPZE.rds"))
ZGZW <- readRDS(file.path(matrices_dir, "ZGZW.rds"))
ZPZW <- readRDS(file.path(matrices_dir, "ZPZW.rds"))
GGK  <- readRDS(file.path(matrices_dir, "GGK.rds"))
PGK  <- readRDS(file.path(matrices_dir, "PGK.rds"))
GAK  <- readRDS(file.path(matrices_dir, "GAK.rds"))
PAK  <- readRDS(file.path(matrices_dir, "PAK.rds"))
GGKE <- readRDS(file.path(matrices_dir, "GGKE.rds"))
PGKE <- readRDS(file.path(matrices_dir, "PGKE.rds"))
GAKE <- readRDS(file.path(matrices_dir, "GAKE.rds"))
PAKE <- readRDS(file.path(matrices_dir, "PAKE.rds"))
GGKW <- readRDS(file.path(matrices_dir, "GGKW.rds"))
PGKW <- readRDS(file.path(matrices_dir, "PGKW.rds"))
GAKW <- readRDS(file.path(matrices_dir, "GAKW.rds"))
PAKW <- readRDS(file.path(matrices_dir, "PAKW.rds"))

if (!exists("Pedigree")) Pedigree <- unique(Pheno$Pedigree)

candidate_traits <- c("GY", "KW")
traits_to_run <- candidate_traits[candidate_traits %in% names(Pheno)]
if (length(traits_to_run) == 0) {
  stop("None of the expected traits (GY, KW) were found in Pheno.")
}

if (!exists("nIter"))   nIter   <- 5000
if (!exists("burnIn"))  burnIn  <- 1000
if (!exists("thin"))    thin    <- 10
if (!exists("num_rep")) num_rep <- 10
if (!exists("n_cores")) n_cores <- max(1L, parallel::detectCores() - 1L)

model_catalog <- data.frame(
  ModelNumber = 1:18,
  Model = paste0("Eta", 1:18),
  Description = c(
    "E + G + GE",
    "E + P + PE",
    "E + GGK + GGKE",
    "E + PGK + PGKE",
    "E + GAK + GAKE",
    "E + PAK + PAKE",
    "E + G + P + GE + PE",
    "E + GGK + PGK + GGKE + PGKE",
    "E + GAK + PAK + GAKE + PAKE",
    "E + W + G + GE + GW",
    "E + W + P + PE + PW",
    "E + W + G + P + GE + PE + GW + PW",
    "E + W + GGK + GGKE + GGKW",
    "E + W + PGK + PGKE + PGKW",
    "E + W + GGK + PGK + GGKE + PGKE + GGKW + PGKW",
    "E + W + GAK + GAKE + GAKW",
    "E + W + PAK + PAKE + PAKW",
    "E + W + GAK + PAK + GAKE + PAKE + GAKW + PAKW"
  ),
  stringsAsFactors = FALSE
)

env_design <- model.matrix(~ Pheno$Env - 1)

Eta_list <- vector("list", 18)
Eta_list[[1]]  <- list(list(X = env_design, model = "BRR"), list(K = ZG, model = "RKHS"), list(K = ZGZE, model = "RKHS"))
Eta_list[[2]]  <- list(list(X = env_design, model = "BRR"), list(K = ZP, model = "RKHS"), list(K = ZPZE, model = "RKHS"))
Eta_list[[3]]  <- list(list(X = env_design, model = "BRR"), list(K = GGK, model = "RKHS"), list(K = GGKE, model = "RKHS"))
Eta_list[[4]]  <- list(list(X = env_design, model = "BRR"), list(K = PGK, model = "RKHS"), list(K = PGKE, model = "RKHS"))
Eta_list[[5]]  <- list(list(X = env_design, model = "BRR"), list(K = GAK, model = "RKHS"), list(K = GAKE, model = "RKHS"))
Eta_list[[6]]  <- list(list(X = env_design, model = "BRR"), list(K = PAK, model = "RKHS"), list(K = PAKE, model = "RKHS"))
Eta_list[[7]]  <- list(list(X = env_design, model = "BRR"), list(K = ZG, model = "RKHS"), list(K = ZP, model = "RKHS"), list(K = ZGZE, model = "RKHS"), list(K = ZPZE, model = "RKHS"))
Eta_list[[8]]  <- list(list(X = env_design, model = "BRR"), list(K = GGK, model = "RKHS"), list(K = PGK, model = "RKHS"), list(K = GGKE, model = "RKHS"), list(K = PGKE, model = "RKHS"))
Eta_list[[9]]  <- list(list(X = env_design, model = "BRR"), list(K = GAK, model = "RKHS"), list(K = PAK, model = "RKHS"), list(K = GAKE, model = "RKHS"), list(K = PAKE, model = "RKHS"))
Eta_list[[10]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = ZG, model = "RKHS"), list(K = ZGZE, model = "RKHS"), list(K = ZGZW, model = "RKHS"))
Eta_list[[11]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = ZP, model = "RKHS"), list(K = ZPZE, model = "RKHS"), list(K = ZPZW, model = "RKHS"))
Eta_list[[12]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = ZG, model = "RKHS"), list(K = ZP, model = "RKHS"), list(K = ZGZE, model = "RKHS"), list(K = ZPZE, model = "RKHS"), list(K = ZGZW, model = "RKHS"), list(K = ZPZW, model = "RKHS"))
Eta_list[[13]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = GGK, model = "RKHS"), list(K = GGKE, model = "RKHS"), list(K = GGKW, model = "RKHS"))
Eta_list[[14]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = PGK, model = "RKHS"), list(K = PGKE, model = "RKHS"), list(K = PGKW, model = "RKHS"))
Eta_list[[15]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = GGK, model = "RKHS"), list(K = PGK, model = "RKHS"), list(K = GGKE, model = "RKHS"), list(K = PGKE, model = "RKHS"), list(K = GGKW, model = "RKHS"), list(K = PGKW, model = "RKHS"))
Eta_list[[16]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = GAK, model = "RKHS"), list(K = GAKE, model = "RKHS"), list(K = GAKW, model = "RKHS"))
Eta_list[[17]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = PAK, model = "RKHS"), list(K = PAKE, model = "RKHS"), list(K = PAKW, model = "RKHS"))
Eta_list[[18]] <- list(list(X = env_design, model = "BRR"), list(K = ZW, model = "RKHS"), list(K = GAK, model = "RKHS"), list(K = PAK, model = "RKHS"), list(K = GAKE, model = "RKHS"), list(K = PAKE, model = "RKHS"), list(K = GAKW, model = "RKHS"), list(K = PAKW, model = "RKHS"))

```


## 4. Define the execution contract

This block restores the operational objects used by the current workflow: the fixed leave-one-environment-out set, the number of folds, and the `MODEL × REP × TRAIT` task grid for the 18-model catalog. The fitting and summarization logic are written directly inside the CV blocks so that the execution path stays explicit.

```{r define-run-objects, eval=FALSE}
envs_leaveout <- c("CS11_WS", "CS11_WW", "CS12_WS", "CS12_WW")
envs_leaveout <- envs_leaveout[envs_leaveout %in% unique(Pheno$Env)]

num_fold <- 5L
n_hybrids <- length(Pedigree)
combos <- expand.grid(
  MODEL = seq_along(Eta_list),
  REP = seq_len(num_rep),
  TRAIT = traits_to_run,
  stringsAsFactors = FALSE
)
```

## 5. Start the parallel backend

The backend below mirrors the original workflow: one worker receives one combination of model, repetition, and trait.

```{r start-parallel-backend, eval=FALSE}
useCores <- max(1L, n_cores)
cl <- parallel::makeCluster(useCores)
doParallel::registerDoParallel(cl)

parallel::clusterEvalQ(cl, {
  library(BGLR)
  library(dplyr)
  library(plyr)
  NULL
})

cat("Running on ", useCores, " cores.")
```


## 6. Run prediction analyses for CV1 and CV2

This block implements genotype-based validation for the 18-model workflow. The deterministic fold map, the BGLR fit, and the environment-wise summary are all written directly inside the loop so that each step stays explicit. One fit is run per fold, and the resulting predictions are split into `CV1` (held-out genotypes) and `CV2` (training genotypes under the same fit). In the current standard, both the final CSV summaries and the BGLR fitting artifacts are stored in a rep/fold hierarchy.

```{r run-predictions-cv12, eval=FALSE}
foreach(i = seq_len(nrow(combos)),
        .packages = c("BGLR", "dplyr", "plyr")) %dopar% {
  MODEL <- combos$MODEL[i]
  rep_num <- combos$REP[i]
  TRAIT <- as.character(combos$TRAIT[i])

  pheno <- data.frame(Pheno)

  trait_id <- match(TRAIT, trait_reference_order)
  if (is.na(trait_id)) {
    trait_id <- 1L
  }

  seed_value <- as.integer(100000L * MODEL + 100L * rep_num + trait_id)
  old_seed_exists <- exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
  if (old_seed_exists) {
    old_seed <- get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
  }

  set.seed(seed_value)
  obs_per_fold <- ceiling(length(Pedigree) / num_fold)
  folds <- sample(rep(seq_len(num_fold), each = obs_per_fold, length.out = length(Pedigree)))

  if (old_seed_exists) {
    assign(".Random.seed", old_seed, envir = .GlobalEnv)
  } else if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) {
    rm(".Random.seed", envir = .GlobalEnv)
  }

  folds_df <- data.frame(Pedigree = Pedigree, fold = folds, stringsAsFactors = FALSE)

  for (fold_num in seq_len(num_fold)) {
    rep_label <- sprintf("rep%02d", rep_num)
    fold_label <- sprintf("fold%02d", fold_num)

    cv1_dir <- file.path(results_dir, "CV1", TRAIT, rep_label, fold_label)
    cv2_dir <- file.path(results_dir, "CV2", TRAIT, rep_label, fold_label)
    dir.create(cv1_dir, recursive = TRUE, showWarnings = FALSE)
    dir.create(cv2_dir, recursive = TRUE, showWarnings = FALSE)

    cv1_file <- file.path(cv1_dir, paste0("CV1_", TRAIT, "_Eta", MODEL, "_", rep_label, "_", fold_label, ".csv"))
    cv2_file <- file.path(cv2_dir, paste0("CV2_", TRAIT, "_Eta", MODEL, "_", rep_label, "_", fold_label, ".csv"))

    if (file.exists(cv1_file) && file.exists(cv2_file)) {
      next
    }

    train_geno <- folds_df$Pedigree[folds_df$fold != fold_num]

    yield <- pheno
    yield$Y2 <- NA
    yield$Y2[yield$Pedigree %in% train_geno] <- yield[[TRAIT]][yield$Pedigree %in% train_geno]

    run_dir <- file.path(bglr_runs_dir, "CV1", TRAIT, rep_label, fold_label, paste0("Eta", MODEL))
    dir.create(run_dir, recursive = TRUE, showWarnings = FALSE)

    save_prefix_full <- file.path(run_dir, paste0("CV1_", TRAIT, "_Eta", MODEL, "_", rep_label, "_", fold_label, "_"))

    y_t <- as.numeric(yield$Y2)
    start_time <- Sys.time()
    fit <- BGLR(y = y_t, ETA = Eta_list[[MODEL]], nIter = nIter, burnIn = burnIn, thin = thin, saveAt = save_prefix_full)
    end_time <- Sys.time()
    runtime_full <- as.numeric(difftime(end_time, start_time, units = "secs"))

    yield$yhat <- fit$yHat
    rm(fit)
    gc()

    df_test <- yield[!yield$Pedigree %in% train_geno, ]
    df_train <- yield[yield$Pedigree %in% train_geno, ]

    tmp1 <- df_test %>%
      dplyr::group_by(Env) %>%
      dplyr::summarize(cor = cor(.data[[TRAIT]], yhat, use = "complete.obs"), .groups = "drop") %>%
      as.data.frame()

    tmp2 <- df_train %>%
      dplyr::group_by(Env) %>%
      dplyr::summarize(cor = cor(.data[[TRAIT]], yhat, use = "complete.obs"), .groups = "drop") %>%
      as.data.frame()

    tmp1$Runtime_sec <- runtime_full
    tmp2$Runtime_sec <- runtime_full
    tmp1$Fold <- fold_label
    tmp2$Fold <- fold_label

    write.csv(tmp1, file = cv1_file, row.names = FALSE)
    write.csv(tmp2, file = cv2_file, row.names = FALSE)
  }

  NULL
}
```
## 7. Run prediction analyses for CV0 and CV00

This block implements leave-one-environment-out validation for the same 18-model workflow. For each task, the deterministic fold map is rebuilt with the same rule used in `CV1/CV2`, one fit is run for each `env_leave × fold` combination, and the predictions are split into `CV0` (training genotypes under the environment mask) and `CV00` (held-out genotypes under the same mask). As in the genotype-based block, the fold construction, the BGLR fit, and the environment-wise summary are all written directly inside the loop so that the execution path remains explicit. The outputs are written in a rep/fold hierarchy and the BGLR artifacts are preserved in matching persistent folders.

```{r run-predictions-cv0-cv00, eval=FALSE}
foreach(i = seq_len(nrow(combos)),
        .packages = c("BGLR", "dplyr", "plyr")) %dopar% {
  MODEL <- combos$MODEL[i]
  rep_num <- combos$REP[i]
  TRAIT <- as.character(combos$TRAIT[i])

  pheno <- data.frame(Pheno)

  trait_id <- match(TRAIT, trait_reference_order)
  if (is.na(trait_id)) {
    trait_id <- 1L
  }

  seed_value <- as.integer(100000L * MODEL + 100L * rep_num + trait_id)
  old_seed_exists <- exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
  if (old_seed_exists) {
    old_seed <- get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
  }

  set.seed(seed_value)
  obs_per_fold <- ceiling(length(Pedigree) / num_fold)
  folds <- sample(rep(seq_len(num_fold), each = obs_per_fold, length.out = length(Pedigree)))

  if (old_seed_exists) {
    assign(".Random.seed", old_seed, envir = .GlobalEnv)
  } else if (exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE)) {
    rm(".Random.seed", envir = .GlobalEnv)
  }

  folds_df <- data.frame(Pedigree = Pedigree, fold = folds, stringsAsFactors = FALSE)

  for (fold_num in seq_len(num_fold)) {
    train_geno <- folds_df$Pedigree[folds_df$fold != fold_num]
    rep_label <- sprintf("rep%02d", rep_num)
    fold_label <- sprintf("fold%02d", fold_num)

    for (env_idx in seq_along(envs_leaveout)) {
      env_name <- envs_leaveout[env_idx]

      cv00_dir <- file.path(results_dir, "CV00", env_name, TRAIT, rep_label, fold_label)
      cv0_dir <- file.path(results_dir, "CV0", env_name, TRAIT, rep_label, fold_label)
      dir.create(cv00_dir, recursive = TRUE, showWarnings = FALSE)
      dir.create(cv0_dir, recursive = TRUE, showWarnings = FALSE)

      cv00_file <- file.path(cv00_dir, paste0("CV00_", TRAIT, "_", env_name, "_Eta", MODEL, "_", rep_label, "_", fold_label, ".csv"))
      cv0_file <- file.path(cv0_dir, paste0("CV0_", TRAIT, "_", env_name, "_Eta", MODEL, "_", rep_label, "_", fold_label, ".csv"))

      if (file.exists(cv00_file) && file.exists(cv0_file)) {
        next
      }

      yield2 <- pheno
      yield2$Y2 <- NA
      yield2$Y2[yield2$Pedigree %in% train_geno] <- yield2[[TRAIT]][yield2$Pedigree %in% train_geno]
      yield2$Y2[yield2$Env == env_name] <- NA

      run_dir <- file.path(bglr_runs_dir, "CV0", env_name, TRAIT, rep_label, fold_label, paste0("Eta", MODEL))
      dir.create(run_dir, recursive = TRUE, showWarnings = FALSE)

      save_prefix_env <- file.path(run_dir, paste0("CV0_", env_name, "_", TRAIT, "_Eta", MODEL, "_", rep_label, "_", fold_label, "_"))

      y_t <- as.numeric(yield2$Y2)
      start_time <- Sys.time()
      fit <- BGLR(y = y_t, ETA = Eta_list[[MODEL]], nIter = nIter, burnIn = burnIn, thin = thin, saveAt = save_prefix_env)
      end_time <- Sys.time()
      runtime_env <- as.numeric(difftime(end_time, start_time, units = "secs"))

      yield2$yhat <- fit$yHat
      rm(fit)
      gc()

      df_test2 <- yield2[!yield2$Pedigree %in% train_geno & yield2$Env == env_name, , drop = FALSE]
      df_train2 <- yield2[yield2$Pedigree %in% train_geno & yield2$Env == env_name, , drop = FALSE]

      tmp00 <- df_test2 %>%
        dplyr::group_by(Env) %>%
        dplyr::summarize(cor = cor(.data[[TRAIT]], yhat, use = "complete.obs"), .groups = "drop") %>%
        as.data.frame()

      tmp0 <- df_train2 %>%
        dplyr::group_by(Env) %>%
        dplyr::summarize(cor = cor(.data[[TRAIT]], yhat, use = "complete.obs"), .groups = "drop") %>%
        as.data.frame()

      tmp00$Runtime_sec <- runtime_env
      tmp0$Runtime_sec <- runtime_env
      tmp00$Fold <- fold_label
      tmp0$Fold <- fold_label

      write.csv(tmp00, file = cv00_file, row.names = FALSE)
      write.csv(tmp0, file = cv0_file, row.names = FALSE)
    }
  }

  NULL
}
```
## 8. Stop the parallel backend

```{r stop-parallel-backend, eval=FALSE}
stopCluster(cl)
foreach::registerDoSEQ()
```


## 9. Inventory of the prediction output files

The inventories below summarize the CSV files produced by the execution stage using the current rep/fold folder contract for the 18-model workflow.

```{r prediction-output-inventory}
prediction_files <- character(0)

if (dir.exists(results_dir)) {
  prediction_files <- list.files(results_dir, pattern = "\\.csv$", full.names = TRUE, recursive = TRUE)
}

inventory_cv12 <- data.frame()
inventory_cvloo <- data.frame()

if (length(prediction_files) > 0) {
  results_root <- normalizePath(results_dir, winslash = "/")

  for (current_file in prediction_files) {
    current_norm <- normalizePath(current_file, winslash = "/")
    rel_path <- substring(current_norm, nchar(results_root) + 2)
    path_parts <- strsplit(rel_path, "/", fixed = TRUE)[[1]]
    current_name <- basename(current_file)
    current_name_no_ext <- sub("\\.csv$", "", current_name)
    current_parts <- strsplit(current_name_no_ext, "_")[[1]]

    if (length(path_parts) >= 5 && current_parts[1] %in% c("CV1", "CV2")) {
      inventory_cv12 <- dplyr::bind_rows(
        inventory_cv12,
        data.frame(
          relative_path = rel_path,
          file_name = current_name,
          CV = current_parts[1],
          Trait = path_parts[2],
          Rep = path_parts[3],
          Fold = path_parts[4],
          Model = current_parts[3],
          stringsAsFactors = FALSE
        )
      )
    }

    if (length(path_parts) >= 6 && current_parts[1] %in% c("CV0", "CV00")) {
      inventory_cvloo <- dplyr::bind_rows(
        inventory_cvloo,
        data.frame(
          relative_path = rel_path,
          file_name = current_name,
          CV = current_parts[1],
          Env_leave = path_parts[2],
          Trait = path_parts[3],
          Rep = path_parts[4],
          Fold = path_parts[5],
          Model = current_parts[length(current_parts) - 2],
          stringsAsFactors = FALSE
        )
      )
    }
  }
}

write_csv(inventory_cv12, file.path(analysis_tables_dir, "analysis_prediction_inventory_cv12.csv"))
write_csv(inventory_cvloo, file.path(analysis_tables_dir, "analysis_prediction_inventory_cvloo.csv"))
```
## 10. Counts of files by validation scheme and trait

```{r prediction-output-counts}
counts_by_cv <- data.frame()
counts_by_trait <- data.frame()

if (nrow(inventory_cv12) > 0 || nrow(inventory_cvloo) > 0) {
  combined_inventory <- bind_rows(
    inventory_cv12 %>% mutate(Env_leave = NA_character_),
    inventory_cvloo
  )

  counts_by_cv <- combined_inventory %>%
    count(CV, name = "n_files") %>%
    arrange(CV)

  counts_by_trait <- combined_inventory %>%
    count(CV, Trait, name = "n_files") %>%
    arrange(CV, Trait)
}

write_csv(counts_by_cv, file.path(analysis_tables_dir, "analysis_prediction_counts_by_cv.csv"))
write_csv(counts_by_trait, file.path(analysis_tables_dir, "analysis_prediction_counts_by_trait.csv"))

if (nrow(counts_by_cv) > 0) {
  knitr::kable(counts_by_cv, caption = "Number of prediction files by validation scheme.", align = "cc")
} else {
  cat("Prediction counts by validation scheme are not available yet.
")
}
```


## 11. Examples of the output path pattern

```{r prediction-output-examples}
output_examples <- bind_rows(
  tibble::tibble(
    ExampleType = "CV1/CV2",
    relative_path = c(
      "CV1/GY/rep01/fold01/CV1_GY_Eta1_rep01_fold01.csv",
      "CV2/KW/rep10/fold05/CV2_KW_Eta18_rep10_fold05.csv"
    ),
    bglr_example = c(
      "bglr_runs/CV1/GY/rep01/fold01/Eta1/CV1_GY_Eta1_rep01_fold01_GE_RKHS_varU.dat",
      "bglr_runs/CV2/KW/rep10/fold05/Eta18/CV2_KW_Eta18_rep10_fold05_PAKW_RKHS_varU.dat"
    ),
    Meaning = c(
      "Prediction accuracy for CV1, trait GY, model Eta1, repetition 1, fold 1.",
      "Prediction accuracy for CV2, trait KW, model Eta18, repetition 10, fold 5."
    )
  ),
  tibble::tibble(
    ExampleType = "CV0/CV00",
    relative_path = c(
      "CV0/CS11_WS/GY/rep01/fold01/CV0_GY_CS11_WS_Eta1_rep01_fold01.csv",
      "CV00/CS12_WW/KW/rep10/fold02/CV00_KW_CS12_WW_Eta18_rep10_fold02.csv"
    ),
    bglr_example = c(
      "bglr_runs/CV0/CS11_WS/GY/rep01/fold01/Eta1/CV0_CS11_WS_GY_Eta1_rep01_fold01_GW_RKHS_varU.dat",
      "bglr_runs/CV00/CS12_WW/KW/rep10/fold02/Eta18/CV00_CS12_WW_KW_Eta18_rep10_fold02_PAKW_RKHS_varU.dat"
    ),
    Meaning = c(
      "Prediction accuracy for CV0, trait GY, held-out environment CS11_WS, model Eta1, repetition 1, fold 1.",
      "Prediction accuracy for CV00, trait KW, held-out environment CS12_WW, model Eta18, repetition 10, fold 2."
    )
  )
)

write_csv(output_examples, file.path(analysis_tables_dir, "analysis_prediction_output_examples.csv"))
knitr::kable(output_examples, caption = "Examples of output files stored under the current rep/fold layout, together with the scenario-specific prefixes used for the persistent BGLR artifacts.", align = "c")
```
## 12. Note on temporary `BGLR` artifacts

In the current 18-model workflow, persistent `saveAt` artifacts are written under `output/results/bglr_runs/` during each fit and preserved for later convergence diagnostics. The scenario-specific paths now separate `CV1`, `CV2`, `CV0`, and `CV00`, and the persisted filenames use the scenario prefix together with explicit effect labels instead of generic `ETA_1`, `ETA_2`, and similar placeholders. Therefore, this execution module inventories both the final CSV outputs stored under `output/results/` and the hierarchical paths that locate the corresponding BGLR run folders.

## 13. Final note

This file closes the execution layer of the prediction stage while preserving consistency with the reduced 18-model preprocessing module documented in `analysis_prediction.Rmd` and with the current rep/fold output contract used by the downstream summaries.

## Stage navigation

**Previous stage:** [Prediction — Part I](analysis_prediction.html)  
**Next stage:** [Visualization](visualization.html)

