Previous step

  1. Prepare a training dataset: Freshly download all 2018-2020 trials. Combine rest of data and re-run “cleaning” and formatting pipeline.

Detect experimental designs

rm(list = ls())
dbdata <- readRDS(here::here("output", "IITA_CleanedTrialData_2020Dec03.rds"))
source(here::here("code", "gsFunctions.R"))
dbdata <- nestByTrials(dbdata)

The next step is to check the experimental design of each trial. If you are absolutely certain of the usage of the design variables in your dataset, you might not need this step.

Examples of reasons to do the step below:

One reason it might be important to get this right is that the variance among complete blocks might not be the same among incomplete blocks. If we treat a mixture of complete and incomplete blocks as part of the same random-effect (replicated-within-trial), we assume they have the same variance.

Also error variances might be heterogeneous among different trial-types (blocking scheme available) and/or plot sizes (maxNOHAV).

Detect designs

dbdata <- detectExptDesigns(dbdata)
dbdata %>% count(programName, CompleteBlocks, IncompleteBlocks) %>% rmarkdown::paged_table()

–> Save output file

saveRDS(dbdata, file = here::here("output", "IITA_ExptDesignsDetected_2020Dec03.rds"))

Get multi-trial BLUPs from raw data (two-stage)

Two-stage procedure:

  1. Fit mixed-model to multi-trial dataset and extract BLUPs, de-regressed BLUPs and weights. Include two rounds of outlier removal.
  2. Genomic prediction with drg-BLUPs from multi-trial analysis as input.

Work below represents Stage 1 of the Two-stage procedure.

Set-up training datasets

# activate multithread OpenBLAS for fast matrix algebra
library(tidyverse); library(magrittr);
          "logDYLD", # <-- logDYLD now included. 

Nest by trait. Need to restructure the data from per-trial by regrouping by trait.

dbdata %<>% dplyr::select(-MaxNOHAV) %>% unnest(TrialData) %>% dplyr::select(programName, 
    locationName, studyYear, TrialType, studyName, CompleteBlocks, IncompleteBlocks, 
    yearInLoc, trialInLocYr, repInTrial, blockInRep, observationUnitDbId, germplasmName, 
    FullSampleName, GID, all_of(traits), PropNOHAV) %>% mutate(IncompleteBlocks = ifelse(IncompleteBlocks == 
    TRUE, "Yes", "No"), CompleteBlocks = ifelse(CompleteBlocks == TRUE, "Yes", "No")) %>% 
    pivot_longer(cols = all_of(traits), names_to = "Trait", values_to = "Value") %>% 
    filter(!, ! %>% nest(MultiTrialTraitData = c(-Trait))

To fit the mixed-model I used last year, I am again resorting to asreml. I fit random effects for rep and block only where complete and incomplete blocks, respectively are indicated in the trial design variables. sommer should be able to fit the same model via the at() function, but I am having trouble with it and sommer is much slower even without a dense covariance (i.e. a kinship), compared to lme4::lmer() or asreml().

dbdata %<>% mutate(fixedFormula = ifelse(Trait %in% c("logFYLD", "logRTNO", "logTOPYLD"), 
    "Value ~ yearInLoc", "Value ~ yearInLoc + PropNOHAV"), randFormula = paste0("~idv(GID) + idv(trialInLocYr) + at(CompleteBlocks,'Yes'):repInTrial ", 
    "+ at(IncompleteBlocks,'Yes'):blockInRep"))
dbdata %>% mutate(Nobs = map_dbl(MultiTrialTraitData, nrow)) %>% select(Trait, Nobs, 
    fixedFormula, randFormula) %>% rmarkdown::paged_table()
# randFormula<-paste0('~vs(GID) + vs(trialInLocYr) +
# vs(at(CompleteBlocks,'Yes'),repInTrial) +
# vs(at(IncompleteBlocks,'Yes'),blockInRep)') library(sommer) fit <- mmer(fixed =
# Value ~ 1 + yearInLoc, random = as.formula(randFormula), data=trainingdata,
# getPEV=TRUE)

Function to run asreml

Includes rounds of outlier removal and re-fitting.

  # test arguments for function
  # ----------------------
  # MultiTrialTraitData<-dbdata$MultiTrialTraitData[[7]]
  # #Trait<-dbdata$Trait[[3]]
  # fixedFormula<-dbdata$fixedFormula[[7]]
  # randFormula<-dbdata$randFormula[[7]]
  # test<-fitASfunc(fixedFormula,randFormula,MultiTrialTraitData)
  # ----------------------
  # fit asreml 
  out<-asreml(fixed = fixedFormula,
              random = randFormula,
              data = MultiTrialTraitData, 
              maxiter = 40, workspace=800e6, na.method.X = "omit")
  #### extract residuals - Round 1
    # re-fit
    out<-asreml(fixed = fixedFormula,
                random = randFormula,
                data = x, 
                maxiter = 40, workspace=800e6, na.method.X = "omit")
    #### extract residuals - Round 2
      #### remove outliers
      # final re-fit
      out<-asreml(fixed = fixedFormula,
                  random = randFormula,
                  data = x, maxiter = 40,workspace=800e6, na.method.X = "omit")
  if(length(outliers1)==0){ outliers1<-NULL }
  if(length(outliers2)==0){ outliers2<-NULL }
  blups<-summary(out,all=T)$coef.random %>% %>%
    rownames_to_column(var = "GID") %>%
    dplyr::select(GID,solution,`std error`) %>%
    filter(grepl("GID",GID)) %>%
    rename(BLUP=solution) %>%
           PEV=`std error`^2, # asreml specific
           REL=1-(PEV/Vg), # Reliability
           drgBLUP=BLUP/REL, # deregressed BLUP
           WT=(1-H2)/((0.1 + (1-REL)/REL)*H2)) # weight for use in Stage 2
  return(out) }

Run asreml

options(mc.cores = 14)
dbdata %<>% mutate(fitAS = future_pmap(., fitASfunc))
dbdata %<>% select(-fixedFormula, -randFormula, -MultiTrialTraitData) %>% unnest(fitAS)

Output file

saveRDS(dbdata, file = here::here("output", "iita_blupsForModelTraining_twostage_asreml_2020Dec03.rds"))

Next step

  1. Genomic prediction: Predict GETGV specifically, for all selection candidates using all available data.

