This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20191123) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version f73b05f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Results.Rmd) and HTML (docs/Results.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 4.00 16.00 30.61 36.00 256.00
There were 3199 comprising 462 families, derived from 209 parents in our pedigree. Parents were used an average of 31 (median 16, range 1-256) times as sire and/or dam in the pedigree. The mean family size was 7 (median 4, range 1-72).
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.7639 0.8254 0.8364 0.8363 0.8476 0.9273
The average proportion homozygous was 0.84 (range 0.76-0.93) across the 3199 pedigree members (computed over 33370 variable SNP; Table S14).
As expected for a population under recurrent selection, the homozygosity rate increases (though only fractionally) from the C0 (mean 0.826), C1 (0.835), C2 (0.838), C3 (0.839) (Figure S01).
# A tibble: 2 x 5
Ntestparents Ntrainset Ntestset NcrossesToPredict Value
<dbl> <dbl> <dbl> <dbl> <chr>
1 41 1245 1003 143 Min
2 42 2323 2081 204 Max
Across the 5 replications of 5-fold cross-validation the average number of samples was 1833 (range 1245-2323) for training sets and 1494 (range 1003-2081) for testing sets. The 25 training-testing pairs set-up an average of 167 (range 143-204) crosses-to-predict (Table S02).
Summary of the BLUPs and sel. index
The correlation between phenotypic BLUPs for the two SI (stdSI and biofortSI; Table S01) was 0.43 (Figure S02). The correlation between DM and TCHART BLUPs, for which we had a priori expectations, was -0.29.
The usefulness criteria i.e. UCparent and UCvariety are predicted by:
UCparent=μBV+(iRS×σBV)
UCvariety=μTGV+(iVDP×σTGV)
Selection intensity
In order to combined predicted means and variances into a UC, we first calculated the realized intensity of within-family selection (iRS and iVDP). For UCparent we computed the iRS based on the proportion of progeny from each family, that themselves later appeared in the pedigree as parents. For UCclone we compute computed iVDP based on the proportion of family-members that had at least one plot at the penultimate stage of the VDP, the advanced yield trial (AYT). For completeness and as part of exploratory analysis, we computed the proportion selected at each of the VDP stages: clonal evaluation trial (CET), preliminary yield trial (PYT), advanced yield trial (AYT) and uniform yield trial (UYT).
The table below provides a quick summary of the number of families available with realized selection observed at each stage, plus the corresponding mean selection intensity and proportion selected across families.
There were 48 families with a mean intensity of 1.59 (mean 2% selected) that themselves had members who were parents in the pedigree. As expected, the number of available families and the proportion selected decreased (increasing selection intensity) from CET to UYT. We choose to focus on the AYT stage, which has 104 families, mean intensity 1.46 (mean 5% selected).
Population estimates of the importance of dominance variance
In this study, our focus is mainly on distinguishing among crosses, and the accuracy of cross-metric predictions. Detailed analysis of the additive-dominance genetic variance-covariance structure in cassava (sub)-populations is an important topic, which we mostly leave for future study. However, we make a brief examination of the genetic variance-covariance estimates associated with the overall population and component genetic groups. We report all variance-covariance estimates in TableS15 and complete BGLR output in the FTP repository associated with this study.
We found mostly consistent and significant (diff. from zero) effects of inbreeding depression associated especially (Figure 5, Table S16), with logFYLD (mean effect -2.75 across genetic groups, -3.88 across cross-validation folds), but also DM (-4.82 genetic groups, -7.85 cross-validation) and MCMDS (0.32 genetic groups, 1.27 cross-validation). This corresponds to higher homozygosity being associated with lower DM, lower yield and worse disease.
Exploring Untested Crosses
We made 8 predictions (2 SIs x 2 prediction targets [BV, TGV] x 2 criteria [Mean, UC = Mean + 2*SD]) prediction for each of 47,083 possible crosses of 306 parents.
Correlations among predictions
First, quickly evaluate the multivariate decision space encompassed by predictions of mean, SD, UC for BV and TGV.
The mean and variance have a low, but negative correlation. At the standardized intensity of 2.67 (1% selected), leads to a small negative correlation between SD and UC. The crosses with highest mean will mostly be those with highest UC. The crosses with highest mean will also have a small tendency to have smaller variance.
Nevertheless, the biggest differences in decision space have to do with the difference between using the Mean vs. including the SD via the UC.
Decision space - top 50 crosses?
What we next want to know, is how different the selections of crosses-to-make would be if we use different criteria, particularly the mean vs. the UC.
For each of the 8 predictions of 47,083 crosses, select the top 50 ranked crosses.
# A tibble: 4 x 3
Trait IsSelf n
<chr> <lgl> <int>
1 biofortSI FALSE 84
2 biofortSI TRUE 6
3 stdSI FALSE 105
4 stdSI TRUE 7
Only 202 unique crosses selected based on at least one of the 8 criteria. Of those 112 were selected for the StdSI (90 Biofort) including 7 (6) selfs on the StdSI (BiofortSI).
# A tibble: 1 x 2
CrossPrevMade n
<chr> <int>
1 No 202
None of the selected crosses have previously been tested.
Table: Summarize, by trait, the number of and relative contributions (number of matings) proposed for each parent selected in the group of top crosses.
Most often, cross-validation done to test genomic prediction accuracy uses validation data (the stand-in for “truth”) consisting of adjusted values, (e.g. BLUPs or BLUEs) for total individual performance, not including genomic relatedness information. In our study we set-up cross-validation folds that enable use to predict the GEBV and GETGV (GBLUPs) of validation family-members, and to subsequently compute their sample means, variances and usefulness. This approach has the added advantage of expanding the available sample size of validation progeny with complete data across traits. Nevertheless, we made some comparison to results using BLUPs that do not incorporate genomic relatedness information; in other words, independent and identically distributed (i.i.d.) BLUPs.
library(tidyverse); library(magrittr);# Table S10: Accuracies predicting the meanaccMeans<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet ="TableS10")## Table S11: Accuracies predicting the variancesaccVars<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet ="TableS11") %>%mutate(Component=ifelse(Trait1==Trait2,"Variance","Covariance"),TraitType=ifelse(grepl("SI",Trait1),"SI","ComponentTrait"))
Summarize accuracy differences between GBLUPs and iidBLUPs as validation data.
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.38849 -0.00209 0.07280 0.06568 0.13568 0.43723
Similar to the means, accuracy using GBLUP-validation-data appeared mostly higher compared to iidBLUPs (median difference GBLUPs-iidBLUPs = 0.07, interquartile range -0.002-0.14).
Compute Spearman rank correlations on a per Trait, per prediction type (BV vs TGV) basis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.04154 0.33269 0.52385 0.50785 0.66346 0.89154
The Spearman rank correlations of iidBLUP and GBLUP-validation-based accuracies was positive for family (co)variances, but smaller compared to family means (mean correlation 0.5, range 0.04-0.89).
Supplementary plots comparing validation-data accuracies for means and (co)variances were inspected (Figure S??-S??Figure ???). Based on this, we conclude that we would reach similar though more muted conclusions about which trait variances and trait-trait covariances are best or worst predicted, if restricted to iidBLUPs for validation data.
Accuracy considering only families with 10+ members?
In our primary analysis, we computed (co)variance prediction accuracies with weighted correlations, considering any family with more than one member. We also considered a more conservative alternative approach of including only families with 10(n=112)or20 (n=22).
The Spearman rank correlation between accuracy estimates when all vs. only families with more than 10 members was 0.89. There should therefore be good concordance with our primary conclusions, depending on the family size threshold we impose.
The median difference in accuracy (“threshold size families” minus “all families”) was 0.01. Considering only size 10 or greater families noticeably improved prediction accuracy for several trait variances and especially for two covariances (DM-TCHART and logFYLD-MCMDS) (Figures S???).
Comparing PMV and VPM predictions
Comparing posterior mean variance (PMV) to variance of posterior mean (VPM) predictions:
Variances and covariances estimates and cross-validation predictions were made with the PMV method, which is computationally more intensive than VPM.
Overall Spearman rank correlation between VPM and PMV variance component estimates.
We observed that population variance estimates based on PMV were consistently larger than VPM, but the correlation of those estimates is 0.98 (**Figure __**).
Compute the correlation between PMV and VPM predictions.
Using the predictions from the cross-validation results, we further observed that the PMV predictions were consistently larger and most notably that the correlation between PMV and VPM was very high (0.995). Some VPM prediction accuracies actually appear better than PMV predictions (Figure S???).
The critical point is that VPM and PMV predictions should have very similar rankings. In our primary analysis, we focus on the PMV results with the only exception being the exploratory predictions where we saved time/computation and used the VPM. If implementing mate selections via the usefulness criteria, choosing the VPM method would mostly have the consequence of shrinking the influence on selection decisions towards the mean.
Compare directional dominance to “classic” model
Our focus in this article was not in finding the optimal or most accurate prediction model for obtaining marker effects. However, genome-wide estimates of directional dominance have not previously been made in cassava. For this reason, we make some brief comparison to the standard or “classic” additive-dominance prediction model, where dominance effects are centered on zero.
Overall, the ranking of models and predictions between the two models were similar, as indicated by a rank correlation between model accuracy estimates of 0.98 for family means and 0.94 for variances and covariances.
Summarize accuracy differences between DirDom and Classic model: Means