Processing math: 100%
  • Raw data
  • BLUPs
  • Marker density and distribution
  • Pedigree validation
  • Parent-wise Cross-validation
    • Selection index accuracy
    • Cross-mean prediction accuracy
    • Cross variance param prediction accruacy
      • Cross variance pred. accuracy
      • Cross co-variance pred. accuracy
  • Genomic Predictions
    • Genetic trends (and gain estimates)
    • Parent selection
    • Genomic mate selection
  • [FUTURE] Compare PHG-to-Beagle data
    • [TO DO] PCA
    • [TO DO] Compare kinships
    • [TO DO] Prediction accuracy - PHG
    • [TO DO] Compare PHG-to-Beagle accuracy

Last updated: 2021-07-29

Checks: 7 0

Knit directory: implementGMSinCassava/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20210504) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version bc85a7d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/accuracies.png
    Ignored:    analysis/fig2.png
    Ignored:    analysis/fig3.png
    Ignored:    analysis/fig4.png
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store

Untracked files:
    Untracked:  accuracies.png
    Untracked:  analysis/docs/
    Untracked:  analysis/speedUpPredCrossVar.Rmd
    Untracked:  data/AllChrom_RefPanelAndGSprogeny_ReadyForGP_72719.bed
    Untracked:  data/AllChrom_RefPanelAndGSprogeny_ReadyForGP_72719.bim
    Untracked:  data/AllChrom_RefPanelAndGSprogeny_ReadyForGP_72719.fam
    Untracked:  data/AllChrom_RefPanelAndGSprogeny_ReadyForGP_72719.hap.gz
    Untracked:  data/AllChrom_RefPanelAndGSprogeny_ReadyForGP_72719.log
    Untracked:  data/AllChrom_RefPanelAndGSprogeny_ReadyForGP_72719.nosex
    Untracked:  data/AllChrom_RefPanelAndGSprogeny_ReadyForGP_72719.sample
    Untracked:  data/AllChrom_RefPanelAndGSprogeny_ReadyForGP_72719.vcf.gz
    Untracked:  data/DatabaseDownload_2021May04/
    Untracked:  data/blups_forCrossVal.rds
    Untracked:  data/config.txt
    Untracked:  data/config_mw.txt
    Untracked:  data/dosages_IITA_2021May13.rds
    Untracked:  data/dosages_IITA_filtered_2021May13.rds
    Untracked:  data/genmap_2021May13.rds
    Untracked:  data/haps_IITA_2021May13.rds
    Untracked:  data/haps_IITA_filtered_2021May13.rds
    Untracked:  data/recombFreqMat_1minus2c_2021May13.rds
    Untracked:  fig2.png
    Untracked:  fig3.png
    Untracked:  figure/
    Untracked:  output/Cassava_HMII_V3_Marning_imputation_6-18-21.samples
    Untracked:  output/IITA_CleanedTrialData_2021May10.rds
    Untracked:  output/IITA_ExptDesignsDetected_2021May10.rds
    Untracked:  output/IITA_blupsForModelTraining_twostage_asreml_2021May10.rds
    Untracked:  output/IITA_trials_NOT_identifiable.csv
    Untracked:  output/alphaAssignOutput_iita_pedigree.txt.dams
    Untracked:  output/alphaAssignOutput_iita_pedigree.txt.dams.full
    Untracked:  output/alphaAssignOutput_iita_pedigree.txt.pedigree
    Untracked:  output/alphaAssignOutput_iita_pedigree.txt.pedigree.top
    Untracked:  output/alphaAssignOutput_iita_pedigree.txt.sires
    Untracked:  output/alphaAssignOutput_iita_pedigree.txt.sires.full
    Untracked:  output/crossValPredsA.rds
    Untracked:  output/crossValPredsAD.rds
    Untracked:  output/cvAD_1rep_markerEffects.rds
    Untracked:  output/cvAD_1rep_meanPredAccuracy.rds
    Untracked:  output/cvAD_1rep_parentfolds.rds
    Untracked:  output/cvAD_1rep_predAccuracy.rds
    Untracked:  output/cvAD_1rep_predMeans.rds
    Untracked:  output/cvAD_1rep_predVars.rds
    Untracked:  output/cvAD_1rep_varPredAccuracy.rds
    Untracked:  output/cvAD_5rep5fold_markerEffects.rds
    Untracked:  output/cvAD_5rep5fold_meanPredAccuracy.rds
    Untracked:  output/cvAD_5rep5fold_parentfolds.rds
    Untracked:  output/cvAD_5rep5fold_predMeans.rds
    Untracked:  output/cvAD_5rep5fold_predVars.rds
    Untracked:  output/cvAD_5rep5fold_varPredAccuracy.rds
    Untracked:  output/cvDirDom_5rep5fold_markerEffects.rds
    Untracked:  output/cvDirDom_5rep5fold_meanPredAccuracy.rds
    Untracked:  output/cvDirDom_5rep5fold_parentfolds.rds
    Untracked:  output/cvDirDom_5rep5fold_predMeans.rds
    Untracked:  output/cvDirDom_5rep5fold_predVars.rds
    Untracked:  output/cvDirDom_5rep5fold_varPredAccuracy.rds
    Untracked:  output/cvMeanPredAccuracyA.rds
    Untracked:  output/cvMeanPredAccuracyAD.rds
    Untracked:  output/cvPredMeansA.rds
    Untracked:  output/cvPredMeansAD.rds
    Untracked:  output/cvVarPredAccuracyA.rds
    Untracked:  output/cvVarPredAccuracyAD.rds
    Untracked:  output/genomicMatePredictions_top121parents_ModelAD.csv
    Untracked:  output/genomicMatePredictions_top121parents_ModelAD.rds
    Untracked:  output/genomicMatePredictions_top121parents_ModelDirDom.csv
    Untracked:  output/genomicMatePredictions_top121parents_ModelDirDom.rds
    Untracked:  output/genomicPredictions_ModelAD.csv
    Untracked:  output/genomicPredictions_ModelAD.rds
    Untracked:  output/genomicPredictions_ModelDirDom.csv
    Untracked:  output/genomicPredictions_ModelDirDom.rds
    Untracked:  output/kinship_A_IITA_2021May13.rds
    Untracked:  output/kinship_D_IITA_2021May13.rds
    Untracked:  output/kinship_domGenotypic_IITA_2021July5.rds
    Untracked:  output/markEffsTest.rds
    Untracked:  output/markerEffects.rds
    Untracked:  output/markerEffectsA.rds
    Untracked:  output/markerEffectsAD.rds
    Untracked:  output/maxNOHAV_byStudy.csv
    Untracked:  output/obsCrossMeansAndVars.rds
    Untracked:  output/parentfolds.rds
    Untracked:  output/ped2check_genome.rds
    Untracked:  output/ped2genos.txt
    Untracked:  output/pednames2keep.txt
    Untracked:  output/pednames_Prune100_25_pt25.log
    Untracked:  output/pednames_Prune100_25_pt25.nosex
    Untracked:  output/pednames_Prune100_25_pt25.prune.in
    Untracked:  output/pednames_Prune100_25_pt25.prune.out
    Untracked:  output/potential_dams.txt
    Untracked:  output/potential_sires.txt
    Untracked:  output/predVarTest.rds
    Untracked:  output/samples2keep_IITA_2021May13.txt
    Untracked:  output/samples2keep_IITA_MAFpt01_prune50_25_pt98.log
    Untracked:  output/samples2keep_IITA_MAFpt01_prune50_25_pt98.nosex
    Untracked:  output/samples2keep_IITA_MAFpt01_prune50_25_pt98.prune.in
    Untracked:  output/samples2keep_IITA_MAFpt01_prune50_25_pt98.prune.out
    Untracked:  output/samples2keep_notInPHGdb.txt
    Untracked:  output/test_cvAD_markerEffects.rds
    Untracked:  output/test_cvAD_meanPredAccuracy.rds
    Untracked:  output/test_cvAD_parentfolds.rds
    Untracked:  output/test_cvAD_predAccuracy.rds
    Untracked:  output/test_cvAD_predMeans.rds
    Untracked:  output/test_cvAD_predVars.rds
    Untracked:  output/test_cvAD_varPredAccuracy.rds
    Untracked:  output/top50crosses.csv
    Untracked:  output/verified_ped.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/07-Results.Rmd) and HTML (docs/07-Results.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd bc85a7d wolfemd 2021-07-29 Polished and ready to go.
html baa7d80 wolfemd 2021-07-29 Build site.
Rmd 5a53779 wolfemd 2021-07-29 Publish the nearly complete and polished version of the project.
html 934141c wolfemd 2021-07-14 Build site.
html cc1eb4b wolfemd 2021-07-14 Build site.
Rmd 772750a wolfemd 2021-07-14 DirDom model and selection index calc fully integrated functions.

The summaries below correspond to the results of analyses outlined here and linked above.

Raw data

Summary of the number of unique plots, locations, years, etc. in the cleaned plot-basis data (output/IITA_ExptDesignsDetected_2021May10.rds, download from FTP). See the data cleaning step here for details.

library(tidyverse); library(magrittr); library(ragg)
rawdata<-readRDS(file=here::here("output","IITA_ExptDesignsDetected_2021May10.rds"))
rawdata %>% 
  summarise(Nplots=nrow(.),
            across(c(locationName,studyYear,studyName,TrialType,GID), ~length(unique(.)),.names = "N_{.col}")) %>% 
  rmarkdown::paged_table()
ABCDEFGHIJ0123456789
Nplots
<int>
N_locationName
<int>
N_studyYear
<int>
N_studyName
<int>
N_TrialType
<int>
N_GID
<int>
4795881332186410254339

This is not the same number of clones as are expected to be genotyped-and-phenotyped.

Next, a break down of the plots based on the trial design and TrialType (really a grouping of the population that is breeding program specific), captured by two logical variables, CompleteBlocks and IncompleteBlocks.

rawdata %>% 
  count(TrialType,CompleteBlocks,IncompleteBlocks) %>% 
  spread(TrialType,n) %>% 
  rmarkdown::paged_table()
ABCDEFGHIJ0123456789
CompleteBlocks
<lgl>
IncompleteBlocks
<lgl>
AYT
<int>
CET
<int>
Conservation
<int>
CrossingBlock
<int>
ExpCET
<int>
GeneticGain
<int>
NCRP
<int>
PYT
<int>
FALSEFALSENA27913997557NA6729NA1361
FALSETRUE4031587NA98918655823NA801
TRUEFALSE41750460NANANA18310263034685
TRUETRUE6950320NANANA17755115217053

Next, look at breakdown of plots by TrialType (rows) and locations (columns):

rawdata %>% 
  count(locationName,TrialType) %>% 
  spread(locationName,n) %>% 
  rmarkdown::paged_table()
ABCDEFGHIJ0123456789
TrialType
<chr>
Abuja
<int>
Ago-Owu
<int>
Ibadan
<int>
Ikenne
<int>
Ilorin
<int>
Jos
<int>
Kano
<int>
Malam Madori
<int>
Mokwa
<int>
Ubiaja
<int>
Umudike
<int>
Warri
<int>
Zaria
<int>
AYT4076122331520829669252011016695544913215583957
CETNANA39085104391548410086524242257NANA4872
ConservationNANA997NANANANANANANANANANA
CrossingBlockNANA831NANANANANANA715NANANA
ExpCETNANA592700NANANANA329244NANANA
GeneticGainNA13212261NANA474NA2837920416947NANA6762
NCRP3362161076408NANANANA854336NANA556
PYT31414302928210168NA95555888150472149NA11621954
SNNANA18976010947NANANANANANANANANA
UYT1156744177774285208516898270914735636552221007139
traits<-c("MCMDS","DM","PLTHT","BRNHT1","BRLVLS","HI",
          "logDYLD", # <-- logDYLD now included. 
          "logFYLD","logTOPYLD","logRTNO","TCHART","LCHROMO","ACHROMO","BCHROMO")
rawdata %>% 
  select(locationName,studyYear,studyName,TrialType,any_of(traits)) %>% 
  pivot_longer(cols = any_of(traits), values_to = "Value", names_to = "Trait") %>% 
  ggplot(.,aes(x=Value,fill=Trait)) + geom_histogram() + facet_wrap(~Trait, scales='free') + 
  theme_bw() + scale_fill_viridis_d() + 
  labs(title = "Distribution of Raw Phenotypic Values")

Version Author Date
cc1eb4b wolfemd 2021-07-14

How many genotyped-and-phenotyped clones?

genotypedAndPhenotypedClones<-rawdata %>% 
  select(locationName,studyYear,studyName,TrialType,germplasmName,FullSampleName,GID,any_of(traits)) %>% 
  pivot_longer(cols = any_of(traits), values_to = "Value", names_to = "Trait") %>%
  filter(!is.na(Value),!is.na(FullSampleName)) %>%
  distinct(germplasmName,FullSampleName,GID)

There are 8149 genotyped-and-phenotyped clones!

genotypedAndPhenotypedClones %>% 
  rmarkdown::paged_table()
ABCDEFGHIJ0123456789
germplasmName
<chr>
FullSampleName
<chr>
GID
<chr>
IITA-TMS-IBA30572I30572:250253643I30572:250253643
IITA-TMS-IBA8200058I8200058:250304538I8200058:250304538
IITA-TMS-IBA9102324I9102324:250090766I9102324:250090766
TMEB117TMEB117:250253666TMEB117:250253666
TMEB1TMEB1:250253593TMEB1:250253593
IITA-TMS-IBA920326I920326:250090768I920326:250090768
IITA-TMS-IBA996012I996012:250300416I996012:250300416
IITA-TMS-IBA996016I996016:250090791I996016:250090791
IITA-TMS-IBA996017I996017:250300415I996017:250300415
IITA-TMS-IBA996069I996069:250300420I996069:250300420

BLUPs

Summarize the BLUPs from the training data, leveraging for each clone data across trials/locations without genomic information and to be used as input for genomic prediction downstream (data/blups_forCrossVal.rds, download from FTP). See the mixed-model analysis step here and a subsequent processing step here for details.

blups<-readRDS(file=here::here("data","blups_forCrossVal.rds")) 
gidWithBLUPs<-blups %>% select(Trait,blups) %>% unnest(blups) %$% unique(GID)
rawdata %>% 
  select(observationUnitDbId,GID,any_of(blups$Trait)) %>% 
  pivot_longer(cols = any_of(blups$Trait), 
               names_to = "Trait", 
               values_to = "Value",values_drop_na = T) %>% 
  filter(GID %in% gidWithBLUPs) %>% 
  group_by(Trait) %>% 
  summarize(Nplots=n()) %>% 
  ungroup() %>% 
  left_join(blups %>% 
              mutate(Nclones=map_dbl(blups,~nrow(.)),
                     avgREL=map_dbl(blups,~mean(.$REL)),
                     Vg=map_dbl(varcomp,~.["GID!GID.var","component"]),
                     Ve=map_dbl(varcomp,~.["R!variance","component"]),
                     H2=Vg/(Vg+Ve)) %>% 
              select(-blups,-varcomp)) %>% 
  mutate(across(is.numeric,~round(.,3))) %>% arrange(desc(H2)) %>% 
  rmarkdown::paged_table()
ABCDEFGHIJ0123456789
Trait
<chr>
Nplots
<dbl>
Nclones
<dbl>
avgREL
<dbl>
Vg
<dbl>
Ve
<dbl>
H2
<dbl>
BCHROMO2631151670.98154.8891.8150.968
MCMDS9638578900.8830.6510.1550.808
PLTHT1758043370.8401391.744384.1590.784
logFYLD10996874640.7110.2620.2110.553
BRLVLS1622943730.6570.6530.5330.551
HI9208674160.6720.0090.0090.504
logRTNO11108974410.6660.1370.1440.488
BRNHT11914042100.577211.561253.2630.455
DM8432267720.6049.70814.0580.408
logTOPYLD9358975310.5970.1130.1650.407
logDYLD8090766710.5970.1480.2200.403
  • Nplots, Nclones: the number of unique plots and clones per trait
  • avgREL: the mean reliability of BLUPs, where for the jth clone, RELj=1PEVjVg
  • Vg, Ve, H2: the genetic and residual variance components and the broad sense heritability (H2=VgVg+Ve).
blups %>% 
  select(Trait,blups) %>% 
  unnest(blups) %>% 
  ggplot(.,aes(x=drgBLUP,fill=Trait)) + geom_histogram() + facet_wrap(~Trait, scales='free') + 
  theme_bw() + scale_fill_viridis_d() + theme(legend.position = 'none') + 
  labs(title = "Distribution of de-regressed BLUP Values")

Version Author Date
baa7d80 wolfemd 2021-07-29
cc1eb4b wolfemd 2021-07-14
  • de-regressed BLUPs or drgBLUPj=BLUPjRELj
blups %>% 
  select(Trait,blups) %>% 
  unnest(blups) %>% 
  ggplot(.,aes(x=Trait,y=REL,fill=Trait)) + geom_boxplot(notch=T) + #facet_wrap(~Trait, scales='free') + 
  theme_bw() + scale_fill_viridis_d() + 
  theme(axis.text.x = element_text(angle=90),
        legend.position = 'none') + 
  labs(title = "Distribution of BLUP Reliabilities")

Version Author Date
baa7d80 wolfemd 2021-07-29
cc1eb4b wolfemd 2021-07-14

Marker density and distribution

Summarize the marker data (data/dosages_IITA_filtered_2021May13.rds, download from FTP). See the pre-processing steps here.

library(tidyverse); library(magrittr);
getwd()
[1] "/Users/mw489/Google Drive/NextGenGS/implementGMSinCassava"
snps<-readRDS(file=here::here("data","dosages_IITA_filtered_2021May13.rds"))
mrks<-colnames(snps) %>% 
  tibble(SNP_ID=.) %>% 
  separate(SNP_ID,c("Chr","Pos","Allele"),"_") %>% 
  mutate(Chr=as.integer(gsub("S","",Chr)),
         Pos=as.numeric(Pos))
rm(snps);
mrks %>% 
  ggplot(.,aes(x=Pos,fill=as.character(Chr))) + geom_histogram() + 
  facet_wrap(~Chr,scales = 'free') + theme_bw() + 
  scale_fill_viridis_d() + theme(legend.position = 'none',
                                 axis.text.x = element_text(angle=90))

Version Author Date
baa7d80 wolfemd 2021-07-29

In total, 34981 SNPs genome-wide. Broken down by chromosome:

mrks %>% count(Chr,name = "Nsnps") %>% rmarkdown::paged_table()
ABCDEFGHIJ0123456789
Chr
<int>
Nsnps
<int>
13468
22237
32282
41742
52053
61879
71185
82006
91691
101735
111915
121596
131510
142833
152019
161403
171673
181754

Pedigree validation

Introduced new downstream procedures (the parent-wise cross-validation, which rely on a trusted pedigree. To support this, introduced a new pedigree-validation step. The pedigree and validation results are summarized below.

The verified pedigree (output/verified_ped.txt), can be downloaded from the FTP here).

library(tidyverse); library(magrittr);
ped2check_genome<-readRDS(file=here::here("output","ped2check_genome.rds"))
ped2check_genome %<>% 
  select(IID1,IID2,Z0,Z1,Z2,PI_HAT)
ped2check<-read.table(file=here::here("output","ped2genos.txt"),
                      header = F, stringsAsFactors = F) %>% 
  rename(FullSampleName=V1,DamID=V2,SireID=V3)
ped2check %<>% 
  select(FullSampleName,DamID,SireID) %>% 
  inner_join(ped2check_genome %>% 
               rename(FullSampleName=IID1,DamID=IID2) %>% 
               bind_rows(ped2check_genome %>% 
                           rename(FullSampleName=IID2,DamID=IID1))) %>% 
  distinct %>% 
  mutate(FemaleParent=case_when(Z0<0.32 & Z1>0.67~"Confirm",
                                       SireID==DamID & PI_HAT>0.6 & Z0<0.3 & Z2>0.32~"Confirm",
                                       TRUE~"Reject")) %>% 
  select(-Z0,-Z1,-Z2,-PI_HAT) %>% 
  inner_join(ped2check_genome %>% 
               rename(FullSampleName=IID1,SireID=IID2) %>% 
               bind_rows(ped2check_genome %>% 
                           rename(FullSampleName=IID2,SireID=IID1))) %>% 
  distinct %>% 
  mutate(MaleParent=case_when(Z0<0.32 & Z1>0.67~"Confirm",
                                       SireID==DamID & PI_HAT>0.6 & Z0<0.3 & Z2>0.32~"Confirm",
                                       TRUE~"Reject")) %>% 
  select(-Z0,-Z1,-Z2,-PI_HAT)
rm(ped2check_genome)

ped2check %<>% 
  mutate(Cohort=NA,
         Cohort=ifelse(grepl("TMS18",FullSampleName,ignore.case = T),"TMS18",
                       ifelse(grepl("TMS15",FullSampleName,ignore.case = T),"TMS15",
                              ifelse(grepl("TMS14",FullSampleName,ignore.case = T),"TMS14",
                                     ifelse(grepl("TMS13|2013_",FullSampleName,ignore.case = T),"TMS13","GGetc")))))

Proportion of accessions with male, female or both parents in pedigree confirm-vs-rejected?

ped2check %>% 
  count(FemaleParent,MaleParent) %>% 
  mutate(Prop=round(n/sum(n),2))
  FemaleParent MaleParent    n Prop
1      Confirm    Confirm 4259 0.77
2      Confirm     Reject  563 0.10
3       Reject    Confirm  382 0.07
4       Reject     Reject  313 0.06

Proportion of accessions within each Cohort with pedigree records confirmed-vs-rejected?

ped2check %>% 
  count(Cohort,FemaleParent,MaleParent) %>% 
  spread(Cohort,n) %>% 
  mutate(across(is.numeric,~round(./sum(.),2))) %>% 
  rmarkdown::paged_table()
ABCDEFGHIJ0123456789
FemaleParent
<chr>
MaleParent
<chr>
GGetc
<dbl>
TMS13
<dbl>
TMS14
<dbl>
TMS15
<dbl>
TMS18
<dbl>
ConfirmConfirm0.690.780.870.700.66
ConfirmReject0.040.100.060.180.11
RejectConfirm0.120.040.070.090.14
RejectReject0.150.090.010.020.09

Use only fully-confirmed families / trios. Remove any without both parents confirmed.

ped<-read.table(here::here("output","verified_ped.txt"),
                header = T, stringsAsFactors = F) %>% 
  mutate(Cohort=NA,
         Cohort=ifelse(grepl("TMS18",FullSampleName,ignore.case = T),"TMS18",
                       ifelse(grepl("TMS15",FullSampleName,ignore.case = T),"TMS15",
                              ifelse(grepl("TMS14",FullSampleName,ignore.case = T),"TMS14",
                                     ifelse(grepl("TMS13|2013_",FullSampleName,ignore.case = T),"TMS13","GGetc")))))

Summary of family sizes

ped %>% 
  count(SireID,DamID) %$% summary(n)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.00    3.00    5.85    8.00   77.00 
ped %>% nrow(.) # 4259 pedigree entries
[1] 4259
ped %>% 
  count(Cohort,name = "Number of Verified Pedigree Entries")
  Cohort Number of Verified Pedigree Entries
1  GGetc                                  18
2  TMS13                                1786
3  TMS14                                1302
4  TMS15                                 589
5  TMS18                                 564
ped %>% 
  distinct(Cohort,SireID,DamID) %>% 
  count(Cohort,name = "Number of Families per Cohort")
  Cohort Number of Families per Cohort
1  GGetc                            16
2  TMS13                           120
3  TMS14                           233
4  TMS15                           197
5  TMS18                           164

730 families. Mean size 5.85, range 1-77.

Parent-wise Cross-validation

I have introduced a new procedure to assess the accuracy of genomic predictions of cross means and variances on a selection index, which is called (parent-wise cross-validation. The actual parent-wise cross-validation folds (output/parentfolds.rds) used are summarized below and can be downloaded here).

parentfolds<-readRDS(file=here::here("output","parentfolds.rds"))
summarized_parentfolds<-parentfolds %>% 
  mutate(Ntestparents=map_dbl(testparents,length),
         Ntrainset=map_dbl(trainset,length),
         Ntestset=map_dbl(testset,length),
         NcrossesToPredict=map_dbl(CrossesToPredict,nrow)) %>% 
  select(Repeat,Fold,starts_with("N"))
summarized_parentfolds %>% 
  rmarkdown::paged_table()
ABCDEFGHIJ0123456789
Repeat
<chr>
Fold
<chr>
Ntestparents
<dbl>
Ntrainset
<dbl>
Ntestset
<dbl>
NcrossesToPredict
<dbl>
Repeat1Fold15614662712237
Repeat1Fold25520602118218
Repeat1Fold35525801598164
Repeat1Fold45524791699165
Repeat1Fold55519612217195
Repeat2Fold15620132165212
Repeat2Fold25524121766168
Repeat2Fold35519252253195
Repeat2Fold45520532125174
Repeat2Fold55519122266213
summarized_parentfolds %>% summarize(across(is.numeric,median,.names = "median{.col}"))
# A tibble: 1 × 4
  medianNtestparents medianNtrainset medianNtestset medianNcrossesToPredict
               <dbl>           <dbl>          <dbl>                   <dbl>
1                 55            2053           2125                     195

Selection index weights (according to IITA)

# SELECTION INDEX WEIGHTS
## from IYR+IK
## note that not ALL predicted traits are on index
c(logFYLD=20,
  HI=10,
  DM=15,
  MCMDS=-10,
  logRTNO=12,
  logDYLD=20,
  logTOPYLD=15,
  PLTHT=10)
  logFYLD        HI        DM     MCMDS   logRTNO   logDYLD logTOPYLD     PLTHT 
       20        10        15       -10        12        20        15        10 

Selection index accuracy

library(ggdist)
accuracy_ad<-readRDS(here::here("output","cvAD_5rep5fold_predAccuracy.rds"))
accuracy_dirdom<-readRDS(here::here("output","cvDirDom_5rep5fold_predAccuracy.rds"))
accuracy<-accuracy_ad$meanPredAccuracy %>% 
  bind_rows(accuracy_dirdom$meanPredAccuracy) %>% 
  filter(Trait=="SELIND") %>% 
  mutate(VarComp=gsub("Mean","",predOf),
         predOf="Mean") %>% 
  bind_rows(accuracy_ad$varPredAccuracy %>% 
              bind_rows(accuracy_dirdom$varPredAccuracy) %>% 
              filter(Trait1=="SELIND") %>% 
              rename(Trait=Trait1) %>% 
              select(-Trait2) %>% 
              mutate(VarComp=gsub("Var","",predOf),
                     predOf="Var")) %>% 
  select(-predVSobs)

colors<-viridis::viridis(4)[1:2]

The figure below shows the ultimate summary of the cross-validation, the estimated accuracy predicting cross-means and cross-variances on the selection index. See further below for a break down by trait. Two models were tested and are compared: modelType=AD and modelType=DirDom. The y-axis “AccuracyEst” is the family-size weighted correlation between the predicted and observed cross means and variances. Predictions of breeding value (BV) and total genetic value (TGV) are distinguished in all plots and relate to the value of a cross for producing future parents and/or elite varieties, respectively among their offspring. Predictions of BV use only allele substitution effects (α). Predictions of TGV incorporate dominance effects/variance.

accuracy %>% 
  mutate(predOf=factor(predOf,levels=c("Mean","Var")),
         VarComp=factor(VarComp,levels=c("BV","TGV"))) %>% 
  ggplot(.,aes(x=VarComp,y=AccuracyEst,fill=VarComp)) +
  ggdist::stat_halfeye(adjust=0.5,.width = 0,fill='gray',width=0.75) +  
  geom_boxplot(width=0.12,notch = TRUE) +
  ggdist::stat_dots(side = "left",justification = 1.1,
                    binwidth = 0.03,dotsize=0.6) +
  theme_bw() + 
  scale_fill_manual(values = colors) + 
  geom_hline(yintercept = 0, color='black', size=0.9) +
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        strip.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.title = element_text(face='bold',color = 'black'),
        strip.text.x = element_text(face='bold',color='black',size=14),
        axis.text.y = element_text(face = 'bold',color='black'),
        legend.text = element_text(face='bold'),
        legend.position = 'bottom') +
  facet_grid(predOf~modelType,scales = 'free') + 
  #facet_wrap(~predOf+modelType,scales = 'free_y',nrow=1) + 
  labs(title="Selection Index Prediction Accuracy") + 
  coord_cartesian(xlim = c(1.2, NA))

Version Author Date
cc1eb4b wolfemd 2021-07-14

The DirDom model is at least as good, if not better than AD. Suggest proceeding to consider only DirDom model genomic predictions.

Prediction accuracy estimates are in output/ (here) with filenames ending _predAccuracy.rds.

For details on the cross-validation scheme, see the article (and for even more, the corresponding supplemental documentation here).

Cross-mean prediction accuracy

accuracy_ad$meanPredAccuracy %>% 
  bind_rows(accuracy_dirdom$meanPredAccuracy) %>% 
  select(-predVSobs) %>% 
  mutate(Trait=factor(Trait,levels=c("SELIND",blups$Trait)),
         predOf=factor(paste0(predOf,"_",modelType),levels=c("MeanBV_AD","MeanTGV_AD","MeanBV_DirDom","MeanTGV_DirDom"))) %>% 
  ggplot(.,aes(x=predOf,y=AccuracyEst,fill=predOf,color=modelType)) + 
  geom_boxplot(notch = TRUE, color='gray40') +
  # ggdist::stat_dots(side = "left", justification = 1.3,
  #                   binwidth = 0.03,dotsize=0.5,layout="swarm") +
  scale_fill_manual(values = viridis::viridis(4)[1:4]) + 
  scale_color_manual(values = viridis::viridis(4)[1:4]) + 
  geom_hline(yintercept = 0, color='black', size=0.8) +
  facet_grid(.~Trait) + 
  labs(title="Accuracy predicting cross-means") + 
  theme_bw() + 
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        strip.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position='bottom',
        axis.text.y = element_text(face='bold'),
        axis.title.y = element_text(face='bold'),
        strip.text.x = element_text(face='bold'),
        plot.title = element_text(face='bold'),
        legend.title = element_text(face='bold'),
        legend.text = element_text(face='bold'),
        panel.spacing = unit(0.2, "lines"))

Version Author Date
baa7d80 wolfemd 2021-07-29
cc1eb4b wolfemd 2021-07-14
        # axis.text.y = element_text(face='bold',size=18),
        # axis.title.y = element_text(face='bold',size=18),
        # strip.text.x = element_text(face='bold', size=20),
        # plot.title = element_text(face='bold', size=24),
        # legend.title = element_text(face='bold', size=20),
        # legend.text = element_text(face='bold', size=20),

Cross variance param prediction accruacy

Cross variance pred. accuracy

accuracy_ad$varPredAccuracy %>% 
  bind_rows(accuracy_dirdom$varPredAccuracy) %>% 
  select(-predVSobs) %>% 
  filter(Trait1==Trait2) %>% 
  mutate(Trait1=factor(Trait1,levels=c("SELIND",blups$Trait)),
         predOf=factor(paste0(predOf,"_",modelType),
                       levels=c("VarBV_AD","VarTGV_AD",
                                "VarBV_DirDom","VarTGV_DirDom"))) %>% 
  ggplot(.,aes(x=predOf,y=AccuracyEst,fill=predOf,color=modelType)) + 
  geom_boxplot(notch = TRUE,color='gray40') +
  # ggdist::stat_dots(side = "left", justification = 1.3,layout='swarm',
  #                   binwidth = 0.03,dotsize=0.4) +
  scale_fill_manual(values = viridis::viridis(4)[1:4]) + 
  scale_color_manual(values = viridis::viridis(4)[1:4]) + 
  geom_hline(yintercept = 0, color='black', size=0.8) +
  facet_grid(.~Trait1) + 
  labs(title="Accuracy predicting cross-variances") + 
  theme_bw() + 
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        strip.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position='bottom',
        axis.text.y = element_text(face='bold'),
        axis.title.y = element_text(face='bold'),
        strip.text.x = element_text(face='bold'),
        plot.title = element_text(face='bold'),
        legend.title = element_text(face='bold'),
        legend.text = element_text(face='bold'),
        panel.spacing = unit(0.2, "lines"))

Version Author Date
baa7d80 wolfemd 2021-07-29
cc1eb4b wolfemd 2021-07-14
        # axis.text.y = element_text(face='bold',size=18),
        # axis.title.y = element_text(face='bold',size=18),
        # strip.text.x = element_text(face='bold', size=20),
        # plot.title = element_text(face='bold', size=24),
        # legend.title = element_text(face='bold', size=20),
        # legend.text = element_text(face='bold', size=20),

Cross co-variance pred. accuracy

accuracy_ad$varPredAccuracy %>% 
  bind_rows(accuracy_dirdom$varPredAccuracy) %>% 
  select(-predVSobs) %>% 
  filter(Trait1!="SELIND",Trait2!="SELIND",
         Trait1!=Trait2) %>% 
  mutate(#Trait1=factor(Trait1,levels=c("SELIND",blups$Trait)),
         #Trait2=factor(Trait2,levels=c("SELIND",blups$Trait)),
         Trait1=factor(Trait1,levels=c(blups$Trait)),
         Trait2=factor(Trait2,levels=c(blups$Trait)),
         predOf=factor(paste0(predOf,"_",modelType),
                       levels=c("VarBV_AD","VarTGV_AD",
                                "VarBV_DirDom","VarTGV_DirDom"))) %>% 
  ggplot(.,aes(x=predOf,y=AccuracyEst,fill=predOf,color=modelType)) + 
  geom_boxplot(notch = TRUE) +
  scale_fill_manual(values = viridis::viridis(4)[1:4]) + 
  scale_color_manual(values = viridis::viridis(4)[1:4]) + 
  geom_hline(yintercept = 0, color='gray40', size=0.6) +
  facet_wrap(~Trait1+Trait2,nrow=5) + 
  labs(title="Accuracy predicting cross-covariances") + 
  theme_bw() + 
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        strip.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position='bottom',
        axis.text.y = element_text(face='bold'),
        axis.title.y = element_text(face='bold'),
        strip.text.x = element_text(face='bold'),
        plot.title = element_text(face='bold'),
        legend.title = element_text(face='bold'),
        legend.text = element_text(face='bold'),
        panel.spacing = unit(0.2, "lines"))

Version Author Date
baa7d80 wolfemd 2021-07-29
cc1eb4b wolfemd 2021-07-14
rm(list=ls())

Genomic Predictions

After evaluating prediction accuracy, the genomic prediction step implements full-training dataset predictions and outputs tidy tables of selection criteria, including rankings on the selection index. For the sake of example, I selected 121 parents that were the union of parents ranking in the top 100 highest SELIND GEBV and/or GETGV as predicted by the DirDom and/or AD models. I predicted all 7381 crosses between these 121 pre-chosen parents and summarize those predictions below.

I find the accuracy results above compelling enough to focus on DirDom only below. In addition, below I focus on the selection index predictions (SELIND). Predictions of all component traits are also available. Feedback on presentation of results welcome!

Parent selection

library(tidyverse); library(magrittr); library(ggdist)
# crossPreds<-readRDS(file = here::here("output","genomicMatePredictions_top121parents_ModelDirDom.rds"))
# crossPreds<-crossPreds$tidyPreds[[1]]
crossPreds<-read.csv(here::here("output","genomicMatePredictions_top121parents_ModelAD.csv"), stringsAsFactors = F, header = T)

I predicted 7381 crosses of 121 parents originally selected as the union of elite parents predicted by both DirDom and AD models. So not all these parents are the absolute top in terms of SI GETGV and the DirDom model, which I will plot below. Since I made the predictions for those extra parents, they are plotted here.

for_selected_plot<-si_getgvs %>% 
  mutate(GeneticGroup=case_when(grepl("2013_|TMS13",GID)~"C1",
                                grepl("TMS14",GID)~"C2",
                                grepl("TMS15",GID)~"C3",
                                grepl("TMS18",GID)~"C4",
                                !grepl("2013_|TMS13|TMS14|TMS15|TMS18",GID)~"PreGS"))
for_selected_plot %>% 
  mutate(Cycle="AllGermplasm") %>% 
  bind_rows(for_selected_plot %>% 
              filter(GID %in% union(crossPreds$sireID,crossPreds$damID)) %>% 
              mutate(Cycle="SelectedParents")) %>% 
  mutate(Cycle=factor(Cycle,levels = c("AllGermplasm","SelectedParents")),
         GeneticGroup=factor(GeneticGroup,levels = c("PreGS","C1","C2","C3","C4"))) %>% 
  ggplot(.,aes(x=Cycle,y=SELIND,fill=Cycle)) + 
  ggdist::stat_halfeye(adjust=0.5,.width = 0,fill='gray',width=0.75) +  
  geom_boxplot(width=0.09,notch = TRUE) +
  ggdist::stat_dots(aes(color=GeneticGroup),side = "left",justification = 1.1,dotsize=.8) + 
  scale_fill_viridis_d() + scale_color_viridis_d() + 
  plottheme + 
  labs(title="Distribution of selection index GETGV in parents selected for mate predictions",
       subtitle="compared to the overall population")

Version Author Date
baa7d80 wolfemd 2021-07-29

Genomic mate selection

library(tidyverse); library(magrittr); library(ggdist)
# crossPreds<-readRDS(file = here::here("output","genomicMatePredictions_top121parents_ModelDirDom.rds"))
# crossPreds<-crossPreds$tidyPreds[[1]]
crossPreds<-read.csv(here::here("output","genomicMatePredictions_top121parents_ModelAD.csv"), stringsAsFactors = F, header = T)
forplot<-crossPreds %>% 
  filter(Trait=="SELIND") %>% 
  select(sireID,damID,CrossGroup,predOf,predMean,predSD)

cross_group_order<-crossPreds %>% 
  filter(Trait=="SELIND") %>% 
  distinct(sireGroup,damGroup) %>% 
  mutate(sireGroup=factor(sireGroup,levels=c("PreGS","C1","C2","C3","C4")),
         damGroup=factor(damGroup,levels=c("PreGS","C1","C2","C3","C4"))) %>% 
  arrange(desc(sireGroup),desc(damGroup)) %>% 
  mutate(CrossGroup=paste0(sireGroup,"x",damGroup)) %$% 
  CrossGroup

intensity<-function(propSel){ dnorm(qnorm(1-propSel))/propSel } # standardized selection intensity
stdSelInt=intensity(0.05); # stdSelInt [1] 2.062713
# qnorm(0.95); # [1] 1.644854
forplot %<>% 
  mutate(predUsefulness=predMean+(predSD*stdSelInt),
         CrossGroup=factor(CrossGroup,levels=c(cross_group_order)))

The standard budget for genotyping has been 2500 new clones per year.

Suppose we choose to create 50 families of 50 siblings each, from the the 7381 predicted crosses.

Quick digression: The input file has a pre-computed predUsefulness variable. I used stdSelInt=2.67 when making the predictions with the predictCrosses() function, but that corresponds to selecting the top 1% of each family. For a family of 50, the top 1% means only a single clone per family. In retrospect, I have decided to re-compute the predUsefulness targeting selection of the top 5% (or top 5 offspring) from each family. This corresponds to using stdSelInt = 2.0627128 and a selection threshold std. deviation of 1.6448536.

Crosses may be of interest for their predicted UCparent (predOf=="BV") and/or UCvariety (predOf=="TGV").

Each crossing nursery needs to produce both new exceptional parents and elite candidate cultivars. These will not necessarily be the same individuals or come from the same crosses.

# forplot %>% 
#   select(-predMean,-predSD) %>% 
#   spread(predOf,predUsefulness) %$% 
#   cor(BV,TGV) 
# [1] 0.999988
# The correlation between predUC BV and TGV is extremely high


# forplot %>% 
#   group_by(predOf) %>% 
#   slice_max(order_by = predUsefulness, n = 50) %>% ungroup() %>% 
#   distinct(sireID,damID) %>% nrow() # 50
# Also, the exact same 50 are ranked top predUC

In this case, the same 50 crosses are best for both UCparent (predOf=="BV") and UCvariety (predOf=="TGV").

First, display the entire set of predicted crosses, ranked by their selection index UCvariety. This will be more clear in the next plots with fewer families: the x-axis is simply the descending rank of predicted UCvariety for each cross. The y-axis shows the mean (dot) and distribution quantiles (linerange) based on the predicted mean and standard deviation of each cross. Crosses are color coded according to the “genetic group” of the parents.

forplot %>% 
  filter(predOf=="TGV") %>% 
  #slice_max(order_by = predUsefulness, n = 100) %>% 
  arrange(desc(predUsefulness)) %>% 
  mutate(Rank=1:nrow(.)) %>% 
  ggplot(aes(x = Rank, dist = "norm", 
             arg1 = predMean, arg2 = predSD, 
             fill=CrossGroup, color=CrossGroup),
         alpha=0.5) +
  stat_dist_pointinterval() +
  #stat_dist_gradientinterval(n=50) +
  scale_fill_viridis_d() + scale_color_viridis_d() + 
  plottheme + theme() + 
  labs(x = paste0("Cross Rank ",expression(bold("UC"["variety"]~" (TGV)"))),
       y = "Selection Index GETGV",
       title = "Predicted distribution (mean and variability) of all crosses")

Version Author Date
baa7d80 wolfemd 2021-07-29

Next, sub

forplot %>% 
  filter(predOf=="TGV") %>% 
  slice_max(order_by = predUsefulness, n = 50) %>% 
  arrange(desc(predUsefulness)) %>% 
  mutate(Rank=1:nrow(.)) %>% 
    ggplot(aes(x = Rank, dist = "norm", 
               arg1 = predMean, arg2 = predSD, 
               fill=CrossGroup, color=CrossGroup),
           alpha=0.5) +
    stat_dist_gradientinterval(n=100) +
    scale_fill_viridis_d() + scale_color_viridis_d() + 
  plottheme + theme(axis.text.x = element_text(face='bold'),
                    axis.title.x = element_text(face = 'bold')) + 
  labs(x = expression(bold("Rank on SELIND - UC"["variety"])),
       y = "Selection Index GETGV",
       title = "Predicted distribution of the top 50 crosses")

Version Author Date
baa7d80 wolfemd 2021-07-29

The best 5 crosses to make:

forplot %>% 
  filter(predOf=="TGV") %>% 
  slice_max(order_by = predUsefulness, n = 5) %>% rmarkdown::paged_table()
ABCDEFGHIJ0123456789
sireID
<chr>
damID
<chr>
CrossGroup
<fct>
predOf
<chr>
predMean
<dbl>
predSD
<dbl>
predUsefulness
<dbl>
TMS15F1471P0080:250482464TMS15F1471P0080:250482464C3xC3TGV273.616217.49895309.7115
TMS15F1471P0080:2504824642013_10139_12:250164378C3xC1TGV258.970220.60475301.4719
TMS15F1471P0080:2504824642013_10139_22:250164388C3xC1TGV259.264020.20805300.9474
2013_10139_12:2501643782013_10139_12:250164378C1xC1TGV244.324323.63646293.0795
2013_10139_22:2501643882013_10139_12:250164378C1xC1TGV244.618123.12648292.3214

Interestingly (?) no cross of C4 (TMS18) clones are yet recommended (in the top 50) on this ranking. The highest rank C4 cross is the 214th from top (see below) along with two other TMS18 x TMS15 crosses.

forplot %>% 
  filter(predOf=="TGV") %>% 
  arrange(desc(predUsefulness)) %>% 
  mutate(Rank=1:nrow(.)) %>% 
  filter(grepl("C4",CrossGroup)) %>% slice(1:10) %>% 
  rmarkdown::paged_table()
ABCDEFGHIJ0123456789
sireID
<chr>
damID
<chr>
CrossGroup
<fct>
predOf
<chr>
predMean
<dbl>
predSD
<dbl>
predUsefulness
<dbl>
Rank
<int>
TMS15F1471P0080:250482464TMS18F1457P0005_A19406C3xC4TGV215.893220.69547258.5820214
TMS15F1471P0080:250482464TMS18F1457P0009_A19236C3xC4TGV217.738919.57696258.1205215
TMS15F1471P0080:250482464TMS18F1399P0036_A18283C3xC4TGV209.132222.20299254.9306238
2013_10139_12:250164378TMS18F1457P0005_A19406C1xC4TGV201.247323.50480249.7309375
2013_10139_12:250164378TMS18F1457P0009_A19236C1xC4TGV203.093022.53923249.5849386
2013_10139_22:250164388TMS18F1457P0005_A19406C1xC4TGV201.541023.15062249.2941410
2013_10139_22:250164388TMS18F1457P0009_A19236C1xC4TGV203.386722.16216249.1009424
2013_10139_12:250164378TMS18F1399P0036_A18283C1xC4TGV194.486324.85371245.7523791
2013_10139_22:250164388TMS18F1399P0036_A18283C1xC4TGV194.780024.50146245.3195845
TMS15F1471P0080:250482464TMS18F1485P0024_A19068C3xC4TGV203.220520.04130244.56001004

In this last plot, the area under the top 5% of each crosses predicted distribution is highlighted. The mean of individuals from under the highlighted area is the UCvariety. There are also 50 dots for each cross illustrating the hypothetical outcome of creating 50 progeny.

forplot %>% 
  filter(predOf=="TGV") %>% 
  slice_max(order_by = predUsefulness, n = 50) %>% 
  arrange(desc(predUsefulness)) %>% 
  mutate(Rank=1:nrow(.)) %>% 
  ggplot(aes(x = Rank, dist = "norm", 
             arg1 = predMean, arg2 = predSD, 
             fill = CrossGroup,
             label = CrossGroup)) +
  stat_dist_gradientinterval(n=100,side='top',position = "dodgejust",
                         aes(fill = stat(y < (arg1+arg2*qnorm(0.95))))) +
  stat_dist_dotsinterval(n=50,side='both',position = "dodgejust",
                         aes(fill = stat(y < (arg1+arg2*qnorm(0.95))))) +
  scale_fill_viridis_d() + scale_color_viridis_d() + 
  plottheme + theme(axis.text.x = element_text(face='bold'),
                    axis.title.x = element_text(face = 'bold'),
                    legend.position = 'none') + 
  labs(x = expression(bold("Rank on SELIND - UC"["variety"])),
       y = "Selection Index GETGV",
       title = "Predicted distribution of the top 50 crosses")

Version Author Date
baa7d80 wolfemd 2021-07-29

Or for more clarity, just the top 5 crosses:

forplot %>% 
  filter(predOf=="TGV") %>% 
  slice_max(order_by = predUsefulness, n = 5) %>% 
  arrange(desc(predUsefulness)) %>% 
  mutate(Rank=1:nrow(.)) %>% 
  ggplot(aes(y = Rank, dist = "norm", 
             arg1 = predMean, arg2 = predSD, 
             label = paste0(sireID,"\n x ",damID)),
         alpha=0.5) +
  stat_dist_dotsinterval(n=50,side='top',position = "dodgejust",scale=0.85,
                         aes(fill = stat(x < (arg1+arg2*qnorm(0.95))))) +
  stat_dist_halfeye(position = "dodgejust",scale=1.25, alpha=0.5,
                    aes(fill = stat(x < (arg1+arg2*qnorm(0.95))))) +
  geom_label(aes(x=predMean),size=3) + 
  scale_fill_viridis_d() + scale_color_viridis_d() + 
  plottheme + theme(axis.text.x = element_text(face='bold'),
                    axis.title.x = element_text(face = 'bold'),
                    legend.position = 'none') + 
  labs(y = expression(bold("Rank on SELIND - UC"["variety"])),
       x = "Selection Index GETGV",
       title = "Predicted distribution of the top 5 crosses")

Version Author Date
baa7d80 wolfemd 2021-07-29

Table of Top 50 Crosses: 2 rows for each cross, one for predOf=="BV" one for predOf=="TGV".

top50crosses<-forplot %>% 
  group_by(predOf) %>% 
  slice_max(order_by = predUsefulness, n = 50) %>% 
  ungroup()
top50crosses %>% 
  write.csv(.,file = here::here("output","top50crosses.csv"), row.names = F)
top50crosses %>% 
  rmarkdown::paged_table()
ABCDEFGHIJ0123456789
sireID
<chr>
damID
<chr>
CrossGroup
<fct>
predOf
<chr>
predMean
<dbl>
predSD
<dbl>
predUsefulness
<dbl>
TMS15F1471P0080:250482464TMS15F1471P0080:250482464C3xC3BV273.616217.01585308.7150
TMS15F1471P0080:2504824642013_10139_12:250164378C3xC1BV258.970220.20874300.6551
TMS15F1471P0080:2504824642013_10139_22:250164388C3xC1BV259.264019.83811300.1843
2013_10139_12:2501643782013_10139_12:250164378C1xC1BV244.324322.96186291.6880
2013_10139_22:2501643882013_10139_12:250164378C1xC1BV244.618122.63635291.3104
2013_10139_22:2501643882013_10139_22:250164388C1xC1BV244.911822.30609290.9229
TMS15F1471P0080:2504824642013_10139_17:250164383C3xC1BV240.139021.43368284.3505
TMS15F1471P0080:2504824642013_10139_3:250164369C3xC1BV242.665319.18058282.2293
TMS15F1471P0080:2504824642013_10139_14:250164380C3xC1BV242.598018.89266281.5681
TMS15F1471P0080:2504824642013_10139_9:250164375C3xC1BV239.972120.11136281.4561

[FUTURE] Compare PHG-to-Beagle data

[TO DO] PCA

[TO DO] Compare kinships

[TO DO] Prediction accuracy - PHG

[TO DO] Compare PHG-to-Beagle accuracy


sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggdist_3.0.0    ragg_1.1.3      magrittr_2.0.1  forcats_0.5.1  
 [5] stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4     readr_2.0.0    
 [9] tidyr_1.1.3     tibble_3.1.3    ggplot2_3.3.5   tidyverse_1.3.1
[13] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] viridis_0.6.1        httr_1.4.2           sass_0.4.0          
 [4] jsonlite_1.7.2       viridisLite_0.4.0    splines_4.1.0       
 [7] here_1.0.1           modelr_0.1.8         bslib_0.2.5.1       
[10] assertthat_0.2.1     distributional_0.2.2 highr_0.9           
[13] cellranger_1.1.0     yaml_2.2.1           lattice_0.20-44     
[16] pillar_1.6.2         backports_1.2.1      glue_1.4.2          
[19] digest_0.6.27        promises_1.2.0.1     rvest_1.0.1         
[22] colorspace_2.0-2     Matrix_1.3-4         htmltools_0.5.1.1   
[25] httpuv_1.6.1         pkgconfig_2.0.3      broom_0.7.9         
[28] haven_2.4.1          scales_1.1.1         whisker_0.4         
[31] later_1.2.0          tzdb_0.1.2           git2r_0.28.0        
[34] mgcv_1.8-36          generics_0.1.0       farver_2.1.0        
[37] ellipsis_0.3.2       withr_2.4.2          cli_3.0.1           
[40] crayon_1.4.1         readxl_1.3.1         evaluate_0.14       
[43] fs_1.5.0             fansi_0.5.0          nlme_3.1-152        
[46] xml2_1.3.2           textshaping_0.3.5    tools_4.1.0         
[49] hms_1.1.0            lifecycle_1.0.0      munsell_0.5.0       
[52] reprex_2.0.0         compiler_4.1.0       jquerylib_0.1.4     
[55] systemfonts_1.0.2    rlang_0.4.11         grid_4.1.0          
[58] rstudioapi_0.13      labeling_0.4.2       rmarkdown_2.9       
[61] gtable_0.3.0         DBI_1.1.1            R6_2.5.0            
[64] gridExtra_2.3        lubridate_1.7.10     knitr_1.33          
[67] utf8_1.2.2           rprojroot_2.0.2      stringi_1.7.3       
[70] Rcpp_1.0.7           vctrs_0.3.8          dbplyr_2.1.1        
[73] tidyselect_1.1.1     xfun_0.24