Last updated: 2020-09-22

Checks: 6 1

Knit directory: R_gene_analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2.9000). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200917) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 50f995b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Unstaged changes:
    Modified:   analysis/index.Rmd
    Modified:   analysis/total_numbers.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/total_numbers.Rmd) and HTML (docs/total_numbers.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 50f995b Philipp Bayer 2020-09-21 Add more analysis
html 50f995b Philipp Bayer 2020-09-21 Add more analysis
html e6e9b9b Philipp Bayer 2020-09-21 Build site.
Rmd c1fbbf9 Philipp Bayer 2020-09-21 wflow_publish("analysis/*")
Rmd c71005a Philipp Bayer 2020-09-18 lots changes
html c71005a Philipp Bayer 2020-09-18 lots changes
html 7d33bac Philipp Bayer 2020-09-18 Build site.
Rmd 695db1e Philipp Bayer 2020-09-18 wflow_publish(c(“analysis/eda.Rmd”, “analysis/first-analysis.Rmd”,

This is the same analysis as first-analysis, but with total numbers, not percentages genes lost

knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
library(tidyverse)
library(patchwork)
library(ggsci)
library(dabestr)
library(dabestr)
library(cowplot)
library(ggsignif)
library(ggforce)

theme_set(theme_cowplot())

Introduction

npg_col = pal_npg("nrc")(9)
col_list <- c(`Wild-type`=npg_col[8],
   Landrace = npg_col[3],
  `Old cultivar`=npg_col[2],
  `Modern cultivar`=npg_col[4])

pav_table <- read_tsv('./data/soybean_pan_pav.matrix_gene.txt.gz')

NBS part

Let’s pull the NBS genes from the table

nbs <- read_tsv('./data/Lee.NBS.candidates.lst', col_names = c('Name', 'Class'))
nbs
# A tibble: 486 x 2
   Name                   Class
   <chr>                  <chr>
 1 UWASoyPan00953.t1      CN   
 2 GlymaLee.13G222900.1.p CN   
 3 GlymaLee.18G227000.1.p CN   
 4 GlymaLee.18G080600.1.p CN   
 5 GlymaLee.20G036200.1.p CN   
 6 UWASoyPan01876.t1      CN   
 7 UWASoyPan04211.t1      CN   
 8 GlymaLee.19G105400.1.p CN   
 9 GlymaLee.18G085100.1.p CN   
10 GlymaLee.11G142600.1.p CN   
# ... with 476 more rows
# have to remove the .t1s 
nbs$Name <- gsub('.t1','', nbs$Name)
nbs_pav_table <- pav_table %>% filter(Individual %in% nbs$Name)

Modern vs Old gene loss

groups <- read_csv('./data/Table_of_cultivar_groups.csv')
groups
# A tibble: 1,069 x 3
   `Data-storage-ID` `PI-ID`   `Group in violin table`
   <chr>             <chr>     <chr>                  
 1 SRR1533284        PI416890  landrace               
 2 SRR1533282        PI323576  landrace               
 3 SRR1533292        PI157421  landrace               
 4 SRR1533216        PI594615  landrace               
 5 SRR1533239        PI603336  landrace               
 6 USB-108           PI165675  landrace               
 7 HNEX-13           PI253665D landrace               
 8 USB-382           PI603549  landrace               
 9 SRR1533236        PI587552  landrace               
10 SRR1533332        PI567293  landrace               
# ... with 1,059 more rows

Which genes are present more or less in old / modern cultivars?

big_norm_count <- tibble(
  name = character(),
  landrace = numeric(),
  Modern_cultivar = numeric(),
  Old_cultivar = numeric(),
  `Wild-type` = numeric()
)

groups_list <- split(groups$`Group in violin table`, groups$`Data-storage-ID`)

for( i in 1:nrow(nbs_pav_table) ) {
  this_gene <- nbs_pav_table[i,]
  groups_count <- list()
  total_groups_count <- list()
  for (x in seq_along(nbs_pav_table)){
    if ( x == 1) next
    thisind <- colnames(nbs_pav_table)[x]
    thisind_group <- groups_list[[thisind]]
    if( is.null(thisind_group) ) next # no group for this individual
    pavs <- this_gene[[x]] # either 1 or 0
    
    if ( thisind_group %in% names(groups_count)) {
      # count the number of present genes
      groups_count[[thisind_group]] <- groups_count[[thisind_group]] + pavs
      # count the total number of individuals for this group
      total_groups_count[[thisind_group]] <- total_groups_count[[thisind_group]] + 1
    } else {
      groups_count[[thisind_group]] <-  pavs
      total_groups_count[[thisind_group]] <- 1
    }
  }
  
  norm_group_count <- list()
  for (m in seq_along(groups_count)) {
    thisname <- names(groups_count)[m]
    norm_group_count[[thisname]] <- groups_count[[thisname]] / total_groups_count[[thisname]] * 100
  }
  norm_group_count$Individual <- this_gene$Individual
  big_norm_count <- rbind(big_norm_count, as_tibble(norm_group_count))

}

# wow, I DO write R like Python

Let’s pull out the genes that are variable in any group

var_norm_count <- big_norm_count %>% 
  filter(landrace != 100 & 
           Modern_cultivar != 100 & 
           Old_cultivar != 100 & 
           `Wild-type` != 100)

var_norm_count <- left_join(var_norm_count, nbs, by=c('Individual'='Name'))
var_norm_count$Mod_minus_Old <- var_norm_count$Modern_cultivar - var_norm_count$Old_cultivar

The top 20 genes reduced the most in modern cultivars compared with old cultivars:

var_norm_count %>% 
  arrange(Mod_minus_Old) %>% 
  head(20) %>% 
  select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Mod_minus_Old, Class) %>% 
  knitr::kable()
Individual Wild-type landrace Old_cultivar Modern_cultivar Mod_minus_Old Class
UWASoyPan03261 74.52229 62.102351 60.86957 26.573427 -34.296139 TX
UWASoyPan00953 83.43949 33.056708 43.47826 11.188811 -32.289450 CN
UWASoyPan00725 89.17197 92.392808 82.60870 51.748252 -30.860444 TX
UWASoyPan00316 91.08280 90.594744 80.43478 50.349650 -30.085132 NBS
UWASoyPan01530 80.89172 45.089903 47.82609 20.979021 -26.847066 NL
UWASoyPan00975 42.67516 19.778700 32.60870 7.692308 -24.916388 TX
UWASoyPan00155 85.35032 77.316736 63.04348 42.657343 -20.386136 NBS
UWASoyPan00772 54.77707 18.948824 30.43478 11.888112 -18.546671 NBS
UWASoyPan03402 62.42038 42.738589 36.95652 18.881119 -18.075403 NBS
UWASoyPan01320 45.85987 30.152144 23.91304 6.993007 -16.920036 NBS
UWASoyPan02799 65.60510 30.843707 19.56522 2.797203 -16.768015 NBS
GlymaLee.03G045500.1.p 82.16561 58.921162 67.39130 51.748252 -15.643053 OTHER
UWASoyPan01253 73.24841 24.481328 17.39130 3.496504 -13.894801 NBS
UWASoyPan03340 18.47134 26.279391 19.56522 6.293706 -13.271511 TX
GlymaLee.06G230600.1.p 78.34395 57.399723 56.52174 44.055944 -12.465795 TX
GlymaLee.06G228900.1.p 96.17834 74.688797 91.30435 80.419580 -10.884767 TX
UWASoyPan00251 60.50955 20.470263 21.73913 11.188811 -10.550319 NL
GlymaLee.03G045700.1.p 75.79618 67.496542 65.21739 55.244755 -9.972636 OTHER
GlymaLee.06G228600.1.p 90.44586 79.391425 82.60870 72.727273 -9.881423 TX
UWASoyPan00670 30.57325 6.915629 10.86957 2.097902 -8.771663 TX

So these are the NLR genes selected against during soybean breeding.

Let’s look at those without the TX ones:

var_norm_count %>% 
  arrange(Mod_minus_Old) %>% 
  select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Mod_minus_Old, Class) %>% 
  filter(Class != 'TX') %>% 
  head(20) %>% 
  knitr::kable()
Individual Wild-type landrace Old_cultivar Modern_cultivar Mod_minus_Old Class
UWASoyPan00953 83.439490 33.0567082 43.478261 11.188811 -32.289450 CN
UWASoyPan00316 91.082802 90.5947441 80.434783 50.349650 -30.085132 NBS
UWASoyPan01530 80.891720 45.0899032 47.826087 20.979021 -26.847066 NL
UWASoyPan00155 85.350319 77.3167358 63.043478 42.657343 -20.386136 NBS
UWASoyPan00772 54.777070 18.9488243 30.434783 11.888112 -18.546671 NBS
UWASoyPan03402 62.420382 42.7385892 36.956522 18.881119 -18.075403 NBS
UWASoyPan01320 45.859873 30.1521438 23.913044 6.993007 -16.920036 NBS
UWASoyPan02799 65.605096 30.8437068 19.565217 2.797203 -16.768015 NBS
GlymaLee.03G045500.1.p 82.165605 58.9211618 67.391304 51.748252 -15.643053 OTHER
UWASoyPan01253 73.248408 24.4813278 17.391304 3.496504 -13.894801 NBS
UWASoyPan00251 60.509554 20.4702628 21.739130 11.188811 -10.550319 NL
GlymaLee.03G045700.1.p 75.796178 67.4965422 65.217391 55.244755 -9.972636 OTHER
UWASoyPan03194 45.222930 10.7883817 10.869565 2.097902 -8.771663 NBS
GlymaLee.03G045900.1.p 74.522293 65.5601660 63.043478 54.545454 -8.498024 OTHER
UWASoyPan00326 40.764331 19.0871369 17.391304 9.790210 -7.601095 CN
UWASoyPan02496 26.114650 14.6611342 13.043478 6.993007 -6.050471 CN
UWASoyPan01217 57.961783 14.3845090 10.869565 5.594406 -5.275160 NBS
GlymaLee.06G229300.1.p 86.624204 50.3457815 65.217391 60.839161 -4.378230 TN
UWASoyPan04757 6.369427 0.9681881 4.347826 0.000000 -4.347826 NBS
GlymaLee.03G042000.1.p 99.363057 88.3817427 91.304348 88.111888 -3.192460 CNL

The top 20 genes increased the most in modern cultivars:

var_norm_count %>% 
  arrange(desc(Mod_minus_Old)) %>% 
  head(20) %>% 
  select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Mod_minus_Old, Class) %>% 
  knitr::kable()
Individual Wild-type landrace Old_cultivar Modern_cultivar Mod_minus_Old Class
GlymaLee.01G030900.1.p 81.52866 50.345782 34.782609 71.328671 36.546063 NL
GlymaLee.15G199500.1.p 85.98726 73.582296 69.565217 86.713287 17.148069 CN
GlymaLee.15G199200.1.p 92.99363 74.827109 73.913044 90.909091 16.996047 CNL
GlymaLee.06G232800.1.p 40.12739 47.579530 56.521739 73.426573 16.904834 NBS
GlymaLee.01G088400.1.p 49.68153 89.488243 82.608696 97.202797 14.594102 TNL
UWASoyPan05312 30.57325 8.575380 10.869565 23.776224 12.906659 NBS
UWASoyPan00005 36.94268 13.831259 8.695652 20.279720 11.584068 NBS
UWASoyPan01876 43.31210 15.629322 8.695652 20.279720 11.584068 CN
GlymaLee.10G034600.1.p 91.08280 91.839557 84.782609 95.104895 10.322286 NL
UWASoyPan01330 29.29936 25.172891 15.217391 23.776224 8.558832 NBS
UWASoyPan00202 50.31847 35.408022 26.086956 32.167832 6.080876 NBS
UWASoyPan00427 97.45223 84.232365 73.913044 79.720280 5.807236 NBS
GlymaLee.15G199300.1.p 96.17834 88.243430 93.478261 98.601399 5.123138 NL
GlymaLee.03G070700.1.p 65.60510 89.903181 89.130435 93.706294 4.575859 TNL
GlymaLee.06G229100.1.p 85.98726 38.174274 50.000000 53.146853 3.146853 TX
GlymaLee.07G070200.1.p 78.98089 91.286307 91.304348 94.405594 3.101247 NBS
UWASoyPan01418 68.78981 66.251729 71.739130 74.825175 3.086044 TX
GlymaLee.03G070600.1.p 66.87898 90.179806 91.304348 93.706294 2.401946 TNL
GlymaLee.16G175200.1.p 95.54140 96.127248 95.652174 97.902098 2.249924 TNL
UWASoyPan04967 35.03185 6.224066 2.173913 4.195804 2.021891 TX

As these genes have relatively high percentages in WT they must have been re-introduced by using WT in the breeding process.

Presence plotting per individual

names <- c()
presences <- c()

for (i in seq_along(nbs_pav_table)){
  if ( i == 1) next
  thisind <- colnames(nbs_pav_table)[i]
  pavs <- nbs_pav_table[[i]]
  presents <- sum(pavs)
  names <- c(names, thisind)
  presences <- c(presences, presents)
}
nbs_res_tibb <- new_tibble(list(names = names, presences = presences))

OK what do these presence percentages look like?

ggplot(data=nbs_res_tibb, aes(x=presences)) + geom_histogram(bins=25) 

On average, 446.0027027 of NBS genes are present in each individual.

Now let’s join the table of presences to the four different types so we can group these numbers.

nbs_joined_groups <- left_join(nbs_res_tibb, groups, by = c('names'='Data-storage-ID'))
nbs_joined_groups$`Group in violin table` <- gsub('landrace', 'Landrace', nbs_joined_groups$`Group in violin table`)
nbs_joined_groups$`Group in violin table` <- gsub('Modern_cultivar', 'Modern cultivar', nbs_joined_groups$`Group in violin table`)
nbs_joined_groups$`Group in violin table` <- gsub('Old_cultivar', 'Old cultivar', nbs_joined_groups$`Group in violin table`)

nbs_joined_groups$`Group in violin table` <- factor(nbs_joined_groups$`Group in violin table`, levels=c(NA, 'Wild-type', 'Landrace', 'Old cultivar', 'Modern cultivar'))
nbs_vio <- nbs_joined_groups %>% filter(!is.na(`Group in violin table`)) %>% 
  ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) + 
  geom_violin(draw_quantiles = c(0.5)) +
  geom_sina(alpha=0.5) +
  geom_smooth(aes(group=1), method='glm') +
  scale_fill_manual(values=col_list) +
  guides(fill = FALSE)

nbs_vio

nbs_joined_groups %>% filter(`Group in violin table` != 'NA') %>% 
  ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) + 
  geom_smooth(aes(group=1), method='lm', se = FALSE) +
  geom_jitter() +
  scale_fill_manual(values=col_list)+
  guides(fill = FALSE)

nbs_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  group_by(`Group in violin table`) %>% 
  summarise(min_present = min(presences),
            max_present = max(presences),
            mean_present = mean(presences),
            median_present = median(presences),
            std_present = sd(presences)) %>% 
  knitr::kable()
Group in violin table min_present max_present mean_present median_present std_present
Wild-type 435 473 452.9490 453 7.170806
Landrace 429 465 444.8907 445 5.011672
Old cultivar 433 456 444.8696 445 5.200892
Modern cultivar 431 455 442.3147 442 4.047986

RLK part

Let’s do the same plot with RLKs

rlk <- read_tsv('./data/Lee.RLK.candidates.lst', col_names = c('Name', 'Class', 'Subtype'))
rlk
# A tibble: 1,173 x 3
   Name                   Class Subtype       
   <chr>                  <chr> <chr>         
 1 GlymaLee.01G001800.1.p RLK   lrr           
 2 GlymaLee.01G004900.1.p RLK   lrr           
 3 GlymaLee.01G007300.1.p RLK   lrr           
 4 GlymaLee.01G007400.1.p RLK   lrr           
 5 GlymaLee.01G012800.1.p RLK   other_receptor
 6 GlymaLee.01G018800.1.p RLK   lrr           
 7 GlymaLee.01G021100.1.p RLK   other_receptor
 8 GlymaLee.01G025500.1.p RLK   lysm          
 9 GlymaLee.01G026500.1.p RLK   other_receptor
10 GlymaLee.01G027000.1.p RLK   lrr           
# ... with 1,163 more rows
# have to remove the .t1s 
rlk$Name <- gsub('.t1','', rlk$Name)
rlk_pav_table <- pav_table %>% filter(Individual %in% rlk$Name)
rlk_pav_table
# A tibble: 1,173 x 1,111
   Individual `AB-01` `AB-02` `BR-01` `BR-02` `BR-03` `BR-04` `BR-05` `BR-06`
   <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 GlymaLee.~       1       1       1       1       1       1       1       1
 2 GlymaLee.~       1       1       1       1       1       1       1       1
 3 GlymaLee.~       1       1       1       1       1       1       1       1
 4 GlymaLee.~       1       1       1       1       1       1       1       1
 5 GlymaLee.~       1       1       1       1       1       1       1       1
 6 GlymaLee.~       1       1       1       1       1       1       1       1
 7 GlymaLee.~       1       1       1       1       1       1       1       1
 8 GlymaLee.~       1       1       1       1       1       1       1       1
 9 GlymaLee.~       1       1       1       1       1       1       1       1
10 GlymaLee.~       1       1       1       1       1       1       1       1
# ... with 1,163 more rows, and 1,102 more variables: `BR-07` <dbl>,
#   `BR-08` <dbl>, `BR-09` <dbl>, `BR-10` <dbl>, `BR-11` <dbl>, `BR-12` <dbl>,
#   `BR-13` <dbl>, `BR-14` <dbl>, `BR-15` <dbl>, `BR-16` <dbl>, `BR-17` <dbl>,
#   `BR-18` <dbl>, `BR-20` <dbl>, `BR-23` <dbl>, `BR-24` <dbl>, `BR-29` <dbl>,
#   `BR-30` <dbl>, `BR-32` <dbl>, DT2000 <dbl>, ESS <dbl>, For <dbl>,
#   HN001 <dbl>, HN002 <dbl>, HN003 <dbl>, HN004 <dbl>, HN005 <dbl>,
#   HN006 <dbl>, HN007 <dbl>, HN008 <dbl>, HN009 <dbl>, HN010 <dbl>,
#   HN011 <dbl>, HN012 <dbl>, HN013 <dbl>, HN015 <dbl>, HN016B <dbl>,
#   HN017B <dbl>, HN018 <dbl>, HN019 <dbl>, HN021 <dbl>, HN022 <dbl>,
#   HN023 <dbl>, HN024 <dbl>, HN025 <dbl>, HN026 <dbl>, HN027 <dbl>,
#   HN028 <dbl>, HN029 <dbl>, HN030 <dbl>, HN031 <dbl>, HN032 <dbl>,
#   HN033 <dbl>, HN034 <dbl>, HN035 <dbl>, HN036 <dbl>, HN037 <dbl>,
#   HN038 <dbl>, HN039 <dbl>, HN040 <dbl>, HN041 <dbl>, HN042 <dbl>,
#   HN043 <dbl>, HN044 <dbl>, HN045 <dbl>, HN046 <dbl>, HN047 <dbl>,
#   HN048 <dbl>, HN049 <dbl>, HN050 <dbl>, HN051 <dbl>, HN052 <dbl>,
#   HN053 <dbl>, HN054 <dbl>, HN055 <dbl>, HN056 <dbl>, HN057 <dbl>,
#   HN058 <dbl>, HN059 <dbl>, HN060 <dbl>, HN061 <dbl>, HN062 <dbl>,
#   HN063 <dbl>, HN064 <dbl>, HN065 <dbl>, HN066 <dbl>, HN067 <dbl>,
#   HN068 <dbl>, HN069 <dbl>, HN070 <dbl>, HN071 <dbl>, HN072 <dbl>,
#   HN073 <dbl>, HN074 <dbl>, HN075 <dbl>, HN076 <dbl>, HN077 <dbl>,
#   HN078 <dbl>, HN079 <dbl>, HN080 <dbl>, HN081 <dbl>, ...
names <- c()
presences <- c()

for (i in seq_along(rlk_pav_table)){
  if ( i == 1) next
  thisind <- colnames(rlk_pav_table)[i]
  pavs <- rlk_pav_table[[i]]
  presents <- sum(pavs)
  names <- c(names, thisind)
  presences <- c(presences, presents)
}
rlk_res_tibb <- new_tibble(list(names = names, presences = presences))
rlk_res_tibb
# A tibble: 1,110 x 2
   names presences
   <chr>     <dbl>
 1 AB-01      1167
 2 AB-02      1162
 3 BR-01      1166
 4 BR-02      1165
 5 BR-03      1166
 6 BR-04      1167
 7 BR-05      1164
 8 BR-06      1167
 9 BR-07      1165
10 BR-08      1167
# ... with 1,100 more rows

OK what do these presence percentages look like?

ggplot(data=rlk_res_tibb, aes(x=presences)) + geom_histogram(bins=25) 

On average, 1163.5036036% of NBS genes are present in each individual.

Now let’s join the table of presences to the four different types so we can group these numbers.

rlk_joined_groups <- left_join(rlk_res_tibb, groups, by = c('names'='Data-storage-ID'))
rlk_joined_groups$`Group in violin table` <- gsub('landrace', 'Landrace', rlk_joined_groups$`Group in violin table`)
rlk_joined_groups$`Group in violin table` <- gsub('Modern_cultivar', 'Modern cultivar', rlk_joined_groups$`Group in violin table`)
rlk_joined_groups$`Group in violin table` <- gsub('Old_cultivar', 'Old cultivar', rlk_joined_groups$`Group in violin table`)

rlk_joined_groups$`Group in violin table` <- factor(rlk_joined_groups$`Group in violin table`, levels=c(NA, 'Wild-type', 'Landrace', 'Old cultivar', 'Modern cultivar'))
rlk_vio <- rlk_joined_groups %>% filter(`Group in violin table` != 'NA') %>% 
  ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) + 
  geom_violin(draw_quantiles = c(0.5)) +
  geom_sina(alpha=0.5) +
  geom_smooth(aes(group=1), method='lm', se = FALSE) +
  scale_fill_manual(values=col_list)+
  guides(fill = FALSE)
rlk_vio

rlk_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  group_by(`Group in violin table`) %>% 
  summarise(min_present = min(presences),
            max_present = max(presences),
            mean_present = mean(presences),
            median_present = median(presences),
            std_present = sd(presences)) %>% 
  knitr::kable()
Group in violin table min_present max_present mean_present median_present std_present
Wild-type 1154 1170 1164.357 1165 2.554565
Landrace 1157 1168 1163.217 1163 1.499264
Old cultivar 1161 1166 1163.587 1164 1.407537
Modern cultivar 1159 1168 1163.490 1163 1.472122

RLP part

And now with RLPs

rlp <- read_tsv('./data/Lee.RLP.candidates.lst', col_names = c('Name', 'Class', 'Subtype'))
# have to remove the .t1s 
rlp$Name <- gsub('.t1','', rlp$Name)
rlp_pav_table <- pav_table %>% filter(Individual %in% rlp$Name)
names <- c()
presences <- c()

for (i in seq_along(rlp_pav_table)){
  if ( i == 1) next
  thisind <- colnames(rlp_pav_table)[i]
  pavs <- rlp_pav_table[[i]]
  presents <- sum(pavs)
  names <- c(names, thisind)
  presences <- c(presences, presents)
}
rlp_res_tibb <- new_tibble(list(names = names, presences = presences))

OK what do these presence percentages look like?

ggplot(data=rlp_res_tibb, aes(x=presences)) + geom_histogram(bins=25) 

On average, 172.1693694% of NBS genes are present in each individual.

Now let’s join the table of presences to the four different types so we can group these numbers.

rlp_joined_groups <- left_join(rlp_res_tibb, groups, by = c('names'='Data-storage-ID'))
rlp_joined_groups$`Group in violin table` <- gsub('landrace', 'Landrace', rlp_joined_groups$`Group in violin table`)
rlp_joined_groups$`Group in violin table` <- gsub('Modern_cultivar', 'Modern cultivar', rlp_joined_groups$`Group in violin table`)
rlp_joined_groups$`Group in violin table` <- gsub('Old_cultivar', 'Old cultivar', rlp_joined_groups$`Group in violin table`)

rlp_joined_groups$`Group in violin table` <- factor(rlp_joined_groups$`Group in violin table`, levels=c(NA, 'Wild-type', 'Landrace', 'Old cultivar', 'Modern cultivar'))
rlp_vio <- rlp_joined_groups %>% filter(`Group in violin table` != 'NA') %>% 
  ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) + 
  geom_violin(draw_quantiles = c(0.5)) +
  geom_sina(alpha=0.5) +
    geom_smooth(aes(group=1), method='lm', se = FALSE) +
  scale_fill_manual(values=col_list)+
  guides(fill = FALSE)
rlp_vio

rlp_joined_groups %>% filter(`Group in violin table` != 'NA') %>% 
  ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) + 
  geom_jitter() +
  #geom_sina(alpha=0.5) +
  scale_fill_manual(values=col_list)+
  guides(fill = FALSE) +
  ylim(c(87, 100))

rlp_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  group_by(`Group in violin table`) %>% 
  summarise(min_present = min(presences),
            max_present = max(presences),
            mean_present = mean(presences),
            median_present = median(presences),
            std_present = sd(presences)) %>% 
  knitr::kable()
Group in violin table min_present max_present mean_present median_present std_present
Wild-type 168 177 173.4140 173 1.617392
Landrace 162 177 171.9668 172 1.661526
Old cultivar 169 176 171.8261 172 1.623499
Modern cultivar 169 175 171.8042 172 1.290587

Plotting together

nbs_vio + rlk_vio + rlp_vio

Stats - Dabayes

I want to know whether the groups are statistically significantly different. First let’s use dabestr

NBS

Let’s run dabestr first:

nbs_multi.two.group.unpaired <- 
  nbs_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  dabest(`Group in violin table`, presences, 
         idx = list(c("Wild-type", "Landrace"),
                    c('Old cultivar', 'Modern cultivar')),
         paired = FALSE)
nbs_multi.two.group.unpaired
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:44 AM on Tuesday September 22, 2020.

Dataset    :  .
The first five rows are:
# A tibble: 5 x 4
  names  presences `PI-ID`  `Group in violin table`
  <chr>      <dbl> <chr>    <fct>                  
1 AB-01        445 PI458020 Landrace               
2 AB-02        454 PI603713 Landrace               
3 DT2000       447 PI635999 Modern cultivar        
4 For          448 PI548645 Modern cultivar        
5 HN001        448 PI518664 Modern cultivar        

X Variable :  Group in violin table
Y Variable :  presences

Effect sizes(s) will be computed for:
  1. Landrace minus Wild-type
  2. Modern cultivar minus Old cultivar
nbs_multi.two.group.unpaired.meandiff <- mean_diff(nbs_multi.two.group.unpaired)
nbs_multi.two.group.unpaired.meandiff
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:45 AM on Tuesday September 22, 2020.

Dataset    :  .
X Variable :  Group in violin table
Y Variable :  presences

Unpaired mean difference of Landrace (n = 723) minus Wild-type (n = 157)
 -8.06 [95CI  -9.24; -6.86]

Unpaired mean difference of Modern cultivar (n = 143) minus Old cultivar (n = 46)
 -2.55 [95CI  -4.25; -0.97]


5000 bootstrap resamples.
All confidence intervals are bias-corrected and accelerated.
plot(nbs_multi.two.group.unpaired.meandiff, color.column=`Group in violin table`,
     rawplot.ylabel = 'Presence (%)', show.legend=FALSE)

RLK

rlk_multi.two.group.unpaired <- 
  rlk_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  dabest(`Group in violin table`, presences, 
         idx = list(c("Wild-type", "Landrace"),
                    c('Old cultivar', 'Modern cultivar')),
         paired = FALSE)
rlk_multi.two.group.unpaired
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:45 AM on Tuesday September 22, 2020.

Dataset    :  .
The first five rows are:
# A tibble: 5 x 4
  names  presences `PI-ID`  `Group in violin table`
  <chr>      <dbl> <chr>    <fct>                  
1 AB-01       1167 PI458020 Landrace               
2 AB-02       1162 PI603713 Landrace               
3 DT2000      1165 PI635999 Modern cultivar        
4 For         1163 PI548645 Modern cultivar        
5 HN001       1163 PI518664 Modern cultivar        

X Variable :  Group in violin table
Y Variable :  presences

Effect sizes(s) will be computed for:
  1. Landrace minus Wild-type
  2. Modern cultivar minus Old cultivar
rlk_multi.two.group.unpaired.meandiff <- mean_diff(rlk_multi.two.group.unpaired)
rlk_multi.two.group.unpaired.meandiff
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:45 AM on Tuesday September 22, 2020.

Dataset    :  .
X Variable :  Group in violin table
Y Variable :  presences

Unpaired mean difference of Landrace (n = 723) minus Wild-type (n = 157)
 -1.14 [95CI  -1.55; -0.717]

Unpaired mean difference of Modern cultivar (n = 143) minus Old cultivar (n = 46)
 -0.0974 [95CI  -0.562; 0.362]


5000 bootstrap resamples.
All confidence intervals are bias-corrected and accelerated.
plot(rlk_multi.two.group.unpaired.meandiff, color.column=`Group in violin table`,
     rawplot.ylabel = 'Presence (%)', show.legend=FALSE)

No difference between old and modern cultivars!

RLP

rlp_multi.two.group.unpaired <- 
  rlp_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  dabest(`Group in violin table`, presences, 
         idx = list(c("Wild-type", "Landrace"),
                    c('Old cultivar', 'Modern cultivar')),
         paired = FALSE)
rlp_multi.two.group.unpaired
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:45 AM on Tuesday September 22, 2020.

Dataset    :  .
The first five rows are:
# A tibble: 5 x 4
  names  presences `PI-ID`  `Group in violin table`
  <chr>      <dbl> <chr>    <fct>                  
1 AB-01        171 PI458020 Landrace               
2 AB-02        172 PI603713 Landrace               
3 DT2000       171 PI635999 Modern cultivar        
4 For          171 PI548645 Modern cultivar        
5 HN001        172 PI518664 Modern cultivar        

X Variable :  Group in violin table
Y Variable :  presences

Effect sizes(s) will be computed for:
  1. Landrace minus Wild-type
  2. Modern cultivar minus Old cultivar
rlp_multi.two.group.unpaired.meandiff <- mean_diff(rlp_multi.two.group.unpaired)
rlp_multi.two.group.unpaired.meandiff
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:45 AM on Tuesday September 22, 2020.

Dataset    :  .
X Variable :  Group in violin table
Y Variable :  presences

Unpaired mean difference of Landrace (n = 723) minus Wild-type (n = 157)
 -1.45 [95CI  -1.74; -1.17]

Unpaired mean difference of Modern cultivar (n = 143) minus Old cultivar (n = 46)
 -0.0219 [95CI  -0.53; 0.477]


5000 bootstrap resamples.
All confidence intervals are bias-corrected and accelerated.
plot(rlp_multi.two.group.unpaired.meandiff, color.column=`Group in violin table`,
     rawplot.ylabel = 'Presence (%)', show.legend=FALSE)

Again, no difference between old and modern cultivars!

Stats - classic t-test

NBS

nbs_joined_groups %>% 
  filter( !is.na(`PI-ID`) ) %>%
    ggplot(aes(x=`Group in violin table`, y = presences,
               fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Number of NLR genes') +
  xlab('Accession group')

RLP

rlp_joined_groups %>% 
  filter( !is.na(`PI-ID`) ) %>%
    ggplot(aes(x=`Group in violin table`, y = presences,
               fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Number of RLP genes') +
  xlab('Accession group')

RLK

rlk_joined_groups %>% 
  filter( !is.na(`PI-ID`) ) %>%
    ggplot(aes(x=`Group in violin table`, y = presences,
               fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Number of RLK genes') +
  xlab('Accession group')

Linking with yield

Can we link the trajectory of NLR genes with the trajectory of yield across the history of soybean breeding? let’s make a simple regression for now

protein <- read_tsv('./data/protein_phenotype.txt')
protein_join <- left_join(nbs_res_tibb, protein, by=c('names'='Line')) %>% filter(!is.na(Protein))
protein_join %>% ggplot(aes(x=presences, y=Protein)) + geom_hex() + geom_smooth() +
  xlab('NLR gene count')

summary(lm(Protein ~ presences, data = protein_join))

Call:
lm(formula = Protein ~ presences, data = protein_join)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.8479  -2.1274  -0.3336   1.9959  10.0949 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.98158    7.24125  -1.102    0.271    
presences    0.11786    0.01624   7.258 8.07e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.106 on 960 degrees of freedom
Multiple R-squared:  0.05203,   Adjusted R-squared:  0.05104 
F-statistic: 52.69 on 1 and 960 DF,  p-value: 8.075e-13

Let’s look at seed weight:

seed_weight <- read_tsv('./data/Seed_weight_Phenotype.txt', col_names = c('names', 'wt'))
seed_join <- left_join(nbs_res_tibb, seed_weight) %>% filter(!is.na(wt))
seed_join %>% filter(wt > 5) %>%  ggplot(aes(x=presences, y=wt)) + geom_hex() + geom_smooth() +
  ylab('Seed weight') +
  xlab('NLR gene count')

summary(lm(wt ~ presences, data = seed_join))

Call:
lm(formula = wt ~ presences, data = seed_join)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.2910  -2.8692   0.1462   2.7771  19.6962 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 91.40656   14.67990   6.227 8.28e-10 ***
presences   -0.17636    0.03298  -5.348 1.21e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.714 on 690 degrees of freedom
Multiple R-squared:  0.0398,    Adjusted R-squared:  0.0384 
F-statistic:  28.6 on 1 and 690 DF,  p-value: 1.213e-07

And now let’s look at the oil phenotype:

oil <- read_tsv('./data/oil_phenotype.txt')
oil_join <- left_join(nbs_res_tibb, oil, by=c('names'='Line')) %>% filter(!is.na(Oil))
oil_join %>%  ggplot(aes(x=presences, y=Oil)) + geom_hex() + geom_smooth() +
  xlab('NLR gene count')

summary(lm(Oil ~ presences, data = oil_join))

Call:
lm(formula = Oil ~ presences, data = oil_join)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.4376  -1.9081   0.4846   2.2401   9.0361 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 118.03941    7.31646   16.13   <2e-16 ***
presences    -0.22591    0.01641  -13.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.139 on 960 degrees of freedom
Multiple R-squared:  0.1649,    Adjusted R-squared:  0.1641 
F-statistic: 189.6 on 1 and 960 DF,  p-value: < 2.2e-16

OK there are many, many outliers here. Clearly I’ll have to do something fancier - for example, using the first two PCs as covariates might get rid of some of those outliers.

Boxplots per group

First, protein vs. the four groups:

nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(protein, by=c('names'='Line')) %>% 
  ggplot(aes(x=`Group in violin table`, y=Protein, fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Protein') +
  xlab('Accession group')

And seed weight:

nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(seed_join) %>% 
  ggplot(aes(x=`Group in violin table`, y=wt, fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Seed weight') +
  xlab('Accession group')

Wow, that’s breeding!

And finally, Oil content:

nbs_joined_groups %>% 
  filter(!is.na(`Group in violin table`)) %>% 
  inner_join(oil_join, by = 'names') %>% 
  ggplot(aes(x=`Group in violin table`, y=Oil, fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Oil content') +
  xlab('Accession group')

Oha, a single star. That’s p < 0.05!


sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggforce_0.3.1   ggsignif_0.6.0  cowplot_1.0.0   dabestr_0.3.0  
 [5] magrittr_1.5    ggsci_2.9       patchwork_1.0.0 forcats_0.5.0  
 [9] stringr_1.4.0   dplyr_1.0.0     purrr_0.3.4     readr_1.3.1    
[13] tidyr_1.1.0     tibble_3.0.2    ggplot2_3.3.2   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] nlme_3.1-148         fs_1.5.0.9000        lubridate_1.7.9     
 [4] RColorBrewer_1.1-2   httr_1.4.2           rprojroot_1.3-2     
 [7] tools_3.6.3          backports_1.1.10     utf8_1.1.4          
[10] R6_2.4.1             vipor_0.4.5          DBI_1.1.0           
[13] mgcv_1.8-31          colorspace_1.4-1     withr_2.2.0         
[16] tidyselect_1.1.0     compiler_3.6.3       git2r_0.27.1        
[19] cli_2.0.2            rvest_0.3.5          xml2_1.3.2          
[22] labeling_0.3         scales_1.1.1         hexbin_1.28.1       
[25] digest_0.6.25        rmarkdown_2.3        pkgconfig_2.0.3     
[28] htmltools_0.5.0      dbplyr_1.4.4         highr_0.8           
[31] rlang_0.4.7          readxl_1.3.1         rstudioapi_0.11     
[34] farver_2.0.3         generics_0.0.2       jsonlite_1.7.1      
[37] Matrix_1.2-18        Rcpp_1.0.5           ggbeeswarm_0.6.0    
[40] munsell_0.5.0        fansi_0.4.1          lifecycle_0.2.0     
[43] stringi_1.5.3        whisker_0.4          yaml_2.2.1          
[46] MASS_7.3-51.6        plyr_1.8.6           grid_3.6.3          
[49] blob_1.2.1           promises_1.1.1       crayon_1.3.4        
[52] lattice_0.20-41      haven_2.3.1          splines_3.6.3       
[55] hms_0.5.3            knitr_1.29           pillar_1.4.4        
[58] boot_1.3-25          reprex_0.3.0         glue_1.4.2          
[61] evaluate_0.14        modelr_0.1.8         vctrs_0.3.1         
[64] tweenr_1.0.1         httpuv_1.5.4         cellranger_1.1.0    
[67] gtable_0.3.0         polyclip_1.10-0      assertthat_0.2.1    
[70] xfun_0.17            broom_0.5.6          later_1.1.0.1       
[73] beeswarm_0.2.3       workflowr_1.6.2.9000 ellipsis_0.3.1