Last updated: 2020-12-11

Checks: 7 0

Knit directory: R_gene_analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2.9000). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200917)

The command set.seed(20200917) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 4b21bc2

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 4b21bc2. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  data/Brec_R1.txt
    Untracked:  data/Brec_R2.txt
    Untracked:  data/CR15_R1.txt
    Untracked:  data/CR15_R2.txt
    Untracked:  data/CR_14_R1.txt
    Untracked:  data/CR_14_R2.txt
    Untracked:  data/KS_R1.txt
    Untracked:  data/KS_R2.txt
    Untracked:  data/NBS_PAV.txt.gz
    Untracked:  data/NLR_PAV_GD.txt
    Untracked:  data/NLR_PAV_GM.txt
    Untracked:  data/PAVs_newick.txt
    Untracked:  data/PPR1.txt
    Untracked:  data/PPR2.txt
    Untracked:  data/SNPs_newick.txt
    Untracked:  data/bac.txt
    Untracked:  data/brown.txt
    Untracked:  data/cy3.txt
    Untracked:  data/cy5.txt
    Untracked:  data/early.txt
    Untracked:  data/flowerings.txt
    Untracked:  data/foregeye.txt
    Untracked:  data/height.txt
    Untracked:  data/late.txt
    Untracked:  data/mature.txt
    Untracked:  data/motting.txt
    Untracked:  data/mvp.kin.bin
    Untracked:  data/mvp.kin.desc
    Untracked:  data/oil.txt
    Untracked:  data/pdh.txt
    Untracked:  data/protein.txt
    Untracked:  data/rust_tan.txt
    Untracked:  data/salt.txt
    Untracked:  data/seedq.txt
    Untracked:  data/seedweight.txt
    Untracked:  data/stem_termination.txt
    Untracked:  data/sudden.txt
    Untracked:  data/virus.txt
    Untracked:  data/yield.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/total_numbers.Rmd) and HTML (docs/total_numbers.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	4b21bc2	Philipp Bayer	2020-12-11	wflow_publish(“analysis/total_numbers.Rmd”)
html	e519122	Philipp Bayer	2020-12-11	Build site.
html	83dc49a	Philipp Bayer	2020-12-11	Trace deleted files
html	e059725	Philipp Bayer	2020-11-18	Build site.
Rmd	a9702e9	Philipp Bayer	2020-11-18	wflow_publish(“analysis/total_numbers.Rmd”)
Rmd	dae157b	Philipp Bayer	2020-09-24	Update of analysis
html	dae157b	Philipp Bayer	2020-09-24	Update of analysis
Rmd	1b346e9	Philipp Bayer	2020-09-22	More data!
html	1b346e9	Philipp Bayer	2020-09-22	More data!
html	0777e1d	Philipp Bayer	2020-09-22	update
Rmd	3adcc5a	Philipp Bayer	2020-09-22	wflow_git_commit(all = TRUE)
html	3adcc5a	Philipp Bayer	2020-09-22	wflow_git_commit(all = TRUE)
Rmd	50f995b	Philipp Bayer	2020-09-21	Add more analysis
html	50f995b	Philipp Bayer	2020-09-21	Add more analysis
html	e6e9b9b	Philipp Bayer	2020-09-21	Build site.
Rmd	c1fbbf9	Philipp Bayer	2020-09-21	wflow_publish("analysis/*")
Rmd	c71005a	Philipp Bayer	2020-09-18	lots changes
html	c71005a	Philipp Bayer	2020-09-18	lots changes
html	7d33bac	Philipp Bayer	2020-09-18	Build site.
Rmd	695db1e	Philipp Bayer	2020-09-18	wflow_publish(c(“analysis/eda.Rmd”, “analysis/first-analysis.Rmd”,

This is the same analysis as first-analysis, but with total numbers, not percentages genes lost

knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
library(tidyverse)
library(patchwork)
library(ggsci)
library(dabestr)
library(dabestr)
library(cowplot)
library(ggsignif)
library(ggforce)

theme_set(theme_cowplot())

Introduction

npg_col = pal_npg("nrc")(9)
col_list <- c(`Wild`=npg_col[8],
   Landrace = npg_col[3],
  `Old cultivar`=npg_col[2],
  `Modern cultivar`=npg_col[4])

pav_table <- read_tsv('./data/soybean_pan_pav.matrix_gene.txt.gz')

NBS part

Let’s pull the NBS genes from the table

nbs <- read_tsv('./data/Lee.NBS.candidates.lst', col_names = c('Name', 'Class'))
nbs

# A tibble: 486 x 2
   Name                   Class
   <chr>                  <chr>
 1 UWASoyPan00953.t1      CN   
 2 GlymaLee.13G222900.1.p CN   
 3 GlymaLee.18G227000.1.p CN   
 4 GlymaLee.18G080600.1.p CN   
 5 GlymaLee.20G036200.1.p CN   
 6 UWASoyPan01876.t1      CN   
 7 UWASoyPan04211.t1      CN   
 8 GlymaLee.19G105400.1.p CN   
 9 GlymaLee.18G085100.1.p CN   
10 GlymaLee.11G142600.1.p CN   
# ... with 476 more rows

# have to remove the .t1s 
nbs$Name <- gsub('.t1','', nbs$Name)

nbs_pav_table <- pav_table %>% filter(Individual %in% nbs$Name)
write_delim(nbs_pav_table, 'data/NBS_PAV.txt.gz', delim='\t')

Modern vs Old gene loss

groups <- read_csv('./data/Table_of_cultivar_groups.csv')
groups

# A tibble: 1,069 x 3
   `Data-storage-ID` `PI-ID`   `Group in violin table`
   <chr>             <chr>     <chr>                  
 1 SRR1533284        PI416890  landrace               
 2 SRR1533282        PI323576  landrace               
 3 SRR1533292        PI157421  landrace               
 4 SRR1533216        PI594615  landrace               
 5 SRR1533239        PI603336  landrace               
 6 USB-108           PI165675  landrace               
 7 HNEX-13           PI253665D landrace               
 8 USB-382           PI603549  landrace               
 9 SRR1533236        PI587552  landrace               
10 SRR1533332        PI567293  landrace               
# ... with 1,059 more rows

Which genes are present more or less in old / modern cultivars?

big_norm_count <- tibble(
  name = character(),
  landrace = numeric(),
  Modern_cultivar = numeric(),
  Old_cultivar = numeric(),
  `Wild` = numeric()
)

groups_list <- split(groups$`Group in violin table`, groups$`Data-storage-ID`)

for( i in 1:nrow(nbs_pav_table) ) {
  this_gene <- nbs_pav_table[i,]
  groups_count <- list()
  total_groups_count <- list()
  for (x in seq_along(nbs_pav_table)){
    if ( x == 1) next
    thisind <- colnames(nbs_pav_table)[x]
    thisind_group <- groups_list[[thisind]]
    if( is.null(thisind_group) ) next # no group for this individual
    pavs <- this_gene[[x]] # either 1 or 0
    
    if ( thisind_group %in% names(groups_count)) {
      # count the number of present genes
      groups_count[[thisind_group]] <- groups_count[[thisind_group]] + pavs
      # count the total number of individuals for this group
      total_groups_count[[thisind_group]] <- total_groups_count[[thisind_group]] + 1
    } else {
      groups_count[[thisind_group]] <-  pavs
      total_groups_count[[thisind_group]] <- 1
    }
  }
  
  norm_group_count <- list()
  for (m in seq_along(groups_count)) {
    thisname <- names(groups_count)[m]
    norm_group_count[[thisname]] <- groups_count[[thisname]] / total_groups_count[[thisname]] * 100
  }
  norm_group_count$Individual <- this_gene$Individual
  big_norm_count <- rbind(big_norm_count, as_tibble(norm_group_count))

}

# wow, I DO write R like Python

Let’s pull out the genes that are variable in any group

var_norm_count <- big_norm_count %>% 
  filter(landrace != 100 & 
           Modern_cultivar != 100 & 
           Old_cultivar != 100 & 
           `Wild-type` != 100)

var_norm_count <- left_join(var_norm_count, nbs, by=c('Individual'='Name'))
var_norm_count$Mod_minus_Old <- var_norm_count$Modern_cultivar - var_norm_count$Old_cultivar

The top 20 genes reduced the most in modern cultivars compared with old cultivars:

var_norm_count %>% 
  arrange(Mod_minus_Old) %>% 
  head(20) %>% 
  select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Mod_minus_Old, Class) %>% 
  knitr::kable()

Individual	Wild-type	landrace	Old_cultivar	Modern_cultivar	Mod_minus_Old	Class
UWASoyPan03261	74.52229	62.102351	60.86957	26.573427	-34.296139	TX
UWASoyPan00953	83.43949	33.056708	43.47826	11.188811	-32.289450	CN
UWASoyPan00725	89.17197	92.392808	82.60870	51.748252	-30.860444	TX
UWASoyPan00316	91.08280	90.594744	80.43478	50.349650	-30.085132	NBS
UWASoyPan01530	80.89172	45.089903	47.82609	20.979021	-26.847066	NL
UWASoyPan00975	42.67516	19.778700	32.60870	7.692308	-24.916388	TX
UWASoyPan00155	85.35032	77.316736	63.04348	42.657343	-20.386136	NBS
UWASoyPan00772	54.77707	18.948824	30.43478	11.888112	-18.546671	NBS
UWASoyPan03402	62.42038	42.738589	36.95652	18.881119	-18.075403	NBS
UWASoyPan01320	45.85987	30.152144	23.91304	6.993007	-16.920036	NBS
UWASoyPan02799	65.60510	30.843707	19.56522	2.797203	-16.768015	NBS
GlymaLee.03G045500.1.p	82.16561	58.921162	67.39130	51.748252	-15.643053	OTHER
UWASoyPan01253	73.24841	24.481328	17.39130	3.496504	-13.894801	NBS
UWASoyPan03340	18.47134	26.279391	19.56522	6.293706	-13.271511	TX
GlymaLee.06G230600.1.p	78.34395	57.399723	56.52174	44.055944	-12.465795	TX
GlymaLee.06G228900.1.p	96.17834	74.688797	91.30435	80.419580	-10.884767	TX
UWASoyPan00251	60.50955	20.470263	21.73913	11.188811	-10.550319	NL
GlymaLee.03G045700.1.p	75.79618	67.496542	65.21739	55.244755	-9.972636	OTHER
GlymaLee.06G228600.1.p	90.44586	79.391425	82.60870	72.727273	-9.881423	TX
UWASoyPan00670	30.57325	6.915629	10.86957	2.097902	-8.771663	TX

So these are the NLR genes selected against during soybean breeding.

Let’s look at those without the TX ones:

var_norm_count %>% 
  arrange(Mod_minus_Old) %>% 
  select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Mod_minus_Old, Class) %>% 
  filter(Class != 'TX') %>% 
  head(20) %>% 
  knitr::kable()

Individual	Wild-type	landrace	Old_cultivar	Modern_cultivar	Mod_minus_Old	Class
UWASoyPan00953	83.439490	33.0567082	43.478261	11.188811	-32.289450	CN
UWASoyPan00316	91.082802	90.5947441	80.434783	50.349650	-30.085132	NBS
UWASoyPan01530	80.891720	45.0899032	47.826087	20.979021	-26.847066	NL
UWASoyPan00155	85.350319	77.3167358	63.043478	42.657343	-20.386136	NBS
UWASoyPan00772	54.777070	18.9488243	30.434783	11.888112	-18.546671	NBS
UWASoyPan03402	62.420382	42.7385892	36.956522	18.881119	-18.075403	NBS
UWASoyPan01320	45.859873	30.1521438	23.913044	6.993007	-16.920036	NBS
UWASoyPan02799	65.605096	30.8437068	19.565217	2.797203	-16.768015	NBS
GlymaLee.03G045500.1.p	82.165605	58.9211618	67.391304	51.748252	-15.643053	OTHER
UWASoyPan01253	73.248408	24.4813278	17.391304	3.496504	-13.894801	NBS
UWASoyPan00251	60.509554	20.4702628	21.739130	11.188811	-10.550319	NL
GlymaLee.03G045700.1.p	75.796178	67.4965422	65.217391	55.244755	-9.972636	OTHER
UWASoyPan03194	45.222930	10.7883817	10.869565	2.097902	-8.771663	NBS
GlymaLee.03G045900.1.p	74.522293	65.5601660	63.043478	54.545454	-8.498024	OTHER
UWASoyPan00326	40.764331	19.0871369	17.391304	9.790210	-7.601095	CN
UWASoyPan02496	26.114650	14.6611342	13.043478	6.993007	-6.050471	CN
UWASoyPan01217	57.961783	14.3845090	10.869565	5.594406	-5.275160	NBS
GlymaLee.06G229300.1.p	86.624204	50.3457815	65.217391	60.839161	-4.378230	TN
UWASoyPan04757	6.369427	0.9681881	4.347826	0.000000	-4.347826	NBS
GlymaLee.03G042000.1.p	99.363057	88.3817427	91.304348	88.111888	-3.192460	CNL

Let’s plot:

var_norm_count %>% 
  arrange(Mod_minus_Old) %>% 
  head(20) %>% 
  select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Class) %>% 
  pivot_longer(!c(Individual, Class)) %>% 
  mutate(name = str_replace_all(name, 'landrace', 'Landrace')) %>%
  mutate(name = str_replace_all(name, 'Wild-type', 'Wild')) %>% 
  mutate(name = str_replace_all(name, 'Old_cultivar', 'Old cultivar')) %>% 
  mutate(name = str_replace_all(name, 'Modern_cultivar', 'Modern cultivar')) %>%  
  ggplot(aes(x=factor(name, levels=c('Wild', 'Landrace', 'Old cultivar', 'Modern cultivar')), y=value, group=Individual, color=Class)) + 
  geom_line(size=1.5) +
  xlab('Group') +
  ylab('Percentage presence of gene in group') +
  scale_color_brewer(palette = 'Dark2')

The top 20 genes increased the most in modern cultivars:

var_norm_count %>% 
  arrange(desc(Mod_minus_Old)) %>% 
  head(20) %>% 
  select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Mod_minus_Old, Class) %>% 
  knitr::kable()

Individual	Wild-type	landrace	Old_cultivar	Modern_cultivar	Mod_minus_Old	Class
GlymaLee.01G030900.1.p	81.52866	50.345782	34.782609	71.328671	36.546063	NL
GlymaLee.15G199500.1.p	85.98726	73.582296	69.565217	86.713287	17.148069	CN
GlymaLee.15G199200.1.p	92.99363	74.827109	73.913044	90.909091	16.996047	CNL
GlymaLee.06G232800.1.p	40.12739	47.579530	56.521739	73.426573	16.904834	NBS
GlymaLee.01G088400.1.p	49.68153	89.488243	82.608696	97.202797	14.594102	TNL
UWASoyPan05312	30.57325	8.575380	10.869565	23.776224	12.906659	NBS
UWASoyPan00005	36.94268	13.831259	8.695652	20.279720	11.584068	NBS
UWASoyPan01876	43.31210	15.629322	8.695652	20.279720	11.584068	CN
GlymaLee.10G034600.1.p	91.08280	91.839557	84.782609	95.104895	10.322286	NL
UWASoyPan01330	29.29936	25.172891	15.217391	23.776224	8.558832	NBS
UWASoyPan00202	50.31847	35.408022	26.086956	32.167832	6.080876	NBS
UWASoyPan00427	97.45223	84.232365	73.913044	79.720280	5.807236	NBS
GlymaLee.15G199300.1.p	96.17834	88.243430	93.478261	98.601399	5.123138	NL
GlymaLee.03G070700.1.p	65.60510	89.903181	89.130435	93.706294	4.575859	TNL
GlymaLee.06G229100.1.p	85.98726	38.174274	50.000000	53.146853	3.146853	TX
GlymaLee.07G070200.1.p	78.98089	91.286307	91.304348	94.405594	3.101247	NBS
UWASoyPan01418	68.78981	66.251729	71.739130	74.825175	3.086044	TX
GlymaLee.03G070600.1.p	66.87898	90.179806	91.304348	93.706294	2.401946	TNL
GlymaLee.16G175200.1.p	95.54140	96.127248	95.652174	97.902098	2.249924	TNL
UWASoyPan04967	35.03185	6.224066	2.173913	4.195804	2.021891	TX

As these genes have relatively high percentages in WT they must have been re-introduced by using WT in the breeding process.

Let’s plot those too:

var_norm_count %>% 
  arrange(desc(Mod_minus_Old)) %>% 
  head(20) %>% 
  select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Class) %>% 
  pivot_longer(!c(Individual, Class)) %>% 
  mutate(name = str_replace_all(name, 'landrace', 'Landrace')) %>% 
  mutate(name = str_replace_all(name, 'Old_cultivar', 'Old cultivar')) %>% 
  mutate(name = str_replace_all(name, 'Modern_cultivar', 'Modern cultivar')) %>%  
  mutate(name = str_replace_all(name, 'Wild-type', 'Wild')) %>% 
  ggplot(aes(x=factor(name, levels=c('Wild', 'Landrace', 'Old cultivar', 'Modern cultivar')), y=value, group=Individual, color=Class)) + 
  geom_line(size=1.5) +
  xlab('Group') +
  ylab('Percentage presence of gene in group') +
  scale_color_brewer(palette = 'Dark2')

Presence plotting per individual

names <- c()
presences <- c()

for (i in seq_along(nbs_pav_table)){
  if ( i == 1) next
  thisind <- colnames(nbs_pav_table)[i]
  pavs <- nbs_pav_table[[i]]
  presents <- sum(pavs)
  names <- c(names, thisind)
  presences <- c(presences, presents)
}
nbs_res_tibb <- new_tibble(list(names = names, presences = presences))

OK what do these presence percentages look like?

ggplot(data=nbs_res_tibb, aes(x=presences)) + geom_histogram(bins=25)

On average, 446.0027027 of NBS genes are present in each individual.

Now let’s join the table of presences to the four different types so we can group these numbers.

nbs_joined_groups <- left_join(nbs_res_tibb, groups, by = c('names'='Data-storage-ID'))

nbs_joined_groups$`Group in violin table` <- gsub('landrace', 'Landrace', nbs_joined_groups$`Group in violin table`)
nbs_joined_groups$`Group in violin table` <- gsub('Modern_cultivar', 'Modern cultivar', nbs_joined_groups$`Group in violin table`)
nbs_joined_groups$`Group in violin table` <- gsub('Old_cultivar', 'Old cultivar', nbs_joined_groups$`Group in violin table`)
nbs_joined_groups$`Group in violin table` <- gsub('Wild-type', 'Wild', nbs_joined_groups$`Group in violin table`)

nbs_joined_groups$`Group in violin table` <- factor(nbs_joined_groups$`Group in violin table`, levels=c(NA, 'Wild', 'Landrace', 'Old cultivar', 'Modern cultivar'))

nbs_vio <- nbs_joined_groups %>% filter(!is.na(`Group in violin table`)) %>% 
  ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) + 
  geom_violin(draw_quantiles = c(0.5)) +
  geom_sina(alpha=0.5) +
  geom_smooth(aes(group=1), method='glm') +
  scale_fill_manual(values=col_list) +
  guides(fill = FALSE)

nbs_vio

nbs_joined_groups %>% filter(`Group in violin table` != 'NA') %>% 
  ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) + 
  geom_smooth(aes(group=1), method='lm', se = FALSE) +
  geom_jitter() +
  scale_fill_manual(values=col_list)+
  guides(fill = FALSE)

nbs_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  group_by(`Group in violin table`) %>% 
  summarise(min_present = min(presences),
            max_present = max(presences),
            mean_present = mean(presences),
            median_present = median(presences),
            std_present = sd(presences)) %>% 
  knitr::kable()

Group in violin table	min_present	max_present	mean_present	median_present	std_present
Wild	435	473	452.9490	453	7.170806
Landrace	429	465	444.8907	445	5.011672
Old cultivar	433	456	444.8696	445	5.200892
Modern cultivar	431	455	442.3147	442	4.047986

RLK part

Let’s do the same plot with RLKs

rlk <- read_tsv('./data/Lee.RLK.candidates.lst', col_names = c('Name', 'Class', 'Subtype'))
rlk

# A tibble: 1,173 x 3
   Name                   Class Subtype       
   <chr>                  <chr> <chr>         
 1 GlymaLee.01G001800.1.p RLK   lrr           
 2 GlymaLee.01G004900.1.p RLK   lrr           
 3 GlymaLee.01G007300.1.p RLK   lrr           
 4 GlymaLee.01G007400.1.p RLK   lrr           
 5 GlymaLee.01G012800.1.p RLK   other_receptor
 6 GlymaLee.01G018800.1.p RLK   lrr           
 7 GlymaLee.01G021100.1.p RLK   other_receptor
 8 GlymaLee.01G025500.1.p RLK   lysm          
 9 GlymaLee.01G026500.1.p RLK   other_receptor
10 GlymaLee.01G027000.1.p RLK   lrr           
# ... with 1,163 more rows

# have to remove the .t1s 
rlk$Name <- gsub('.t1','', rlk$Name)

rlk_pav_table <- pav_table %>% filter(Individual %in% rlk$Name)
rlk_pav_table

# A tibble: 1,173 x 1,111
   Individual `AB-01` `AB-02` `BR-01` `BR-02` `BR-03` `BR-04` `BR-05` `BR-06`
   <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 GlymaLee.~       1       1       1       1       1       1       1       1
 2 GlymaLee.~       1       1       1       1       1       1       1       1
 3 GlymaLee.~       1       1       1       1       1       1       1       1
 4 GlymaLee.~       1       1       1       1       1       1       1       1
 5 GlymaLee.~       1       1       1       1       1       1       1       1
 6 GlymaLee.~       1       1       1       1       1       1       1       1
 7 GlymaLee.~       1       1       1       1       1       1       1       1
 8 GlymaLee.~       1       1       1       1       1       1       1       1
 9 GlymaLee.~       1       1       1       1       1       1       1       1
10 GlymaLee.~       1       1       1       1       1       1       1       1
# ... with 1,163 more rows, and 1,102 more variables: `BR-07` <dbl>,
#   `BR-08` <dbl>, `BR-09` <dbl>, `BR-10` <dbl>, `BR-11` <dbl>, `BR-12` <dbl>,
#   `BR-13` <dbl>, `BR-14` <dbl>, `BR-15` <dbl>, `BR-16` <dbl>, `BR-17` <dbl>,
#   `BR-18` <dbl>, `BR-20` <dbl>, `BR-23` <dbl>, `BR-24` <dbl>, `BR-29` <dbl>,
#   `BR-30` <dbl>, `BR-32` <dbl>, DT2000 <dbl>, ESS <dbl>, For <dbl>,
#   HN001 <dbl>, HN002 <dbl>, HN003 <dbl>, HN004 <dbl>, HN005 <dbl>,
#   HN006 <dbl>, HN007 <dbl>, HN008 <dbl>, HN009 <dbl>, HN010 <dbl>,
#   HN011 <dbl>, HN012 <dbl>, HN013 <dbl>, HN015 <dbl>, HN016B <dbl>,
#   HN017B <dbl>, HN018 <dbl>, HN019 <dbl>, HN021 <dbl>, HN022 <dbl>,
#   HN023 <dbl>, HN024 <dbl>, HN025 <dbl>, HN026 <dbl>, HN027 <dbl>,
#   HN028 <dbl>, HN029 <dbl>, HN030 <dbl>, HN031 <dbl>, HN032 <dbl>,
#   HN033 <dbl>, HN034 <dbl>, HN035 <dbl>, HN036 <dbl>, HN037 <dbl>,
#   HN038 <dbl>, HN039 <dbl>, HN040 <dbl>, HN041 <dbl>, HN042 <dbl>,
#   HN043 <dbl>, HN044 <dbl>, HN045 <dbl>, HN046 <dbl>, HN047 <dbl>,
#   HN048 <dbl>, HN049 <dbl>, HN050 <dbl>, HN051 <dbl>, HN052 <dbl>,
#   HN053 <dbl>, HN054 <dbl>, HN055 <dbl>, HN056 <dbl>, HN057 <dbl>,
#   HN058 <dbl>, HN059 <dbl>, HN060 <dbl>, HN061 <dbl>, HN062 <dbl>,
#   HN063 <dbl>, HN064 <dbl>, HN065 <dbl>, HN066 <dbl>, HN067 <dbl>,
#   HN068 <dbl>, HN069 <dbl>, HN070 <dbl>, HN071 <dbl>, HN072 <dbl>,
#   HN073 <dbl>, HN074 <dbl>, HN075 <dbl>, HN076 <dbl>, HN077 <dbl>,
#   HN078 <dbl>, HN079 <dbl>, HN080 <dbl>, HN081 <dbl>, ...

names <- c()
presences <- c()

for (i in seq_along(rlk_pav_table)){
  if ( i == 1) next
  thisind <- colnames(rlk_pav_table)[i]
  pavs <- rlk_pav_table[[i]]
  presents <- sum(pavs)
  names <- c(names, thisind)
  presences <- c(presences, presents)
}
rlk_res_tibb <- new_tibble(list(names = names, presences = presences))
rlk_res_tibb

# A tibble: 1,110 x 2
   names presences
   <chr>     <dbl>
 1 AB-01      1167
 2 AB-02      1162
 3 BR-01      1166
 4 BR-02      1165
 5 BR-03      1166
 6 BR-04      1167
 7 BR-05      1164
 8 BR-06      1167
 9 BR-07      1165
10 BR-08      1167
# ... with 1,100 more rows

OK what do these presence percentages look like?

ggplot(data=rlk_res_tibb, aes(x=presences)) + geom_histogram(bins=25)

On average, 1163.5036036% of NBS genes are present in each individual.

Now let’s join the table of presences to the four different types so we can group these numbers.

rlk_joined_groups <- left_join(rlk_res_tibb, groups, by = c('names'='Data-storage-ID'))

rlk_joined_groups$`Group in violin table` <- gsub('landrace', 'Landrace', rlk_joined_groups$`Group in violin table`)
rlk_joined_groups$`Group in violin table` <- gsub('Modern_cultivar', 'Modern cultivar', rlk_joined_groups$`Group in violin table`)
rlk_joined_groups$`Group in violin table` <- gsub('Old_cultivar', 'Old cultivar', rlk_joined_groups$`Group in violin table`)
rlk_joined_groups$`Group in violin table` <- gsub('Wild-type', 'Wild', rlk_joined_groups$`Group in violin table`)


rlk_joined_groups$`Group in violin table` <- factor(rlk_joined_groups$`Group in violin table`, levels=c(NA, 'Wild', 'Landrace', 'Old cultivar', 'Modern cultivar'))

rlk_vio <- rlk_joined_groups %>% filter(`Group in violin table` != 'NA') %>% 
  ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) + 
  geom_violin(draw_quantiles = c(0.5)) +
  geom_sina(alpha=0.5) +
  geom_smooth(aes(group=1), method='lm', se = FALSE) +
  scale_fill_manual(values=col_list)+
  guides(fill = FALSE)
rlk_vio

rlk_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  group_by(`Group in violin table`) %>% 
  summarise(min_present = min(presences),
            max_present = max(presences),
            mean_present = mean(presences),
            median_present = median(presences),
            std_present = sd(presences)) %>% 
  knitr::kable()

Group in violin table	min_present	max_present	mean_present	median_present	std_present
Wild	1154	1170	1164.357	1165	2.554565
Landrace	1157	1168	1163.217	1163	1.499264
Old cultivar	1161	1166	1163.587	1164	1.407537
Modern cultivar	1159	1168	1163.490	1163	1.472122

RLP part

And now with RLPs

rlp <- read_tsv('./data/Lee.RLP.candidates.lst', col_names = c('Name', 'Class', 'Subtype'))
# have to remove the .t1s 
rlp$Name <- gsub('.t1','', rlp$Name)

rlp_pav_table <- pav_table %>% filter(Individual %in% rlp$Name)

names <- c()
presences <- c()

for (i in seq_along(rlp_pav_table)){
  if ( i == 1) next
  thisind <- colnames(rlp_pav_table)[i]
  pavs <- rlp_pav_table[[i]]
  presents <- sum(pavs)
  names <- c(names, thisind)
  presences <- c(presences, presents)
}
rlp_res_tibb <- new_tibble(list(names = names, presences = presences))

OK what do these presence percentages look like?

ggplot(data=rlp_res_tibb, aes(x=presences)) + geom_histogram(bins=25)

On average, 172.1693694% of NBS genes are present in each individual.

Now let’s join the table of presences to the four different types so we can group these numbers.

rlp_joined_groups <- left_join(rlp_res_tibb, groups, by = c('names'='Data-storage-ID'))

rlp_joined_groups$`Group in violin table` <- gsub('landrace', 'Landrace', rlp_joined_groups$`Group in violin table`)
rlp_joined_groups$`Group in violin table` <- gsub('Modern_cultivar', 'Modern cultivar', rlp_joined_groups$`Group in violin table`)
rlp_joined_groups$`Group in violin table` <- gsub('Old_cultivar', 'Old cultivar', rlp_joined_groups$`Group in violin table`)
rlp_joined_groups$`Group in violin table` <- gsub('Wild-type', 'Wild', rlp_joined_groups$`Group in violin table`)

rlp_joined_groups$`Group in violin table` <- factor(rlp_joined_groups$`Group in violin table`, levels=c(NA, 'Wild', 'Landrace', 'Old cultivar', 'Modern cultivar'))

rlp_vio <- rlp_joined_groups %>% filter(`Group in violin table` != 'NA') %>% 
  ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) + 
  geom_violin(draw_quantiles = c(0.5)) +
  geom_sina(alpha=0.5) +
    geom_smooth(aes(group=1), method='lm', se = FALSE) +
  scale_fill_manual(values=col_list)+
  guides(fill = FALSE)
rlp_vio

rlp_joined_groups %>% filter(`Group in violin table` != 'NA') %>% 
  ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) + 
  geom_jitter() +
  #geom_sina(alpha=0.5) +
  scale_fill_manual(values=col_list)+
  guides(fill = FALSE) +
  ylim(c(87, 100))

rlp_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  group_by(`Group in violin table`) %>% 
  summarise(min_present = min(presences),
            max_present = max(presences),
            mean_present = mean(presences),
            median_present = median(presences),
            std_present = sd(presences)) %>% 
  knitr::kable()

Group in violin table	min_present	max_present	mean_present	median_present	std_present
Wild	168	177	173.4140	173	1.617392
Landrace	162	177	171.9668	172	1.661526
Old cultivar	169	176	171.8261	172	1.623499
Modern cultivar	169	175	171.8042	172	1.290587

Plotting together

nbs_vio + rlk_vio + rlp_vio

Stats - Dabayes

I want to know whether the groups are statistically significantly different. First let’s use dabestr

NBS

Let’s run dabestr first:

nbs_multi.two.group.unpaired <- 
  nbs_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  dabest(`Group in violin table`, presences, 
         idx = list(c("Wild", "Landrace"),
                    c('Old cultivar', 'Modern cultivar')),
         paired = FALSE)
nbs_multi.two.group.unpaired

dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:21 AM on Friday December 11, 2020.

Dataset    :  .
The first five rows are:
# A tibble: 5 x 4
  names  presences `PI-ID`  `Group in violin table`
  <chr>      <dbl> <chr>    <fct>                  
1 AB-01        445 PI458020 Landrace               
2 AB-02        454 PI603713 Landrace               
3 DT2000       447 PI635999 Modern cultivar        
4 For          448 PI548645 Modern cultivar        
5 HN001        448 PI518664 Modern cultivar        

X Variable :  Group in violin table
Y Variable :  presences

Effect sizes(s) will be computed for:
  1. Landrace minus Wild
  2. Modern cultivar minus Old cultivar

nbs_multi.two.group.unpaired.meandiff <- mean_diff(nbs_multi.two.group.unpaired)
nbs_multi.two.group.unpaired.meandiff

dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:22 AM on Friday December 11, 2020.

Dataset    :  .
X Variable :  Group in violin table
Y Variable :  presences

Unpaired mean difference of Landrace (n = 723) minus Wild (n = 157)
 -8.06 [95CI  -9.24; -6.86]

Unpaired mean difference of Modern cultivar (n = 143) minus Old cultivar (n = 46)
 -2.55 [95CI  -4.25; -0.97]


5000 bootstrap resamples.
All confidence intervals are bias-corrected and accelerated.

plot(nbs_multi.two.group.unpaired.meandiff, color.column=`Group in violin table`,
     rawplot.ylabel = 'Presence (%)', show.legend=FALSE)

RLK

rlk_multi.two.group.unpaired <- 
  rlk_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  dabest(`Group in violin table`, presences, 
         idx = list(c("Wild", "Landrace"),
                    c('Old cultivar', 'Modern cultivar')),
         paired = FALSE)
rlk_multi.two.group.unpaired

dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:22 AM on Friday December 11, 2020.

Dataset    :  .
The first five rows are:
# A tibble: 5 x 4
  names  presences `PI-ID`  `Group in violin table`
  <chr>      <dbl> <chr>    <fct>                  
1 AB-01       1167 PI458020 Landrace               
2 AB-02       1162 PI603713 Landrace               
3 DT2000      1165 PI635999 Modern cultivar        
4 For         1163 PI548645 Modern cultivar        
5 HN001       1163 PI518664 Modern cultivar        

X Variable :  Group in violin table
Y Variable :  presences

Effect sizes(s) will be computed for:
  1. Landrace minus Wild
  2. Modern cultivar minus Old cultivar

rlk_multi.two.group.unpaired.meandiff <- mean_diff(rlk_multi.two.group.unpaired)
rlk_multi.two.group.unpaired.meandiff

dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:22 AM on Friday December 11, 2020.

Dataset    :  .
X Variable :  Group in violin table
Y Variable :  presences

Unpaired mean difference of Landrace (n = 723) minus Wild (n = 157)
 -1.14 [95CI  -1.55; -0.717]

Unpaired mean difference of Modern cultivar (n = 143) minus Old cultivar (n = 46)
 -0.0974 [95CI  -0.562; 0.362]


5000 bootstrap resamples.
All confidence intervals are bias-corrected and accelerated.

plot(rlk_multi.two.group.unpaired.meandiff, color.column=`Group in violin table`,
     rawplot.ylabel = 'Presence (%)', show.legend=FALSE)

No difference between old and modern cultivars!

RLP

rlp_multi.two.group.unpaired <- 
  rlp_joined_groups %>% filter(!is.na(`PI-ID`)) %>% 
  dabest(`Group in violin table`, presences, 
         idx = list(c("Wild", "Landrace"),
                    c('Old cultivar', 'Modern cultivar')),
         paired = FALSE)
rlp_multi.two.group.unpaired

dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:22 AM on Friday December 11, 2020.

Dataset    :  .
The first five rows are:
# A tibble: 5 x 4
  names  presences `PI-ID`  `Group in violin table`
  <chr>      <dbl> <chr>    <fct>                  
1 AB-01        171 PI458020 Landrace               
2 AB-02        172 PI603713 Landrace               
3 DT2000       171 PI635999 Modern cultivar        
4 For          171 PI548645 Modern cultivar        
5 HN001        172 PI518664 Modern cultivar        

X Variable :  Group in violin table
Y Variable :  presences

Effect sizes(s) will be computed for:
  1. Landrace minus Wild
  2. Modern cultivar minus Old cultivar

rlp_multi.two.group.unpaired.meandiff <- mean_diff(rlp_multi.two.group.unpaired)
rlp_multi.two.group.unpaired.meandiff

dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================

Good morning!
The current time is 11:22 AM on Friday December 11, 2020.

Dataset    :  .
X Variable :  Group in violin table
Y Variable :  presences

Unpaired mean difference of Landrace (n = 723) minus Wild (n = 157)
 -1.45 [95CI  -1.74; -1.17]

Unpaired mean difference of Modern cultivar (n = 143) minus Old cultivar (n = 46)
 -0.0219 [95CI  -0.53; 0.477]


5000 bootstrap resamples.
All confidence intervals are bias-corrected and accelerated.

plot(rlp_multi.two.group.unpaired.meandiff, color.column=`Group in violin table`,
     rawplot.ylabel = 'Presence (%)', show.legend=FALSE)

Again, no difference between old and modern cultivars!

Stats - classic t-test

NBS

nbs_joined_groups %>% 
  filter( !is.na(`PI-ID`) ) %>%
    ggplot(aes(x=`Group in violin table`, y = presences,
               fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Number of NLR genes') +
  xlab('Accession group')

RLP

rlp_joined_groups %>% 
  filter( !is.na(`PI-ID`) ) %>%
    ggplot(aes(x=`Group in violin table`, y = presences,
               fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Number of RLP genes') +
  xlab('Accession group')

RLK

rlk_joined_groups %>% 
  filter( !is.na(`PI-ID`) ) %>%
    ggplot(aes(x=`Group in violin table`, y = presences,
               fill = `Group in violin table`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Number of RLK genes') +
  xlab('Accession group')

Ratio of NLR/RLK plot

rlk_nbs_joined_groups <- rlk_joined_groups %>% inner_join(nbs_joined_groups, by=c('names'))
rlk_nbs_joined_groups$ratio <- rlk_nbs_joined_groups$presences.x / rlk_nbs_joined_groups$presences.y # RLK/NLR
rlk_nbs_joined_groups %>% 
  filter( !is.na(`PI-ID.x`) ) %>%
    ggplot(aes(x=`Group in violin table.x`, y = ratio,
               fill = `Group in violin table.x`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Number of RLK divided by NLR') +
  xlab('Accession group')

Ratio of NLR/RLP plot

rlp_nbs_joined_groups <- rlp_joined_groups %>% inner_join(nbs_joined_groups, by=c('names'))
rlp_nbs_joined_groups$ratio <- rlp_nbs_joined_groups$presences.x / rlp_nbs_joined_groups$presences.y # RLP/NLR
rlp_nbs_joined_groups %>% 
  filter( !is.na(`PI-ID.x`) ) %>%
    ggplot(aes(x=`Group in violin table.x`, y = ratio,
               fill = `Group in violin table.x`)) + 
  geom_boxplot() +
  scale_fill_manual(values = col_list) + 
  theme_minimal_hgrid() +
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12)) +
  geom_signif(comparisons = list(c('Wild', 'Landrace'),
                                 c('Old cultivar', 'Modern cultivar')), 
              map_signif_level = T) +
  guides(fill=FALSE) +
  ylab('Number of RLP divided by NLR') +
  xlab('Accession group')

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggforce_0.3.1        ggsignif_0.6.0       cowplot_1.0.0       
 [4] dabestr_0.3.0        magrittr_1.5         ggsci_2.9           
 [7] patchwork_1.0.0      forcats_0.5.0        stringr_1.4.0       
[10] dplyr_1.0.0          purrr_0.3.4          readr_1.3.1         
[13] tidyr_1.1.0          tibble_3.0.2         ggplot2_3.3.2       
[16] tidyverse_1.3.0      workflowr_1.6.2.9000

loaded via a namespace (and not attached):
 [1] nlme_3.1-148       fs_1.5.0.9000      lubridate_1.7.9    RColorBrewer_1.1-2
 [5] httr_1.4.2         rprojroot_1.3-2    tools_3.6.3        backports_1.1.10  
 [9] utf8_1.1.4         R6_2.4.1           vipor_0.4.5        DBI_1.1.0         
[13] mgcv_1.8-31        colorspace_1.4-1   withr_2.2.0        tidyselect_1.1.0  
[17] processx_3.4.4     compiler_3.6.3     git2r_0.27.1       cli_2.0.2         
[21] rvest_0.3.5        xml2_1.3.2         labeling_0.3       scales_1.1.1      
[25] callr_3.4.4        digest_0.6.25      rmarkdown_2.3      pkgconfig_2.0.3   
[29] htmltools_0.5.0    dbplyr_1.4.4       highr_0.8          rlang_0.4.7       
[33] readxl_1.3.1       rstudioapi_0.11    farver_2.0.3       generics_0.0.2    
[37] jsonlite_1.7.1     Matrix_1.2-18      Rcpp_1.0.5         ggbeeswarm_0.6.0  
[41] munsell_0.5.0      fansi_0.4.1        lifecycle_0.2.0    stringi_1.5.3     
[45] whisker_0.4        yaml_2.2.1         MASS_7.3-51.6      plyr_1.8.6        
[49] grid_3.6.3         blob_1.2.1         promises_1.1.1     crayon_1.3.4      
[53] lattice_0.20-41    haven_2.3.1        splines_3.6.3      hms_0.5.3         
[57] knitr_1.29         ps_1.3.4           pillar_1.4.4       boot_1.3-25       
[61] reprex_0.3.0       glue_1.4.2         evaluate_0.14      getPass_0.2-2     
[65] modelr_0.1.8       vctrs_0.3.1        tweenr_1.0.1       httpuv_1.5.4      
[69] cellranger_1.1.0   gtable_0.3.0       polyclip_1.10-0    assertthat_0.2.1  
[73] xfun_0.17          broom_0.5.6        later_1.1.0.1      beeswarm_0.2.3    
[77] ellipsis_0.3.1

R-gene count comparisons

Philipp Bayer

2020-09-18

Introduction

NBS part

Modern vs Old gene loss

Presence plotting per individual

RLK part

RLP part

Plotting together

Stats - Dabayes

NBS

RLK

RLP

Stats - classic t-test

NBS

RLP

RLK

Ratio of NLR/RLK plot

Ratio of NLR/RLP plot