Last updated: 2020-09-24
Checks: 6 1
Knit directory: R_gene_analysis/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2.9000). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish
to commit the R Markdown file and build the HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200917)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 1b346e9. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Untracked files:
Untracked: analysis/gwas.Rmd
Untracked: analysis/phylo.Rmd
Untracked: analysis/yield_link.Rmd
Untracked: code/transformToGAPIT.py
Untracked: data/Brec_R1.txt
Untracked: data/Brec_R2.txt
Untracked: data/CR15_R1.txt
Untracked: data/CR15_R2.txt
Untracked: data/CR_14_R1.txt
Untracked: data/CR_14_R2.txt
Untracked: data/KS_R1.txt
Untracked: data/KS_R2.txt
Untracked: data/NBS_PAV.txt.gz
Untracked: data/NLR_PAV_GD.txt
Untracked: data/NLR_PAV_GM.txt
Untracked: data/PPR1.txt
Untracked: data/PPR2.txt
Untracked: data/bac.txt
Untracked: data/brown.txt
Untracked: data/cy3.txt
Untracked: data/cy5.txt
Untracked: data/early.txt
Untracked: data/flowerings.txt
Untracked: data/foregeye.txt
Untracked: data/height.txt
Untracked: data/late.txt
Untracked: data/mature.txt
Untracked: data/motting.txt
Untracked: data/oil.txt
Untracked: data/pdh.txt
Untracked: data/protein.txt
Untracked: data/rust_tan.txt
Untracked: data/salt.txt
Untracked: data/seedq.txt
Untracked: data/seedweight.txt
Untracked: data/stem_termination.txt
Untracked: data/sudden.txt
Untracked: data/virus.txt
Untracked: data/yield.txt
Untracked: output/GAPIT/
Unstaged changes:
Modified: README.md
Modified: analysis/index.Rmd
Modified: analysis/total_numbers.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/total_numbers.Rmd
) and HTML (docs/total_numbers.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 1b346e9 | Philipp Bayer | 2020-09-22 | More data! |
html | 1b346e9 | Philipp Bayer | 2020-09-22 | More data! |
html | 0777e1d | Philipp Bayer | 2020-09-22 | update |
Rmd | 3adcc5a | Philipp Bayer | 2020-09-22 | wflow_git_commit(all = TRUE) |
html | 3adcc5a | Philipp Bayer | 2020-09-22 | wflow_git_commit(all = TRUE) |
Rmd | 50f995b | Philipp Bayer | 2020-09-21 | Add more analysis |
html | 50f995b | Philipp Bayer | 2020-09-21 | Add more analysis |
html | e6e9b9b | Philipp Bayer | 2020-09-21 | Build site. |
Rmd | c1fbbf9 | Philipp Bayer | 2020-09-21 | wflow_publish("analysis/*") |
Rmd | c71005a | Philipp Bayer | 2020-09-18 | lots changes |
html | c71005a | Philipp Bayer | 2020-09-18 | lots changes |
html | 7d33bac | Philipp Bayer | 2020-09-18 | Build site. |
Rmd | 695db1e | Philipp Bayer | 2020-09-18 | wflow_publish(c(“analysis/eda.Rmd”, “analysis/first-analysis.Rmd”, |
This is the same analysis as first-analysis, but with total numbers, not percentages genes lost
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
library(patchwork)
library(ggsci)
library(dabestr)
library(dabestr)
library(cowplot)
library(ggsignif)
library(ggforce)
theme_set(theme_cowplot())
npg_col = pal_npg("nrc")(9)
col_list <- c(`Wild-type`=npg_col[8],
Landrace = npg_col[3],
`Old cultivar`=npg_col[2],
`Modern cultivar`=npg_col[4])
pav_table <- read_tsv('./data/soybean_pan_pav.matrix_gene.txt.gz')
Let’s pull the NBS genes from the table
nbs <- read_tsv('./data/Lee.NBS.candidates.lst', col_names = c('Name', 'Class'))
nbs
# A tibble: 486 x 2
Name Class
<chr> <chr>
1 UWASoyPan00953.t1 CN
2 GlymaLee.13G222900.1.p CN
3 GlymaLee.18G227000.1.p CN
4 GlymaLee.18G080600.1.p CN
5 GlymaLee.20G036200.1.p CN
6 UWASoyPan01876.t1 CN
7 UWASoyPan04211.t1 CN
8 GlymaLee.19G105400.1.p CN
9 GlymaLee.18G085100.1.p CN
10 GlymaLee.11G142600.1.p CN
# ... with 476 more rows
# have to remove the .t1s
nbs$Name <- gsub('.t1','', nbs$Name)
nbs_pav_table <- pav_table %>% filter(Individual %in% nbs$Name)
write_delim(nbs_pav_table, 'data/NBS_PAV.txt.gz', delim='\t')
groups <- read_csv('./data/Table_of_cultivar_groups.csv')
groups
# A tibble: 1,069 x 3
`Data-storage-ID` `PI-ID` `Group in violin table`
<chr> <chr> <chr>
1 SRR1533284 PI416890 landrace
2 SRR1533282 PI323576 landrace
3 SRR1533292 PI157421 landrace
4 SRR1533216 PI594615 landrace
5 SRR1533239 PI603336 landrace
6 USB-108 PI165675 landrace
7 HNEX-13 PI253665D landrace
8 USB-382 PI603549 landrace
9 SRR1533236 PI587552 landrace
10 SRR1533332 PI567293 landrace
# ... with 1,059 more rows
Which genes are present more or less in old / modern cultivars?
big_norm_count <- tibble(
name = character(),
landrace = numeric(),
Modern_cultivar = numeric(),
Old_cultivar = numeric(),
`Wild-type` = numeric()
)
groups_list <- split(groups$`Group in violin table`, groups$`Data-storage-ID`)
for( i in 1:nrow(nbs_pav_table) ) {
this_gene <- nbs_pav_table[i,]
groups_count <- list()
total_groups_count <- list()
for (x in seq_along(nbs_pav_table)){
if ( x == 1) next
thisind <- colnames(nbs_pav_table)[x]
thisind_group <- groups_list[[thisind]]
if( is.null(thisind_group) ) next # no group for this individual
pavs <- this_gene[[x]] # either 1 or 0
if ( thisind_group %in% names(groups_count)) {
# count the number of present genes
groups_count[[thisind_group]] <- groups_count[[thisind_group]] + pavs
# count the total number of individuals for this group
total_groups_count[[thisind_group]] <- total_groups_count[[thisind_group]] + 1
} else {
groups_count[[thisind_group]] <- pavs
total_groups_count[[thisind_group]] <- 1
}
}
norm_group_count <- list()
for (m in seq_along(groups_count)) {
thisname <- names(groups_count)[m]
norm_group_count[[thisname]] <- groups_count[[thisname]] / total_groups_count[[thisname]] * 100
}
norm_group_count$Individual <- this_gene$Individual
big_norm_count <- rbind(big_norm_count, as_tibble(norm_group_count))
}
# wow, I DO write R like Python
Let’s pull out the genes that are variable in any group
var_norm_count <- big_norm_count %>%
filter(landrace != 100 &
Modern_cultivar != 100 &
Old_cultivar != 100 &
`Wild-type` != 100)
var_norm_count <- left_join(var_norm_count, nbs, by=c('Individual'='Name'))
var_norm_count$Mod_minus_Old <- var_norm_count$Modern_cultivar - var_norm_count$Old_cultivar
The top 20 genes reduced the most in modern cultivars compared with old cultivars:
var_norm_count %>%
arrange(Mod_minus_Old) %>%
head(20) %>%
select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Mod_minus_Old, Class) %>%
knitr::kable()
Individual | Wild-type | landrace | Old_cultivar | Modern_cultivar | Mod_minus_Old | Class |
---|---|---|---|---|---|---|
UWASoyPan03261 | 74.52229 | 62.102351 | 60.86957 | 26.573427 | -34.296139 | TX |
UWASoyPan00953 | 83.43949 | 33.056708 | 43.47826 | 11.188811 | -32.289450 | CN |
UWASoyPan00725 | 89.17197 | 92.392808 | 82.60870 | 51.748252 | -30.860444 | TX |
UWASoyPan00316 | 91.08280 | 90.594744 | 80.43478 | 50.349650 | -30.085132 | NBS |
UWASoyPan01530 | 80.89172 | 45.089903 | 47.82609 | 20.979021 | -26.847066 | NL |
UWASoyPan00975 | 42.67516 | 19.778700 | 32.60870 | 7.692308 | -24.916388 | TX |
UWASoyPan00155 | 85.35032 | 77.316736 | 63.04348 | 42.657343 | -20.386136 | NBS |
UWASoyPan00772 | 54.77707 | 18.948824 | 30.43478 | 11.888112 | -18.546671 | NBS |
UWASoyPan03402 | 62.42038 | 42.738589 | 36.95652 | 18.881119 | -18.075403 | NBS |
UWASoyPan01320 | 45.85987 | 30.152144 | 23.91304 | 6.993007 | -16.920036 | NBS |
UWASoyPan02799 | 65.60510 | 30.843707 | 19.56522 | 2.797203 | -16.768015 | NBS |
GlymaLee.03G045500.1.p | 82.16561 | 58.921162 | 67.39130 | 51.748252 | -15.643053 | OTHER |
UWASoyPan01253 | 73.24841 | 24.481328 | 17.39130 | 3.496504 | -13.894801 | NBS |
UWASoyPan03340 | 18.47134 | 26.279391 | 19.56522 | 6.293706 | -13.271511 | TX |
GlymaLee.06G230600.1.p | 78.34395 | 57.399723 | 56.52174 | 44.055944 | -12.465795 | TX |
GlymaLee.06G228900.1.p | 96.17834 | 74.688797 | 91.30435 | 80.419580 | -10.884767 | TX |
UWASoyPan00251 | 60.50955 | 20.470263 | 21.73913 | 11.188811 | -10.550319 | NL |
GlymaLee.03G045700.1.p | 75.79618 | 67.496542 | 65.21739 | 55.244755 | -9.972636 | OTHER |
GlymaLee.06G228600.1.p | 90.44586 | 79.391425 | 82.60870 | 72.727273 | -9.881423 | TX |
UWASoyPan00670 | 30.57325 | 6.915629 | 10.86957 | 2.097902 | -8.771663 | TX |
So these are the NLR genes selected against during soybean breeding.
Let’s look at those without the TX ones:
var_norm_count %>%
arrange(Mod_minus_Old) %>%
select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Mod_minus_Old, Class) %>%
filter(Class != 'TX') %>%
head(20) %>%
knitr::kable()
Individual | Wild-type | landrace | Old_cultivar | Modern_cultivar | Mod_minus_Old | Class |
---|---|---|---|---|---|---|
UWASoyPan00953 | 83.439490 | 33.0567082 | 43.478261 | 11.188811 | -32.289450 | CN |
UWASoyPan00316 | 91.082802 | 90.5947441 | 80.434783 | 50.349650 | -30.085132 | NBS |
UWASoyPan01530 | 80.891720 | 45.0899032 | 47.826087 | 20.979021 | -26.847066 | NL |
UWASoyPan00155 | 85.350319 | 77.3167358 | 63.043478 | 42.657343 | -20.386136 | NBS |
UWASoyPan00772 | 54.777070 | 18.9488243 | 30.434783 | 11.888112 | -18.546671 | NBS |
UWASoyPan03402 | 62.420382 | 42.7385892 | 36.956522 | 18.881119 | -18.075403 | NBS |
UWASoyPan01320 | 45.859873 | 30.1521438 | 23.913044 | 6.993007 | -16.920036 | NBS |
UWASoyPan02799 | 65.605096 | 30.8437068 | 19.565217 | 2.797203 | -16.768015 | NBS |
GlymaLee.03G045500.1.p | 82.165605 | 58.9211618 | 67.391304 | 51.748252 | -15.643053 | OTHER |
UWASoyPan01253 | 73.248408 | 24.4813278 | 17.391304 | 3.496504 | -13.894801 | NBS |
UWASoyPan00251 | 60.509554 | 20.4702628 | 21.739130 | 11.188811 | -10.550319 | NL |
GlymaLee.03G045700.1.p | 75.796178 | 67.4965422 | 65.217391 | 55.244755 | -9.972636 | OTHER |
UWASoyPan03194 | 45.222930 | 10.7883817 | 10.869565 | 2.097902 | -8.771663 | NBS |
GlymaLee.03G045900.1.p | 74.522293 | 65.5601660 | 63.043478 | 54.545454 | -8.498024 | OTHER |
UWASoyPan00326 | 40.764331 | 19.0871369 | 17.391304 | 9.790210 | -7.601095 | CN |
UWASoyPan02496 | 26.114650 | 14.6611342 | 13.043478 | 6.993007 | -6.050471 | CN |
UWASoyPan01217 | 57.961783 | 14.3845090 | 10.869565 | 5.594406 | -5.275160 | NBS |
GlymaLee.06G229300.1.p | 86.624204 | 50.3457815 | 65.217391 | 60.839161 | -4.378230 | TN |
UWASoyPan04757 | 6.369427 | 0.9681881 | 4.347826 | 0.000000 | -4.347826 | NBS |
GlymaLee.03G042000.1.p | 99.363057 | 88.3817427 | 91.304348 | 88.111888 | -3.192460 | CNL |
Let’s plot:
var_norm_count %>%
arrange(Mod_minus_Old) %>%
head(20) %>%
select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Class) %>%
pivot_longer(!c(Individual, Class)) %>%
mutate(name = str_replace_all(name, 'landrace', 'Landrace')) %>%
mutate(name = str_replace_all(name, 'Old_cultivar', 'Old cultivar')) %>%
mutate(name = str_replace_all(name, 'Modern_cultivar', 'Modern cultivar')) %>%
ggplot(aes(x=factor(name, levels=c('Wild-type', 'Landrace', 'Old cultivar', 'Modern cultivar')), y=value, group=Individual, color=Class)) +
geom_line(size=1.5) +
xlab('Group') +
ylab('Percentage presence of gene in group') +
scale_color_brewer(palette = 'Dark2')
The top 20 genes increased the most in modern cultivars:
var_norm_count %>%
arrange(desc(Mod_minus_Old)) %>%
head(20) %>%
select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Mod_minus_Old, Class) %>%
knitr::kable()
Individual | Wild-type | landrace | Old_cultivar | Modern_cultivar | Mod_minus_Old | Class |
---|---|---|---|---|---|---|
GlymaLee.01G030900.1.p | 81.52866 | 50.345782 | 34.782609 | 71.328671 | 36.546063 | NL |
GlymaLee.15G199500.1.p | 85.98726 | 73.582296 | 69.565217 | 86.713287 | 17.148069 | CN |
GlymaLee.15G199200.1.p | 92.99363 | 74.827109 | 73.913044 | 90.909091 | 16.996047 | CNL |
GlymaLee.06G232800.1.p | 40.12739 | 47.579530 | 56.521739 | 73.426573 | 16.904834 | NBS |
GlymaLee.01G088400.1.p | 49.68153 | 89.488243 | 82.608696 | 97.202797 | 14.594102 | TNL |
UWASoyPan05312 | 30.57325 | 8.575380 | 10.869565 | 23.776224 | 12.906659 | NBS |
UWASoyPan00005 | 36.94268 | 13.831259 | 8.695652 | 20.279720 | 11.584068 | NBS |
UWASoyPan01876 | 43.31210 | 15.629322 | 8.695652 | 20.279720 | 11.584068 | CN |
GlymaLee.10G034600.1.p | 91.08280 | 91.839557 | 84.782609 | 95.104895 | 10.322286 | NL |
UWASoyPan01330 | 29.29936 | 25.172891 | 15.217391 | 23.776224 | 8.558832 | NBS |
UWASoyPan00202 | 50.31847 | 35.408022 | 26.086956 | 32.167832 | 6.080876 | NBS |
UWASoyPan00427 | 97.45223 | 84.232365 | 73.913044 | 79.720280 | 5.807236 | NBS |
GlymaLee.15G199300.1.p | 96.17834 | 88.243430 | 93.478261 | 98.601399 | 5.123138 | NL |
GlymaLee.03G070700.1.p | 65.60510 | 89.903181 | 89.130435 | 93.706294 | 4.575859 | TNL |
GlymaLee.06G229100.1.p | 85.98726 | 38.174274 | 50.000000 | 53.146853 | 3.146853 | TX |
GlymaLee.07G070200.1.p | 78.98089 | 91.286307 | 91.304348 | 94.405594 | 3.101247 | NBS |
UWASoyPan01418 | 68.78981 | 66.251729 | 71.739130 | 74.825175 | 3.086044 | TX |
GlymaLee.03G070600.1.p | 66.87898 | 90.179806 | 91.304348 | 93.706294 | 2.401946 | TNL |
GlymaLee.16G175200.1.p | 95.54140 | 96.127248 | 95.652174 | 97.902098 | 2.249924 | TNL |
UWASoyPan04967 | 35.03185 | 6.224066 | 2.173913 | 4.195804 | 2.021891 | TX |
As these genes have relatively high percentages in WT they must have been re-introduced by using WT in the breeding process.
Let’s plot those too:
var_norm_count %>%
arrange(desc(Mod_minus_Old)) %>%
head(20) %>%
select(Individual, `Wild-type`, landrace, Old_cultivar, Modern_cultivar, Class) %>%
pivot_longer(!c(Individual, Class)) %>%
mutate(name = str_replace_all(name, 'landrace', 'Landrace')) %>%
mutate(name = str_replace_all(name, 'Old_cultivar', 'Old cultivar')) %>%
mutate(name = str_replace_all(name, 'Modern_cultivar', 'Modern cultivar')) %>%
ggplot(aes(x=factor(name, levels=c('Wild-type', 'Landrace', 'Old cultivar', 'Modern cultivar')), y=value, group=Individual, color=Class)) +
geom_line(size=1.5) +
xlab('Group') +
ylab('Percentage presence of gene in group') +
scale_color_brewer(palette = 'Dark2')
names <- c()
presences <- c()
for (i in seq_along(nbs_pav_table)){
if ( i == 1) next
thisind <- colnames(nbs_pav_table)[i]
pavs <- nbs_pav_table[[i]]
presents <- sum(pavs)
names <- c(names, thisind)
presences <- c(presences, presents)
}
nbs_res_tibb <- new_tibble(list(names = names, presences = presences))
OK what do these presence percentages look like?
ggplot(data=nbs_res_tibb, aes(x=presences)) + geom_histogram(bins=25)
On average, 446.0027027 of NBS genes are present in each individual.
Now let’s join the table of presences to the four different types so we can group these numbers.
nbs_joined_groups <- left_join(nbs_res_tibb, groups, by = c('names'='Data-storage-ID'))
nbs_joined_groups$`Group in violin table` <- gsub('landrace', 'Landrace', nbs_joined_groups$`Group in violin table`)
nbs_joined_groups$`Group in violin table` <- gsub('Modern_cultivar', 'Modern cultivar', nbs_joined_groups$`Group in violin table`)
nbs_joined_groups$`Group in violin table` <- gsub('Old_cultivar', 'Old cultivar', nbs_joined_groups$`Group in violin table`)
nbs_joined_groups$`Group in violin table` <- factor(nbs_joined_groups$`Group in violin table`, levels=c(NA, 'Wild-type', 'Landrace', 'Old cultivar', 'Modern cultivar'))
nbs_vio <- nbs_joined_groups %>% filter(!is.na(`Group in violin table`)) %>%
ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) +
geom_violin(draw_quantiles = c(0.5)) +
geom_sina(alpha=0.5) +
geom_smooth(aes(group=1), method='glm') +
scale_fill_manual(values=col_list) +
guides(fill = FALSE)
nbs_vio
nbs_joined_groups %>% filter(`Group in violin table` != 'NA') %>%
ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) +
geom_smooth(aes(group=1), method='lm', se = FALSE) +
geom_jitter() +
scale_fill_manual(values=col_list)+
guides(fill = FALSE)
nbs_joined_groups %>% filter(!is.na(`PI-ID`)) %>%
group_by(`Group in violin table`) %>%
summarise(min_present = min(presences),
max_present = max(presences),
mean_present = mean(presences),
median_present = median(presences),
std_present = sd(presences)) %>%
knitr::kable()
Group in violin table | min_present | max_present | mean_present | median_present | std_present |
---|---|---|---|---|---|
Wild-type | 435 | 473 | 452.9490 | 453 | 7.170806 |
Landrace | 429 | 465 | 444.8907 | 445 | 5.011672 |
Old cultivar | 433 | 456 | 444.8696 | 445 | 5.200892 |
Modern cultivar | 431 | 455 | 442.3147 | 442 | 4.047986 |
Let’s do the same plot with RLKs
rlk <- read_tsv('./data/Lee.RLK.candidates.lst', col_names = c('Name', 'Class', 'Subtype'))
rlk
# A tibble: 1,173 x 3
Name Class Subtype
<chr> <chr> <chr>
1 GlymaLee.01G001800.1.p RLK lrr
2 GlymaLee.01G004900.1.p RLK lrr
3 GlymaLee.01G007300.1.p RLK lrr
4 GlymaLee.01G007400.1.p RLK lrr
5 GlymaLee.01G012800.1.p RLK other_receptor
6 GlymaLee.01G018800.1.p RLK lrr
7 GlymaLee.01G021100.1.p RLK other_receptor
8 GlymaLee.01G025500.1.p RLK lysm
9 GlymaLee.01G026500.1.p RLK other_receptor
10 GlymaLee.01G027000.1.p RLK lrr
# ... with 1,163 more rows
# have to remove the .t1s
rlk$Name <- gsub('.t1','', rlk$Name)
rlk_pav_table <- pav_table %>% filter(Individual %in% rlk$Name)
rlk_pav_table
# A tibble: 1,173 x 1,111
Individual `AB-01` `AB-02` `BR-01` `BR-02` `BR-03` `BR-04` `BR-05` `BR-06`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 GlymaLee.~ 1 1 1 1 1 1 1 1
2 GlymaLee.~ 1 1 1 1 1 1 1 1
3 GlymaLee.~ 1 1 1 1 1 1 1 1
4 GlymaLee.~ 1 1 1 1 1 1 1 1
5 GlymaLee.~ 1 1 1 1 1 1 1 1
6 GlymaLee.~ 1 1 1 1 1 1 1 1
7 GlymaLee.~ 1 1 1 1 1 1 1 1
8 GlymaLee.~ 1 1 1 1 1 1 1 1
9 GlymaLee.~ 1 1 1 1 1 1 1 1
10 GlymaLee.~ 1 1 1 1 1 1 1 1
# ... with 1,163 more rows, and 1,102 more variables: `BR-07` <dbl>,
# `BR-08` <dbl>, `BR-09` <dbl>, `BR-10` <dbl>, `BR-11` <dbl>, `BR-12` <dbl>,
# `BR-13` <dbl>, `BR-14` <dbl>, `BR-15` <dbl>, `BR-16` <dbl>, `BR-17` <dbl>,
# `BR-18` <dbl>, `BR-20` <dbl>, `BR-23` <dbl>, `BR-24` <dbl>, `BR-29` <dbl>,
# `BR-30` <dbl>, `BR-32` <dbl>, DT2000 <dbl>, ESS <dbl>, For <dbl>,
# HN001 <dbl>, HN002 <dbl>, HN003 <dbl>, HN004 <dbl>, HN005 <dbl>,
# HN006 <dbl>, HN007 <dbl>, HN008 <dbl>, HN009 <dbl>, HN010 <dbl>,
# HN011 <dbl>, HN012 <dbl>, HN013 <dbl>, HN015 <dbl>, HN016B <dbl>,
# HN017B <dbl>, HN018 <dbl>, HN019 <dbl>, HN021 <dbl>, HN022 <dbl>,
# HN023 <dbl>, HN024 <dbl>, HN025 <dbl>, HN026 <dbl>, HN027 <dbl>,
# HN028 <dbl>, HN029 <dbl>, HN030 <dbl>, HN031 <dbl>, HN032 <dbl>,
# HN033 <dbl>, HN034 <dbl>, HN035 <dbl>, HN036 <dbl>, HN037 <dbl>,
# HN038 <dbl>, HN039 <dbl>, HN040 <dbl>, HN041 <dbl>, HN042 <dbl>,
# HN043 <dbl>, HN044 <dbl>, HN045 <dbl>, HN046 <dbl>, HN047 <dbl>,
# HN048 <dbl>, HN049 <dbl>, HN050 <dbl>, HN051 <dbl>, HN052 <dbl>,
# HN053 <dbl>, HN054 <dbl>, HN055 <dbl>, HN056 <dbl>, HN057 <dbl>,
# HN058 <dbl>, HN059 <dbl>, HN060 <dbl>, HN061 <dbl>, HN062 <dbl>,
# HN063 <dbl>, HN064 <dbl>, HN065 <dbl>, HN066 <dbl>, HN067 <dbl>,
# HN068 <dbl>, HN069 <dbl>, HN070 <dbl>, HN071 <dbl>, HN072 <dbl>,
# HN073 <dbl>, HN074 <dbl>, HN075 <dbl>, HN076 <dbl>, HN077 <dbl>,
# HN078 <dbl>, HN079 <dbl>, HN080 <dbl>, HN081 <dbl>, ...
names <- c()
presences <- c()
for (i in seq_along(rlk_pav_table)){
if ( i == 1) next
thisind <- colnames(rlk_pav_table)[i]
pavs <- rlk_pav_table[[i]]
presents <- sum(pavs)
names <- c(names, thisind)
presences <- c(presences, presents)
}
rlk_res_tibb <- new_tibble(list(names = names, presences = presences))
rlk_res_tibb
# A tibble: 1,110 x 2
names presences
<chr> <dbl>
1 AB-01 1167
2 AB-02 1162
3 BR-01 1166
4 BR-02 1165
5 BR-03 1166
6 BR-04 1167
7 BR-05 1164
8 BR-06 1167
9 BR-07 1165
10 BR-08 1167
# ... with 1,100 more rows
OK what do these presence percentages look like?
ggplot(data=rlk_res_tibb, aes(x=presences)) + geom_histogram(bins=25)
On average, 1163.5036036% of NBS genes are present in each individual.
Now let’s join the table of presences to the four different types so we can group these numbers.
rlk_joined_groups <- left_join(rlk_res_tibb, groups, by = c('names'='Data-storage-ID'))
rlk_joined_groups$`Group in violin table` <- gsub('landrace', 'Landrace', rlk_joined_groups$`Group in violin table`)
rlk_joined_groups$`Group in violin table` <- gsub('Modern_cultivar', 'Modern cultivar', rlk_joined_groups$`Group in violin table`)
rlk_joined_groups$`Group in violin table` <- gsub('Old_cultivar', 'Old cultivar', rlk_joined_groups$`Group in violin table`)
rlk_joined_groups$`Group in violin table` <- factor(rlk_joined_groups$`Group in violin table`, levels=c(NA, 'Wild-type', 'Landrace', 'Old cultivar', 'Modern cultivar'))
rlk_vio <- rlk_joined_groups %>% filter(`Group in violin table` != 'NA') %>%
ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) +
geom_violin(draw_quantiles = c(0.5)) +
geom_sina(alpha=0.5) +
geom_smooth(aes(group=1), method='lm', se = FALSE) +
scale_fill_manual(values=col_list)+
guides(fill = FALSE)
rlk_vio
rlk_joined_groups %>% filter(!is.na(`PI-ID`)) %>%
group_by(`Group in violin table`) %>%
summarise(min_present = min(presences),
max_present = max(presences),
mean_present = mean(presences),
median_present = median(presences),
std_present = sd(presences)) %>%
knitr::kable()
Group in violin table | min_present | max_present | mean_present | median_present | std_present |
---|---|---|---|---|---|
Wild-type | 1154 | 1170 | 1164.357 | 1165 | 2.554565 |
Landrace | 1157 | 1168 | 1163.217 | 1163 | 1.499264 |
Old cultivar | 1161 | 1166 | 1163.587 | 1164 | 1.407537 |
Modern cultivar | 1159 | 1168 | 1163.490 | 1163 | 1.472122 |
And now with RLPs
rlp <- read_tsv('./data/Lee.RLP.candidates.lst', col_names = c('Name', 'Class', 'Subtype'))
# have to remove the .t1s
rlp$Name <- gsub('.t1','', rlp$Name)
rlp_pav_table <- pav_table %>% filter(Individual %in% rlp$Name)
names <- c()
presences <- c()
for (i in seq_along(rlp_pav_table)){
if ( i == 1) next
thisind <- colnames(rlp_pav_table)[i]
pavs <- rlp_pav_table[[i]]
presents <- sum(pavs)
names <- c(names, thisind)
presences <- c(presences, presents)
}
rlp_res_tibb <- new_tibble(list(names = names, presences = presences))
OK what do these presence percentages look like?
ggplot(data=rlp_res_tibb, aes(x=presences)) + geom_histogram(bins=25)
On average, 172.1693694% of NBS genes are present in each individual.
Now let’s join the table of presences to the four different types so we can group these numbers.
rlp_joined_groups <- left_join(rlp_res_tibb, groups, by = c('names'='Data-storage-ID'))
rlp_joined_groups$`Group in violin table` <- gsub('landrace', 'Landrace', rlp_joined_groups$`Group in violin table`)
rlp_joined_groups$`Group in violin table` <- gsub('Modern_cultivar', 'Modern cultivar', rlp_joined_groups$`Group in violin table`)
rlp_joined_groups$`Group in violin table` <- gsub('Old_cultivar', 'Old cultivar', rlp_joined_groups$`Group in violin table`)
rlp_joined_groups$`Group in violin table` <- factor(rlp_joined_groups$`Group in violin table`, levels=c(NA, 'Wild-type', 'Landrace', 'Old cultivar', 'Modern cultivar'))
rlp_vio <- rlp_joined_groups %>% filter(`Group in violin table` != 'NA') %>%
ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) +
geom_violin(draw_quantiles = c(0.5)) +
geom_sina(alpha=0.5) +
geom_smooth(aes(group=1), method='lm', se = FALSE) +
scale_fill_manual(values=col_list)+
guides(fill = FALSE)
rlp_vio
rlp_joined_groups %>% filter(`Group in violin table` != 'NA') %>%
ggplot(aes(y=presences, x=`Group in violin table`, fill=`Group in violin table`)) +
geom_jitter() +
#geom_sina(alpha=0.5) +
scale_fill_manual(values=col_list)+
guides(fill = FALSE) +
ylim(c(87, 100))
rlp_joined_groups %>% filter(!is.na(`PI-ID`)) %>%
group_by(`Group in violin table`) %>%
summarise(min_present = min(presences),
max_present = max(presences),
mean_present = mean(presences),
median_present = median(presences),
std_present = sd(presences)) %>%
knitr::kable()
Group in violin table | min_present | max_present | mean_present | median_present | std_present |
---|---|---|---|---|---|
Wild-type | 168 | 177 | 173.4140 | 173 | 1.617392 |
Landrace | 162 | 177 | 171.9668 | 172 | 1.661526 |
Old cultivar | 169 | 176 | 171.8261 | 172 | 1.623499 |
Modern cultivar | 169 | 175 | 171.8042 | 172 | 1.290587 |
nbs_vio + rlk_vio + rlp_vio
I want to know whether the groups are statistically significantly different. First let’s use dabestr
Let’s run dabestr first:
nbs_multi.two.group.unpaired <-
nbs_joined_groups %>% filter(!is.na(`PI-ID`)) %>%
dabest(`Group in violin table`, presences,
idx = list(c("Wild-type", "Landrace"),
c('Old cultivar', 'Modern cultivar')),
paired = FALSE)
nbs_multi.two.group.unpaired
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================
Good afternoon!
The current time is 14:29 PM on Thursday September 24, 2020.
Dataset : .
The first five rows are:
# A tibble: 5 x 4
names presences `PI-ID` `Group in violin table`
<chr> <dbl> <chr> <fct>
1 AB-01 445 PI458020 Landrace
2 AB-02 454 PI603713 Landrace
3 DT2000 447 PI635999 Modern cultivar
4 For 448 PI548645 Modern cultivar
5 HN001 448 PI518664 Modern cultivar
X Variable : Group in violin table
Y Variable : presences
Effect sizes(s) will be computed for:
1. Landrace minus Wild-type
2. Modern cultivar minus Old cultivar
nbs_multi.two.group.unpaired.meandiff <- mean_diff(nbs_multi.two.group.unpaired)
nbs_multi.two.group.unpaired.meandiff
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================
Good afternoon!
The current time is 14:30 PM on Thursday September 24, 2020.
Dataset : .
X Variable : Group in violin table
Y Variable : presences
Unpaired mean difference of Landrace (n = 723) minus Wild-type (n = 157)
-8.06 [95CI -9.24; -6.86]
Unpaired mean difference of Modern cultivar (n = 143) minus Old cultivar (n = 46)
-2.55 [95CI -4.25; -0.97]
5000 bootstrap resamples.
All confidence intervals are bias-corrected and accelerated.
plot(nbs_multi.two.group.unpaired.meandiff, color.column=`Group in violin table`,
rawplot.ylabel = 'Presence (%)', show.legend=FALSE)
rlk_multi.two.group.unpaired <-
rlk_joined_groups %>% filter(!is.na(`PI-ID`)) %>%
dabest(`Group in violin table`, presences,
idx = list(c("Wild-type", "Landrace"),
c('Old cultivar', 'Modern cultivar')),
paired = FALSE)
rlk_multi.two.group.unpaired
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================
Good afternoon!
The current time is 14:30 PM on Thursday September 24, 2020.
Dataset : .
The first five rows are:
# A tibble: 5 x 4
names presences `PI-ID` `Group in violin table`
<chr> <dbl> <chr> <fct>
1 AB-01 1167 PI458020 Landrace
2 AB-02 1162 PI603713 Landrace
3 DT2000 1165 PI635999 Modern cultivar
4 For 1163 PI548645 Modern cultivar
5 HN001 1163 PI518664 Modern cultivar
X Variable : Group in violin table
Y Variable : presences
Effect sizes(s) will be computed for:
1. Landrace minus Wild-type
2. Modern cultivar minus Old cultivar
rlk_multi.two.group.unpaired.meandiff <- mean_diff(rlk_multi.two.group.unpaired)
rlk_multi.two.group.unpaired.meandiff
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================
Good afternoon!
The current time is 14:30 PM on Thursday September 24, 2020.
Dataset : .
X Variable : Group in violin table
Y Variable : presences
Unpaired mean difference of Landrace (n = 723) minus Wild-type (n = 157)
-1.14 [95CI -1.55; -0.717]
Unpaired mean difference of Modern cultivar (n = 143) minus Old cultivar (n = 46)
-0.0974 [95CI -0.562; 0.362]
5000 bootstrap resamples.
All confidence intervals are bias-corrected and accelerated.
plot(rlk_multi.two.group.unpaired.meandiff, color.column=`Group in violin table`,
rawplot.ylabel = 'Presence (%)', show.legend=FALSE)
No difference between old and modern cultivars!
rlp_multi.two.group.unpaired <-
rlp_joined_groups %>% filter(!is.na(`PI-ID`)) %>%
dabest(`Group in violin table`, presences,
idx = list(c("Wild-type", "Landrace"),
c('Old cultivar', 'Modern cultivar')),
paired = FALSE)
rlp_multi.two.group.unpaired
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================
Good afternoon!
The current time is 14:30 PM on Thursday September 24, 2020.
Dataset : .
The first five rows are:
# A tibble: 5 x 4
names presences `PI-ID` `Group in violin table`
<chr> <dbl> <chr> <fct>
1 AB-01 171 PI458020 Landrace
2 AB-02 172 PI603713 Landrace
3 DT2000 171 PI635999 Modern cultivar
4 For 171 PI548645 Modern cultivar
5 HN001 172 PI518664 Modern cultivar
X Variable : Group in violin table
Y Variable : presences
Effect sizes(s) will be computed for:
1. Landrace minus Wild-type
2. Modern cultivar minus Old cultivar
rlp_multi.two.group.unpaired.meandiff <- mean_diff(rlp_multi.two.group.unpaired)
rlp_multi.two.group.unpaired.meandiff
dabestr (Data Analysis with Bootstrap Estimation in R) v0.3.0
=============================================================
Good afternoon!
The current time is 14:30 PM on Thursday September 24, 2020.
Dataset : .
X Variable : Group in violin table
Y Variable : presences
Unpaired mean difference of Landrace (n = 723) minus Wild-type (n = 157)
-1.45 [95CI -1.74; -1.17]
Unpaired mean difference of Modern cultivar (n = 143) minus Old cultivar (n = 46)
-0.0219 [95CI -0.53; 0.477]
5000 bootstrap resamples.
All confidence intervals are bias-corrected and accelerated.
plot(rlp_multi.two.group.unpaired.meandiff, color.column=`Group in violin table`,
rawplot.ylabel = 'Presence (%)', show.legend=FALSE)
Again, no difference between old and modern cultivars!
nbs_joined_groups %>%
filter( !is.na(`PI-ID`) ) %>%
ggplot(aes(x=`Group in violin table`, y = presences,
fill = `Group in violin table`)) +
geom_boxplot() +
scale_fill_manual(values = col_list) +
theme_minimal_hgrid() +
theme(axis.text.x = element_text(size=12),
axis.text.y = element_text(size=12)) +
geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
c('Old cultivar', 'Modern cultivar')),
map_signif_level = T) +
guides(fill=FALSE) +
ylab('Number of NLR genes') +
xlab('Accession group')
rlp_joined_groups %>%
filter( !is.na(`PI-ID`) ) %>%
ggplot(aes(x=`Group in violin table`, y = presences,
fill = `Group in violin table`)) +
geom_boxplot() +
scale_fill_manual(values = col_list) +
theme_minimal_hgrid() +
theme(axis.text.x = element_text(size=12),
axis.text.y = element_text(size=12)) +
geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
c('Old cultivar', 'Modern cultivar')),
map_signif_level = T) +
guides(fill=FALSE) +
ylab('Number of RLP genes') +
xlab('Accession group')
rlk_joined_groups %>%
filter( !is.na(`PI-ID`) ) %>%
ggplot(aes(x=`Group in violin table`, y = presences,
fill = `Group in violin table`)) +
geom_boxplot() +
scale_fill_manual(values = col_list) +
theme_minimal_hgrid() +
theme(axis.text.x = element_text(size=12),
axis.text.y = element_text(size=12)) +
geom_signif(comparisons = list(c('Wild-type', 'Landrace'),
c('Old cultivar', 'Modern cultivar')),
map_signif_level = T) +
guides(fill=FALSE) +
ylab('Number of RLK genes') +
xlab('Accession group')
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggforce_0.3.1 ggsignif_0.6.0 cowplot_1.0.0
[4] dabestr_0.3.0 magrittr_1.5 ggsci_2.9
[7] patchwork_1.0.0 forcats_0.5.0 stringr_1.4.0
[10] dplyr_1.0.0 purrr_0.3.4 readr_1.3.1
[13] tidyr_1.1.0 tibble_3.0.2 ggplot2_3.3.2
[16] tidyverse_1.3.0 workflowr_1.6.2.9000
loaded via a namespace (and not attached):
[1] nlme_3.1-148 fs_1.5.0.9000 lubridate_1.7.9 RColorBrewer_1.1-2
[5] httr_1.4.2 rprojroot_1.3-2 tools_3.6.3 backports_1.1.10
[9] utf8_1.1.4 R6_2.4.1 vipor_0.4.5 DBI_1.1.0
[13] mgcv_1.8-31 colorspace_1.4-1 withr_2.2.0 tidyselect_1.1.0
[17] processx_3.4.4 compiler_3.6.3 git2r_0.27.1 cli_2.0.2
[21] rvest_0.3.5 xml2_1.3.2 labeling_0.3 scales_1.1.1
[25] callr_3.4.4 digest_0.6.25 rmarkdown_2.3 pkgconfig_2.0.3
[29] htmltools_0.5.0 dbplyr_1.4.4 highr_0.8 rlang_0.4.7
[33] readxl_1.3.1 rstudioapi_0.11 farver_2.0.3 generics_0.0.2
[37] jsonlite_1.7.1 Matrix_1.2-18 Rcpp_1.0.5 ggbeeswarm_0.6.0
[41] munsell_0.5.0 fansi_0.4.1 lifecycle_0.2.0 stringi_1.5.3
[45] whisker_0.4 yaml_2.2.1 MASS_7.3-51.6 plyr_1.8.6
[49] grid_3.6.3 blob_1.2.1 promises_1.1.1 crayon_1.3.4
[53] lattice_0.20-41 haven_2.3.1 splines_3.6.3 hms_0.5.3
[57] knitr_1.29 ps_1.3.4 pillar_1.4.4 boot_1.3-25
[61] reprex_0.3.0 glue_1.4.2 evaluate_0.14 getPass_0.2-2
[65] modelr_0.1.8 vctrs_0.3.1 tweenr_1.0.1 httpuv_1.5.4
[69] cellranger_1.1.0 gtable_0.3.0 polyclip_1.10-0 assertthat_0.2.1
[73] xfun_0.17 broom_0.5.6 later_1.1.0.1 beeswarm_0.2.3
[77] ellipsis_0.3.1