Last updated: 2018-08-24
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180807)
The command set.seed(20180807)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: b1cf8d0
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/raw/
Ignored: src/.DS_Store
Ignored: src/.ipynb_checkpoints/
Ignored: src/Rmd/.Rhistory
Untracked files:
Untracked: Snakefile_clonality
Untracked: Snakefile_somatic_calling
Untracked: code/analysis_for_garx.Rmd
Untracked: code/yuanhua/
Untracked: data/canopy/
Untracked: data/cell_assignment/
Untracked: data/de_analysis_FTv62/
Untracked: data/donor_info_070818.txt
Untracked: data/donor_info_core.csv
Untracked: data/donor_neutrality.tsv
Untracked: data/fdr10.annot.txt.gz
Untracked: data/high-vs-low-exomes.v62.ft.alldonors-filt_lenient.all_filt_sites.vep_most_severe_csq.txt
Untracked: data/high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz
Untracked: data/human_H_v5p2.rdata
Untracked: data/human_c2_v5p2.rdata
Untracked: data/human_c6_v5p2.rdata
Untracked: data/neg-bin-rsquared-petr.csv
Untracked: data/neutralitytestr-petr.tsv
Untracked: data/sce_merged_donors_cardelino_donorid_all_qc_filt.rds
Untracked: data/sce_merged_donors_cardelino_donorid_all_with_qc_labels.rds
Untracked: data/sce_merged_donors_cardelino_donorid_unstim_qc_filt.rds
Untracked: data/sces/
Untracked: data/simulations/
Untracked: data/variance_components/
Untracked: docs/figure/overview_lines.Rmd/
Untracked: figures/
Untracked: metadata/
Untracked: output/differential_expression/
Untracked: output/donor_specific/
Untracked: output/line_info.tsv
Untracked: output/nvars_by_category_by_donor.tsv
Untracked: output/nvars_by_category_by_line.tsv
Untracked: output/variance_components/
Untracked: references/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | d2e8b31 | davismcc | 2018-08-19 | Build site. |
html | 1489d32 | davismcc | 2018-08-17 | Add html files |
Rmd | d5e785d | davismcc | 2018-08-17 | Updating to use “line” instead of “donor” |
Rmd | bcee02b | davismcc | 2018-08-17 | Updating donors to lines in text. |
Rmd | 0862ede | davismcc | 2018-08-10 | Fixing some typos |
Rmd | a27cace | davismcc | 2018-08-10 | Updating analysis |
Rmd | 0c8f70d | davismcc | 2018-08-10 | Tidying up mutated genes analysis |
Rmd | 1cbadbd | davismcc | 2018-08-10 | Updating analyses. |
Rmd | de1daa9 | davismcc | 2018-08-09 | Adding analysis for DE between mutated and unmutated clones |
In this analysis, we look at the direct impact of somatic variants on the genes in which they are located. We combine information from the clonal trees about which variants tag which clones and VEP functional annotations about the expected consequence of each variant. For each gene with a somatic variant (and sufficient gene expression), we test for differential expression between cells that have the mutation (based on clone assignment) and those that do not. We can then compare these “mutated gene” differential expression results for variants with different categories of functional annotation.
Clonal trees were inferred with Canopy using variant allele frequency information from a set of strictly filtered somatic variants called from whole-exome sequencing data. We filtered somatic variants to only include variants that had read coverage in at least one single-cell for the relevant line. Assignment of cells to clones was done with cardelino.
Load VEP consequence information.
vep_best <- read_tsv("data/high-vs-low-exomes.v62.ft.alldonors-filt_lenient.all_filt_sites.vep_most_severe_csq.txt")
colnames(vep_best)[1] <- "Uploaded_variation"
## deduplicate dataframe
vep_best <- vep_best[!duplicated(vep_best[["Uploaded_variation"]]),]
There are 8503 variants, aggregated across all lines, and we can check the number that fall into the different VEP annotation categories.
table(vep_best[["Consequence"]])
3_prime_UTR_variant 5_prime_UTR_variant
181 121
downstream_gene_variant intergenic_variant
13 13
intron_variant mature_miRNA_variant
1539 3
missense_variant non_coding_transcript_exon_variant
3648 428
regulatory_region_variant splice_acceptor_variant
1 41
splice_donor_variant splice_region_variant
24 291
start_lost stop_gained
9 227
stop_lost stop_retained_variant
2 5
synonymous_variant upstream_gene_variant
1923 34
Load exome sites, that is, the somatic SNVs identified with Petr Danecek’s somatic variant calling approach.
exome_sites <- read_tsv("data/high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz",
col_types = "ciccdcciiiiccccccccddcdcll", comment = "#",
col_names = TRUE)
exome_sites <- dplyr::select(exome_sites,
-c(clinsing, SIFT_Pdel, PolyPhen_Pdel, gene_name,
bcftools_csq))
exome_sites <- dplyr::mutate(
exome_sites,
chrom = paste0("chr", gsub("chr", "", chrom)),
var_id = paste0(chrom, ":", pos, "_", ref, "_", alt))
## deduplicate sites list
exome_sites <- exome_sites[!duplicated(exome_sites[["var_id"]]),]
Add variant consequences annotations to exome sites.
vep_best[["var_id"]] <- paste0("chr", vep_best[["Uploaded_variation"]])
vep_best[, c("var_id", "Location", "Consequence")]
# A tibble: 8,503 x 3
var_id Location Consequence
<chr> <chr> <chr>
1 chr1:874501_C_T 1:874501-874501 missense_variant
2 chr1:915025_G_A 1:915025-915025 missense_variant
3 chr1:916622_C_T 1:916622-916622 intron_variant
4 chr1:985731_G_A 1:985731-985731 intron_variant
5 chr1:985732_G_A 1:985732-985732 intron_variant
6 chr1:1019508_G_A 1:1019508-1019508 missense_variant
7 chr1:1019509_G_A 1:1019509-1019509 synonymous_variant
8 chr1:1116160_G_A 1:1116160-1116160 synonymous_variant
9 chr1:1158664_G_A 1:1158664-1158664 synonymous_variant
10 chr1:1178463_T_A 1:1178463-1178463 missense_variant
# ... with 8,493 more rows
exome_sites <- inner_join(exome_sites,
vep_best[, c("var_id", "Location", "Consequence")],
by = "var_id")
# com_vars <- intersect(exome_sites$var_id, csq_annos$var_id)
# length(com_vars)
# mean(exome_sites$var_id %in% csq_annos$var_id)
# mean(exome_sites$pos %in% csq_annos$Pos)
Load SingleCellExperiment objects with single-cell gene expression data.
params <- list()
params$callset <- "filt_lenient.cell_coverage_sites"
fls <- list.files("data/sces")
fls <- fls[grepl(params$callset, fls)]
lines <- gsub(".*ce_([a-z]+)_.*", "\\1", fls)
sce_unst_list <- list()
for (don in lines) {
sce_unst_list[[don]] <- readRDS(file.path("data/sces",
paste0("sce_", don, "_with_clone_assignments.", params$callset, ".rds")))
cat(paste("reading", don, ": ", ncol(sce_unst_list[[don]]), "cells.\n"))
}
reading euts : 79 cells.
reading fawm : 53 cells.
reading feec : 75 cells.
reading fikt : 39 cells.
reading garx : 70 cells.
reading gesg : 105 cells.
reading heja : 50 cells.
reading hipn : 62 cells.
reading ieki : 58 cells.
reading joxm : 79 cells.
reading kuco : 48 cells.
reading laey : 55 cells.
reading lexy : 63 cells.
reading naju : 44 cells.
reading nusw : 60 cells.
reading oaaz : 38 cells.
reading oilg : 90 cells.
reading pipw : 107 cells.
reading puie : 41 cells.
reading qayj : 97 cells.
reading qolg : 36 cells.
reading qonc : 58 cells.
reading rozh : 91 cells.
reading sehl : 30 cells.
reading ualf : 89 cells.
reading vass : 37 cells.
reading vils : 37 cells.
reading vuna : 71 cells.
reading wahn : 82 cells.
reading wetu : 77 cells.
reading xugn : 35 cells.
reading zoxy : 88 cells.
assignments_lst <- list()
for (don in lines) {
assignments_lst[[don]] <- as_data_frame(
colData(sce_unst_list[[don]])[,
c("donor_short_id", "highest_prob",
"assigned", "total_features",
"total_counts_endogenous", "num_processed")])
}
assignments <- do.call("rbind", assignments_lst)
We have SCE objects (expression data and cell assignments) for 32 lines, with 85% of cells confidently assigned to a clone.
Finally, we load transcriptome-wide differential expression results computed using the quasi-likelihood F test method in edgeR.
de_res <- readRDS(paste0("data/de_analysis_FTv62/",
params$callset,
".de_results_unstimulated_cells.rds"))
Next, we need to annotate the variants with information about which clones they are expected to be present in (obtained from the Canopy tree inference).
cell_assign_list <- list()
for (don in lines) {
cell_assign_list[[don]] <- readRDS(file.path("data/cell_assignment",
paste0("cardelino_results.", don, ".", params$callset, ".rds")))
cat(paste("reading", don, "\n"))
}
reading euts
reading fawm
reading feec
reading fikt
reading garx
reading gesg
reading heja
reading hipn
reading ieki
reading joxm
reading kuco
reading laey
reading lexy
reading naju
reading nusw
reading oaaz
reading oilg
reading pipw
reading puie
reading qayj
reading qolg
reading qonc
reading rozh
reading sehl
reading ualf
reading vass
reading vils
reading vuna
reading wahn
reading wetu
reading xugn
reading zoxy
get_sites_by_line <- function(sites_df, sce_list, assign_list) {
if (!identical(sort(names(sce_list)), sort(names(assign_list))))
stop("lines do not match between sce_list and assign_list.")
sites_by_line <- list()
for (don in names(sce_list)) {
config <- as_data_frame(assign_list[[don]]$tree$Z)
config[["var_id"]] <- rownames(assign_list[[don]]$tree$Z)
sites_line <- inner_join(sites_df, config)
sites_line[["clone_presence"]] <- ""
for (cln in colnames(assign_list[[don]]$tree$Z)[-1]) {
sites_line[["clone_presence"]][
as.logical(sites_line[[cln]])] <- paste(
sites_line[["clone_presence"]][
as.logical(sites_line[[cln]])], cln, sep = "&")
}
sites_line[["clone_presence"]] <- gsub("^&", "",
sites_line[["clone_presence"]])
## drop config columns as these won't match up between lines
keep_cols <- grep("^clone[0-9]$", colnames(sites_line), invert = TRUE)
sites_by_line[[don]] <- sites_line[, keep_cols]
}
do.call("bind_rows", sites_by_line)
}
sites_by_line <- get_sites_by_line(exome_sites, sce_unst_list, cell_assign_list)
sites_by_line <- dplyr::mutate(sites_by_line,
Consequence = gsub("_variant", "", Consequence))
Get DE results comparing mutated clone to all unmutated clones, but we first have to organise the data further.
Turn dataframes into GRanges objects to enable fast and robust overlaps.
## run DE for mutated cells vs unmutated cells using existing DE results
## filter out any remaining ERCC genes
for (don in names(de_res[["sce_list_unst"]]))
de_res[["sce_list_unst"]][[don]] <- de_res[["sce_list_unst"]][[don]][
!rowData(de_res[["sce_list_unst"]][[don]])$is_feature_control,]
sce_de_list_gr <- list()
for (don in names(de_res[["sce_list_unst"]])) {
sce_de_list_gr[[don]] <- makeGRangesFromDataFrame(
rowData(de_res[["sce_list_unst"]][[don]]),
start.field = "start_position",
end.field = "end_position",
keep.extra.columns = TRUE)
seqlevelsStyle(sce_de_list_gr[[don]]) <- "Ensembl"
}
sites_by_line <- dplyr::mutate(sites_by_line, end_pos = pos)
sites_by_line_gr <- makeGRangesFromDataFrame(
sites_by_line,
start.field = "pos",
end.field = "end_pos",
keep.extra.columns = TRUE)
seqlevelsStyle(sites_by_line_gr) <- "Ensembl"
Run DE testing between mutated and un-mutated clones for all affected genes for each line.
mut_genes_df_allcells_list <- list()
for (don in names(de_res[["sce_list_unst"]])) {
cat("working on ", don, "\n")
sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
sites_tmp$gene <- rownames(sce_tmp)
dge_tmp <- de_res[["dge_list"]][[don]]
dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
de_tbl_tmp <- data.frame(line = don,
gene = sites_tmp$gene,
hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
var_id = sites_tmp$var_id,
location = sites_tmp$Location,
consequence = sites_tmp$Consequence,
clone_presence = sites_tmp$clone_presence,
logFC = NA, logCPM = NA, F = NA, PValue = NA,
comment = "", stringsAsFactors = FALSE)
for (i in seq_len(length(sites_tmp))) {
clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
dsgn_tmp <- cbind(base_design, data.frame(mutatedclone))
if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)) {
qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
de_tbl_tmp$F[i] <- de_tmp$table$F
de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
}
if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
if (!is.fullrank(dsgn_tmp))
de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
}
mut_genes_df_allcells_list[[don]] <- de_tbl_tmp
}
working on euts
working on fawm
working on feec
working on fikt
working on garx
working on gesg
working on heja
working on hipn
working on ieki
working on joxm
working on kuco
working on laey
working on lexy
working on naju
working on nusw
working on oaaz
working on oilg
working on pipw
working on puie
working on qayj
working on qolg
working on qonc
working on rozh
working on sehl
working on ualf
working on vass
working on vuna
working on wahn
working on wetu
working on xugn
working on zoxy
mut_genes_df_allcells <- do.call("bind_rows", mut_genes_df_allcells_list)
dim(mut_genes_df_allcells)
[1] 1140 13
head(mut_genes_df_allcells)
line gene hgnc_symbol ensembl_gene_id
1 euts ENSG00000006327_TNFRSF12A TNFRSF12A ENSG00000006327
2 euts ENSG00000028528_SNX1 SNX1 ENSG00000028528
3 euts ENSG00000060642_PIGV PIGV ENSG00000060642
4 euts ENSG00000083168_KAT6A KAT6A ENSG00000083168
5 euts ENSG00000083520_DIS3 DIS3 ENSG00000083520
6 euts ENSG00000090060_PAPOLA PAPOLA ENSG00000090060
var_id location consequence clone_presence
1 chr16:3071779_A_T 16:3071779-3071779 stop_gained clone2
2 chr15:64388302_G_A 15:64388302-64388302 synonymous clone3
3 chr1:27121236_G_A 1:27121236-27121236 synonymous clone3
4 chr8:41798596_G_A 8:41798596-41798596 missense clone2
5 chr13:73335940_G_C 13:73335940-73335940 synonymous clone2&clone3
6 chr14:97000841_C_T 14:97000841-97000841 missense clone3
logFC logCPM F PValue
1 0.08487392 9.225735 0.1704411 0.6809820
2 NA NA NA NA
3 NA NA NA NA
4 -0.75250667 3.108107 1.9374726 0.1683502
5 -0.79956030 4.352695 1.1304492 0.2913337
6 NA NA NA NA
comment
1
2 insufficient cells assigned to clone
3 insufficient cells assigned to clone
4
5
6 insufficient cells assigned to clone
With this analysis, 1011 genes/variants could be tested for DE between mutated and unmutated clones.
We can recompute false discovery rates with IHW, simplify the VEP annotation categories (assigning all nonsense categories to “nonsense” and all splicing categories to “splicing”) and inspect the results.
## add FDRs for genes tested here for DE
ihw_res_all <- ihw(PValue ~ logCPM, data = mut_genes_df_allcells, alpha = 0.2)
mut_genes_df_allcells$FDR <- adj_pvalues(ihw_res_all)
## add simplified consequence categories
mut_genes_df_allcells$consequence_simplified <-
mut_genes_df_allcells$consequence
mut_genes_df_allcells$consequence_simplified[
mut_genes_df_allcells$consequence_simplified %in%
c("stop_retained", "start_lost", "stop_lost", "stop_gained")] <- "nonsense"
mut_genes_df_allcells$consequence_simplified[
mut_genes_df_allcells$consequence_simplified %in%
c("splice_donor", "splice_acceptor", "splice_region")] <- "splicing"
table(mut_genes_df_allcells$consequence_simplified)
3_prime_UTR 5_prime_UTR
27 12
intron missense
20 650
non_coding_transcript_exon nonsense
20 39
splicing synonymous
21 351
summary(mut_genes_df_allcells$FDR)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.5465 0.8892 0.7571 0.9376 0.9972 129
dplyr::arrange(mut_genes_df_allcells, FDR) %>% dplyr::select(-location) %>%
DT::datatable(., options = list(pageLength = 20))
There are few variants with a significant difference in expression between clones with and without the mutation (FDR < 10%).
Plot these results.
dplyr::filter(mut_genes_df_allcells, !is.na(logFC)) %>%
ggplot(aes(x = logFC, y = -log10(PValue),
fill = FDR < 0.1)) +
geom_point(colour = "gray40", pch = 21) +
geom_rug(sides = "b", alpha = 0.6) +
geom_vline(xintercept = 0, linetype = 2) +
# geom_text_repel(show.legend = FALSE) +
scale_fill_manual(values = c("gray70", "firebrick"),
label = c("N.S.", "FDR < 10%"),
name = "DE result") +
facet_wrap(~consequence, ncol = 5) +
guides(alpha = FALSE) +
theme_classic(16) +
theme(strip.background = element_rect(fill = "gray90"),
legend.position = c(0.9, 0.1))
Version | Author | Date |
---|---|---|
d2e8b31 | davismcc | 2018-08-19 |
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_vep_anno_allcells.png",
height = 7, width = 16)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_vep_anno_allcells.pdf",
height = 7, width = 16)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_vep_anno_allcells.svg",
height = 7, width = 16)
The picture can look a little clearer with simplified annotation categories.
dplyr::filter(mut_genes_df_allcells, !is.na(logFC)) %>%
ggplot(aes(x = logFC, y = -log10(PValue),
fill = FDR < 0.2)) +
geom_point(colour = "gray40", pch = 21) +
geom_rug(sides = "b", alpha = 0.6) +
geom_vline(xintercept = 0, linetype = 2) +
# geom_text_repel(show.legend = FALSE) +
scale_fill_manual(values = c("gray70", "firebrick"),
label = c("N.S.", "FDR < 20%"),
name = "DE result") +
facet_wrap(~consequence_simplified, ncol = 3) +
guides(alpha = FALSE) +
theme_classic(20) +
theme(strip.background = element_rect(fill = "gray90"),
strip.text = element_text(size = 13),
legend.position = c(0.8, 0.15))
Version | Author | Date |
---|---|---|
d2e8b31 | davismcc | 2018-08-19 |
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells.png",
height = 7, width = 10.5)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells.pdf",
height = 7, width = 10.5)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells.svg",
height = 7, width = 10.5)
Let us check the median logFC values for each simplified VEP annotation category and add boxplots for logFC to the volcano plots.
tmp2 <- mut_genes_df_allcells %>%
group_by(consequence_simplified) %>%
summarise(med = median(logFC, na.rm = TRUE),
nvars = n())
dplyr::arrange(tmp2, med)
# A tibble: 8 x 3
consequence_simplified med nvars
<chr> <dbl> <int>
1 intron -0.236 20
2 3_prime_UTR -0.193 27
3 nonsense -0.154 39
4 non_coding_transcript_exon -0.106 20
5 synonymous -0.0459 351
6 missense 0.0228 650
7 splicing 0.0355 21
8 5_prime_UTR 0.624 12
plist <- list()
simple_cons_vec <- c("nonsense", "splicing", "missense",
"synonymous", "intron",
"non_coding_transcript_exon",
"5_prime_UTR", "3_prime_UTR")
for (cons in simple_cons_vec) {
p <- dplyr::filter(mut_genes_df_allcells, !is.na(logFC),
consequence_simplified == cons) %>%
ggplot(aes(x = logFC, y = -log10(PValue),
fill = FDR < 0.2)) +
geom_point(colour = "gray40", pch = 21, size = 2) +
geom_rug(sides = "b", alpha = 0.6) +
geom_vline(xintercept = 0, linetype = 2) +
scale_fill_manual(values = c("gray70", "firebrick"),
label = c("N.S.", "FDR < 20%"),
name = "DE result") +
guides(alpha = FALSE) +
theme_classic(14) +
ggtitle(cons) +
xlim(1.05 * range(mut_genes_df_allcells$logFC, na.rm = TRUE)) +
ylim(c(0, 1.05 * max(-log10(mut_genes_df_allcells$PValue), na.rm = TRUE))) +
theme(strip.background = element_rect(fill = "gray90"),
strip.text = element_text(size = 12),
legend.position = "none",
plot.title = element_text(face = "bold"))
plist[[cons]] <- ggExtra::ggMarginal(
p, type = "boxplot", margins = "x", fill = "wheat",
colour = "firebrick4", outlier.size = 0, outlier.alpha = 0)
}
cowplot::plot_grid(plotlist = plist, nrow = 2)
Version | Author | Date |
---|---|---|
d2e8b31 | davismcc | 2018-08-19 |
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells_with_boxplot.png",
height = 7, width = 18)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells_with_boxplot.pdf",
height = 7, width = 18)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells_with_boxplot.svg",
height = 7, width = 18)
tmp2 <- mut_genes_df_allcells %>%
group_by(consequence) %>%
summarise(med = median(logFC, na.rm = TRUE),
nvars = n())
mut_genes_df_allcells %>%
dplyr::filter(!is.na(logFC)) %>%
dplyr::mutate(consequence = factor(
consequence, levels(as.factor(consequence))[order(tmp2[["med"]])]),
de = ifelse(FDR < 0.2, "FDR < 0.2", "FDR > 0.2")) %>%
ggplot(aes(y = logFC, x = consequence)) +
geom_hline(yintercept = 0, linetype = 1, colour = "black") +
ggbeeswarm::geom_quasirandom(colour = "gray40", fill = "gray70", pch = 21) +
geom_boxplot(outlier.size = 0, outlier.alpha = 0, fill = "wheat",
colour = "firebrick4", width = 0.2, size = 1) +
scale_x_discrete(expand = c(.01, 0)) +
scale_y_continuous(expand = c(0, 0), name = "logFC") +
theme_ridges(16) +
coord_flip() +
theme(strip.background = element_rect(fill = "gray90")) +
guides(fill = FALSE, color = FALSE, point_color = FALSE)
Version | Author | Date |
---|---|---|
d2e8b31 | davismcc | 2018-08-19 |
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_vep_anno_allcells.png",
height = 9, width = 10)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_vep_anno_allcells.pdf",
height = 9, width = 10)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_vep_anno_allcells.svg",
height = 9, width = 10)
We can also just look specifically at the boxplots of the logFC values across categories, first looking at the median logFC values and results from a t-test testing if the mean logFC value is different from zero. Positive logFC values indicate higher expression in the mutated clone, and negative logFC values lower expression in the mutated clone.
tmp3 <- mut_genes_df_allcells %>%
group_by(consequence_simplified) %>%
summarise(med = median(logFC, na.rm = TRUE),
nvars = n(),
t_coef = t.test(logFC, mu = 0)$estimate,
t_stat = t.test(logFC, mu = 0)$statistic,
t_pval = t.test(logFC, mu = 0)$p.value)
tmp3
# A tibble: 8 x 6
consequence_simplified med nvars t_coef t_stat t_pval
<chr> <dbl> <int> <dbl> <dbl> <dbl>
1 3_prime_UTR -0.193 27 0.0147 0.0332 0.974
2 5_prime_UTR 0.624 12 0.720 2.36 0.0397
3 intron -0.236 20 -0.405 -1.87 0.0781
4 missense 0.0228 650 -0.0411 -0.775 0.439
5 non_coding_transcript_exon -0.106 20 -0.260 -1.79 0.0908
6 nonsense -0.154 39 -0.287 -1.99 0.0558
7 splicing 0.0355 21 -0.231 -0.945 0.359
8 synonymous -0.0459 351 -0.112 -1.58 0.116
mut_genes_df_allcells %>%
dplyr::filter(!is.na(logFC)) %>%
dplyr::mutate(consequence_simplified = factor(
consequence_simplified,
levels(as.factor(consequence_simplified))[order(tmp3[["med"]])]),
de = ifelse(FDR < 0.2, "FDR < 0.2", "FDR > 0.2")) %>%
ggplot(aes(y = logFC, x = consequence_simplified)) +
geom_hline(yintercept = 0, linetype = 1, colour = "black") +
ggbeeswarm::geom_quasirandom(colour = "gray40", fill = "gray70", pch = 21) +
geom_boxplot(outlier.size = 0, outlier.alpha = 0, fill = "wheat",
colour = "firebrick4", width = 0.2, size = 1) +
scale_x_discrete(expand = c(.01, 0), name = "consequence") +
scale_y_continuous(expand = c(0, 0), name = "logFC") +
theme_ridges(20) +
coord_flip() +
theme(strip.background = element_rect(fill = "gray90")) +
guides(fill = FALSE, color = FALSE, point_color = FALSE)
Version | Author | Date |
---|---|---|
d2e8b31 | davismcc | 2018-08-19 |
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.png",
height = 7, width = 10)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.pdf",
height = 7, width = 10)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.svg",
height = 7, width = 10)
In the analysis above, the minimum allowed clone size (i.e. group for DE analysis) was 3 cells. Such potentially small group sizes could lead to noisy estimates of the logFC values that we looked at above. Let us repeat the above analysis but increasing the minimum number of cells for a group to 10 to try to obtain more accurate (on average) logFC estimates from the DE model.
Run DE testing between mutated and un-mutated clones for a more strictly filtered set of affected genes for each line.
mut_genes_df_allcells_filt_list <- list()
for (don in names(de_res[["sce_list_unst"]])) {
cat("working on ", don, "\n")
sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
sites_tmp$gene <- rownames(sce_tmp)
dge_tmp <- de_res[["dge_list"]][[don]]
dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
de_tbl_tmp <- data.frame(line = don,
gene = sites_tmp$gene,
hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
var_id = sites_tmp$var_id,
location = sites_tmp$Location,
consequence = sites_tmp$Consequence,
clone_presence = sites_tmp$clone_presence,
logFC = NA, logCPM = NA, F = NA, PValue = NA,
comment = "", stringsAsFactors = FALSE)
for (i in seq_len(length(sites_tmp))) {
clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
dsgn_tmp <- cbind(base_design, data.frame(mutatedclone))
if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)
&& (sum(mutatedclone) > 9.5) && (sum(!mutatedclone) > 9.5)) {
qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
de_tbl_tmp$F[i] <- de_tmp$table$F
de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
}
if ((sum(mutatedclone) < 9.5) || (sum(!mutatedclone) < 9.5))
de_tbl_tmp$comment[i] <- "minimum group size < 10 cells"
if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
if (!is.fullrank(dsgn_tmp))
de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
}
mut_genes_df_allcells_filt_list[[don]] <- de_tbl_tmp
}
working on euts
working on fawm
working on feec
working on fikt
working on garx
working on gesg
working on heja
working on hipn
working on ieki
working on joxm
working on kuco
working on laey
working on lexy
working on naju
working on nusw
working on oaaz
working on oilg
working on pipw
working on puie
working on qayj
working on qolg
working on qonc
working on rozh
working on sehl
working on ualf
working on vass
working on vuna
working on wahn
working on wetu
working on xugn
working on zoxy
mut_genes_df_allcells_filt <- do.call("bind_rows", mut_genes_df_allcells_filt_list)
With this analysis, 766 could be tested for DE between mutated and unmutated clones (245 fewer than with the lenient filtering above).
We can recompute false discovery rates with IHW, simplify the VEP annotation categories (assigning all nonsense categories to “nonsense” and all splicing categories to “splicing”) and inspect the results.
## add FDRs for genes tested here for DE
ihw_res_all <- ihw(PValue ~ logCPM, data = mut_genes_df_allcells_filt, alpha = 0.2)
mut_genes_df_allcells_filt$FDR <- adj_pvalues(ihw_res_all)
## add simplified consequence categories
mut_genes_df_allcells_filt$consequence_simplified <-
mut_genes_df_allcells_filt$consequence
mut_genes_df_allcells_filt$consequence_simplified[
mut_genes_df_allcells_filt$consequence_simplified %in%
c("stop_retained", "start_lost", "stop_lost", "stop_gained")] <- "nonsense"
mut_genes_df_allcells_filt$consequence_simplified[
mut_genes_df_allcells_filt$consequence_simplified %in%
c("splice_donor", "splice_acceptor", "splice_region")] <- "splicing"
table(mut_genes_df_allcells_filt$consequence_simplified)
3_prime_UTR 5_prime_UTR
27 12
intron missense
20 650
non_coding_transcript_exon nonsense
20 39
splicing synonymous
21 351
dplyr::arrange(mut_genes_df_allcells_filt, FDR) %>%
dplyr::select(-location, line, hgnc_symbol, var_id, consequence, FDR, PValue,
everything()) %>%
DT::datatable(., options = list(pageLength = 20))
Again, we’ll look specifically at the boxplots of the logFC values across categories. Positive logFC values indicate higher expression in the mutated clone, and negative logFC values lower expression in the mutated clone.
tmp5 <- mut_genes_df_allcells_filt %>%
group_by(consequence_simplified) %>%
summarise(med = median(logFC, na.rm = TRUE),
nvars = n()
)
dplyr::arrange(tmp5, med)
# A tibble: 8 x 3
consequence_simplified med nvars
<chr> <dbl> <int>
1 3_prime_UTR -0.136 27
2 nonsense -0.132 39
3 non_coding_transcript_exon -0.0931 20
4 intron -0.0821 20
5 synonymous -0.0312 351
6 missense 0.00622 650
7 splicing 0.0355 21
8 5_prime_UTR 0.624 12
mut_genes_df_allcells_filt %>%
dplyr::filter(!is.na(logFC)) %>%
dplyr::mutate(consequence_simplified = factor(
consequence_simplified,
levels(as.factor(consequence_simplified))[order(tmp5[["med"]])]),
de = ifelse(FDR < 0.2, "FDR < 0.2", "FDR > 0.2")) %>%
ggplot(aes(y = logFC, x = consequence_simplified, fill = consequence)) +
geom_hline(yintercept = 0, linetype = 1, colour = "black") +
ggbeeswarm::geom_quasirandom(colour = "gray40", pch = 21) +
geom_boxplot(outlier.size = 0, outlier.alpha = 0, fill = "wheat", alpha = 0.5,
colour = "firebrick4", width = 0.2, size = 1) +
scale_x_discrete(expand = c(0.1, 0.1), name = "consequence") +
scale_y_continuous(expand = c(0.1, 0.1), name = "logFC") +
theme_ridges(20) +
coord_flip() +
theme(strip.background = element_rect(fill = "gray90")) +
guides(color = FALSE, point_color = FALSE) +
scale_fill_brewer(palette = "Paired") +
theme(legend.position = "bottom")
Version | Author | Date |
---|---|---|
d2e8b31 | davismcc | 2018-08-19 |
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.png",
height = 9, width = 15)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.pdf",
height = 9, width = 15)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.svg",
height = 9, width = 15)
On the evidence of the plot above it does not look like stricter filtering of variants fundmentally changes the results. It still does not look possible to discern differences in the logFC distributions of mutated genes between different annotation categories for variants.
We can also get DE results comparing mutated clone to all unmutated clones, fitting the first couple of PCs in the model to see if this accounts for unwanted variation and increases power to detect differences in expression between mutated and unwanted clones for mutated genes.
## run DE for mutated cells vs unmutated cells using existing DE results
## fitting PCs as covariates in DE model
mut_genes_df_allcells_list_PCreg <- list()
for (don in names(de_res[["sce_list_unst"]])) {
cat("working on ", don, "\n")
sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
sites_tmp$gene <- rownames(sce_tmp)
dge_tmp <- de_res[["dge_list"]][[don]]
dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
de_tbl_tmp <- data.frame(line = don,
gene = sites_tmp$gene,
hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
var_id = sites_tmp$var_id,
location = sites_tmp$Location,
consequence = sites_tmp$Consequence,
clone_presence = sites_tmp$clone_presence,
logFC = NA, logCPM = NA, F = NA, PValue = NA,
comment = "", stringsAsFactors = FALSE)
pca_tmp <- reducedDim(runPCA(
de_res[["sce_list_unst"]][[don]], ntop = 500, ncomponents = 5), "PCA")
for (i in seq_len(length(sites_tmp))) {
clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
dsgn_tmp <- cbind(base_design, pca_tmp[, 1], data.frame(mutatedclone))
if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)) {
qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
de_tbl_tmp$F[i] <- de_tmp$table$F
de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
}
if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
if (!is.fullrank(dsgn_tmp))
de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
}
mut_genes_df_allcells_list_PCreg[[don]] <- de_tbl_tmp
}
working on euts
working on fawm
working on feec
working on fikt
working on garx
working on gesg
working on heja
working on hipn
working on ieki
working on joxm
working on kuco
working on laey
working on lexy
working on naju
working on nusw
working on oaaz
working on oilg
working on pipw
working on puie
working on qayj
working on qolg
working on qonc
working on rozh
working on sehl
working on ualf
working on vass
working on vuna
working on wahn
working on wetu
working on xugn
working on zoxy
mut_genes_df_allcells_PCreg <- do.call("bind_rows", mut_genes_df_allcells_list_PCreg)
## add FDRs for genes tested here for DE
ihw_res_all <- ihw(PValue ~ logCPM, data = mut_genes_df_allcells_PCreg, alpha = 0.2)
mut_genes_df_allcells_PCreg$FDR <- adj_pvalues(ihw_res_all)
## add simplified consequence categories
mut_genes_df_allcells_PCreg$consequence_simplified <-
mut_genes_df_allcells_PCreg$consequence
mut_genes_df_allcells_PCreg$consequence_simplified[
mut_genes_df_allcells_PCreg$consequence_simplified %in%
c("stop_retained", "start_lost", "stop_lost", "stop_gained")] <- "nonsense"
mut_genes_df_allcells_PCreg$consequence_simplified[
mut_genes_df_allcells_PCreg$consequence_simplified %in%
c("splice_donor", "splice_acceptor", "splice_region")] <- "splicing"
table(mut_genes_df_allcells_PCreg$consequence_simplified)
3_prime_UTR 5_prime_UTR
27 12
intron missense
20 650
non_coding_transcript_exon nonsense
20 39
splicing synonymous
21 351
dplyr::arrange(mut_genes_df_allcells_PCreg, FDR) %>%
dplyr::select(-location) %>%
DT::datatable(., options = list(pageLength = 20))
data_frame(PValue_default = mut_genes_df_allcells[["PValue"]],
PValue_PCreg = mut_genes_df_allcells_PCreg[["PValue"]],
sig_default = mut_genes_df_allcells[["FDR"]] < 0.1,
sig_PCreg = mut_genes_df_allcells_PCreg[["FDR"]] < 0.1) %>%
ggplot(aes(x = -log10(PValue_default), y = -log10(PValue_PCreg),
shape = sig_default, colour = sig_PCreg)) +
geom_point() +
geom_smooth(aes(group = 1), colour = "firebrick") +
geom_abline(slope = 1, intercept = 0) +
scale_color_manual(values = c("gray60", "black"))
Version | Author | Date |
---|---|---|
d2e8b31 | davismcc | 2018-08-19 |
There is thus very little change in DE results when regressing out 5 PCs in these DE models.
We can also check to see if accounting for cell cycle scores in the DE model increases power to detect differential expression between mutated and unmutated clones for mutated genes.
## run DE for mutated cells vs unmutated cells using existing DE results
## fitting PCs as covariates in DE model
mut_genes_df_allcells_list_CCreg <- list()
for (don in names(de_res[["sce_list_unst"]])) {
cat("working on ", don, "\n")
sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
sites_tmp$gene <- rownames(sce_tmp)
dge_tmp <- de_res[["dge_list"]][[don]]
dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
de_tbl_tmp <- data.frame(line = don,
gene = sites_tmp$gene,
hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
var_id = sites_tmp$var_id,
location = sites_tmp$Location,
consequence = sites_tmp$Consequence,
clone_presence = sites_tmp$clone_presence,
logFC = NA, logCPM = NA, F = NA, PValue = NA,
comment = "", stringsAsFactors = FALSE)
G1 <- de_res[["sce_list_unst"]][[don]]$G1
S <- de_res[["sce_list_unst"]][[don]]$S
G2M <- de_res[["sce_list_unst"]][[don]]$G2M
for (i in seq_len(length(sites_tmp))) {
clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
dsgn_tmp <- cbind(base_design, G1, S, G2M, data.frame(mutatedclone))
if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)) {
qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
de_tbl_tmp$F[i] <- de_tmp$table$F
de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
}
if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
if (!is.fullrank(dsgn_tmp))
de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
}
mut_genes_df_allcells_list_CCreg[[don]] <- de_tbl_tmp
}
working on euts
working on fawm
working on feec
working on fikt
working on garx
working on gesg
working on heja
working on hipn
working on ieki
working on joxm
working on kuco
working on laey
working on lexy
working on naju
working on nusw
working on oaaz
working on oilg
working on pipw
working on puie
working on qayj
working on qolg
working on qonc
working on rozh
working on sehl
working on ualf
working on vass
working on vuna
working on wahn
working on wetu
working on xugn
working on zoxy
mut_genes_df_allcells_CCreg <- do.call("bind_rows", mut_genes_df_allcells_list_CCreg)
## add FDRs for genes tested here for DE
ihw_res_all <- ihw(PValue ~ logCPM, data = mut_genes_df_allcells_CCreg, alpha = 0.2)
mut_genes_df_allcells_CCreg$FDR <- adj_pvalues(ihw_res_all)
## add simplified consequence categories
mut_genes_df_allcells_CCreg$consequence_simplified <-
mut_genes_df_allcells_CCreg$consequence
mut_genes_df_allcells_CCreg$consequence_simplified[
mut_genes_df_allcells_CCreg$consequence_simplified %in%
c("stop_retained", "start_lost", "stop_lost", "stop_gained")] <- "nonsense"
mut_genes_df_allcells_CCreg$consequence_simplified[
mut_genes_df_allcells_CCreg$consequence_simplified %in%
c("splice_donor", "splice_acceptor", "splice_region")] <- "splicing"
dplyr::arrange(mut_genes_df_allcells_CCreg, FDR) %>%
dplyr::select(-location) %>%
DT::datatable(., options = list(pageLength = 20))