Last updated: 2019-10-30
Checks: 7 0
Knit directory: fibroblast-clonality/
This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20180807)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .vscode/
Ignored: code/.DS_Store
Ignored: code/selection/.DS_Store
Ignored: code/selection/.Rhistory
Ignored: code/selection/figures/
Ignored: data/.DS_Store
Ignored: logs/
Ignored: src/.DS_Store
Ignored: src/Rmd/.Rhistory
Untracked files:
Untracked: .dockerignore
Untracked: .dropbox
Untracked: .snakemake/
Untracked: Rplots.pdf
Untracked: Snakefile_clonality
Untracked: Snakefile_somatic_calling
Untracked: analysis/.ipynb_checkpoints/
Untracked: analysis/assess_mutect2_fibro-ipsc_variant_calls.ipynb
Untracked: analysis/cardelino_fig1b.R
Untracked: analysis/cardelino_fig2b.R
Untracked: code/analysis_for_garx.Rmd
Untracked: code/selection/data/
Untracked: code/selection/fit-dist.nb
Untracked: code/selection/result-figure.R
Untracked: code/yuanhua/
Untracked: data/Melanoma-RegevGarraway-DFCI-scRNA-Seq/
Untracked: data/PRJNA485423/
Untracked: data/canopy/
Untracked: data/cell_assignment/
Untracked: data/cnv/
Untracked: data/de_analysis_FTv62/
Untracked: data/donor_info_070818.txt
Untracked: data/donor_info_core.csv
Untracked: data/donor_neutrality.tsv
Untracked: data/exome-point-mutations/
Untracked: data/fdr10.annot.txt.gz
Untracked: data/human_H_v5p2.rdata
Untracked: data/human_c2_v5p2.rdata
Untracked: data/human_c6_v5p2.rdata
Untracked: data/neg-bin-rsquared-petr.csv
Untracked: data/neutralitytestr-petr.tsv
Untracked: data/raw/
Untracked: data/sce_merged_donors_cardelino_donorid_all_qc_filt.rds
Untracked: data/sce_merged_donors_cardelino_donorid_all_with_qc_labels.rds
Untracked: data/sce_merged_donors_cardelino_donorid_unstim_qc_filt.rds
Untracked: data/sces/
Untracked: data/selection/
Untracked: data/simulations/
Untracked: data/variance_components/
Untracked: docs/figure/analysis_for_joxm_cardelino-relax.Rmd/
Untracked: docs/figure/clone_prevalences_cardelino-relax.Rmd/
Untracked: docs/figure/differential_expression_cardelino-relax.Rmd/
Untracked: figures/
Untracked: output/differential_expression/
Untracked: output/differential_expression_cardelino-relax/
Untracked: output/donor_specific/
Untracked: output/line_info.tsv
Untracked: output/nvars_by_category_by_donor.tsv
Untracked: output/nvars_by_category_by_line.tsv
Untracked: output/variance_components/
Untracked: qolg_BIC.pdf
Untracked: references/
Untracked: reports/
Untracked: src/Rmd/DE_pathways_FTv62_callset_clones_pairwise_vs_base.unst_cells.carderelax.Rmd
Untracked: src/Rmd/Rplots.pdf
Untracked: src/Rmd/cell_assignment_cardelino-relax_template.Rmd
Untracked: tree.txt
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | f3a24d1 | Davis McCarthy | 2019-10-30 | Fixing some nameclash bugs |
Rmd | 550176f | Davis McCarthy | 2019-10-30 | Updating analysis to reflect accepted ms |
html | 8729e02 | davismcc | 2018-11-09 | Build site. |
html | 0540cdb | davismcc | 2018-09-02 | Build site. |
html | f0ed980 | davismcc | 2018-08-31 | Build site. |
html | ca3438f | davismcc | 2018-08-29 | Build site. |
html | e573f2f | davismcc | 2018-08-27 | Build site. |
html | 9ec2a59 | davismcc | 2018-08-26 | Build site. |
html | 36acf15 | davismcc | 2018-08-25 | Build site. |
Rmd | d618fe5 | davismcc | 2018-08-25 | Updating analyses |
html | 090c1b9 | davismcc | 2018-08-24 | Build site. |
html | d2e8b31 | davismcc | 2018-08-19 | Build site. |
html | 1489d32 | davismcc | 2018-08-17 | Add html files |
Rmd | d5e785d | davismcc | 2018-08-17 | Updating to use “line” instead of “donor” |
Rmd | bcee02b | davismcc | 2018-08-17 | Updating donors to lines in text. |
Rmd | 0862ede | davismcc | 2018-08-10 | Fixing some typos |
Rmd | a27cace | davismcc | 2018-08-10 | Updating analysis |
Rmd | 0c8f70d | davismcc | 2018-08-10 | Tidying up mutated genes analysis |
Rmd | 1cbadbd | davismcc | 2018-08-10 | Updating analyses. |
Rmd | de1daa9 | davismcc | 2018-08-09 | Adding analysis for DE between mutated and unmutated clones |
In this analysis, we look at the direct impact of somatic variants on the genes in which they are located. We combine information from the clonal trees about which variants tag which clones and VEP functional annotations about the expected consequence of each variant. For each gene with a somatic variant (and sufficient gene expression), we test for differential expression between cells that have the mutation (based on clone assignment) and those that do not. We can then compare these “mutated gene” differential expression results for variants with different categories of functional annotation.
Clonal trees were inferred with Canopy using variant allele frequency information from a set of strictly filtered somatic variants called from whole-exome sequencing data. We filtered somatic variants to only include variants that had read coverage in at least one single-cell for the relevant line. Assignment of cells to clones was done with cardelino.
Load VEP consequence information.
vep_best <- read_tsv("data/exome-point-mutations/high-vs-low-exomes.v62.ft.alldonors-filt_lenient.all_filt_sites.vep_most_severe_csq.txt")
colnames(vep_best)[1] <- "Uploaded_variation"
## deduplicate dataframe
vep_best <- vep_best[!duplicated(vep_best[["Uploaded_variation"]]),]
There are 8503 variants, aggregated across all lines, and we can check the number that fall into the different VEP annotation categories.
3_prime_UTR_variant 5_prime_UTR_variant
181 121
downstream_gene_variant intergenic_variant
13 13
intron_variant mature_miRNA_variant
1539 3
missense_variant non_coding_transcript_exon_variant
3648 428
regulatory_region_variant splice_acceptor_variant
1 41
splice_donor_variant splice_region_variant
24 291
start_lost stop_gained
9 227
stop_lost stop_retained_variant
2 5
synonymous_variant upstream_gene_variant
1923 34
Load exome sites, that is, the somatic SNVs identified with Petr Danecek’s somatic variant calling approach.
exome_sites <- read_tsv("data/exome-point-mutations/high-vs-low-exomes.v62.ft.filt_lenient-alldonors.txt.gz",
col_types = "ciccdcciiiiccccccccddcdcll", comment = "#",
col_names = TRUE)
exome_sites <- dplyr::select(exome_sites,
-c(clinsing, SIFT_Pdel, PolyPhen_Pdel, gene_name,
bcftools_csq))
exome_sites <- dplyr::mutate(
exome_sites,
chrom = paste0("chr", gsub("chr", "", chrom)),
var_id = paste0(chrom, ":", pos, "_", ref, "_", alt))
## deduplicate sites list
exome_sites <- exome_sites[!duplicated(exome_sites[["var_id"]]),]
Add variant consequences annotations to exome sites.
vep_best[["var_id"]] <- paste0("chr", vep_best[["Uploaded_variation"]])
vep_best[, c("var_id", "Location", "Consequence")]
# A tibble: 8,503 x 3
var_id Location Consequence
<chr> <chr> <chr>
1 chr1:874501_C_T 1:874501-874501 missense_variant
2 chr1:915025_G_A 1:915025-915025 missense_variant
3 chr1:916622_C_T 1:916622-916622 intron_variant
4 chr1:985731_G_A 1:985731-985731 intron_variant
5 chr1:985732_G_A 1:985732-985732 intron_variant
6 chr1:1019508_G_A 1:1019508-1019508 missense_variant
7 chr1:1019509_G_A 1:1019509-1019509 synonymous_variant
8 chr1:1116160_G_A 1:1116160-1116160 synonymous_variant
9 chr1:1158664_G_A 1:1158664-1158664 synonymous_variant
10 chr1:1178463_T_A 1:1178463-1178463 missense_variant
# … with 8,493 more rows
exome_sites <- inner_join(exome_sites,
vep_best[, c("var_id", "Location", "Consequence")],
by = "var_id")
# com_vars <- intersect(exome_sites$var_id, csq_annos$var_id)
# length(com_vars)
# mean(exome_sites$var_id %in% csq_annos$var_id)
# mean(exome_sites$pos %in% csq_annos$Pos)
Load SingleCellExperiment objects with single-cell gene expression data.
params <- list()
params$callset <- "filt_lenient.cell_coverage_sites"
fls <- list.files("data/sces")
fls <- fls[grepl(paste0("carderelax.", params$callset), fls)]
lines <- gsub(".*ce_([a-z]+)_with_clone_assignments_carderelax.*", "\\1", fls)
sce_unst_list <- list()
for (don in lines) {
sce_unst_list[[don]] <- readRDS(file.path("data/sces",
paste0("sce_", don, "_with_clone_assignments_carderelax.",
params$callset, ".rds")))
cat(paste("reading", don, ": ", ncol(sce_unst_list[[don]]),
"unstimulated cells.\n"))
}
reading euts : 79 unstimulated cells.
reading fawm : 53 unstimulated cells.
reading feec : 75 unstimulated cells.
reading fikt : 39 unstimulated cells.
reading garx : 70 unstimulated cells.
reading gesg : 105 unstimulated cells.
reading heja : 50 unstimulated cells.
reading hipn : 62 unstimulated cells.
reading ieki : 58 unstimulated cells.
reading joxm : 79 unstimulated cells.
reading kuco : 48 unstimulated cells.
reading laey : 55 unstimulated cells.
reading lexy : 63 unstimulated cells.
reading naju : 44 unstimulated cells.
reading nusw : 60 unstimulated cells.
reading oaaz : 38 unstimulated cells.
reading oilg : 90 unstimulated cells.
reading pipw : 107 unstimulated cells.
reading puie : 41 unstimulated cells.
reading qayj : 97 unstimulated cells.
reading qolg : 36 unstimulated cells.
reading qonc : 58 unstimulated cells.
reading rozh : 91 unstimulated cells.
reading sehl : 30 unstimulated cells.
reading ualf : 89 unstimulated cells.
reading vass : 37 unstimulated cells.
reading vils : 37 unstimulated cells.
reading vuna : 71 unstimulated cells.
reading wahn : 82 unstimulated cells.
reading wetu : 77 unstimulated cells.
reading xugn : 35 unstimulated cells.
reading zoxy : 88 unstimulated cells.
assignments_lst <- list()
for (don in lines) {
assignments_lst[[don]] <- as_data_frame(
colData(sce_unst_list[[don]])[,
c("donor_short_id", "highest_prob",
"assigned", "total_features",
"total_counts_endogenous", "num_processed")])
}
assignments <- do.call("rbind", assignments_lst)
We have SCE objects (expression data and cell assignments) for 32 lines, with 88% of cells confidently assigned to a clone.
Finally, we load transcriptome-wide differential expression results computed using the quasi-likelihood F test method in edgeR.
Next, we need to annotate the variants with information about which clones they are expected to be present in (obtained from the Canopy tree inference).
cell_assign_list <- list()
for (don in lines) {
cell_assign_list[[don]] <- readRDS(file.path("data/cell_assignment",
paste0("cardelino_results_carderelax.", don, ".", params$callset, ".rds")))
cat(paste("reading", don, "\n"))
}
reading euts
reading fawm
reading feec
reading fikt
reading garx
reading gesg
reading heja
reading hipn
reading ieki
reading joxm
reading kuco
reading laey
reading lexy
reading naju
reading nusw
reading oaaz
reading oilg
reading pipw
reading puie
reading qayj
reading qolg
reading qonc
reading rozh
reading sehl
reading ualf
reading vass
reading vils
reading vuna
reading wahn
reading wetu
reading xugn
reading zoxy
get_sites_by_line <- function(sites_df, sce_list, assign_list) {
if (!identical(sort(names(sce_list)), sort(names(assign_list))))
stop("lines do not match between sce_list and assign_list.")
sites_by_line <- list()
for (don in names(sce_list)) {
config <- as_data_frame(assign_list[[don]]$tree$Z)
config[["var_id"]] <- rownames(assign_list[[don]]$tree$Z)
sites_line <- inner_join(sites_df, config)
sites_line[["clone_presence"]] <- ""
for (cln in colnames(assign_list[[don]]$tree$Z)[-1]) {
sites_line[["clone_presence"]][
as.logical(sites_line[[cln]])] <- paste(
sites_line[["clone_presence"]][
as.logical(sites_line[[cln]])], cln, sep = "&")
}
sites_line[["clone_presence"]] <- gsub("^&", "",
sites_line[["clone_presence"]])
## drop config columns as these won't match up between lines
keep_cols <- grep("^clone[0-9]$", colnames(sites_line), invert = TRUE)
sites_by_line[[don]] <- sites_line[, keep_cols]
}
do.call("bind_rows", sites_by_line)
}
sites_by_line <- get_sites_by_line(exome_sites, sce_unst_list, cell_assign_list)
sites_by_line <- dplyr::mutate(sites_by_line,
Consequence = gsub("_variant", "", Consequence))
Get DE results comparing mutated clone to all unmutated clones, but we first have to organise the data further.
Turn dataframes into GRanges objects to enable fast and robust overlaps.
## run DE for mutated cells vs unmutated cells using existing DE results
## filter out any remaining ERCC genes
for (don in names(de_res[["sce_list_unst"]]))
de_res[["sce_list_unst"]][[don]] <- de_res[["sce_list_unst"]][[don]][
!rowData(de_res[["sce_list_unst"]][[don]])$is_feature_control,]
sce_de_list_gr <- list()
for (don in names(de_res[["sce_list_unst"]])) {
sce_de_list_gr[[don]] <- makeGRangesFromDataFrame(
rowData(de_res[["sce_list_unst"]][[don]]),
start.field = "start_position",
end.field = "end_position",
keep.extra.columns = TRUE)
seqlevelsStyle(sce_de_list_gr[[don]]) <- "Ensembl"
}
sites_by_line <- dplyr::mutate(sites_by_line, end_pos = pos)
sites_by_line_gr <- makeGRangesFromDataFrame(
sites_by_line,
start.field = "pos",
end.field = "end_pos",
keep.extra.columns = TRUE)
seqlevelsStyle(sites_by_line_gr) <- "Ensembl"
Run DE testing between mutated and un-mutated clones for all affected genes for each line.
mut_genes_df_allcells_list <- list()
for (don in names(de_res[["sce_list_unst"]])) {
cat("working on ", don, "\n")
sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
sites_tmp$gene <- rownames(sce_tmp)
dge_tmp <- de_res[["dge_list"]][[don]]
dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
de_tbl_tmp <- data.frame(line = don,
gene = sites_tmp$gene,
hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
var_id = sites_tmp$var_id,
location = sites_tmp$Location,
consequence = sites_tmp$Consequence,
clone_presence = sites_tmp$clone_presence,
logFC = NA, logCPM = NA, F = NA, PValue = NA,
comment = "", stringsAsFactors = FALSE)
for (i in seq_len(length(sites_tmp))) {
clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
dsgn_tmp <- cbind(base_design, data.frame(mutatedclone))
if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)) {
qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
de_tbl_tmp$F[i] <- de_tmp$table$F
de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
}
if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
if (!is.fullrank(dsgn_tmp))
de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
}
mut_genes_df_allcells_list[[don]] <- de_tbl_tmp
}
working on euts
working on fawm
working on feec
working on fikt
working on garx
working on gesg
working on heja
working on hipn
working on ieki
working on joxm
working on kuco
working on laey
working on lexy
working on naju
working on nusw
working on oaaz
working on oilg
working on pipw
working on puie
working on qayj
working on qolg
working on qonc
working on rozh
working on sehl
working on ualf
working on vass
working on vuna
working on wahn
working on wetu
working on xugn
working on zoxy
mut_genes_df_allcells <- do.call("bind_rows", mut_genes_df_allcells_list)
dim(mut_genes_df_allcells)
[1] 1140 13
line gene hgnc_symbol ensembl_gene_id
1 euts ENSG00000006327_TNFRSF12A TNFRSF12A ENSG00000006327
2 euts ENSG00000028528_SNX1 SNX1 ENSG00000028528
3 euts ENSG00000060642_PIGV PIGV ENSG00000060642
4 euts ENSG00000083168_KAT6A KAT6A ENSG00000083168
5 euts ENSG00000083520_DIS3 DIS3 ENSG00000083520
6 euts ENSG00000090060_PAPOLA PAPOLA ENSG00000090060
var_id location consequence clone_presence
1 chr16:3071779_A_T 16:3071779-3071779 stop_gained clone2
2 chr15:64388302_G_A 15:64388302-64388302 synonymous clone3
3 chr1:27121236_G_A 1:27121236-27121236 synonymous clone3
4 chr8:41798596_G_A 8:41798596-41798596 missense clone2
5 chr13:73335940_G_C 13:73335940-73335940 synonymous clone2&clone3
6 chr14:97000841_C_T 14:97000841-97000841 missense clone3
logFC logCPM F PValue comment
1 0.0378859 9.221413 0.03458196 0.85301221
2 -0.1549842 6.353333 0.08000619 0.77812377
3 -2.8765803 4.367360 7.03122364 0.00990008
4 NA NA NA NA gene did not pass DE filters
5 -0.7635865 4.366939 0.96241345 0.32995811
6 0.0493242 7.067499 0.01151527 0.91485064
With this analysis, 1067 genes/variants could be tested for DE between mutated and unmutated clones.
We can recompute false discovery rates with IHW, simplify the VEP annotation categories (assigning all nonsense categories to “nonsense” and all splicing categories to “splicing”) and inspect the results.
## add FDRs for genes tested here for DE
ihw_res_all <- ihw(PValue ~ logCPM, data = mut_genes_df_allcells, alpha = 0.2)
mut_genes_df_allcells$FDR <- adj_pvalues(ihw_res_all)
## add simplified consequence categories
mut_genes_df_allcells$consequence_simplified <-
mut_genes_df_allcells$consequence
mut_genes_df_allcells$consequence_simplified[
mut_genes_df_allcells$consequence_simplified %in%
c("stop_retained", "start_lost", "stop_lost", "stop_gained")] <- "nonsense"
mut_genes_df_allcells$consequence_simplified[
mut_genes_df_allcells$consequence_simplified %in%
c("splice_donor", "splice_acceptor", "splice_region")] <- "splicing"
table(mut_genes_df_allcells$consequence_simplified)
3_prime_UTR 5_prime_UTR
27 12
intron missense
20 650
non_coding_transcript_exon nonsense
20 39
splicing synonymous
21 351
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.7450 0.8918 0.7912 0.9408 1.0000 73
dplyr::arrange(mut_genes_df_allcells, FDR) %>% dplyr::select(-location) %>%
DT::datatable(., options = list(pageLength = 20))
There are few variants with a significant difference in expression between clones with and without the mutation (FDR < 10%).
Plot these results.
dplyr::filter(mut_genes_df_allcells, !is.na(logFC)) %>%
ggplot(aes(x = logFC, y = -log10(PValue),
fill = FDR < 0.1)) +
geom_point(colour = "gray40", pch = 21) +
geom_rug(sides = "b", alpha = 0.6) +
geom_vline(xintercept = 0, linetype = 2) +
# geom_text_repel(show.legend = FALSE) +
scale_fill_manual(values = c("gray70", "firebrick"),
label = c("N.S.", "FDR < 10%"),
name = "DE result") +
facet_wrap(~consequence, ncol = 5) +
guides(alpha = FALSE) +
theme_classic(16) +
theme(strip.background = element_rect(fill = "gray90"),
legend.position = c(0.9, 0.1))
Version | Author | Date |
---|---|---|
36acf15 | davismcc | 2018-08-25 |
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_vep_anno_allcells.png",
height = 7, width = 16)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_vep_anno_allcells.pdf",
height = 7, width = 16)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_vep_anno_allcells.svg",
height = 7, width = 16)
The picture can look a little clearer with simplified annotation categories.
dplyr::filter(mut_genes_df_allcells, !is.na(logFC)) %>%
ggplot(aes(x = logFC, y = -log10(PValue),
fill = FDR < 0.2)) +
geom_point(colour = "gray40", pch = 21) +
geom_rug(sides = "b", alpha = 0.6) +
geom_vline(xintercept = 0, linetype = 2) +
# geom_text_repel(show.legend = FALSE) +
scale_fill_manual(values = c("gray70", "firebrick"),
label = c("N.S.", "FDR < 20%"),
name = "DE result") +
facet_wrap(~consequence_simplified, ncol = 3) +
guides(alpha = FALSE) +
theme_classic(20) +
theme(strip.background = element_rect(fill = "gray90"),
strip.text = element_text(size = 13),
legend.position = c(0.8, 0.15))
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells.png",
height = 7, width = 10.5)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells.pdf",
height = 7, width = 10.5)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells.svg",
height = 7, width = 10.5)
Let us check the median logFC values for each simplified VEP annotation category and add boxplots for logFC to the volcano plots.
tmp2 <- mut_genes_df_allcells %>%
group_by(consequence_simplified) %>%
summarise(med = median(logFC, na.rm = TRUE),
nvars = n())
dplyr::arrange(tmp2, med)
# A tibble: 8 x 3
consequence_simplified med nvars
<chr> <dbl> <int>
1 non_coding_transcript_exon -0.385 20
2 nonsense -0.306 39
3 intron -0.176 20
4 3_prime_UTR -0.0998 27
5 synonymous -0.0745 351
6 missense -0.00875 650
7 splicing 0.0130 21
8 5_prime_UTR 0.609 12
plist <- list()
simple_cons_vec <- c("nonsense", "splicing", "missense",
"synonymous", "intron",
"non_coding_transcript_exon",
"5_prime_UTR", "3_prime_UTR")
for (cons in simple_cons_vec) {
p <- dplyr::filter(mut_genes_df_allcells, !is.na(logFC),
consequence_simplified == cons) %>%
ggplot(aes(x = logFC, y = -log10(PValue),
fill = FDR < 0.2)) +
geom_point(colour = "gray40", pch = 21, size = 2) +
geom_rug(sides = "b", alpha = 0.6) +
geom_vline(xintercept = 0, linetype = 2) +
scale_fill_manual(values = c("gray70", "firebrick"),
label = c("N.S.", "FDR < 20%"),
name = "DE result") +
guides(alpha = FALSE) +
theme_classic(14) +
ggtitle(cons) +
xlim(1.05 * range(mut_genes_df_allcells$logFC, na.rm = TRUE)) +
ylim(c(0, 1.05 * max(-log10(mut_genes_df_allcells$PValue), na.rm = TRUE))) +
theme(strip.background = element_rect(fill = "gray90"),
strip.text = element_text(size = 12),
legend.position = "none",
plot.title = element_text(face = "bold"))
plist[[cons]] <- ggExtra::ggMarginal(
p, type = "boxplot", margins = "x", fill = "wheat",
colour = "firebrick4", outlier.size = 0, outlier.alpha = 0)
}
cowplot::plot_grid(plotlist = plist, nrow = 2)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells_with_boxplot.png",
height = 7, width = 18)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells_with_boxplot.pdf",
height = 7, width = 18)
ggsave("figures/mutated_genes/alllines_mutgenes_volcano_by_simple_vep_anno_allcells_with_boxplot.svg",
height = 7, width = 18)
tmp2 <- mut_genes_df_allcells %>%
group_by(consequence) %>%
dplyr::summarise(med = median(logFC, na.rm = TRUE),
nvars = n())
mut_genes_df_allcells %>%
dplyr::filter(!is.na(logFC)) %>%
dplyr::mutate(consequence = factor(
consequence, levels(as.factor(consequence))[order(tmp2[["med"]])]),
de = ifelse(FDR < 0.2, "FDR < 0.2", "FDR > 0.2")) %>%
ggplot(aes(y = logFC, x = consequence)) +
geom_hline(yintercept = 0, linetype = 1, colour = "black") +
ggbeeswarm::geom_quasirandom(colour = "gray40", fill = "gray70", pch = 21) +
geom_boxplot(outlier.size = 0, outlier.alpha = 0, fill = "wheat",
colour = "firebrick4", width = 0.2, size = 1) +
scale_x_discrete(expand = c(.01, 0)) +
scale_y_continuous(expand = c(0, 0), name = "logFC") +
theme_ridges(16) +
coord_flip() +
theme(strip.background = element_rect(fill = "gray90")) +
guides(fill = FALSE, color = FALSE, point_color = FALSE)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_vep_anno_allcells.png",
height = 9, width = 10)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_vep_anno_allcells.pdf",
height = 9, width = 10)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_vep_anno_allcells.svg",
height = 9, width = 10)
We can also just look specifically at the boxplots of the logFC values across categories, first looking at the median logFC values and results from a t-test testing if the mean logFC value is different from zero. Positive logFC values indicate higher expression in the mutated clone, and negative logFC values lower expression in the mutated clone.
tmp3 <- mut_genes_df_allcells %>%
group_by(consequence_simplified) %>%
dplyr::summarise(med = median(logFC, na.rm = TRUE),
nvars = n(),
t_coef = t.test(logFC, mu = 0)$estimate,
t_stat = t.test(logFC, mu = 0)$statistic,
t_pval = t.test(logFC, mu = 0)$p.value)
tmp3
# A tibble: 8 x 6
consequence_simplified med nvars t_coef t_stat t_pval
<chr> <dbl> <int> <dbl> <dbl> <dbl>
1 3_prime_UTR -0.0998 27 0.0935 0.325 0.748
2 5_prime_UTR 0.609 12 0.732 2.71 0.0221
3 intron -0.176 20 -0.285 -1.72 0.102
4 missense -0.00875 650 -0.0623 -1.30 0.194
5 non_coding_transcript_exon -0.385 20 -0.197 -1.46 0.159
6 nonsense -0.306 39 -0.189 -1.68 0.101
7 splicing 0.0130 21 -0.223 -1.06 0.306
8 synonymous -0.0745 351 -0.115 -1.67 0.0962
mut_genes_df_allcells %>%
dplyr::filter(!is.na(logFC)) %>%
dplyr::mutate(consequence_simplified = factor(
consequence_simplified,
levels(as.factor(consequence_simplified))[order(tmp3[["med"]])]),
de = ifelse(FDR < 0.2, "FDR < 0.2", "FDR > 0.2")) %>%
ggplot(aes(y = logFC, x = consequence_simplified)) +
geom_hline(yintercept = 0, linetype = 1, colour = "black") +
ggbeeswarm::geom_quasirandom(colour = "gray40", fill = "gray70", pch = 21) +
geom_boxplot(outlier.size = 0, outlier.alpha = 0, fill = "wheat",
colour = "firebrick4", width = 0.2, size = 1) +
scale_x_discrete(expand = c(.01, 0), name = "consequence") +
scale_y_continuous(expand = c(0, 0), name = "logFC") +
theme_ridges(20) +
coord_flip() +
theme(strip.background = element_rect(fill = "gray90")) +
guides(fill = FALSE, color = FALSE, point_color = FALSE)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.png",
height = 7, width = 10)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.pdf",
height = 7, width = 10)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.svg",
height = 7, width = 10)
In the analysis above, the minimum allowed clone size (i.e. group for DE analysis) was 3 cells. Such potentially small group sizes could lead to noisy estimates of the logFC values that we looked at above. Let us repeat the above analysis but increasing the minimum number of cells for a group to 10 to try to obtain more accurate (on average) logFC estimates from the DE model.
Run DE testing between mutated and un-mutated clones for a more strictly filtered set of affected genes for each line.
mut_genes_df_allcells_filt_list <- list()
for (don in names(de_res[["sce_list_unst"]])) {
cat("working on ", don, "\n")
sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
sites_tmp$gene <- rownames(sce_tmp)
dge_tmp <- de_res[["dge_list"]][[don]]
dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
de_tbl_tmp <- data.frame(line = don,
gene = sites_tmp$gene,
hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
var_id = sites_tmp$var_id,
location = sites_tmp$Location,
consequence = sites_tmp$Consequence,
clone_presence = sites_tmp$clone_presence,
logFC = NA, logCPM = NA, F = NA, PValue = NA,
comment = "", stringsAsFactors = FALSE)
for (i in seq_len(length(sites_tmp))) {
clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
dsgn_tmp <- cbind(base_design, data.frame(mutatedclone))
if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)
&& (sum(mutatedclone) > 9.5) && (sum(!mutatedclone) > 9.5)) {
qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
de_tbl_tmp$F[i] <- de_tmp$table$F
de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
}
if ((sum(mutatedclone) < 9.5) || (sum(!mutatedclone) < 9.5))
de_tbl_tmp$comment[i] <- "minimum group size < 10 cells"
if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
if (!is.fullrank(dsgn_tmp))
de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
}
mut_genes_df_allcells_filt_list[[don]] <- de_tbl_tmp
}
working on euts
working on fawm
working on feec
working on fikt
working on garx
working on gesg
working on heja
working on hipn
working on ieki
working on joxm
working on kuco
working on laey
working on lexy
working on naju
working on nusw
working on oaaz
working on oilg
working on pipw
working on puie
working on qayj
working on qolg
working on qonc
working on rozh
working on sehl
working on ualf
working on vass
working on vuna
working on wahn
working on wetu
working on xugn
working on zoxy
With this analysis, 830 could be tested for DE between mutated and unmutated clones (237 fewer than with the lenient filtering above).
We can recompute false discovery rates with IHW, simplify the VEP annotation categories (assigning all nonsense categories to “nonsense” and all splicing categories to “splicing”) and inspect the results.
## add FDRs for genes tested here for DE
ihw_res_all <- ihw(PValue ~ logCPM, data = mut_genes_df_allcells_filt, alpha = 0.2)
mut_genes_df_allcells_filt$FDR <- adj_pvalues(ihw_res_all)
## add simplified consequence categories
mut_genes_df_allcells_filt$consequence_simplified <-
mut_genes_df_allcells_filt$consequence
mut_genes_df_allcells_filt$consequence_simplified[
mut_genes_df_allcells_filt$consequence_simplified %in%
c("stop_retained", "start_lost", "stop_lost", "stop_gained")] <- "nonsense"
mut_genes_df_allcells_filt$consequence_simplified[
mut_genes_df_allcells_filt$consequence_simplified %in%
c("splice_donor", "splice_acceptor", "splice_region")] <- "splicing"
table(mut_genes_df_allcells_filt$consequence_simplified)
3_prime_UTR 5_prime_UTR
27 12
intron missense
20 650
non_coding_transcript_exon nonsense
20 39
splicing synonymous
21 351
dplyr::arrange(mut_genes_df_allcells_filt, FDR) %>%
dplyr::select(-location, line, hgnc_symbol, var_id, consequence, FDR, PValue,
everything()) %>%
DT::datatable(., options = list(pageLength = 20))
Again, we’ll look specifically at the boxplots of the logFC values across categories. Positive logFC values indicate higher expression in the mutated clone, and negative logFC values lower expression in the mutated clone.
tmp5 <- mut_genes_df_allcells_filt %>%
group_by(consequence_simplified) %>%
dplyr::summarise(med = median(logFC, na.rm = TRUE),
nvars = n()
)
dplyr::arrange(tmp5, med)
# A tibble: 8 x 3
consequence_simplified med nvars
<chr> <dbl> <int>
1 non_coding_transcript_exon -0.383 20
2 nonsense -0.280 39
3 3_prime_UTR -0.136 27
4 synonymous -0.0649 351
5 intron -0.0162 20
6 missense -0.00201 650
7 splicing 0.0324 21
8 5_prime_UTR 0.609 12
mut_genes_df_allcells_filt %>%
dplyr::filter(!is.na(logFC)) %>%
dplyr::mutate(consequence_simplified = factor(
consequence_simplified,
levels(as.factor(consequence_simplified))[order(tmp5[["med"]])]),
de = ifelse(FDR < 0.2, "FDR < 0.2", "FDR > 0.2")) %>%
ggplot(aes(y = logFC, x = consequence_simplified, fill = consequence)) +
geom_hline(yintercept = 0, linetype = 1, colour = "black") +
ggbeeswarm::geom_quasirandom(colour = "gray40", pch = 21) +
geom_boxplot(outlier.size = 0, outlier.alpha = 0, fill = "wheat", alpha = 0.5,
colour = "firebrick4", width = 0.2, size = 1) +
scale_x_discrete(expand = c(0.1, 0.1), name = "consequence") +
scale_y_continuous(expand = c(0.1, 0.1), name = "logFC") +
theme_ridges(20) +
coord_flip() +
theme(strip.background = element_rect(fill = "gray90")) +
guides(color = FALSE, point_color = FALSE) +
scale_fill_brewer(palette = "Paired") +
theme(legend.position = "bottom")
Version | Author | Date |
---|---|---|
d2e8b31 | davismcc | 2018-08-19 |
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.png",
height = 9, width = 15)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.pdf",
height = 9, width = 15)
ggsave("figures/mutated_genes/alllines_mutgenes_logfc-box_by_simple_vep_anno_allcells.svg",
height = 9, width = 15)
On the evidence of the plot above it does not look like stricter filtering of variants fundmentally changes the results. It still does not look possible to discern differences in the logFC distributions of mutated genes between different annotation categories for variants.
We can also get DE results comparing mutated clone to all unmutated clones, fitting the first couple of PCs in the model to see if this accounts for unwanted variation and increases power to detect differences in expression between mutated and unwanted clones for mutated genes.
## run DE for mutated cells vs unmutated cells using existing DE results
## fitting PCs as covariates in DE model
mut_genes_df_allcells_list_PCreg <- list()
for (don in names(de_res[["sce_list_unst"]])) {
cat("working on ", don, "\n")
sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
sites_tmp$gene <- rownames(sce_tmp)
dge_tmp <- de_res[["dge_list"]][[don]]
dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
de_tbl_tmp <- data.frame(line = don,
gene = sites_tmp$gene,
hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
var_id = sites_tmp$var_id,
location = sites_tmp$Location,
consequence = sites_tmp$Consequence,
clone_presence = sites_tmp$clone_presence,
logFC = NA, logCPM = NA, F = NA, PValue = NA,
comment = "", stringsAsFactors = FALSE)
pca_tmp <- reducedDim(runPCA(
de_res[["sce_list_unst"]][[don]], ntop = 500, ncomponents = 5), "PCA")
for (i in seq_len(length(sites_tmp))) {
clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
dsgn_tmp <- cbind(base_design, pca_tmp[, 1], data.frame(mutatedclone))
if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)) {
qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
de_tbl_tmp$F[i] <- de_tmp$table$F
de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
}
if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
if (!is.fullrank(dsgn_tmp))
de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
}
mut_genes_df_allcells_list_PCreg[[don]] <- de_tbl_tmp
}
working on euts
working on fawm
working on feec
working on fikt
working on garx
working on gesg
working on heja
working on hipn
working on ieki
working on joxm
working on kuco
working on laey
working on lexy
working on naju
working on nusw
working on oaaz
working on oilg
working on pipw
working on puie
working on qayj
working on qolg
working on qonc
working on rozh
working on sehl
working on ualf
working on vass
working on vuna
working on wahn
working on wetu
working on xugn
working on zoxy
mut_genes_df_allcells_PCreg <- do.call("bind_rows", mut_genes_df_allcells_list_PCreg)
## add FDRs for genes tested here for DE
ihw_res_all <- ihw(PValue ~ logCPM, data = mut_genes_df_allcells_PCreg, alpha = 0.2)
mut_genes_df_allcells_PCreg$FDR <- adj_pvalues(ihw_res_all)
## add simplified consequence categories
mut_genes_df_allcells_PCreg$consequence_simplified <-
mut_genes_df_allcells_PCreg$consequence
mut_genes_df_allcells_PCreg$consequence_simplified[
mut_genes_df_allcells_PCreg$consequence_simplified %in%
c("stop_retained", "start_lost", "stop_lost", "stop_gained")] <- "nonsense"
mut_genes_df_allcells_PCreg$consequence_simplified[
mut_genes_df_allcells_PCreg$consequence_simplified %in%
c("splice_donor", "splice_acceptor", "splice_region")] <- "splicing"
table(mut_genes_df_allcells_PCreg$consequence_simplified)
3_prime_UTR 5_prime_UTR
27 12
intron missense
20 650
non_coding_transcript_exon nonsense
20 39
splicing synonymous
21 351
dplyr::arrange(mut_genes_df_allcells_PCreg, FDR) %>%
dplyr::select(-location) %>%
DT::datatable(., options = list(pageLength = 20))
tibble(PValue_default = mut_genes_df_allcells[["PValue"]],
PValue_PCreg = mut_genes_df_allcells_PCreg[["PValue"]],
sig_default = mut_genes_df_allcells[["FDR"]] < 0.1,
sig_PCreg = mut_genes_df_allcells_PCreg[["FDR"]] < 0.1) %>%
ggplot(aes(x = -log10(PValue_default), y = -log10(PValue_PCreg),
shape = sig_default, colour = sig_PCreg)) +
geom_point() +
geom_smooth(aes(group = 1), colour = "firebrick") +
geom_abline(slope = 1, intercept = 0) +
scale_color_manual(values = c("gray60", "black"))
Version | Author | Date |
---|---|---|
d2e8b31 | davismcc | 2018-08-19 |
There is thus very little change in DE results when regressing out 5 PCs in these DE models.
We can also check to see if accounting for cell cycle scores in the DE model increases power to detect differential expression between mutated and unmutated clones for mutated genes.
## run DE for mutated cells vs unmutated cells using existing DE results
## fitting PCs as covariates in DE model
mut_genes_df_allcells_list_CCreg <- list()
for (don in names(de_res[["sce_list_unst"]])) {
cat("working on ", don, "\n")
sites_tmp <- sites_by_line_gr[sites_by_line_gr$donor_short_id == don]
ov_tmp <- findOverlaps(sce_de_list_gr[[don]], sites_tmp)
sce_tmp <- de_res[["sce_list_unst"]][[don]][queryHits(ov_tmp),]
sites_tmp <- sites_tmp[subjectHits(ov_tmp)]
sites_tmp$gene <- rownames(sce_tmp)
dge_tmp <- de_res[["dge_list"]][[don]]
dge_tmp <- dge_tmp[intersect(rownames(dge_tmp), sites_tmp$gene),]
base_design <- dge_tmp$design[, !grepl("assigned", colnames(dge_tmp$design))]
de_tbl_tmp <- data.frame(line = don,
gene = sites_tmp$gene,
hgnc_symbol = gsub(".*_", "", sites_tmp$gene),
ensembl_gene_id = gsub("_.*", "", sites_tmp$gene),
var_id = sites_tmp$var_id,
location = sites_tmp$Location,
consequence = sites_tmp$Consequence,
clone_presence = sites_tmp$clone_presence,
logFC = NA, logCPM = NA, F = NA, PValue = NA,
comment = "", stringsAsFactors = FALSE)
G1 <- de_res[["sce_list_unst"]][[don]]$G1
S <- de_res[["sce_list_unst"]][[don]]$S
G2M <- de_res[["sce_list_unst"]][[don]]$G2M
for (i in seq_len(length(sites_tmp))) {
clones_tmp <- strsplit(sites_tmp$clone_presence[i], split = "&")[[1]]
mutatedclone <- as.numeric(sce_tmp$assigned %in% clones_tmp)
dsgn_tmp <- cbind(base_design, G1, S, G2M, data.frame(mutatedclone))
if (sites_tmp$gene[i] %in% rownames(dge_tmp) && is.fullrank(dsgn_tmp)) {
qlfit_tmp <- glmQLFit(dge_tmp[sites_tmp$gene[i],], dsgn_tmp)
de_tmp <- glmQLFTest(qlfit_tmp, coef = ncol(dsgn_tmp))
de_tbl_tmp$logFC[i] <- de_tmp$table$logFC
de_tbl_tmp$logCPM[i] <- de_tmp$table$logCPM
de_tbl_tmp$F[i] <- de_tmp$table$F
de_tbl_tmp$PValue[i] <- de_tmp$table$PValue
}
if (!(sites_tmp$gene[i] %in% rownames(dge_tmp)))
de_tbl_tmp$comment[i] <- "gene did not pass DE filters"
if (!is.fullrank(dsgn_tmp))
de_tbl_tmp$comment[i] <- "insufficient cells assigned to clone"
}
mut_genes_df_allcells_list_CCreg[[don]] <- de_tbl_tmp
}
working on euts
working on fawm
working on feec
working on fikt
working on garx
working on gesg
working on heja
working on hipn
working on ieki
working on joxm
working on kuco
working on laey
working on lexy
working on naju
working on nusw
working on oaaz
working on oilg
working on pipw
working on puie
working on qayj
working on qolg
working on qonc
working on rozh
working on sehl
working on ualf
working on vass
working on vuna
working on wahn
working on wetu
working on xugn
working on zoxy
mut_genes_df_allcells_CCreg <- do.call("bind_rows", mut_genes_df_allcells_list_CCreg)
## add FDRs for genes tested here for DE
ihw_res_all <- ihw(PValue ~ logCPM, data = mut_genes_df_allcells_CCreg, alpha = 0.2)
mut_genes_df_allcells_CCreg$FDR <- adj_pvalues(ihw_res_all)
## add simplified consequence categories
mut_genes_df_allcells_CCreg$consequence_simplified <-
mut_genes_df_allcells_CCreg$consequence
mut_genes_df_allcells_CCreg$consequence_simplified[
mut_genes_df_allcells_CCreg$consequence_simplified %in%
c("stop_retained", "start_lost", "stop_lost", "stop_gained")] <- "nonsense"
mut_genes_df_allcells_CCreg$consequence_simplified[
mut_genes_df_allcells_CCreg$consequence_simplified %in%
c("splice_donor", "splice_acceptor", "splice_region")] <- "splicing"
dplyr::arrange(mut_genes_df_allcells_CCreg, FDR) %>%
dplyr::select(-location) %>%
DT::datatable(., options = list(pageLength = 20))
tibble(PValue_default = mut_genes_df_allcells[["PValue"]],
PValue_CCreg = mut_genes_df_allcells_CCreg[["PValue"]],
sig_default = mut_genes_df_allcells[["FDR"]] < 0.1,
sig_PCreg = mut_genes_df_allcells_CCreg[["FDR"]] < 0.1) %>%
ggplot(aes(x = -log10(PValue_default), y = -log10(PValue_CCreg),
shape = sig_default, colour = sig_PCreg)) +
geom_point() +
geom_smooth(aes(group = 1), colour = "firebrick") +
geom_abline(slope = 1, intercept = 0) +
scale_color_manual(values = c("gray60", "black"))
Version | Author | Date |
---|---|---|
d2e8b31 | davismcc | 2018-08-19 |
Accounting for cell cycle scores (inferred with cyclone) in the DE model does not drastically change DE results, but perhaps does reduce power to find differences in expression between mutated and unmutated clones.
First, we can look at the number of variants annotated to the different VEP categories and assigned to combinations of clones, aggregated across all lines.
sites_by_line %>% group_by(Consequence, clone_presence) %>%
dplyr::summarise(n_vars = n()) %>%
ggplot(aes(x = n_vars, y = reorder(Consequence, n_vars, max),
colour = reorder(Consequence, n_vars, max))) +
geom_point(size = 5) +
geom_segment(aes(x = 0, y = Consequence, xend = n_vars, yend = Consequence)) +
facet_wrap(~clone_presence) +
scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(18)) +
guides(colour = FALSE) +
ggtitle("Clone tagging variants by consequence class: all lines") +
xlab("number of variants") + ylab("consequence") +
theme_bw(16)
We can also produce a similar plot, but showing the total number variants in each annotation category for each line separately.
sites_by_line %>% group_by(Consequence, donor_short_id) %>%
dplyr::summarise(n_vars = n()) %>%
ggplot(aes(x = n_vars, y = reorder(Consequence, n_vars, max),
colour = reorder(Consequence, n_vars, max))) +
geom_point(size = 5) +
geom_segment(aes(x = 0, y = Consequence, xend = n_vars, yend = Consequence)) +
facet_wrap(~donor_short_id) +
# scale_color_brewer(palette = "Set2") +
scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(18)) +
guides(colour = FALSE) +
ggtitle("Clone tagging variants by consequence class: all lines") +
xlab("number of variants") + ylab("consequence") +
theme_bw(16)
It’s also of interest to know if there are genes with multiple mutations in a line.
df_genes_multi_vars <- mut_genes_df_allcells %>%
dplyr::mutate(pos = as.numeric(gsub("chr.+:([0-9]+)_.*", "\\1", var_id))) %>%
group_by(line, hgnc_symbol) %>%
dplyr::summarise(n_variants_in_gene = n(),
max_dist_btw_vars = range(pos)[2] - range(pos)[1])
df_genes_multi_vars %>%
dplyr::filter(n_variants_in_gene > 1.5, max_dist_btw_vars > 1.5)
# A tibble: 4 x 4
# Groups: line [2]
line hgnc_symbol n_variants_in_gene max_dist_btw_vars
<chr> <chr> <int> <dbl>
1 joxm COL1A1 2 1815
2 joxm TBC1D9B 3 29054
3 qonc MAPK12 3 27
4 qonc WIPI2 2 1187
There are only 4 genes, across all lines, that have more than one (non-dinucleotide) somatic variant.
idx <- (df_genes_multi_vars$n_variants_in_gene > 1.5 &
df_genes_multi_vars$max_dist_btw_vars > 1.5)
mut_genes_df_allcells %>%
dplyr::filter(line %in% df_genes_multi_vars[["line"]][idx],
hgnc_symbol %in% df_genes_multi_vars[["hgnc_symbol"]][idx]) %>%
DT::datatable()
In the case of the COL1A1 gene in joxm
, there are two splicing variants (one a splice-acceptor variant and one a more general splice-region variant), and cells in the mutated clones have significantly lower expression than the unmutated clones for this gene. For the remaining genes with more than one somatic variant, there is no significant difference in gene expression between the mutated and unmutated clones.
Given the results of the transcriptome-wide differential expression analyses, it would be interesting to know if cell cycle genes specifically harbour mutations. To explore this, we first obtain the set of genes belonging to cell cycle gene sets in the MSigDB c2 gene set collection.
load(file.path("data/human_c2_v5p2.rdata"))
cellcycle_entrezid <- Hs.c2[grep("CELL_CYCL", names(Hs.c2))] %>%
unlist %>% unique
mapped_genes <- mappedkeys(org.Hs.egSYMBOL)
cellcycle_entrezid <- cellcycle_entrezid[cellcycle_entrezid %in% mapped_genes]
xx <- unlist(as.list(org.Hs.egSYMBOL[cellcycle_entrezid]))
xx <- sort(unique(xx))
This yields a list of 1243 genes annotated to be relevant to the cell cycle.
Let us look at the number of variants used for clonal reconstruction (a strictly filtered set) across VEP annotation categories that lie in cell cycle genes.
df_cc_vars <- mut_genes_df_allcells %>%
dplyr::filter(hgnc_symbol %in% xx) %>%
group_by(line, consequence_simplified) %>%
dplyr::summarise(n_vars = n())
df_cc_vars <- left_join(
expand.grid(line = levels(as.factor(mut_genes_df_allcells$line)),
consequence_simplified = levels(as.factor(
mut_genes_df_allcells$consequence_simplified))),
df_cc_vars)
df_cc_vars[is.na(df_cc_vars)] <- 0
ggplot(df_cc_vars,
aes(x = n_vars, y = reorder(consequence_simplified, n_vars, max),
colour = reorder(consequence_simplified, n_vars, max),
fill = reorder(consequence_simplified, n_vars, max))) +
geom_segment(aes(x = 0, y = consequence_simplified, xend = n_vars,
yend = consequence_simplified)) +
geom_point(size = 5, colour = "gray50", shape = 21) +
facet_wrap(~line, ncol = 4) +
scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(18)) +
scale_fill_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(18)) +
guides(colour = FALSE, fill = FALSE) +
ggtitle("Variants in proliferation genes: all lines") +
xlab("number of variants") + ylab("consequence") +
theme_bw(16)
Version | Author | Date |
---|---|---|
36acf15 | davismcc | 2018-08-25 |
We can also look at the load of mutations in genes annotated to MSigDB c2 proliferation gene sets.
prolif_entrezid <- Hs.c2[grep("PROLIF", names(Hs.c2))] %>%
unlist %>% unique
prolif_entrezid <- prolif_entrezid[prolif_entrezid %in% mapped_genes]
x2 <- unlist(as.list(org.Hs.egSYMBOL[prolif_entrezid]))
x2 <- sort(unique(x2))
This yields a list of 1219 genes annotated to be relevant to the cell cycle.
df_prolif_vars <- mut_genes_df_allcells %>%
dplyr::filter(hgnc_symbol %in% x2) %>%
group_by(line, consequence_simplified) %>%
dplyr::summarise(n_vars = n())
df_prolif_vars <- left_join(
expand.grid(line = levels(as.factor(mut_genes_df_allcells$line)),
consequence_simplified = levels(as.factor(
mut_genes_df_allcells$consequence_simplified))),
df_prolif_vars)
df_prolif_vars[is.na(df_prolif_vars)] <- 0
ggplot(df_prolif_vars,
aes(x = n_vars, y = reorder(consequence_simplified, n_vars, max),
colour = reorder(consequence_simplified, n_vars, max),
fill = reorder(consequence_simplified, n_vars, max))) +
geom_segment(aes(x = 0, y = consequence_simplified, xend = n_vars,
yend = consequence_simplified)) +
geom_point(size = 5, colour = "gray50", shape = 21) +
facet_wrap(~line, ncol = 4) +
scale_color_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(18)) +
scale_fill_manual(values = colorRampPalette(brewer.pal(8, "Accent"))(18)) +
guides(colour = FALSE, fill = FALSE) +
ggtitle("Variants in cell cycle genes: all lines") +
xlab("number of variants") + ylab("consequence") +
theme_bw(16)
Version | Author | Date |
---|---|---|
36acf15 | davismcc | 2018-08-25 |
Finally, create a data.frame that contains the counts of variants annotated to different classes of genes and VEP classes, and save to file for use in other analyses.
colnames(df_cc_vars) <- c("line", "consequence", "n_vars_cellcycle_genes")
colnames(df_prolif_vars) <- c("line", "consequence", "n_vars_proliferation_genes")
df_all_vars <- mut_genes_df_allcells %>%
group_by(line, consequence_simplified) %>%
dplyr::summarise(n_vars_all_genes = n())
df_all_vars <- left_join(
expand.grid(line = levels(as.factor(mut_genes_df_allcells$line)),
consequence_simplified = levels(as.factor(
mut_genes_df_allcells$consequence_simplified))),
df_all_vars)
df_all_vars[is.na(df_all_vars)] <- 0
colnames(df_all_vars) <- c("line", "consequence", "n_vars_all_genes")
df_nvars <- inner_join(inner_join(df_all_vars, df_cc_vars), df_prolif_vars) %>%
as_tibble
readr::write_tsv(df_nvars, "output/nvars_by_category_by_line.tsv")
df_nvars %>%
DT::datatable(., options = list(pageLength = 20))
We can also look at the total mutational load for each of the lines.
# A tibble: 31 x 2
line nvars
<chr> <dbl>
1 euts 49
2 fawm 20
3 feec 24
4 fikt 27
5 garx 95
6 gesg 45
7 heja 32
8 hipn 11
9 ieki 15
10 joxm 116
11 kuco 9
12 laey 63
13 lexy 11
14 naju 15
15 nusw 13
16 oaaz 24
17 oilg 38
18 pipw 56
19 puie 18
20 qayj 16
21 qolg 28
22 qonc 20
23 rozh 16
24 sehl 29
25 ualf 48
26 vass 101
27 vuna 35
28 wahn 115
29 wetu 19
30 xugn 18
31 zoxy 14
We find a small number of somatic variants to associate with differential expression between mutated and unmutated cells/clones. Unfortunately, in this set of variants and lines (with relatively small sample size in terms of numbers of cells), we cannot make any claims about different effects on expression of somatic variants with different functional annotations.
─ Session info ──────────────────────────────────────────────────────────
setting value
version R version 3.6.0 (2019-04-26)
os Ubuntu 18.04.3 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate en_AU.UTF-8
ctype en_AU.UTF-8
tz Australia/Melbourne
date 2019-10-30
─ Packages ──────────────────────────────────────────────────────────────
package * version date lib source
AnnotationDbi * 1.46.1 2019-08-20 [1] Bioconductor
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
backports 1.1.4 2019-04-10 [1] CRAN (R 3.6.0)
beeswarm 0.2.3 2016-04-25 [1] CRAN (R 3.6.0)
Biobase * 2.44.0 2019-05-02 [1] Bioconductor
BiocGenerics * 0.30.0 2019-05-02 [1] Bioconductor
BiocNeighbors 1.2.0 2019-05-02 [1] Bioconductor
BiocParallel * 1.18.1 2019-08-06 [1] Bioconductor
BiocSingular 1.0.0 2019-05-02 [1] Bioconductor
bit 1.1-14 2018-05-29 [1] CRAN (R 3.6.0)
bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.0)
bitops 1.0-6 2013-08-17 [1] CRAN (R 3.6.0)
blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.0)
broom 0.5.2 2019-04-07 [1] CRAN (R 3.6.0)
callr 3.3.2 2019-09-22 [1] CRAN (R 3.6.0)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.0)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.0)
cowplot * 1.0.0 2019-07-11 [1] CRAN (R 3.6.0)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
crosstalk 1.0.0 2016-12-21 [1] CRAN (R 3.6.0)
DBI 1.0.0 2018-05-02 [1] CRAN (R 3.6.0)
DelayedArray * 0.10.0 2019-05-02 [1] Bioconductor
DelayedMatrixStats 1.6.1 2019-09-08 [1] Bioconductor
desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.0)
digest 0.6.21 2019-09-20 [1] CRAN (R 3.6.0)
dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.0)
DT 0.9 2019-09-17 [1] CRAN (R 3.6.0)
edgeR * 3.26.8 2019-09-01 [1] Bioconductor
ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
fansi 0.4.0 2018-10-05 [1] CRAN (R 3.6.0)
farver 1.1.0 2018-11-20 [1] CRAN (R 3.6.0)
fdrtool 1.2.15 2015-07-08 [1] CRAN (R 3.6.0)
forcats * 0.4.0 2019-02-17 [1] CRAN (R 3.6.0)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0)
gdtools * 0.2.0 2019-09-03 [1] CRAN (R 3.6.0)
generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.0)
GenomeInfoDb * 1.20.0 2019-05-02 [1] Bioconductor
GenomeInfoDbData 1.2.1 2019-04-30 [1] Bioconductor
GenomicRanges * 1.36.1 2019-09-06 [1] Bioconductor
ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 3.6.0)
ggExtra 0.9 2019-08-27 [1] CRAN (R 3.6.0)
ggforce * 0.3.1 2019-08-20 [1] CRAN (R 3.6.0)
ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.0)
ggrepel * 0.8.1 2019-05-07 [1] CRAN (R 3.6.0)
ggridges * 0.5.1 2018-09-27 [1] CRAN (R 3.6.0)
git2r 0.26.1 2019-06-29 [1] CRAN (R 3.6.0)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
haven 2.1.1 2019-07-04 [1] CRAN (R 3.6.0)
hms 0.5.1 2019-08-23 [1] CRAN (R 3.6.0)
htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.6.0)
htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.6.0)
httpuv 1.5.2 2019-09-11 [1] CRAN (R 3.6.0)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.0)
IHW * 1.12.0 2019-05-02 [1] Bioconductor
IRanges * 2.18.3 2019-09-24 [1] Bioconductor
irlba 2.3.3 2019-02-05 [1] CRAN (R 3.6.0)
jsonlite 1.6 2018-12-07 [1] CRAN (R 3.6.0)
knitr 1.25 2019-09-18 [1] CRAN (R 3.6.0)
labeling 0.3 2014-08-23 [1] CRAN (R 3.6.0)
later 0.8.0 2019-02-11 [1] CRAN (R 3.6.0)
lattice 0.20-38 2018-11-04 [4] CRAN (R 3.5.1)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.6.0)
lifecycle 0.1.0 2019-08-01 [1] CRAN (R 3.6.0)
limma * 3.40.6 2019-07-26 [1] Bioconductor
locfit 1.5-9.1 2013-04-20 [1] CRAN (R 3.6.0)
lpsymphony 1.12.0 2019-05-02 [1] Bioconductor (R 3.6.0)
lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.6.0)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
MASS 7.3-51.1 2018-11-01 [4] CRAN (R 3.5.1)
Matrix 1.2-17 2019-03-22 [1] CRAN (R 3.6.0)
matrixStats * 0.55.0 2019-09-07 [1] CRAN (R 3.6.0)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
mgcv 1.8-28 2019-03-21 [4] CRAN (R 3.5.3)
mime 0.7 2019-06-11 [1] CRAN (R 3.6.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 3.6.0)
modelr 0.1.5 2019-08-08 [1] CRAN (R 3.6.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
nlme 3.1-139 2019-04-09 [4] CRAN (R 3.5.3)
org.Hs.eg.db * 3.8.2 2019-05-01 [1] Bioconductor
pillar 1.4.2 2019-06-29 [1] CRAN (R 3.6.0)
pkgbuild 1.0.5 2019-08-26 [1] CRAN (R 3.6.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.6.0)
polyclip 1.10-0 2019-03-14 [1] CRAN (R 3.6.0)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0)
processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.0)
promises 1.0.1 2018-04-13 [1] CRAN (R 3.6.0)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
purrr * 0.3.2 2019-03-15 [1] CRAN (R 3.6.0)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.6.0)
RColorBrewer * 1.1-2 2014-12-07 [1] CRAN (R 3.6.0)
Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.6.0)
RCurl 1.95-4.12 2019-03-04 [1] CRAN (R 3.6.0)
readr * 1.3.1 2018-12-21 [1] CRAN (R 3.6.0)
readxl 1.3.1 2019-03-13 [1] CRAN (R 3.6.0)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0)
rlang * 0.4.0 2019-06-25 [1] CRAN (R 3.6.0)
rmarkdown 1.15 2019-08-21 [1] CRAN (R 3.6.0)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
RSQLite 2.1.2 2019-07-24 [1] CRAN (R 3.6.0)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0)
rsvd 1.0.2 2019-07-29 [1] CRAN (R 3.6.0)
rvest 0.3.4 2019-05-15 [1] CRAN (R 3.6.0)
S4Vectors * 0.22.1 2019-09-09 [1] Bioconductor
scales 1.0.0 2018-08-09 [1] CRAN (R 3.6.0)
scater * 1.12.2 2019-05-24 [1] Bioconductor
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
shiny 1.3.2 2019-04-22 [1] CRAN (R 3.6.0)
SingleCellExperiment * 1.6.0 2019-05-02 [1] Bioconductor
slam 0.1-45 2019-02-26 [1] CRAN (R 3.6.0)
stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
SummarizedExperiment * 1.14.1 2019-07-31 [1] Bioconductor
superheat * 0.1.0 2017-02-04 [1] CRAN (R 3.6.0)
svglite 1.2.2 2019-05-17 [1] CRAN (R 3.6.0)
systemfonts 0.1.1 2019-07-01 [1] CRAN (R 3.6.0)
testthat 2.2.1 2019-07-25 [1] CRAN (R 3.6.0)
tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.6.0)
tidyr * 1.0.0 2019-09-11 [1] CRAN (R 3.6.0)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.0)
tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.6.0)
tweenr 1.0.1 2018-12-14 [1] CRAN (R 3.6.0)
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.0)
utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.0)
vctrs 0.2.0 2019-07-05 [1] CRAN (R 3.6.0)
vipor 0.4.5 2017-03-22 [1] CRAN (R 3.6.0)
viridis * 0.5.1 2018-03-29 [1] CRAN (R 3.6.0)
viridisLite * 0.3.0 2018-02-01 [1] CRAN (R 3.6.0)
whisker 0.4 2019-08-28 [1] CRAN (R 3.6.0)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
workflowr 1.4.0 2019-06-08 [1] CRAN (R 3.6.0)
xfun 0.9 2019-08-21 [1] CRAN (R 3.6.0)
xml2 1.2.2 2019-08-09 [1] CRAN (R 3.6.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 3.6.0)
XVector 0.24.0 2019-05-02 [1] Bioconductor
yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0)
zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.6.0)
zlibbioc 1.30.0 2019-05-02 [1] Bioconductor
[1] /home/AD.SVI.EDU.AU/dmccarthy/R/x86_64-pc-linux-gnu-library/3.6
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library