Does intron retention down regulate gene expression

Last updated: 2022-10-26

Checks: 6 1

Knit directory: ChromatinSplicingQTLs/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20191126)

The command set.seed(20191126) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 8df7e43

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 8df7e43. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.Rhistory
    Ignored:    code/.DS_Store
    Ignored:    code/.RData
    Ignored:    code/._.DS_Store
    Ignored:    code/._README.md
    Ignored:    code/._report.html
    Ignored:    code/.ipynb_checkpoints/
    Ignored:    code/.snakemake/
    Ignored:    code/APA_Processing/
    Ignored:    code/Alignments/
    Ignored:    code/ChromHMM/
    Ignored:    code/ENCODE/
    Ignored:    code/ExpressionAnalysis/
    Ignored:    code/FastqFastp/
    Ignored:    code/FastqFastpSE/
    Ignored:    code/Genotypes/
    Ignored:    code/H3K36me3_CutAndTag.pdf
    Ignored:    code/IntronSlopes/
    Ignored:    code/Metaplots/
    Ignored:    code/Misc/
    Ignored:    code/MiscCountTables/
    Ignored:    code/Multiqc/
    Ignored:    code/Multiqc_chRNA/
    Ignored:    code/NonCodingRNA/
    Ignored:    code/NonCodingRNA_annotation/
    Ignored:    code/PeakCalling/
    Ignored:    code/Phenotypes/
    Ignored:    code/PlotGruberQTLs/
    Ignored:    code/PlotQTLs/
    Ignored:    code/ProCapAnalysis/
    Ignored:    code/QC/
    Ignored:    code/QTL_SNP_Enrichment/
    Ignored:    code/QTLs/
    Ignored:    code/RPKM_tables/
    Ignored:    code/ReferenceGenome/
    Ignored:    code/Rplots.pdf
    Ignored:    code/Session.vim
    Ignored:    code/SplicingAnalysis/
    Ignored:    code/TODO
    Ignored:    code/Tehranchi/
    Ignored:    code/bigwigs/
    Ignored:    code/bigwigs_FromNonWASPFilteredReads/
    Ignored:    code/config/.DS_Store
    Ignored:    code/config/._.DS_Store
    Ignored:    code/config/.ipynb_checkpoints/
    Ignored:    code/dag.pdf
    Ignored:    code/dag.png
    Ignored:    code/dag.svg
    Ignored:    code/debug.ipynb
    Ignored:    code/debug_python.ipynb
    Ignored:    code/deepTools/
    Ignored:    code/featureCounts/
    Ignored:    code/gwas_summary_stats/
    Ignored:    code/hyprcoloc/
    Ignored:    code/igv_session.xml
    Ignored:    code/log
    Ignored:    code/logs/
    Ignored:    code/notebooks/.ipynb_checkpoints/
    Ignored:    code/pi1/
    Ignored:    code/rules/.ipynb_checkpoints/
    Ignored:    code/rules/OldRules/
    Ignored:    code/rules/notebooks/
    Ignored:    code/scratch/
    Ignored:    code/scripts/.ipynb_checkpoints/
    Ignored:    code/scripts/GTFtools_0.8.0/
    Ignored:    code/scripts/__pycache__/
    Ignored:    code/scripts/liftOverBedpe/liftOverBedpe.py
    Ignored:    code/snakemake.dryrun.log
    Ignored:    code/snakemake.log
    Ignored:    code/snakemake.sbatch.log
    Ignored:    code/test.introns.bed
    Ignored:    code/test.introns2.bed
    Ignored:    data/.DS_Store
    Ignored:    data/._.DS_Store
    Ignored:    data/._20220414203249_JASPAR2022_combined_matrices_25818_jaspar.txt
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-10.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-11.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-2.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-3.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-4.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-5.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-6.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-7.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-8.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022.csv

Untracked files:
    Untracked:  code/snakemake_profiles/slurm/__pycache__/

Unstaged changes:
    Modified:   analysis/20221012_IntronRetentionAndExpressionConcordance.Rmd
    Modified:   code/scripts/GenometracksByGenotype

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/20221012_IntronRetentionAndExpressionConcordance.Rmd) and HTML (docs/20221012_IntronRetentionAndExpressionConcordance.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	8df7e43	Benjmain Fair	2022-10-26	update nb
html	8df7e43	Benjmain Fair	2022-10-26	update nb
Rmd	61a245f	Benjmain Fair	2022-10-24	update nb
Rmd	c17b6b2	Benjmain Fair	2022-10-19	add un qqnorm phenotypes
html	c17b6b2	Benjmain Fair	2022-10-19	add un qqnorm phenotypes
Rmd	ef85e53	Benjmain Fair	2022-10-12	update index
Rmd	3201a19	Benjmain Fair	2022-10-12	add IR notebook
html	3201a19	Benjmain Fair	2022-10-12	add IR notebook

Intro

I previously showed that effect size directions for intron retention QTLs puzzingly positively correct with eQTLs (more intron retention = more expression). Somehow I think intron retention QTLs are often picking up on chromatin effects. Let’s break up the intron retention QTLs into groups by the location of the SNP (in a enhancer/promoter vs a splice site) and reassess the concordance of expression effects. I can also look at the same idea with normal leafcutter sQTLs, comparing introns by their annotation type. For example increase in splicing of annotated or “basic” tagged introns might generally increase expression, versus increase in unannoated or “NMD” tagged introns might decrease expression.

library(tidyverse)
library(data.table)
library(qvalue)

First let’s make some basic plots establishing that chRNA has more unannoated and NMD-specific splicing.

NMD.transcript.introns <- read_tsv("../code/SplicingAnalysis/Annotations/NMD/NMD_trancsript_introns.bed.gz", col_names=c("chrom", "start", "stop", "name", "score", "strand")) %>%
  mutate(stop=stop+1) %>%
  unite(intron, chrom:stop, strand)

Non.NMD.transcript.introns <- read_tsv("../code/SplicingAnalysis/Annotations/NMD/NonNMD_trancsript_introns.bed.gz", col_names=c("chrom", "start", "stop", "name", "score", "strand")) %>%
  mutate(stop=stop+1) %>%
  unite(intron, chrom:stop, strand)


NMD.specific.introns <- setdiff(NMD.transcript.introns$intron, Non.NMD.transcript.introns$intron)


SpliceJunctionCountTables <- Sys.glob("../code/SplicingAnalysis/leafcutter/NormalizedPsiTables/PSI.JunctionCounts.*.bed.gz") %>%
  setNames(str_replace(., "../code/SplicingAnalysis/leafcutter/NormalizedPsiTables/PSI.JunctionCounts.(.+?).bed.gz", "\\1")) %>%
  lapply(read_tsv)

lapply(SpliceJunctionCountTables, dim)

$Expression.Splicing
[1] 198246    468

$MetabolicLabelled.30min
[1] 198246     72

$MetabolicLabelled.60min
[1] 198246     72

$chRNA.Expression.Splicing
[1] 198246     93

# SpliceJunctionCountTables$Expression.Splicing %>%
#   mutate(Intron=paste(`#Chrom`, start, end, strand, sep="_")) %>%
#   mutate(Is.NMD.Intron = Intron %in% NMD.specific.introns) %>%
#   group_by(Is.NMD.Intron) %>%
#   summarise_if(is.numeric, sum, na.rm = TRUE) %>%
#   dplyr::select(-start, -end) %>%
#   column_to_rownames("Is.NMD.Intron") %>%
#   t()

Sum.NMD.Intron.Counts <- function(df){
  df %>%
  mutate(Intron=paste(`#Chrom`, start, end, strand, sep="_")) %>%
  mutate(Is.NMD.Intron = Intron %in% NMD.specific.introns) %>%
  group_by(Is.NMD.Intron) %>%
  summarise_if(is.numeric, sum, na.rm = TRUE) %>%
  dplyr::select(-start, -end) %>%
  column_to_rownames("Is.NMD.Intron") %>%
  t() %>%
  as.data.frame() %>%
  rownames_to_column("IndID") %>%
  return()
}

lapply(SpliceJunctionCountTables, Sum.NMD.Intron.Counts) %>%
  bind_rows(.id="Source") %>%
  dplyr::rename(c("NotNMD"="FALSE", "NMD"="TRUE")) %>%
  mutate(Fraction = NMD/(NMD+NotNMD)*100) %>%
  ggplot(aes(x=Source, y=Fraction, color=Source)) +
  geom_boxplot(outlier.position="none") +
  geom_jitter() +
  theme_bw()

Version	Author	Date
c17b6b2	Benjmain Fair	2022-10-19

lapply(SpliceJunctionCountTables, Sum.NMD.Intron.Counts) %>%
  bind_rows(.id="Source") %>%
  dplyr::rename(c("NotNMD"="FALSE", "NMD"="TRUE")) %>%
  mutate(TotalJunctionCounts = NMD+NotNMD) %>%
  ggplot(aes(x=Source, y=TotalJunctionCounts, color=Source)) +
  geom_boxplot(outlier.position="none") +
  geom_jitter() +
  theme_bw() +
  labs(title="Total juncs per dataset", y="Total num juncs")

Version	Author	Date
c17b6b2	Benjmain Fair	2022-10-19

Update:

Yang is asking for the following plots to follow up on this one:

same plot with unannotated junctions

instead of overall, compute % for each gene separately, and plot the CDF

Intron.Annotations.basic <- read_tsv("../code/SplicingAnalysis/regtools_annotate_combined/basic.bed.gz") %>%
  filter(known_junction ==1) %>%
  unite(intron, chrom, start, end, strand)
Introns.Annotations.comprehensive <- read_tsv("../code/SplicingAnalysis/regtools_annotate_combined/comprehensive.bed.gz") %>%
  filter(known_junction ==1) %>%
  unite(intron, chrom, start, end, strand)
Introns.Annotations.all <- read_tsv("../code/SplicingAnalysis/regtools_annotate_combined/comprehensive.bed.gz") %>%
  unite(intron, chrom, start, end, strand)

AddIntronAnnotations <- function(df){
  df %>%
  mutate(intron = str_replace(junc, "^(.+):(.+?):(.+?):clu.+?_([+-])$", "\\1_\\2_\\3_\\4")) %>%
  mutate(IntronAnnotation = case_when(
    intron %in% NMD.specific.introns ~ "Annotated NMD",
    intron %in% Intron.Annotations.basic$intron ~ "Annotated basic",
    intron %in% Introns.Annotations.comprehensive$intron ~ "Annotated Not basic",
    TRUE ~ "Unannotated"
  )) %>%
  dplyr::select(1:6, intron, IntronAnnotation, everything())
}

Long.table <- lapply(SpliceJunctionCountTables, AddIntronAnnotations) %>%
  lapply(pivot_longer,names_to="Sample", values_to="Count", -c(1:8)) %>%
  bind_rows(.id="Dataset")

And now make those plots

P.i.dat <- Long.table %>%
  group_by(Sample, Dataset, IntronAnnotation) %>%
  summarise(SumCounts = sum(Count)) %>%
  ungroup() %>%
  group_by(Sample, Dataset) %>%
  mutate(Percent = SumCounts / sum(SumCounts) * 100) %>%
  ungroup()
P.i <- ggplot(P.i.dat, aes(x=Dataset, y=Percent, color=IntronAnnotation)) +
  geom_jitter(alpha=0.2, size=0.5) +
  geom_boxplot(outlier.shape=NA, color='black', fill=NA) +
  facet_wrap(~IntronAnnotation, scales="free_y") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  labs(y="Percent of splice junction reads")
P.i

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26
3201a19	Benjmain Fair	2022-10-12

genes <- read_tsv("../code/ExpressionAnalysis/polyA/ExpressedGeneList.txt", col_names=c("chrom", "start", "stop", "name")) %>%
  mutate(name = str_replace(name, "^(E.+?)\\..+$", "\\1"))

p.ii.dat.joined <- Introns.Annotations.all %>%
  separate_rows(gene_id, sep = ",") %>%
  mutate(gene_id = str_replace(gene_id, "^(E.+?)\\..+$", "\\1")) %>%
  filter(gene_id %in% genes$name) %>%
  inner_join(Long.table)

p.ii.dat <- p.ii.dat.joined %>%
  group_by(gene_id, Dataset, IntronAnnotation) %>%
  summarise(SumCounts = sum(Count, na.rm=F)) %>%
  ungroup() %>%
  group_by(Dataset, gene_id) %>%
  mutate(Percent = SumCounts / sum(SumCounts, na.rm=F) * 100) %>%
  ungroup()

p.ii <- ggplot(p.ii.dat, aes(x=Percent, color=Dataset)) +
  stat_ecdf() +
  facet_wrap(~IntronAnnotation) +
  theme_bw() +
  labs(y="ecdf (14000 ProteinCodingGenes)", x="Percent splice junctions within gene")
p.ii

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26
3201a19	Benjmain Fair	2022-10-12

ggsave("figure/orig/SpliceJunctionsByTypeAndDataset.pdf", P.i, height=4, width=5.5)

ggsave("figure/orig/SpliceJunctionsByGeneTypeAndDataset.pdf", p.ii, height=4, width=5.5)

let’s prcoeed to reading in other files

PhenotypeAliases <- read_tsv("../data/Phenotypes_recode_for_Plotting.txt")

PC.ShortAliases <- PhenotypeAliases %>%
  dplyr::select(PC, ShorterAlias) %>% deframe()

coloc.results <- read_tsv("../code/hyprcoloc/Results/ForColoc/MolColocStandard/results.txt.gz")

coloc.results.tidycolocalized <- read_tsv("../code/hyprcoloc/Results/ForColoc/MolColocStandard/tidy_results_OnlyColocalized.txt.gz") %>%
  separate(phenotype_full, into=c("PC", "P"), sep=";")

finemap.snps.annotated <- read_tsv("../code/QTL_SNP_Enrichment/FinemapIntersections/MolColocStandard.bed.gz", col_names=c("SNPchrom", "SNPstart", "SNPstop", "SNP_iteration_locus", "FinemapPP", "AnnotationChrom", "AnnotationStart", "AnnotatioStop", "AnnotationClass", "Overlap")) %>%
  dplyr::select(-Overlap)

Count chromatin-colocalizing and splicing-colocalizing eQTLs, and recreate previous observation about concordant effects

coloc.results.tidycolocalized %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
  summarise(
    ContainsChromatinEqtl = any(PC %in% c("H3K27AC", "H3K4ME1", "H3K4ME3")),
    ContainsSqtl = any(PC %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER"))
    ) %>%
  ggplot(aes(x=1, fill=paste(ContainsChromatinEqtl, ContainsSqtl))) +
  geom_bar() +
  labs(title="More chromatin localization with eQTLs than splicing", y="Number of colocalizing eQTLs") +
  theme_classic() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26
3201a19	Benjmain Fair	2022-10-12

coloc.results.tidycolocalized %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
  ungroup() %>%
  filter(PC %in% c("Expression.Splicing.Subset_YRI","H3K27AC", "H3K4ME1", "H3K4ME3", "polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER")) %>%
  left_join(., ., by=c("Locus", "snp")) %>%
  filter(!((P.x == P.y) & (PC.x == PC.y))) %>% 
  group_by(Locus, snp) %>%
  mutate(
    ContainsChromatinEqtl = any(PC.x %in% c("H3K27AC", "H3K4ME1", "H3K4ME3")),
    ContainsSqtl = any(PC.x %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER"))
    ) %>%
  ungroup() %>%
  mutate(Contains.eQTL_Contains.sQTL = paste(ContainsChromatinEqtl, ContainsSqtl)) %>%
  group_by(Contains.eQTL_Contains.sQTL, PC.x, PC.y) %>%
  summarise(cor = cor(beta.x, beta.y, method="spearman")) %>%
  mutate(PC.x = recode(PC.x, !!!PC.ShortAliases)) %>%
  mutate(PC.y = recode(PC.y, !!!PC.ShortAliases)) %>%
  ggplot(aes(x=PC.x, y=PC.y, fill=cor)) +
  geom_raster() +
  scale_fill_gradient2() +
  facet_wrap(~Contains.eQTL_Contains.sQTL) +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  labs(x="TraitA Phenotype class", y="TraitB Phenotype class", fill="Spearman cor", title="Effect size correlation",
       caption = "FacetTitle: IsChromatinQTL IsSplicingQTL")

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26

Now split QTLs by location of SNP as either in splice site, in enhancer/promoter, or neither

#annotation types
finemap.snps.annotated$AnnotationClass %>% unique()

 [1] "SpliceBranchpointRegion_0" "10_Txn_Elongation"        
 [3] "6_Weak_Enhancer"           "SpliceDonor_0"            
 [5] "2_Weak_Promoter"           "11_Weak_Txn"              
 [7] "1_Active_Promoter"         "12_Repressed"             
 [9] "4_Strong_Enhancer"         "13_Heterochrom/lo"        
[11] "SpliceAcceptor_0"          "SpliceBranchpointRegion_1"
[13] "ncRNA_coRNA"               "ncRNA_pseudo"             
[15] "9_Txn_Transition"          "SpliceAcceptor_1"         
[17] "ncRNA_srtRNA"              "8_Insulator"              
[19] "ncRNA_uaRNA"               "5_Strong_Enhancer"        
[21] "7_Weak_Enhancer"           "SpliceDonor_1"            
[23] "PAS_Region"                "ncRNA_incRNA"             
[25] "3_Poised_Promoter"         "."                        
[27] "14_Repetitive/CNV"         "ncRNA_ctRNA"              
[29] "15_Repetitive/CNV"         "ncRNA_lncRNA"             
[31] "ncRNA_rtRNA"               "ncRNA_snoRNA"

coloc.results.tidycolocalized %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
  ungroup() %>%
  filter(PC %in% c("Expression.Splicing.Subset_YRI","H3K27AC", "H3K4ME1", "H3K4ME3", "polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER")) %>%
  left_join(., ., by=c("Locus", "snp")) %>%
  filter(!((P.x == P.y) & (PC.x == PC.y))) %>% 
  group_by(Locus, snp) %>%
  mutate(
    ContainsChromatinEqtl = any(PC.x %in% c("H3K27AC", "H3K4ME1", "H3K4ME3")),
    ContainsSqtl = any(PC.x %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER"))
    ) %>%
  ungroup() %>%
  mutate(Contains.eQTL_Contains.sQTL = paste(ContainsChromatinEqtl, ContainsSqtl)) %>%
  dplyr::select(-iteration.y) %>%
  left_join(
    finemap.snps.annotated %>%
      dplyr::select(SNP_iteration_locus, FinemapPP, AnnotationClass) %>%
      separate(SNP_iteration_locus, into=c("snp", "iteration.x", "Locus"), convert=T, sep="_")
  ) %>%
  filter(FinemapPP >0.25) %>%
  mutate(AnnotationSuperclass = case_when(
    str_detect(AnnotationClass, "Splice.+_1") ~ "Annotated SS",
    str_detect(AnnotationClass, "Splice.+_0$") ~ "Unannotated SS",
    str_detect(AnnotationClass, "Enhancer") ~ "Enhancer",
    str_detect(AnnotationClass, "Promoter") ~ "Promoter",
    TRUE ~ "Other"
  )) %>%
  mutate(PC.x = recode(PC.x, !!!PC.ShortAliases)) %>%
  mutate(PC.y = recode(PC.y, !!!PC.ShortAliases)) %>%
  mutate(Contains.eQTL_Contains.sQTL = recode(Contains.eQTL_Contains.sQTL,
                                              !!!c("TRUE TRUE"="hQTL+sQTL",
                                                   "TRUE FALSE"="hQTL only",
                                                   "FALSE TRUE"="sQTL only"))) %>%
  group_by(Contains.eQTL_Contains.sQTL, PC.x, PC.y, AnnotationSuperclass) %>%
  summarise(cor = cor(beta.x, beta.y, method="spearman"), n=n()) %>%
  ggplot(aes(x=PC.x, y=PC.y, fill=cor)) +
  geom_raster() +
  geom_text(aes(label=n), size=1.5) +
  scale_fill_gradient2() +
  facet_grid(rows=vars(Contains.eQTL_Contains.sQTL), cols=vars(AnnotationSuperclass)) +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  labs(x="TraitA Phenotype class", y="TraitB Phenotype class", fill="Spearman cor", title="Effect size correlation for eQTLs that coloc with something",
       caption = str_wrap("FacetCols=Annotation class of SNP (Onlyfinemap PP >0.25 included). FacetRows=Type of eQTL"))

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26
c17b6b2	Benjmain Fair	2022-10-19

I’m quite surprised just by the numbers of hQTL eQTLs with high finemap PP in unannoated splice site regions. This is clearly a really big annotation set that also includes lots of enhancer regions. This unannoated splice site region was defined as at least one spliced read mapping across all the data. Perhaps I shouldn’t be surprised, as any region is sufficeintly transcribed (including enhancers) will at some rate have some splice sites.

And even among the 7 intron retention QTLs that coloc with an eQTL but not hQTL, and are finemapped to annotated splice sites, the the direction of effects is if anything opposite what I was expecting

coloc.results.tidycolocalized %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
  ungroup() %>%
  filter(PC %in% c("Expression.Splicing.Subset_YRI","H3K27AC", "H3K4ME1", "H3K4ME3", "polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER")) %>%
  left_join(., ., by=c("Locus", "snp")) %>%
  filter(!((P.x == P.y) & (PC.x == PC.y))) %>% 
  group_by(Locus, snp) %>%
  mutate(
    ContainsChromatinEqtl = any(PC.x %in% c("H3K27AC", "H3K4ME1", "H3K4ME3")),
    ContainsSqtl = any(PC.x %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER"))
    ) %>%
  ungroup() %>%
  mutate(Contains.eQTL_Contains.sQTL = paste(ContainsChromatinEqtl, ContainsSqtl)) %>%
  dplyr::select(-iteration.y) %>%
  left_join(
    finemap.snps.annotated %>%
      dplyr::select(SNP_iteration_locus, FinemapPP, AnnotationClass) %>%
      separate(SNP_iteration_locus, into=c("snp", "iteration.x", "Locus"), convert=T, sep="_")
  ) %>%
  filter(FinemapPP >0.25) %>%
  mutate(AnnotationSuperclass = case_when(
    str_detect(AnnotationClass, "Splice.+_1") ~ "Annotated SS",
    str_detect(AnnotationClass, "Splice.+_0$") ~ "Unannotated SS",
    str_detect(AnnotationClass, "Enhancer") ~ "Enhancer",
    str_detect(AnnotationClass, "Promoter") ~ "Promoter",
    TRUE ~ "Other"
  )) %>%
  mutate(PC.x = recode(PC.x, !!!PC.ShortAliases)) %>%
  mutate(PC.y = recode(PC.y, !!!PC.ShortAliases)) %>%
  mutate(Contains.eQTL_Contains.sQTL = recode(Contains.eQTL_Contains.sQTL,
                                              !!!c("TRUE TRUE"="hQTL+sQTL",
                                                   "TRUE FALSE"="hQTL only",
                                                   "FALSE TRUE"="sQTL only"))) %>%
  filter(Contains.eQTL_Contains.sQTL=="sQTL only" & AnnotationSuperclass=="Annotated SS") %>%
  ggplot(aes(x=beta.x, y=beta.y)) +
  geom_point() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0) +
  facet_grid(rows=vars(PC.x), cols=vars(PC.y)) +
  theme_bw() +
  labs(x="TraitA Phenotype class", y="TraitB Phenotype class", title="Effect size correlation for eQTLs/sQTLs that are not hQTLs",
       caption = str_wrap("Only SNPs with finemap PP > 0.25 in annotated splice site SNP included"))

Version	Author	Date
c17b6b2	Benjmain Fair	2022-10-19

Note there are only a few points of chRNA.IR intron retention with SNPs in splice sites that coloc with eQTL.

Let’s check how these plots look for NMD-specific introns…

dat.to.plot <- coloc.results.tidycolocalized %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
  ungroup() %>%
  filter(PC %in% c("Expression.Splicing.Subset_YRI","H3K27AC", "H3K4ME1", "H3K4ME3", "polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER")) %>%
  left_join(., ., by=c("Locus", "snp")) %>%
  filter(!((P.x == P.y) & (PC.x == PC.y))) %>% 
  group_by(Locus, snp) %>%
  mutate(
    ContainsChromatinEqtl = any(PC.x %in% c("H3K27AC", "H3K4ME1", "H3K4ME3")),
    ContainsSqtl = any(PC.x %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER"))
    ) %>%
  ungroup() %>%
  mutate(Contains.eQTL_Contains.sQTL = paste(ContainsChromatinEqtl, ContainsSqtl)) %>%
  dplyr::select(-iteration.y) %>%
  left_join(
    finemap.snps.annotated %>%
      dplyr::select(SNP_iteration_locus, FinemapPP, AnnotationClass) %>%
      separate(SNP_iteration_locus, into=c("snp", "iteration.x", "Locus"), convert=T, sep="_")
  ) %>%
  filter(FinemapPP >0.25) %>%
  mutate(AnnotationSuperclass = case_when(
    str_detect(AnnotationClass, "Splice.+_1") ~ "Annotated SS",
    str_detect(AnnotationClass, "Splice.+_0$") ~ "Unannotated SS",
    str_detect(AnnotationClass, "Enhancer") ~ "Enhancer",
    str_detect(AnnotationClass, "Promoter") ~ "Promoter",
    TRUE ~ "Other"
  )) %>%
  mutate(PC.x = recode(PC.x, !!!PC.ShortAliases)) %>%
  mutate(PC.y = recode(PC.y, !!!PC.ShortAliases)) %>%
  mutate(Contains.eQTL_Contains.sQTL = recode(Contains.eQTL_Contains.sQTL,
                                              !!!c("TRUE TRUE"="hQTL+sQTL",
                                                   "TRUE FALSE"="hQTL only",
                                                   "FALSE TRUE"="sQTL only"))) %>%
  filter(Contains.eQTL_Contains.sQTL=="sQTL only") %>%
  mutate(New.x = str_replace(P.x, "^(.+?):(.+?):(.+?):.+?_([-+])$", "chr\\1_\\2_\\3_\\4")) %>%
  mutate(New.y = str_replace(P.x, "^(.+?):(.+?):(.+?):.+?_([-+])$", "chr\\1_\\2_\\3_\\4")) %>%
  mutate(ContainsNMDJunc = (New.x %in% NMD.specific.introns) | (New.y %in% NMD.specific.introns)) %>%
  arrange(ContainsNMDJunc)
ggplot(dat.to.plot, aes(x=beta.x, y=beta.y, color=ContainsNMDJunc)) +
  geom_point(alpha=0.5) +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0) +
  geom_smooth(method='lm') +
  facet_grid(rows=vars(PC.x), cols=vars(PC.y)) +
  theme_bw() +
  labs(x="TraitA Phenotype class", y="TraitB Phenotype class", caption="Effect size correlation for eQTLs/sQTLs that are not hQTLs", title="sQTLs that increase annotated NMD junctions coloc with downward eQTLs")

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26
c17b6b2	Benjmain Fair	2022-10-19

So there’s clearly a negative correlation between NMD-specific intron splicing and expression. But there’s only so many NMD introns annotated, and I expect a lot of splice changes due to genetic variation to be unannotated. Additionally, a lot of the red dots might the necessary down-regulated junctions when a NMD-junction gets up-regulated.

Let’s plot all the blue points with pygenometracks

dir.create("../code/scratch/PlotNMDsQTLeQTLs/")

dat.to.plot.filtered <- dat.to.plot %>%
  filter(ContainsNMDJunc) %>%
  filter(PC.y %in% "Expression_polyA") %>%
  distinct(Locus, .keep_all=T)

#Write out all juncs in clusters
dat.to.plot.filtered %>%
  mutate(gid = str_replace(P.x, "^(.+?):.+?:.+?:(.+?)", "chr\\1_\\2")) %>%
  inner_join(
    SpliceJunctionCountTables$Expression.Splicing %>% dplyr::select(junc, gid)
  ) %>%
  dplyr::select(junc) %>%
  separate(junc, into=c("chrom", "start", "stop", "name"), sep=":", convert=T) %>%
  arrange(chrom, start, stop) %>%
  write_tsv("../code/scratch/PlotNMDsQTLeQTLs/.juncs.bed")

# # Write out NMD introns
# dat.to.plot.filtered %>%
#   dplyr::select(junc = P.x) %>%
#   separate(junc, into=c("chrom", "start", "stop", "name"), sep=":", convert=T) %>%
#   mutate(chrom=paste0("chr", chrom)) %>%
#   dplyr::select(chrom, start=newstart, stop) %>%
#   arrange(chrom, start, stop) %>%
#   write_tsv("../code/scratch/PlotNMDsQTLeQTLs/.NMD.introns.bed", col_names = F)
# 
# # Write out ini file for NMD introns track
# fileConn<-file("../code/scratch/PlotNMDsQTLeQTLs/.NMD.introns.ini")
# writeLines(c("[NMD.introns]","file_type = bed", "type = vlines", "color=red","file = scratch/PlotNMDsQTLeQTLs/.NMD.introns.bed"), con=fileConn)
# close(fileConn)

# Write out NMD introns
dat.to.plot.filtered %>%
  dplyr::select(junc = P.x) %>%
  separate(junc, into=c("chrom", "start", "stop", "name"), sep=":", convert=T) %>%
  mutate(chrom=paste0("chr", chrom)) %>%
  arrange(chrom, start, stop) %>%
  write_tsv("../code/scratch/PlotNMDsQTLeQTLs/.NMD.introns.bed", col_names = F)

# Write out ini file for NMD introns track
fileConn<-file("../code/scratch/PlotNMDsQTLeQTLs/.NMD.introns.ini")
writeLines(c("[NMD.introns]","file_type = bed", "color=red","file = scratch/PlotNMDsQTLeQTLs/.NMD.introns.bed"), con=fileConn)
close(fileConn)

# Write out bash script
dat.to.plot.filtered %>%
  separate(P.x, into=c("chrom", "start", "stop", "name"), sep=":", convert=T) %>%
  mutate(chrom=paste0("chr", chrom)) %>%
  mutate(min=start-10000, max=stop+10000) %>%
  mutate(SnpPos = str_replace(snp, "^(.+?:.+?):.+?:.+?$", "chr\\1")) %>%
  mutate(rn = str_pad(row_number(), 2, pad=0)) %>%
  mutate(cmd = str_glue("python scripts/GenometracksByGenotype/AggregateBigwigsForPlotting.py --VCF Genotypes/1KG_GRCh38/Autosomes.vcf.gz --SnpPos {SnpPos} --GroupSettingsFile Metaplots/bwGroups.tsv --BigwigList Metaplots/bwList.tsv --Normalization WholeGenome --Region {chrom}:{min}-{max} --BigwigListType KeyFile --OutputPrefix scratch/PlotNMDsQTLeQTLs/. -v --Bed12GenesToIni scripts/GenometracksByGenotype/PremadeTracks/gencode.v26.FromGTEx.genes.bed12.gz  --FilterJuncsByBed scratch/PlotNMDsQTLeQTLs/.juncs.bed\npyGenomeTracks --tracks <(cat scratch/PlotNMDsQTLeQTLs/.tracks.ini scratch/PlotNMDsQTLeQTLs/.NMD.introns.ini) --out scratch/PlotNMDsQTLeQTLs/Plot_{rn}_{Locus}.pdf --region {chrom}:{min}-{max} --trackLabelFraction 0.15\n\n")) %>%
  select(cmd) %>%
  write.table("../code/scratch/PlotNMDsQTLeQTLs/MakePlots.sh", quote=F, row.names = F, col.names = F)

..And plot them

conda activate GenometracksByGenotype
cd /project2/yangili1/bjf79/ChromatinSplicingQTLs/code
bash scratch/PlotNMDsQTLeQTLs/MakePlots.sh

And Let’s also expand these results by taking all NMD sQTLs and consider the beta on gene expression…

sQTLs <- c("../code/QTLs/QTLTools/polyA.Splicing.Subset_YRI/PermutationPassForColoc.txt.gz", "../code/QTLs/QTLTools/chRNA.Splicing/PermutationPassForColoc.txt.gz") %>%
  setNames(str_replace(., "../code/QTLs/QTLTools/(.+?)/PermutationPassForColoc.txt.gz", "\\1")) %>%
  lapply(fread) %>%
  bind_rows(.id="source") %>%
  mutate(gene = str_replace(phe_id, ".+:(EN.+?)$", "\\1")) %>%
  mutate(intron = str_replace(phe_id, "^(.+):(.+?):(.+?):clu.+?_([+-]):EN.+$", "chr\\1_\\2_\\3_\\4")) %>%
  group_by(source) %>%
  mutate(q = qvalue(adj_emp_pval)$qvalues) %>%
  ungroup()


eQTLs <- c("../code/QTLs/QTLTools/Expression.Splicing.Subset_YRI/PermutationPassForColoc.txt.gz", "../code/QTLs/QTLTools/chRNA.Expression.Splicing/PermutationPassForColoc.txt.gz") %>%
  setNames(str_replace(., "../code/QTLs/QTLTools/(.+?)/PermutationPassForColoc.txt.gz", "\\1")) %>%
  lapply(fread) %>%
  bind_rows(.id="source") %>%
  mutate(gene = str_replace(phe_id, ".+:(EN.+?)$", "\\1")) %>%
  group_by(source) %>%
  mutate(q = qvalue(adj_emp_pval)$qvalues) %>%
  ungroup()

H3K4me3.QTLs <- c("../code/QTLs/QTLTools/H3K4ME3/PermutationPassForColoc.txt.gz") %>%
  setNames(str_replace(., "../code/QTLs/QTLTools/(.+?)/PermutationPassForColoc.txt.gz", "\\1")) %>%
  lapply(fread) %>%
  bind_rows(.id="source") %>%
  mutate(gene = str_replace(phe_id, ".+:(EN.+?)$", "\\1")) %>%
  mutate(gene_peak = str_replace(phe_id, "^(.+?):(.+?)$", "\\2;\\1")) %>%
  group_by(source) %>%
  mutate(q = qvalue(adj_emp_pval)$qvalues) %>%
  ungroup()

PeaksToTSS <- Sys.glob("../code/Misc/PeaksClosestToTSS/*_assigned.tsv.gz") %>%
  setNames(str_replace(., "../code/Misc/PeaksClosestToTSS/(.+?)_assigned.tsv.gz", "\\1")) %>%
  lapply(read_tsv) %>%
  bind_rows(.id="ChromatinMark") %>%
  mutate(GenePeakPair = paste(gene, peak, sep = ";")) %>%
  distinct(ChromatinMark, peak, gene, .keep_all=T)

inner_join(eQTLs, sQTLs, by="gene", suffix=c(".eQTL", ".sQTL")) %>%
  mutate(Is.NMD.intron = intron %in% NMD.specific.introns) %>%
  filter(q.eQTL < 0.1 & q.sQTL < 0.1) %>%
  mutate(source.eQTL = recode(source.eQTL, !!!c("polyA.Expression.YRI"="Expression.Splicing.Subset_YRI", "chRNA.Expression"="chRNA.Expression.Splicing"))) %>%
  arrange(Is.NMD.intron) %>%
  ggplot(aes(x=slope.eQTL, y=slope.sQTL, color=Is.NMD.intron)) +
  geom_point(alpha=0.5) +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0) +
  geom_smooth(method='lm') +
  facet_grid(rows = vars(source.eQTL), cols=vars(source.sQTL)) +
  theme_bw() +
  labs(title="genetic effects of NMD introns and host gene expression", x="beta, eQTL", y="beta, sQTL", caption="Only FDR<10% sQTLs and eQTLs plotted")

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26
c17b6b2	Benjmain Fair	2022-10-19

inner_join(eQTLs, H3K4me3.QTLs, by="gene", suffix=c(".eQTL", ".sQTL")) %>%
  filter(gene_peak %in% PeaksToTSS$GenePeakPair) %>%
  filter(q.eQTL < 0.1 & q.sQTL < 0.1) %>%
  mutate(source.eQTL = recode(source.eQTL, !!!c("polyA.Expression.YRI"="Expression.Splicing.Subset_YRI", "chRNA.Expression"="chRNA.Expression.Splicing"))) %>%
  ggplot(aes(x=slope.eQTL, y=slope.sQTL)) +
  geom_point(alpha=0.5) +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0) +
  geom_smooth(method='lm') +
  facet_grid(rows = vars(source.eQTL), cols=vars(source.sQTL)) +
  theme_bw() +
  labs(title="genetic effects of H3K4ME promoters and host gene expression", x="beta, eQTL", y="beta, H3K4ME3 QTL", caption="Only FDR<10% H3K4ME3 QTLs and eQTLs plotted")

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26
c17b6b2	Benjmain Fair	2022-10-19

Wait, upon further thought, I think I didn’t make the above plots correctly. I was just plotting the beta for the topSNP for each feature, not the same SNP between sQTL and eQTL, so the directions of SNP effect (which are always ALT/REF) might not be polarized the same way.

TopSNPEffects.ByPairs <- fread("../code/pi1/PairwisePi1Traits.P.all.txt.gz")

TopSNPEffects.ByPairs %>%
  filter(
      (PC1 %in% c("chRNA.Expression.Splicing", "Expression.Splicing.Subset_YRI")) &
      (PC2 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing")) &
      (FDR.x < 0.1) &
      (FDR.y < 0.1)) %>%
  mutate(source.eQTL = recode(PC1, !!!c("Expression.Splicing.Subset_YRI"="polyA.Expression.YRI", "chRNA.Expression.Splicing"="chRNA.Expression"))) %>%
  rename(source.sQTL=PC2) %>%
  mutate(intron = str_replace(P2, "^(.+):(.+?):(.+?):clu.+?_([+-])$", "chr\\1_\\2_\\3_\\4")) %>%
  mutate(Is.NMD.intron = intron %in% NMD.specific.introns) %>%
  arrange(Is.NMD.intron) %>%
  ggplot(aes(x=beta.x, y=trait.x.beta.in.y
, color=Is.NMD.intron)) +
  geom_point(alpha=0.5) +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0) +
  geom_smooth(method='lm') +
  facet_grid(rows = vars(source.eQTL), cols=vars(source.sQTL)) +
  theme_bw() +
  labs(title="genetic effects of NMD introns and host gene expression", x="beta, eQTL", y="beta, sQTL", caption="Only FDR<10% sQTLs and eQTLs plotted. sQTL effects are relative to top eQTL SNP")

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26
c17b6b2	Benjmain Fair	2022-10-19

TopSNPEffects.ByPairs %>%
  filter(
      (PC2 %in% c("chRNA.Expression.Splicing", "Expression.Splicing.Subset_YRI")) &
      (PC1 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing")) &
      (FDR.x < 0.1) &
      (FDR.y < 0.1)) %>%
  mutate(source.eQTL = recode(PC2, !!!c("Expression.Splicing.Subset_YRI"="polyA.Expression.YRI", "chRNA.Expression.Splicing"="chRNA.Expression"))) %>%
  rename(source.sQTL=PC1) %>%
  mutate(intron = str_replace(P1, "^(.+):(.+?):(.+?):clu.+?_([+-])$", "chr\\1_\\2_\\3_\\4")) %>%
  mutate(Is.NMD.intron = intron %in% NMD.specific.introns) %>%
  arrange(Is.NMD.intron) %>%
  ggplot(aes(y=beta.x, x=trait.x.beta.in.y
, color=Is.NMD.intron)) +
  geom_point(alpha=0.5) +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0) +
  geom_smooth(method='lm') +
  facet_grid(rows = vars(source.eQTL), cols=vars(source.sQTL)) +
  theme_bw() +
  labs(title="genetic effects of NMD introns and host gene expression", x="beta, eQTL", y="beta, sQTL", caption="Only FDR<10% sQTLs and eQTLs plotted. eQTL effects are relative to top sQTL SNP")

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26

I think I’ll want to further break down introns into - annotatic basic transcripts - annotated transripts (not NMD) - annotated NMD - unannotated

First read in the annotations…

Intron.Annotations.basic <- read_tsv("../code/SplicingAnalysis/regtools_annotate_combined/basic.bed.gz") %>%
  filter(known_junction ==1) %>%
  unite(intron, chrom, start, end, strand)
Introns.Annotations.comprehensive <- read_tsv("../code/SplicingAnalysis/regtools_annotate_combined/comprehensive.bed.gz") %>%
  filter(known_junction ==1) %>%
  unite(intron, chrom, start, end, strand)

Now remake the plots

TopSNPEffects.ByPairs %>%
  filter(
      (PC1 %in% c("chRNA.Expression.Splicing", "Expression.Splicing.Subset_YRI")) &
      (PC2 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing")) &
      (FDR.x < 0.1) &
      (FDR.y < 0.1)) %>%
  mutate(source.eQTL = recode(PC1, !!!c("Expression.Splicing.Subset_YRI"="polyA.Expression.YRI", "chRNA.Expression.Splicing"="chRNA.Expression"))) %>%
  rename(source.sQTL=PC2) %>%
  mutate(intron = str_replace(P2, "^(.+):(.+?):(.+?):clu.+?_([+-])$", "chr\\1_\\2_\\3_\\4")) %>%
  mutate(IntronAnnotation = case_when(
    intron %in% NMD.specific.introns ~ "Annotated NMD",
    intron %in% Intron.Annotations.basic$intron ~ "Annotated basic",
    intron %in% Introns.Annotations.comprehensive$intron ~ "Annotated Not basic",
    TRUE ~ "Unannotated"
  )) %>%
  count(IntronAnnotation) %>%
  ggplot(aes(x=IntronAnnotation, y=n, fill=IntronAnnotation)) +
  geom_col() +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  labs(title="Number sQTLs that are in an eGene", x="Intron Type", caption="FDR10% for sQTL and eQTL")

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26
c17b6b2	Benjmain Fair	2022-10-19

TopSNPEffects.ByPairs %>%
  filter(
      (PC1 %in% c("chRNA.Expression.Splicing", "Expression.Splicing.Subset_YRI")) &
      (PC2 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing")) &
      (FDR.x < 0.1) &
      (FDR.y < 0.1)) %>%
  mutate(source.eQTL = recode(PC1, !!!c("Expression.Splicing.Subset_YRI"="polyA.Expression.YRI", "chRNA.Expression.Splicing"="chRNA.Expression"))) %>%
  rename(source.sQTL=PC2) %>%
  mutate(intron = str_replace(P2, "^(.+):(.+?):(.+?):clu.+?_([+-])$", "chr\\1_\\2_\\3_\\4")) %>%
  mutate(IntronAnnotation = case_when(
    intron %in% NMD.specific.introns ~ "Annotated NMD",
    intron %in% Intron.Annotations.basic$intron ~ "Annotated basic",
    intron %in% Introns.Annotations.comprehensive$intron ~ "Annotated Not basic",
    TRUE ~ "Unannotated"
  )) %>%
  ggplot(aes(y=beta.x, x=trait.x.beta.in.y
, color=IntronAnnotation)) +
  geom_point(alpha=0.5) +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0) +
  geom_smooth(method='lm') +
  facet_grid(rows = vars(source.eQTL), cols=vars(source.sQTL)) +
  theme_bw() +
  labs(title="genetic effects of NMD introns and host gene expression", x="beta, sQTL", y="beta, eQTL", caption="Only FDR<10% sQTLs and eQTLs plotted. sQTL effects are relative to top eQTL SNP")

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26

TopSNPEffects.ByPairs %>%
  filter(
      (PC2 %in% c("chRNA.Expression.Splicing", "Expression.Splicing.Subset_YRI")) &
      (PC1 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing")) &
      (FDR.x < 0.1) &
      (FDR.y < 0.1)) %>%
  mutate(source.eQTL = recode(PC2, !!!c("Expression.Splicing.Subset_YRI"="polyA.Expression.YRI", "chRNA.Expression.Splicing"="chRNA.Expression"))) %>%
  rename(source.sQTL=PC1) %>%
  mutate(intron = str_replace(P1, "^(.+):(.+?):(.+?):clu.+?_([+-])$", "chr\\1_\\2_\\3_\\4")) %>%
  mutate(IntronAnnotation = case_when(
    intron %in% NMD.specific.introns ~ "Annotated NMD",
    intron %in% Intron.Annotations.basic$intron ~ "Annotated basic",
    intron %in% Introns.Annotations.comprehensive$intron ~ "Annotated Not basic",
    TRUE ~ "Unannotated"
  )) %>%
  ggplot(aes(y=beta.x, x=trait.x.beta.in.y
, color=IntronAnnotation)) +
  geom_point(alpha=0.5) +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0) +
  geom_smooth(method='lm') +
  facet_grid(rows = vars(source.eQTL), cols=vars(source.sQTL)) +
  theme_bw() +
  labs(title="genetic effects of NMD introns and host gene expression", x="beta, sQTL", y="beta, eQTL", caption="Only FDR<10% sQTLs and eQTLs plotted. eQTL effects are relative to top sQTL SNP")

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26

There’s a few things I could do to improve this… one being that the effect sizes are standardized and therefore hard to interpret. Plotting eQTL effect sizes as logFC and splicing effect sizes as deltaPSI could be more interpretable. Also, Yang suggesting exploring these ideas by plotting things as QQ plots… let’s do that now…

For example, let’s plot eQTL P-values for different groups of SNPs, including sQTLs, hQTLs, etc…

test.SNPs <- paste0("../code/QTLs/QTLTools/", c("chRNA.Expression.Splicing", "Expression.Splicing.Subset_YRI"), "/NominalPassForColoc.RandomSamplePvals.txt.gz") %>%
  setNames(str_replace(., "../code/QTLs/QTLTools/(.+?)/NominalPassForColoc.RandomSamplePvals.txt.gz", "\\1")) %>%
  lapply(read_tsv, col_names=c("trait.x.p.in.y")) %>%
  bind_rows(.id="PC2") %>%
  mutate(PC1 = "TestSNPs") %>%
  group_by(PC2) %>%
  sample_n(5000) %>%
  ungroup()

TopSNPEffects.ByPairs %>%
  filter(
      (PC2 %in% c("chRNA.Expression.Splicing", "Expression.Splicing.Subset_YRI")) &
      (PC1 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing", "H3K4ME3", "H3K27AC")) &
      (FDR.x < 0.1)) %>%
  bind_rows(test.SNPs) %>%
  mutate(intron = case_when(
    PC1 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing") ~ str_replace(P1, "^(.+):(.+?):(.+?):clu.+?_([+-])$", "chr\\1_\\2_\\3_\\4"),
    TRUE ~ NA_character_)) %>%
  mutate(IntronAnnotation = case_when(
    intron %in% NMD.specific.introns ~ "Annotated NMD",
    intron %in% Intron.Annotations.basic$intron ~ "Annotated basic",
    intron %in% Introns.Annotations.comprehensive$intron ~ "Annotated Not basic",
    !is.na(intron) ~ "Unannotated",
    TRUE ~ ""
  )) %>%
  mutate(SnpSet = paste(PC1, IntronAnnotation)) %>%
  group_by(SnpSet, PC2) %>%
  mutate(ExpectedP = percent_rank(trait.x.p.in.y)) %>%
  ungroup() %>%
  ggplot(aes(x=-log10(ExpectedP), y=-log10(trait.x.p.in.y), color=SnpSet)) +
  geom_abline() +
  geom_point() +
  facet_wrap(~PC2) +
  theme_bw() +
  labs(title="QQ plot of eQTL P-values", color="xQTL SNP", y="-log10(P)")

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26

Ok, hmm, is that even informative.. Well there is eQTL enrichment in all the types of QTLs tested. Let’s try breaking out the QQ plots by concordance of effect directions… I’ll plot the two mirror image QQ plots by concordance of effect directions

QQ.plot.dat <- TopSNPEffects.ByPairs %>%
  filter(
      (PC2 %in% c("chRNA.Expression.Splicing", "Expression.Splicing.Subset_YRI")) &
      (PC1 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing", "H3K4ME3", "H3K27AC")) &
      (FDR.x < 0.1)) %>%
  mutate(intron = case_when(
    PC1 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing") ~ str_replace(P1, "^(.+):(.+?):(.+?):clu.+?_([+-])$", "chr\\1_\\2_\\3_\\4"),
    TRUE ~ NA_character_)) %>%
  mutate(IntronAnnotation = case_when(
    intron %in% NMD.specific.introns ~ "Annotated NMD",
    intron %in% Intron.Annotations.basic$intron ~ "Annotated basic",
    intron %in% Introns.Annotations.comprehensive$intron ~ "Annotated Not basic",
    !is.na(intron) ~ "Unannotated",
    TRUE ~ ""
  )) %>%
  mutate(SnpSet = paste(PC1, IntronAnnotation)) %>%
  mutate(Concordance = if_else(sign(beta.x)==sign(beta.y), "Concordant", "Discordant")) %>%
  bind_rows(
    test.SNPs %>%
      mutate(Concordance = "Concordant", SnpSet=PC1),
    test.SNPs %>%
      mutate(Concordance = "Discordant", SnpSet=PC1)
    ) %>%
  group_by(SnpSet, PC2, Concordance) %>%
  mutate(ExpectedP = percent_rank(trait.x.p.in.y)) %>%
  ungroup() %>%
  mutate(Sign = recode(Concordance, !!!c("Discordant"=-1, "Concordant"=1))) %>%
  mutate(PC2 = recode(PC2, !!!c("chRNA.Expression.Splicing"="chRNA", "Expression.Splicing.Subset_YRI"="polyA"))) %>%
  arrange(desc(trait.x.p.in.y))
ggplot(QQ.plot.dat, aes(x=-log10(ExpectedP)*Sign, y=-log10(trait.x.p.in.y)*Sign, color=SnpSet)) +
  geom_abline() +
  geom_point() +
  facet_wrap(~PC2) +
  theme_bw() +
  labs(title="QQ plot of eQTL P-values", color="xQTL SNP", y="Observed -log10(P)", x="Theoretical -log10(P)", caption=str_wrap("If eQTL and xQTL beta signs are same, right-side-up. If opposite, plot upside-down", 30))

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26

I like these plots but there is just too many colors… Let’s just plot sQTLs from polyA, to cut down on colors. Then seperately we can do just sQTLs from chRNA…

QQ.plot.dat %>%
  filter(!str_detect(SnpSet, "chRNA")) %>%
ggplot(aes(x=-log10(ExpectedP)*Sign, y=-log10(trait.x.p.in.y)*Sign, color=SnpSet)) +
  geom_abline() +
  geom_point() +
  facet_wrap(~PC2) +
  theme_bw() +
  labs(title="QQ plot of eQTL P-values", color="xQTL SNP", y="Observed -log10(P)", x="Theoretical -log10(P)", caption=str_wrap("If eQTL and xQTL beta signs are same, right-side-up. If opposite, plot upside-down", 30))

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26

QQ.plot.dat %>%
  filter(!str_detect(SnpSet, "polyA")) %>%
ggplot(aes(x=-log10(ExpectedP)*Sign, y=-log10(trait.x.p.in.y)*Sign, color=SnpSet)) +
  geom_abline() +
  geom_point() +
  facet_wrap(~PC2) +
  theme_bw() +
  labs(title="QQ plot of eQTL P-values", color="xQTL SNP", y="Observed -log10(P)", x="Theoretical -log10(P)", caption=str_wrap("If eQTL and xQTL beta signs are same, right-side-up. If opposite, plot upside-down", 30))

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26

enrichment of NMD sQTLs in eQTLs

Another way to show this, which would look more significant I think is to ask what % of non-eQTL sQTL affect NMD junctions of all non-eQTL sQTLs versus the same for sQTLs that are eQTLs

TopSNPEffects.ByPairs %>%
  filter(
      (PC1 %in% c("chRNA.Expression.Splicing", "Expression.Splicing.Subset_YRI"))) %>%
  # filter(
  #     (PC2 %in% c("chRNA.Expression.Splicing", "Expression.Splicing.Subset_YRI")) &
  #     (PC1 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing")) ) %>%
  filter(FDR.x < 0.1) %>%
  distinct(P1) %>% nrow()

[1] 5358

TopSNPEffects.ByPairs %>%
  filter(
      (PC1 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing")) ) %>%
  distinct(GeneLocus) %>% nrow()

[1] 2621

datToPlot <- TopSNPEffects.ByPairs %>%
  filter(
      (PC2 %in% c("chRNA.Expression.Splicing", "Expression.Splicing.Subset_YRI")) &
      (PC1 %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing")) ) %>%
  filter(FDR.x < 0.1) %>%
  mutate(intron = str_replace(P1, "^(.+):(.+?):(.+?):clu.+?_([+-])$", "chr\\1_\\2_\\3_\\4")) %>%
  mutate(IntronAnnotation = case_when(
    intron %in% NMD.specific.introns ~ "Annotated NMD",
    intron %in% Intron.Annotations.basic$intron ~ "Annotated basic",
    intron %in% Introns.Annotations.comprehensive$intron ~ "Annotated Not basic",
    TRUE ~ "Unannotated"
  )) %>%
  mutate(Is_HostGene_eQTL = if_else(FDR.y < 0.1, "eGene", "not eGene")) %>%
  count(IntronAnnotation, Is_HostGene_eQTL) %>%
  group_by(Is_HostGene_eQTL) %>%
  mutate(Percent = n/sum(n)*100) %>%
  mutate(rn = row_number()) %>%
  ungroup()

ggplot(datToPlot, aes(x=Is_HostGene_eQTL, y=n, fill=IntronAnnotation)) +
  geom_col(position="stack") +
  geom_label(aes(label=signif(Percent, 3), vjust=rn), y=Inf) +
  theme_bw() +
  labs(y="Number sQTLs", caption="1406 eGenes identified in YRI with a test intron")

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26

Let’s also break the original plot by ncRNAs for SNP regions

coloc.results.tidycolocalized %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
  ungroup() %>%
  filter(PC %in% c("Expression.Splicing.Subset_YRI","H3K27AC", "H3K4ME1", "H3K4ME3", "polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER")) %>%
  left_join(., ., by=c("Locus", "snp")) %>%
  filter(!((P.x == P.y) & (PC.x == PC.y))) %>% 
  group_by(Locus, snp) %>%
  mutate(
    ContainsChromatinEqtl = any(PC.x %in% c("H3K27AC", "H3K4ME1", "H3K4ME3")),
    ContainsSqtl = any(PC.x %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER"))
    ) %>%
  ungroup() %>%
  mutate(Contains.eQTL_Contains.sQTL = paste(ContainsChromatinEqtl, ContainsSqtl)) %>%
  dplyr::select(-iteration.y) %>%
  left_join(
    finemap.snps.annotated %>%
      dplyr::select(SNP_iteration_locus, FinemapPP, AnnotationClass) %>%
      separate(SNP_iteration_locus, into=c("snp", "iteration.x", "Locus"), convert=T, sep="_")
  ) %>%
  filter(FinemapPP >0.25) %>%
  mutate(AnnotationSuperclass = case_when(
    str_detect(AnnotationClass, "Splice.+_1") ~ "Annotated SS",
    str_detect(AnnotationClass, "Splice.+_0$") ~ "Unannotated SS",
    str_detect(AnnotationClass, "Enhancer") ~ "Enhancer",
    str_detect(AnnotationClass, "Promoter") ~ "Promoter",
    str_detect(AnnotationClass, "ncRNA") ~ AnnotationClass,
    TRUE ~ "Other"
  )) %>%
  group_by(Contains.eQTL_Contains.sQTL, PC.x, PC.y, AnnotationSuperclass) %>%
  summarise(cor = cor(beta.x, beta.y, method="spearman"), n=n()) %>%
  mutate(PC.x = recode(PC.x, !!!PC.ShortAliases)) %>%
  mutate(PC.y = recode(PC.y, !!!PC.ShortAliases)) %>%
  mutate(Contains.eQTL_Contains.sQTL = recode(Contains.eQTL_Contains.sQTL,
                                              !!!c("TRUE TRUE"="hQTL+sQTL",
                                                   "TRUE FALSE"="hQTL only",
                                                   "FALSE TRUE"="sQTL only"))) %>%
  ggplot(aes(x=PC.x, y=PC.y, fill=cor)) +
  geom_raster() +
  geom_text(aes(label=n), size=1.5) +
  scale_fill_gradient2() +
  facet_grid(rows=vars(Contains.eQTL_Contains.sQTL), cols=vars(AnnotationSuperclass)) +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  labs(x="TraitA Phenotype class", y="TraitB Phenotype class", fill="Spearman cor", title="Effect size correlation for eQTLs that coloc with something",
       caption = str_wrap("FacetCols=Annotation class of SNP (Onlyfinemap PP >0.25 included). FacetRows=Type of eQTL"))

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26

Perhaps we should plot the same ideas but breaking out the introns into classes by whether we have some good reason to think it would be an NMD target (ie unannoated, and or annotated with nonsense mediated decay transcript tag)

coloc.results.tidycolocalized %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
  ungroup() %>%
  filter(PC %in% c("Expression.Splicing.Subset_YRI","H3K27AC", "H3K4ME1", "H3K4ME3", "polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER")) %>%
  left_join(., ., by=c("Locus", "snp")) %>%
  filter(!((P.x == P.y) & (PC.x == PC.y))) %>% 
  group_by(Locus, snp) %>%
  mutate(
    ContainsChromatinEqtl = any(PC.x %in% c("H3K27AC", "H3K4ME1", "H3K4ME3")),
    ContainsSqtl = any(PC.x %in% c("polyA.Splicing.Subset_YRI", "chRNA.Splicing", "chRNA.IER"))
    ) %>%
  ungroup() %>%
  mutate(Contains.eQTL_Contains.sQTL = paste(ContainsChromatinEqtl, ContainsSqtl)) %>%
  dplyr::select(-iteration.y) %>%
  left_join(
    finemap.snps.annotated %>%
      dplyr::select(SNP_iteration_locus, FinemapPP, AnnotationClass) %>%
      separate(SNP_iteration_locus, into=c("snp", "iteration.x", "Locus"), convert=T, sep="_")
  ) %>%
  filter(FinemapPP >0.25) %>%
  mutate(AnnotationSuperclass = case_when(
    str_detect(AnnotationClass, "Splice.+_1") ~ "Annotated SS",
    str_detect(AnnotationClass, "Splice.+_0$") ~ "Unannotated SS",
    str_detect(AnnotationClass, "Enhancer") ~ "Enhancer",
    str_detect(AnnotationClass, "Promoter") ~ "Promoter",
    TRUE ~ "Other"
  )) %>%
  mutate(PC.x = recode(PC.x, !!!PC.ShortAliases)) %>%
  mutate(PC.y = recode(PC.y, !!!PC.ShortAliases)) %>%
  mutate(Contains.eQTL_Contains.sQTL = recode(Contains.eQTL_Contains.sQTL,
                                              !!!c("TRUE TRUE"="hQTL+sQTL",
                                                   "TRUE FALSE"="hQTL only",
                                                   "FALSE TRUE"="sQTL only"))) %>%
  filter(Contains.eQTL_Contains.sQTL=="sQTL only" & AnnotationSuperclass=="Annotated SS") %>%
  ggplot(aes(x=beta.x, y=beta.y)) +
  geom_point() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0) +
  facet_grid(rows=vars(PC.x), cols=vars(PC.y)) +
  theme_bw() +
  labs(x="TraitA Phenotype class", y="TraitB Phenotype class", title="Effect size correlation for eQTLs/sQTLs that are not hQTLs",
       caption = str_wrap("Only SNPs with finemap PP > 0.25 in annotated splice site SNP included"))

Version	Author	Date
8df7e43	Benjmain Fair	2022-10-26

Write out bed files for meta QTL plot

As a side thing, I wrote a script to make metaQTL plots. That is, I lump the high/high allele genotypes for many QTLs and plot a metagene plot, along with the heterzygotes, and the low/low genotypes. This could be useful as an alternative to looking through a bunch of pygenometracks QTL plots. I think this list of eQTLs that coloc with hQTL only versus those that coloc with sQTL only would be a good test plot to look at. For example, I expect to see promoter effects at the hQTL eGenes but not so much at the sQTL eGenes, and definitely not at the sQTL eGenes with high finemapping probability as splice sites

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /software/openblas-0.2.19-el7-x86_64/lib/libopenblas_haswellp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] qvalue_2.16.0     data.table_1.14.2 forcats_0.4.0     stringr_1.4.0    
 [5] dplyr_1.0.9       purrr_0.3.4       readr_1.3.1       tidyr_1.2.0      
 [9] tibble_3.1.7      ggplot2_3.3.6     tidyverse_1.3.0  

loaded via a namespace (and not attached):
 [1] httr_1.4.4        jsonlite_1.6      splines_3.6.1     R.utils_2.9.0    
 [5] modelr_0.1.8      assertthat_0.2.1  highr_0.9         cellranger_1.1.0 
 [9] yaml_2.2.0        pillar_1.7.0      backports_1.4.1   lattice_0.20-38  
[13] glue_1.6.2        digest_0.6.20     promises_1.0.1    rvest_0.3.5      
[17] colorspace_1.4-1  htmltools_0.5.3   httpuv_1.5.1      Matrix_1.2-18    
[21] R.oo_1.22.0       plyr_1.8.4        pkgconfig_2.0.2   broom_1.0.0      
[25] haven_2.3.1       scales_1.1.0      whisker_0.3-2     later_0.8.0      
[29] git2r_0.26.1      mgcv_1.8-40       generics_0.1.3    farver_2.1.0     
[33] ellipsis_0.3.2    withr_2.5.0       cli_3.3.0         magrittr_1.5     
[37] crayon_1.3.4      readxl_1.3.1      evaluate_0.15     R.methodsS3_1.7.1
[41] fs_1.5.2          fansi_0.4.0       nlme_3.1-140      xml2_1.3.2       
[45] tools_3.6.1       hms_0.5.3         lifecycle_1.0.1   munsell_0.5.0    
[49] reprex_0.3.0      compiler_3.6.1    rlang_1.0.5       grid_3.6.1       
[53] rstudioapi_0.14   labeling_0.3      rmarkdown_1.13    gtable_0.3.0     
[57] DBI_1.1.0         reshape2_1.4.3    R6_2.4.0          lubridate_1.7.4  
[61] knitr_1.39        fastmap_1.1.0     utf8_1.1.4        workflowr_1.6.2  
[65] rprojroot_2.0.2   stringi_1.4.3     Rcpp_1.0.5        vctrs_0.4.1      
[69] dbplyr_1.4.2      tidyselect_1.1.2  xfun_0.31