20230324_MakeFigs

Last updated: 2023-03-30

Checks: 6 1

Knit directory: ChromatinSplicingQTLs/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20191126)

The command set.seed(20191126) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: ddbf222

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ddbf222. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.Rhistory
    Ignored:    code/.DS_Store
    Ignored:    code/.RData
    Ignored:    code/._report.html
    Ignored:    code/.ipynb_checkpoints/
    Ignored:    code/.snakemake/
    Ignored:    code/APA_Processing/
    Ignored:    code/Alignments/
    Ignored:    code/ChromHMM/
    Ignored:    code/ENCODE/
    Ignored:    code/ExpressionAnalysis/
    Ignored:    code/FastqFastp/
    Ignored:    code/FastqFastpSE/
    Ignored:    code/FastqSE/
    Ignored:    code/FineMapping/
    Ignored:    code/Genotypes/
    Ignored:    code/H3K36me3_CutAndTag.pdf
    Ignored:    code/IntronSlopes/
    Ignored:    code/LR.bed
    Ignored:    code/LR.seq.bed
    Ignored:    code/LongReads/
    Ignored:    code/Metaplots/
    Ignored:    code/Misc/
    Ignored:    code/MiscCountTables/
    Ignored:    code/Multiqc/
    Ignored:    code/Multiqc_chRNA/
    Ignored:    code/NonCodingRNA/
    Ignored:    code/NonCodingRNA_annotation/
    Ignored:    code/PairwisePi1Traits.P.all.txt.gz
    Ignored:    code/PeakCalling/
    Ignored:    code/Phenotypes/
    Ignored:    code/PlotGruberQTLs/
    Ignored:    code/PlotQTLs/
    Ignored:    code/ProCapAnalysis/
    Ignored:    code/QC/
    Ignored:    code/QTL_SNP_Enrichment/
    Ignored:    code/QTLs/
    Ignored:    code/RPKM_tables/
    Ignored:    code/ReadLengthMapExperiment/
    Ignored:    code/ReadLengthMapExperimentResults/
    Ignored:    code/ReadLengthMapExperimentSpliceCounts/
    Ignored:    code/ReferenceGenome/
    Ignored:    code/Rplots.pdf
    Ignored:    code/Session.vim
    Ignored:    code/SmallMolecule/
    Ignored:    code/SplicingAnalysis/
    Ignored:    code/TODO
    Ignored:    code/Tehranchi/
    Ignored:    code/bigwigs/
    Ignored:    code/bigwigs_FromNonWASPFilteredReads/
    Ignored:    code/config/.DS_Store
    Ignored:    code/config/._.DS_Store
    Ignored:    code/config/.ipynb_checkpoints/
    Ignored:    code/config/config.local.yaml
    Ignored:    code/dag.pdf
    Ignored:    code/dag.png
    Ignored:    code/dag.svg
    Ignored:    code/debug.ipynb
    Ignored:    code/debug_python.ipynb
    Ignored:    code/deepTools/
    Ignored:    code/featureCounts/
    Ignored:    code/featureCountsBasicGtf/
    Ignored:    code/gwas_summary_stats/
    Ignored:    code/hyprcoloc/
    Ignored:    code/igv_session.xml
    Ignored:    code/isoseqbams/
    Ignored:    code/log
    Ignored:    code/logs/
    Ignored:    code/notebooks/.ipynb_checkpoints/
    Ignored:    code/pi1/
    Ignored:    code/rules/.ipynb_checkpoints/
    Ignored:    code/rules/OldRules/
    Ignored:    code/rules/notebooks/
    Ignored:    code/scratch/
    Ignored:    code/scripts/.ipynb_checkpoints/
    Ignored:    code/scripts/GTFtools_0.8.0/
    Ignored:    code/scripts/__pycache__/
    Ignored:    code/scripts/liftOverBedpe/liftOverBedpe.py
    Ignored:    code/snakemake.dryrun.log
    Ignored:    code/snakemake.log
    Ignored:    code/snakemake.sbatch.log
    Ignored:    code/snakemake_profiles/slurm/__pycache__/
    Ignored:    code/test.introns.bed
    Ignored:    code/test.introns2.bed
    Ignored:    code/test.log
    Ignored:    code/tracks.xml
    Ignored:    data/.DS_Store
    Ignored:    data/._.DS_Store
    Ignored:    data/._20220414203249_JASPAR2022_combined_matrices_25818_jaspar.txt
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-10.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-11.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-2.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-3.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-4.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-5.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-6.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-7.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-8.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022.csv
    Ignored:    data/Metaplots/.DS_Store

Untracked files:
    Untracked:  analysis/test_rmd_format.Rmd

Unstaged changes:
    Modified:   analysis/20230324_MakeFigs.Rmd
    Modified:   code/Snakefile
    Modified:   code/rules/QTL_SNP_Enrichment.smk
    Modified:   code/rules/common.py
    Modified:   code/scripts/CalculatePi1_GetAscertainmentP_AllPairs.py
    Modified:   code/scripts/CalculatePi1_GetTraitPairs_AllTraits.R
    Modified:   code/scripts/GenometracksByGenotype

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/20230324_MakeFigs.Rmd) and HTML (docs/20230324_MakeFigs.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	ddbf222	Benjmain Fair	2023-03-27	add pi1 heatmaps nb
html	ddbf222	Benjmain Fair	2023-03-27	add pi1 heatmaps nb

knitr::opts_chunk$set(echo = TRUE, warning = F, message = F)

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(RColorBrewer)
library(data.table)


Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose

library(edgeR)

Loading required package: limma

# Set theme
theme_set(
  theme_classic() +
  theme(text=element_text(size=16,  family="Helvetica")))

# I use layer a lot, to rotate long x-axis labels
Rotate_x_labels <- theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

#test plot
ggplot(mtcars, aes(x=mpg, y=cyl)) +
  geom_point()

Version	Author	Date
ddbf222	Benjmain Fair	2023-03-27

Figure2

A

pi heatmap of gene-feature QTLs. Num QTLs in margin.

Only include hQTLs at promoter, chRNA, 4sU, and polyA.

library(qvalue)
library(viridis)

PeaksToTSS <- Sys.glob("../code/Misc/PeaksClosestToTSS/*_assigned.tsv.gz") %>%
  setNames(str_replace(., "../code/Misc/PeaksClosestToTSS/(.+?)_assigned.tsv.gz", "\\1")) %>%
  lapply(read_tsv) %>%
  bind_rows(.id="ChromatinMark") %>%
  mutate(GenePeakPair = paste(gene, peak, sep = ";")) %>%
  distinct(ChromatinMark, peak, gene, .keep_all=T)

PC1.filter <- c("Expression.Splicing", "MetabolicLabelled.30min", "MetabolicLabelled.60min", "H3K36ME3", "H3K27AC", "H3K4ME3", "H3K4ME1", "Expression.Splicing.Subset_YRI", "chRNA.Expression.Splicing")

dat <- fread("../code/pi1/PairwisePi1Traits.P.all.txt.gz") %>%
  filter((PC1 %in% PC1.filter) & (PC2 %in% PC1.filter))

dat$PC1 %>% unique()

[1] "Expression.Splicing"            "Expression.Splicing.Subset_YRI"
[3] "chRNA.Expression.Splicing"      "MetabolicLabelled.30min"       
[5] "MetabolicLabelled.60min"        "H3K27AC"                       
[7] "H3K4ME3"                        "H3K4ME1"                       
[9] "H3K36ME3"

PhenotypeAliases <- read_tsv("../data/Phenotypes_recode_for_Plotting.txt")

Many genes have multiple promoters that should all be accounted for as potential transcriptional effects. So calculating pi1 can be tricky. I want to ask the question, “What fraction of eQTLs have a promoter hQTL in at least one of the promoters?”… I have already solved this problem… For each gene, get lowest P value for across tests (ie all the promoters) and use that a a test statistic to come up with a new Pvalue, to calculate pi1. The test statistic distribution under the null can be approximate by drawing from uniform n times.

First we need to approximate that null using runif…

MaxSampleSizeToCreateANull <- 250
NSamplesToEstimateDistribution <- 10000
NullSimulatedTestStats <- matrix(nrow=MaxSampleSizeToCreateANull, ncol=NSamplesToEstimateDistribution)
rownames(NullSimulatedTestStats) <- paste0("runif_samplesize", 1:MaxSampleSizeToCreateANull)
colnames(NullSimulatedTestStats) <- paste0("Sample_", 1:NSamplesToEstimateDistribution)


for (i in 1:MaxSampleSizeToCreateANull){
  SampleSizeFromUniform <- i
  SampledDat <- matrix(runif(SampleSizeFromUniform*NSamplesToEstimateDistribution), nrow=NSamplesToEstimateDistribution)
  SampleDatNullTestStatistics <- -log10(apply(SampledDat, 1, min))
  NullSimulatedTestStats[i,] <- SampleDatNullTestStatistics
}


NullSimulatedTestStats %>%
  as.data.frame() %>%
  rownames_to_column("runif_samplesize") %>%
  slice(1:250) %>%
  mutate(runif_samplesize = as.numeric(str_remove(runif_samplesize, "runif_samplesize"))) %>%
  gather(key="Sample", value="value", -runif_samplesize) %>%
  ggplot(aes(x=value, color=runif_samplesize, group=runif_samplesize)) +
  geom_density() +
  scale_color_viridis_c() +
  theme_bw() +
  labs(x="Test statistic under null\n-log10(MinP) across n draws from uniform", color="n")

Version	Author	Date
ddbf222	Benjmain Fair	2023-03-27

ecdf.functions <- apply(NullSimulatedTestStats, 1, ecdf)
ecdf.functions[[1]](1)

[1] 0.9034

Make a function to calculate pi1

# pi0est function breaks sometimes. this wrapper function returns 1 when that happens
CalculatePi1 <- function (dat.in, ...) {
  return(tryCatch(1-pi0est(dat.in$Pvals.For.Pi1, ...)$pi0, error=function(e) 1))
}

And now, recode peaks as promoters, and calculate test statistics.

Recode.PCs = c("ActivatingMark_HostGenePromoter"="Promoter marks*", "H3K36ME3"="H3K36ME3", "chRNA.Expression.Splicing"="chRNA", "MetabolicLabelled.30min"="4sU 30min", "MetabolicLabelled.60min"="4sU 60min", "Expression.Splicing.Subset_YRI"="polyA YRI", "Expression.Splicing"="polyA")

ActivatingMarksToConsider <- c("H3K27AC")

ProcessPi1HeatmapDat <- function(FDR.cutoff=0.1){
  dat.split <- dat %>%
  filter(FDR.x < FDR.cutoff) %>%
  mutate(PC1 = case_when(
    paste(GeneLocus, P1, sep=";") %in% PeaksToTSS$GenePeakPair & PC1 %in% ActivatingMarksToConsider ~ "ActivatingMark_HostGenePromoter",
    P1 %in% PeaksToTSS$peak ~ "ActivatingMark_OtherGenePromoter",
    PC1 %in% c("H3K4ME1", "H3K4ME3", "H3K27AC") ~ "ActivatingMark_Enhancer",
    TRUE ~ PC1
  )) %>%
  mutate(PC2 = case_when(
    paste(GeneLocus, P2, sep=";") %in% PeaksToTSS$GenePeakPair & PC2 %in% ActivatingMarksToConsider ~ "ActivatingMark_HostGenePromoter",
    P2 %in% PeaksToTSS$peak ~ "ActivatingMark_OtherGenePromoter",
    PC2 %in% c("H3K4ME1", "H3K4ME3", "H3K27AC") ~ "ActivatingMark_Enhancer",
    TRUE ~ PC2
  )) %>%
  group_by(PC1, P1, PC2) %>%
  mutate(test.stat.obs = -log10(min(trait.x.p.in.y))) %>%
  ungroup() %>%
  add_count(PC1, P1, PC2) %>%
  filter(n<=250) %>%
  group_by(PC1, PC2) %>%
  rowwise() %>%
  mutate(Pvals.For.Pi1 = 1-ecdf.functions[[n]](test.stat.obs)) %>%
  ungroup() %>%
  select(PC1, PC2, Pvals.For.Pi1) %>%
  filter(!PC1==PC2) %>%
  split(paste(.$PC1, .$PC2, sep = ";"))
dat.out <- lapply(dat.split, CalculatePi1, pi0.method="bootstrap") %>%
  unlist() %>%
  data.frame(pi1=.) %>%
  rownames_to_column("PC1_PC2") %>%
  separate(PC1_PC2, into=c("PC1", "PC2"), sep=';') %>%
  filter((PC1%in% c(PC1.filter, "ActivatingMark_HostGenePromoter") & (PC2%in% c(PC1.filter, "ActivatingMark_HostGenePromoter") ))) %>%
  mutate(TextColor = case_when(
    pi1 >= 0.7 ~ "white",
    TRUE ~ "black"
  )) %>%
mutate(PC1 = recode(PC1, !!!Recode.PCs)) %>%
mutate(PC2 = recode(PC2, !!!Recode.PCs)) %>%
mutate(PC1 = factor(PC1, levels=Recode.PCs)) %>%
mutate(PC2 = factor(PC2, levels=Recode.PCs))
P <- ggplot(dat.out, aes(x=PC1, y=PC2, fill=pi1, color=TextColor)) +
  geom_raster() +
  geom_text(aes(label=signif(pi1*100, 2))) +
  scale_color_identity() +
  scale_fill_viridis_c(option="A", direction = -1, limits=c(0,1)) +
  coord_flip() +
  scale_x_discrete(limits=rev) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(x="Discovery QTL phenotype", y="Phenotype assessed for overlap", caption=paste("*Promoter marks include", paste(ActivatingMarksToConsider, collapse=",")))
return(list(dat=dat, P=P))
}

ProcessPi1HeatmapDat()$P

Version	Author	Date
ddbf222	Benjmain Fair	2023-03-27

Redo, but also consider both H3K27AC and H3K4ME3 at promoter…

ActivatingMarksToConsider <- c("H3K27AC", "H3K4ME3")

ProcessPi1HeatmapDat()$P

Version	Author	Date
ddbf222	Benjmain Fair	2023-03-27

Redo but consider H3K4ME3, H3K4ME1, and H3K27AC

ActivatingMarksToConsider <- c("H3K27AC", "H3K4ME3", "H3K4ME1")

ProcessPi1HeatmapDat()$P

Version	Author	Date
ddbf222	Benjmain Fair	2023-03-27

In theory things should only go up as you include H3K4ME1, but that is not the case. This phenomena is something I’ve noticed before, and I think it’s because adding non-overlapping phenotypes to promoter marks only modestly increases the lowest P-value, but more greatly can increase the n number of features used to determine the null.

So, let’s now include H3K4ME1. This is more thought of as an enhancer rather than promoter mark anyway.

So now let’s redo, but only include H3K27AC and H3K4ME3 but use lower FDR cutoffs.

ActivatingMarksToConsider <- c("H3K27AC", "H3K4ME3")

ProcessPi1HeatmapDat(FDR.cutoff = 0.05)$P

Version	Author	Date
ddbf222	Benjmain Fair	2023-03-27

ProcessPi1HeatmapDat(FDR.cutoff = 0.01)$P

Version	Author	Date
ddbf222	Benjmain Fair	2023-03-27

Ok, cool. I mean slightly different versions. I’ll leave it to Yang as to which version should be the main figure, and perhaps we can include alternate versions as supplements. In every case though I notice that the top right corner is dark, suggesting histone effects can be picked up pretty well in downstream assays (ie polyA), but the bottom left corner is light, because of post-transcriptional mechanisms of gene regulation.

save final prettier versions, with axes flipped (yang likes discovery on x-axis)
save alternate versions for supplement, with different FDR thresholds

B

Now I need a way to classify eQTLs as transcriptional or post-transcriptional. Carlos has previously used effect sizes in polyA vs chRNA. I may do something similar. First let’s some scatter plots of effect sizes

FDR.cutoff <- 0.1
ActivatingMarksToConsider <- c("H3K27AC", "H3K4ME3")

dat.recoded <- dat %>%
  filter(FDR.x < FDR.cutoff) %>%
  mutate(PC1 = case_when(
    paste(GeneLocus, P1, sep=";") %in% PeaksToTSS$GenePeakPair & PC1 %in% ActivatingMarksToConsider ~ PC1,
    P1 %in% PeaksToTSS$peak ~ "ActivatingMark_OtherGenePromoter",
    PC1 %in% c("H3K4ME1", "H3K4ME3", "H3K27AC") ~ "ActivatingMark_Enhancer",
    TRUE ~ PC1
  )) %>%
  mutate(PC2 = case_when(
    paste(GeneLocus, P2, sep=";") %in% PeaksToTSS$GenePeakPair & PC2 %in% ActivatingMarksToConsider ~ PC2,
    P1 %in% PeaksToTSS$peak ~ "ActivatingMark_OtherGenePromoter",
    PC2 %in% c("H3K4ME1", "H3K4ME3", "H3K27AC") ~ "ActivatingMark_Enhancer",
    TRUE ~ PC2
  ))

dat.recoded %>%
  filter(PC1 == "Expression.Splicing") %>%
  filter(!str_detect(PC2, "ActivatingMark_Enhancer")) %>%
  filter(!str_detect(PC2, "ActivatingMark_OtherGenePromoter")) %>%
  group_by(PC1, P1, PC2) %>%
  slice(which.min(trait.x.p.in.y)) %>%
  ungroup() %>%
  ggplot(aes(x=beta.x, y=x.beta.in.y)) +
  geom_point(alpha=0.1) +
  facet_wrap(~PC2) +
  labs(x="eQTL beta", y="facetQTL beta")

I think what I want to do is to first consider eQTLs in polyA, then perform one-sided test for significant differences in beta for corresponding feature in chRNA/promoterMarks. For simplicity, I think I can use Z-test, like this, but first, to check that this approach is reasonable, I think I’m going to check that Z-scores from random trait pairs closely follow normal distribution as they should under null. Also, I note that winners curse will bias things since I am conditioning on significant QTLs for polyA, then assessing difference in betas between polyA vs chRNA/promoter. But since only the polyA beta is subject to winners curse bias, it will turn out being overly convervative for the SNP enrichments I’m trying to highlight (because some transcriptional QTLs might erronesouly be classified as non-trancriptional) so i think that’s acceptable.

PC1.filter = "Expression.Splicing"

dat.ZsFromAllTraitPairs <- dat.recoded %>%
  filter(PC1 == PC1.filter) %>%
  filter(!str_detect(PC2, "ActivatingMark_Enhancer")) %>%
  filter(!str_detect(PC2, "ActivatingMark_OtherGenePromoter")) %>%
  group_by(PC1, P1, PC2) %>%
  slice(which.min(trait.x.p.in.y)) %>%
  ungroup() %>%
  filter(!is.na(x.beta.in.y)) %>%
  distinct(PC1, P1, PC2, .keep_all=T) %>%
  mutate(Z = (abs(beta.x)-abs(x.beta.in.y)*sign(beta.x)*sign(x.beta.in.y))/sqrt(x.beta_se.in.y**2 + beta_se.x**2)) %>%
  mutate(OneSidedZTest.P = pnorm(Z, lower.tail=F)) %>%
  group_by(PC1, PC2) %>%
  mutate(q = qvalue(OneSidedZTest.P)$qvalues) %>%
  mutate(PolyASpecific = cut(q, breaks=c(0, 0.01, 0.05, 0.1, 1), include.lowest=T, ordered_result = T))


dat.ZsFromAllTraitPairs %>%
ggplot(aes(x=beta.x, y=x.beta.in.y, color=PolyASpecific)) +
  geom_point(alpha=0.1) +
  scale_color_manual(values = rev(brewer.pal(4, "Reds"))) +
  facet_wrap(~PC2) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  labs(color="PolyA-specific FDR")

dat.ZsFromAllTraitPairs %>%
ggplot(aes(x=beta.x, y=x.beta.in.y, color=PolyASpecific=="[0,0.01]")) +
  geom_point(alpha=0.1) +
  facet_wrap(~PC2) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  labs(color="PolyA-specific FDR < 0.01")

dat.ZsFromAllTraitPairs %>%
  count(PC1, PC2, PolyASpecific) %>%
  ggplot(aes(x=PC2, y=n, fill=fct_rev(PolyASpecific))) +
  geom_col(position="stack") +
  scale_fill_manual(values = brewer.pal(4, "Reds")) +
  Rotate_x_labels

dat.ZsFromAllTraitPairs %>%
  count(PC1, PC2, PolyASpecific) %>%
  ggplot(aes(x=PC2, y=n, fill=fct_rev(PolyASpecific))) +
  geom_col(position="fill") +
  scale_fill_manual(values = brewer.pal(4, "Reds")) +
  Rotate_x_labels

Okay, keep in mind that there is both biology and power issues at play… Biologically, I would expect very few polyA-specific eQTLs when comparing to 4sU, yet there are basically the same amount of polyA-specific eQTLs in that comparison as when comparing to chRNA. One could chalk that up to the 4sU dataset being so underpowered, that a lot more eQTLs would be considered polyA specific. Similarly, that could explain why so many are polyA-specific when comparing to the relatively underpowered H3K36me3 dataset.

Let’s do the same but anchor on YRI eQTLs..

PC1.filter = "Expression.Splicing.Subset_YRI"

dat.ZsFromAllTraitPairs <- dat.recoded %>%
  filter(PC1 == PC1.filter) %>%
  filter(!str_detect(PC2, "ActivatingMark_Enhancer")) %>%
  filter(!str_detect(PC2, "ActivatingMark_OtherGenePromoter")) %>%
  group_by(PC1, P1, PC2) %>%
  slice(which.min(trait.x.p.in.y)) %>%
  ungroup() %>%
  filter(!is.na(x.beta.in.y)) %>%
  distinct(PC1, P1, PC2, .keep_all=T) %>%
  mutate(Z = (abs(beta.x)-abs(x.beta.in.y)*sign(beta.x)*sign(x.beta.in.y))/sqrt(x.beta_se.in.y**2 + beta_se.x**2)) %>%
  mutate(OneSidedZTest.P = pnorm(Z, lower.tail=F)) %>%
  group_by(PC1, PC2) %>%
  mutate(q = qvalue(OneSidedZTest.P)$qvalues)

dat.ZsFromAllTraitPairs %>%
  mutate(PolyASpecific = cut(q, breaks=c(0, 0.01, 0.05, 0.1, 1), include.lowest=T, ordered_result = T)) %>%
ggplot(aes(x=beta.x, y=x.beta.in.y, color=PolyASpecific)) +
  geom_point(alpha=0.1) +
  scale_color_manual(values = rev(brewer.pal(4, "Reds"))) +
  facet_wrap(~PC2) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  labs(color="PolyA-specific FDR")

dat.ZsFromAllTraitPairs %>%
  mutate(PolyASpecific = cut(q, breaks=c(0, 0.01, 0.05, 0.1, 1), include.lowest=T, ordered_result = T)) %>%
  # mutate(PolyASpecific = cut(OneSidedZTest.P, breaks=c(0, 0.01, 0.05, 0.1, 1), include.lowest=T, ordered_result = T)) %>%
  count(PC1, PC2, PolyASpecific) %>%
  ggplot(aes(x=PC2, y=n, fill=fct_rev(PolyASpecific))) +
  geom_col(position="stack") +
  scale_fill_manual(values = brewer.pal(4, "Reds")) +
  Rotate_x_labels

dat.ZsFromAllTraitPairs %>%
  mutate(PolyASpecific = cut(q, breaks=c(0, 0.01, 0.05, 0.1, 1), include.lowest=T, ordered_result = T)) %>%
  # mutate(PolyASpecific = cut(OneSidedZTest.P, breaks=c(0, 0.01, 0.05, 0.1, 1), include.lowest=T, ordered_result = T)) %>%
  count(PC1, PC2, PolyASpecific) %>%
  ggplot(aes(x=PC2, y=n, fill=fct_rev(PolyASpecific))) +
  geom_col(position="fill") +
  scale_fill_manual(values = brewer.pal(4, "Reds")) +
  Rotate_x_labels

Let’s explore Carlos’ susie finemapping output

susie.finemap <- fread("/project2/yangili1/cfbuenabadn/ChromatinSplicingQTLs/code/FineMapping/susie_runs_Geuvadis/susie_output.tab")
 
susie.finemap %>%
  distinct(cs_gene, cs_names) %>%
  count(cs_gene) %>%
  ggplot(aes(x=n)) +
  geom_bar() +
  labs(y="Num genes", x="Num eQTL signals")

susie.finemap %>%
  group_by(cs_gene, cs_names) %>%
  summarise(PIP.sum = sum(cs_pip)) %>%
  ungroup() %>%
  ggplot(aes(x=PIP.sum)) +
  stat_ecdf() +
  labs(x="Sum PIP in CI", title="These are 95% CIs") +
  coord_cartesian(xlim=c(0.9,1))

susie.finemap %>%
  count(cs_gene, cs_names) %>%
  ggplot(aes(x=n)) +
  stat_ecdf() +
  coord_cartesian(xlim=c(0,20)) +
  labs(x="Num SNPs in CI")

Ok great, next I will intersect the finemapped SNPs with annotations, and sum up the number of SNPs in each annotation (weighted by PIP), and compare transcriptional vs post-transcriptional eQTLs. I think I’ll do some work in the snakemake to accomplish this and come back to it…

FigS2C

sQTL QQ plot, with different colors for post-transcriptional eQTLs to transcriptional eQTLs (like what Carlos has done in red and black)

transcriptional.PC2s <- c("chRNA.Expression.Splicing", "H3K27AC", "H3K4ME3")

transcriptional.eQTLs <- dat.ZsFromAllTraitPairs %>%
  filter(PC2 %in% transcriptional.PC2s) %>%
  mutate(IsPolyASpecific = q < 0.01 & trait.x.p.in.y>0.01 )

count(transcriptional.eQTLs, IsPolyASpecific, PC2)

# A tibble: 6 × 4
# Groups:   PC1, PC2 [3]
  PC1                            PC2                       IsPolyASpecific     n
  <chr>                          <chr>                     <lgl>           <int>
1 Expression.Splicing.Subset_YRI H3K27AC                   FALSE            1965
2 Expression.Splicing.Subset_YRI H3K27AC                   TRUE             1405
3 Expression.Splicing.Subset_YRI H3K4ME3                   FALSE            2217
4 Expression.Splicing.Subset_YRI H3K4ME3                   TRUE             1361
5 Expression.Splicing.Subset_YRI chRNA.Expression.Splicing FALSE            2326
6 Expression.Splicing.Subset_YRI chRNA.Expression.Splicing TRUE             1587

PC1.filter <- c("Expression.Splicing", "Expression.Splicing.Subset_YRI")

PC1.sQTL.filter <- c("polyA.Splicing", "polyA.Splicing.Subset_YRI", "chRNA.Splicing")


dat.forsQTLQQ <- fread("../code/pi1/PairwisePi1Traits.P.all.txt.gz") %>%
  filter((PC1 %in% PC1.filter) & (PC2 %in% PC1.sQTL.filter))

dat.forsQTLQQ %>%
  inner_join(
    transcriptional.eQTLs %>%
    dplyr::select(ClassificationPC = PC2, IsPolyASpecific, P1, PC1)
  ) %>%
  group_by(PC1, PC2, IsPolyASpecific, ClassificationPC ) %>%
  mutate(ExpectedP = percent_rank(trait.x.p.in.y)) %>%
  ungroup() %>%
  ggplot(aes(x=-log10(ExpectedP), y=-log10(trait.x.p.in.y), color=IsPolyASpecific)) +
  # geom_abline(slope=1, intercept=1)
  geom_point() +
  geom_abline(slope=1) +
  facet_grid(PC2~ClassificationPC, scales = "free") +
  labs(x="Expected -log10P", y="Observed -log10P", title="sQTL QQ for top eQTL SNPs", caption="Column facets: criteria to determine polyA-specific eQTLs\nRow facets: sQTL dataset")

Fig2c

The eQTL QQplot of transcriptional

sessionInfo()

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /software/openblas-0.3.13-el7-x86_64/lib/libopenblas_haswellp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] viridis_0.6.2      viridisLite_0.4.0  qvalue_2.28.0      edgeR_3.38.4      
 [5] limma_3.52.4       data.table_1.14.2  RColorBrewer_1.1-3 forcats_0.5.1     
 [9] stringr_1.4.0      dplyr_1.0.9        purrr_0.3.4        readr_2.1.2       
[13] tidyr_1.2.0        tibble_3.1.7       ggplot2_3.3.6      tidyverse_1.3.1   

loaded via a namespace (and not attached):
 [1] fs_1.5.2          lubridate_1.8.0   bit64_4.0.5       httr_1.4.3       
 [5] rprojroot_2.0.3   tools_4.2.0       backports_1.4.1   bslib_0.3.1      
 [9] utf8_1.2.2        R6_2.5.1          DBI_1.1.2         colorspace_2.0-3 
[13] withr_2.5.0       tidyselect_1.1.2  gridExtra_2.3     bit_4.0.4        
[17] compiler_4.2.0    git2r_0.30.1      cli_3.3.0         rvest_1.0.2      
[21] xml2_1.3.3        labeling_0.4.2    sass_0.4.1        scales_1.2.0     
[25] digest_0.6.29     rmarkdown_2.14    R.utils_2.11.0    pkgconfig_2.0.3  
[29] htmltools_0.5.2   dbplyr_2.1.1      fastmap_1.1.0     highr_0.9        
[33] rlang_1.0.2       readxl_1.4.0      rstudioapi_0.13   jquerylib_0.1.4  
[37] farver_2.1.0      generics_0.1.2    jsonlite_1.8.0    vroom_1.5.7      
[41] R.oo_1.24.0       magrittr_2.0.3    Rcpp_1.0.8.3      munsell_0.5.0    
[45] fansi_1.0.3       lifecycle_1.0.1   R.methodsS3_1.8.1 stringi_1.7.6    
[49] whisker_0.4       yaml_2.3.5        plyr_1.8.7        grid_4.2.0       
[53] parallel_4.2.0    promises_1.2.0.1  crayon_1.5.1      lattice_0.20-45  
[57] haven_2.5.0       splines_4.2.0     hms_1.1.1         locfit_1.5-9.7   
[61] knitr_1.39        pillar_1.7.0      reshape2_1.4.4    reprex_2.0.1     
[65] glue_1.6.2        evaluate_0.15     modelr_0.1.8      vctrs_0.4.1      
[69] tzdb_0.3.0        httpuv_1.6.5      cellranger_1.1.0  gtable_0.3.0     
[73] assertthat_0.2.1  xfun_0.30         broom_0.8.0       later_1.3.0      
[77] workflowr_1.7.0   ellipsis_0.3.2

20230324_MakeFigs

2023-03-24

Intro

Figure2

A

B

FigS2C

Fig2c