MakeFinalFigs

Last updated: 2023-05-29

Checks: 6 1

Knit directory: ChromatinSplicingQTLs/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20191126)

The command set.seed(20191126) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: d49ae42

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version d49ae42. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.Rhistory
    Ignored:    code/.DS_Store
    Ignored:    code/.RData
    Ignored:    code/._report.html
    Ignored:    code/.ipynb_checkpoints/
    Ignored:    code/.snakemake/
    Ignored:    code/APA_Processing/
    Ignored:    code/Alignments/
    Ignored:    code/ChromHMM/
    Ignored:    code/ENCODE/
    Ignored:    code/ExpressionAnalysis/
    Ignored:    code/ExtractPhenotypeBedByGenotype.py
    Ignored:    code/FastqFastp/
    Ignored:    code/FastqFastpSE/
    Ignored:    code/FastqSE/
    Ignored:    code/FineMapping/
    Ignored:    code/Genotypes/
    Ignored:    code/H3K36me3_CutAndTag.pdf
    Ignored:    code/IntronSlopes/
    Ignored:    code/LR.bed
    Ignored:    code/LR.seq.bed
    Ignored:    code/LongReads/
    Ignored:    code/Metaplots/
    Ignored:    code/Misc/
    Ignored:    code/MiscCountTables/
    Ignored:    code/Multiqc/
    Ignored:    code/Multiqc_chRNA/
    Ignored:    code/NonCodingRNA/
    Ignored:    code/NonCodingRNA_annotation/
    Ignored:    code/PairwisePi1Traits.P.all.txt.gz
    Ignored:    code/PeakCalling/
    Ignored:    code/Phenotypes/
    Ignored:    code/PlotGruberQTLs/
    Ignored:    code/PlotQTLs/
    Ignored:    code/ProCapAnalysis/
    Ignored:    code/QC/
    Ignored:    code/QTL_SNP_Enrichment/
    Ignored:    code/QTLs/
    Ignored:    code/RPKM_tables/
    Ignored:    code/ReadLengthMapExperiment/
    Ignored:    code/ReadLengthMapExperimentResults/
    Ignored:    code/ReadLengthMapExperimentSpliceCounts/
    Ignored:    code/ReferenceGenome/
    Ignored:    code/Rplots.pdf
    Ignored:    code/Session.vim
    Ignored:    code/SmallMolecule/
    Ignored:    code/SplicingAnalysis/
    Ignored:    code/TODO
    Ignored:    code/Tehranchi/
    Ignored:    code/alias/
    Ignored:    code/bigwigs/
    Ignored:    code/bigwigs_FromNonWASPFilteredReads/
    Ignored:    code/config/.DS_Store
    Ignored:    code/config/._.DS_Store
    Ignored:    code/config/.ipynb_checkpoints/
    Ignored:    code/config/config.local.yaml
    Ignored:    code/dag.pdf
    Ignored:    code/dag.png
    Ignored:    code/dag.svg
    Ignored:    code/data/
    Ignored:    code/debug.ipynb
    Ignored:    code/debug_python.ipynb
    Ignored:    code/deepTools/
    Ignored:    code/featureCounts/
    Ignored:    code/featureCountsBasicGtf/
    Ignored:    code/genome_config.yaml
    Ignored:    code/gwas_summary_stats/
    Ignored:    code/hyprcoloc/
    Ignored:    code/igv_session.xml
    Ignored:    code/isoseqbams/
    Ignored:    code/log
    Ignored:    code/logs/
    Ignored:    code/notebooks/.ipynb_checkpoints/
    Ignored:    code/pi1/
    Ignored:    code/polyA.Splicing.Subset_YRI.NominalPassForColoc.bed.bgz
    Ignored:    code/rules/.ipynb_checkpoints/
    Ignored:    code/rules/OldRules/
    Ignored:    code/rules/notebooks/
    Ignored:    code/salmontest/
    Ignored:    code/scratch/
    Ignored:    code/scripts/.Plot_mbv.R.swp
    Ignored:    code/scripts/.ipynb_checkpoints/
    Ignored:    code/scripts/GTFtools_0.8.0/
    Ignored:    code/scripts/__pycache__/
    Ignored:    code/scripts/liftOverBedpe/liftOverBedpe.py
    Ignored:    code/snakemake.dryrun.log
    Ignored:    code/snakemake.log
    Ignored:    code/snakemake.sbatch.log
    Ignored:    code/snakemake_profiles/slurm/__pycache__/
    Ignored:    code/test.introns.bed
    Ignored:    code/test.introns2.bed
    Ignored:    code/test.log
    Ignored:    code/tracks.xml
    Ignored:    data/.DS_Store
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-10.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-11.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-2.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-3.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-4.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-5.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-6.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-7.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022-8.csv
    Ignored:    data/GWAS_catalog_summary_stats_sources/._list_gwas_summary_statistics_6_Apr_2022.csv
    Ignored:    data/Metaplots/.DS_Store

Untracked files:
    Untracked:  analysis/20230528_IdentifyBrianaPacBio.Rmd
    Untracked:  code/scripts/Plot_mbv_NoPredictedID.R

Unstaged changes:
    Modified:   analysis/MakeFinalFigs_Fig2.Rmd
    Modified:   code/Snakefile
    Modified:   code/config/ColocRunWildcards.tsv
    Modified:   code/rules/LongReads.smk
    Modified:   code/scripts/Plot_mbv.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/MakeFinalFigs_Fig2.Rmd) and HTML (docs/MakeFinalFigs_Fig2.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	b8b6d6d	Benjmain Fair	2023-05-24	add new long table
Rmd	5946b5a	Benjmain Fair	2023-05-22	update site

knitr::opts_chunk$set(echo = TRUE, warning = F, message = F)

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(RColorBrewer)
library(data.table)


Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose

library(readxl)
library(qvalue)
library(ggrepel)
library(knitr)
library(ggbreak)

ggbreak v0.1.1

If you use ggbreak in published research, please cite the following
paper:

S Xu, M Chen, T Feng, L Zhan, L Zhou, G Yu. Use ggbreak to effectively
utilize plotting space to deal with large datasets and outliers.
Frontiers in Genetics. 2021, 12:774846. doi: 10.3389/fgene.2021.774846

# Set theme
theme_set(
  theme_classic() +
  theme(text=element_text(size=16,  family="Helvetica")))

# I use layer a lot, to rotate long x-axis labels
Rotate_x_labels <- theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

List

Number QTLs

PhenotypeWildcardsAll <- Sys.glob("../code/QTLs/QTLTools/*/PermutationPass.FDR_Added.txt.gz") %>%
  str_replace("../code/QTLs/QTLTools/(.+?)/PermutationPass.FDR_Added.txt.gz", "\\1")

PhenotypeWildcards_ToIncluide <- c("APA_Nuclear", "APA_Total", "Expression.Splicing.Subset_YRI", "Expression.Splicing", "H3K27AC", "H3K36ME3", "H3K4ME3", "H3K4ME1", "MetabolicLabelled.30min", "MetabolicLabelled.60min", "chRNA.Expression.Splicing", "chRNA.Splicing", "polyA.Splicing", "polyA.Splicing.Subset_YRI", "MetabolicLabelled.30min.Splicing", "MetabolicLabelled.60min.Splicing")

GroupedPhenotypeWildcards_ToIncluide <- c( "chRNA.Splicing", "polyA.Splicing.Subset_YRI", "polyA.Splicing",  "MetabolicLabelled.30min.Splicing", "MetabolicLabelled.60min.Splicing")

UngroupedPhenotypeWildcards_ToIncluide <- setdiff(PhenotypeWildcards_ToIncluide, GroupedPhenotypeWildcards_ToIncluide)

Grouped.dat <- setNames(paste0("../code/QTLs/QTLTools/",GroupedPhenotypeWildcards_ToIncluide,"/GroupedPermutationPass.FDR_Added.txt.gz"), GroupedPhenotypeWildcards_ToIncluide) %>%
  lapply(fread) %>%
  bind_rows(.id="PC")

Ungrouped.dat <- setNames(paste0("../code/QTLs/QTLTools/",UngroupedPhenotypeWildcards_ToIncluide,"/PermutationPass.FDR_Added.txt.gz"), UngroupedPhenotypeWildcards_ToIncluide) %>%
  lapply(fread) %>%
  bind_rows(.id="PC")

Full.dat <- bind_rows(
  Grouped.dat, Ungrouped.dat
)

First plot number of individuals for each assay

SampleSize <- read_tsv("../code/QC/SampleSize.PerPhenotype.tsv", col_names=c("fn", "dummy", "n")) %>%
  mutate(PC = str_replace(fn,"QTLs/QTLTools/(.+?)/OnlyFirstReps.sorted.qqnorm.bed.gz", "\\1")) %>%
  filter(PC %in% c("Expression.Splicing", "Expression.Splicing.Subset_YRI", "chRNA.Expression.Splicing", "MetabolicLabelled.30min", "MetabolicLabelled.60min", "H3K27AC", "H3K4ME3", "H3K4ME1", "H3K36ME3", "APA_Nuclear", "APA_Total"))

SampleSize %>%
  dplyr::select(PC, n) %>%
  kable()

PC	n
Expression.Splicing	462
Expression.Splicing.Subset_YRI	89
chRNA.Expression.Splicing	86
MetabolicLabelled.30min	66
MetabolicLabelled.60min	66
H3K27AC	72
H3K4ME3	72
H3K4ME1	72
H3K36ME3	94
APA_Nuclear	52
APA_Total	52

SourcePublications <- setNames(c("Lappalainen et al", "Lappalainen et al", "This study", "Li et al", "Li et al", "Grubert et al", "Grubert et al", "Grubert et al", "This study", "Mittleman et al", "Mittleman et al"), SampleSize$PC)

Labels =  setNames(c("polyA RNA-seq*", "polyA RNA-seq", "chRNA-seq", "30m 4sU RNA-seq", "60m 4sU RNA-seq", "H3K4me1 ChIP-seq", "H3K27Ac ChIP-seq", "H3K4me3 ChIP-seq", "H3K36me3 Cut&Tag", "nuclear RNA 3'seq", "Total RNA 3'seq"), SampleSize$PC)

SampleSizes <- SampleSize %>%
  filter(!PC %in% "Expression.Splicing.Subset_YRI") %>%
  mutate(publication = recode(PC, !!!SourcePublications)) %>%
  mutate(Assay = factor(PC, levels=c("H3K4ME1", "H3K27AC", "H3K4ME3", "H3K36ME3", "chRNA.Expression.Splicing", "APA_Nuclear", "APA_Total", "MetabolicLabelled.30min","MetabolicLabelled.60min", "Expression.Splicing"))) %>%
  ggplot(aes(x=Assay, y=n)) +
  geom_col() +
  # geom_text(aes(label=n), color="black") +
  geom_text(aes(label=publication),angle=90, color="black", y=0, hjust=-0.05) +
  scale_x_discrete(labels=Labels) +
  scale_y_continuous(expand=c(0,0)) +
  scale_y_break(c(100,425), scales="fixed", ticklabels=c(425, 450, 475), expand=F, space=0.4) +
  Rotate_x_labels +
  labs(x="Assay", y="Sample size", caption=str_wrap("*Includes samples non-YRI donors, though some supplementary analyses only utilize the 89 YRI samples in this dataset", 30)) +
  theme(axis.text.x=element_text(size=rel(0.5)))

P.SampleSizes <- ggplotify::as.ggplot(print(SampleSizes))

P.SampleSizes

Now number of QTLs

relabel.df <- read_tsv("../data/Phenotypes_recode_for_Plotting.txt")

relabel.df %>% head()

# A tibble: 6 × 9
  PC        Alias ShorterAlias ShorterAlias2 Shortest Include Assay Alias_202305
  <chr>     <chr> <chr>        <chr>         <chr>    <lgl>   <chr> <chr>       
1 APA_Nucl… APA_… APA_Nuclear  APA           APA QTL… TRUE    Nucl… APA (Nuclea…
2 APA_Total APA_… APA_Total    APA           APA QTL… TRUE    Tota… APA (Total …
3 chRNA.Ex… Expr… ncRNA_chRNA  ncRNA_chRNA   <NA>     TRUE    <NA>  <NA>        
4 chRNA.Ex… Expr… ncRNA_chRNA  ncRNA_chRNA   <NA>     FALSE   <NA>  <NA>        
5 chRNA.Ex… chRN… ncRNA_chRNA  ncRNA_chRNA   <NA>     FALSE   <NA>  <NA>        
6 chRNA.Ex… chRN… ncRNA_chRNA  ncRNA_chRNA   <NA>     FALSE   <NA>  <NA>        
# … with 1 more variable: OrderFactor <dbl>

Full.dat$PC %>% unique()

 [1] "chRNA.Splicing"                   "polyA.Splicing.Subset_YRI"       
 [3] "polyA.Splicing"                   "MetabolicLabelled.30min.Splicing"
 [5] "MetabolicLabelled.60min.Splicing" "APA_Nuclear"                     
 [7] "APA_Total"                        "Expression.Splicing.Subset_YRI"  
 [9] "Expression.Splicing"              "H3K27AC"                         
[11] "H3K36ME3"                         "H3K4ME3"                         
[13] "H3K4ME1"                          "MetabolicLabelled.30min"         
[15] "MetabolicLabelled.60min"          "chRNA.Expression.Splicing"

MolQTL_Labels <- relabel.df %>% dplyr::select(PC, Shortest) %>% deframe()
MolQTL_AssayLabels <- relabel.df %>% dplyr::select(PC, Assay) %>% deframe()
Factor_Order <- relabel.df %>% dplyr::select(PC, OrderFactor) %>% deframe()

NumQTLs.dat <- Full.dat %>%
  filter(PC %in% c("APA_Total", "APA_Nuclear")) %>%
  separate(phe_id, into=c("gene", "region", "peak"), sep="_") %>%
  arrange(adj_emp_pval) %>% 
  group_by(gene) %>% slice(1) %>%
  bind_rows(
    Full.dat %>%
      filter(!PC %in% c("APA_Total", "APA_Nuclear"))
  ) %>%
  group_by(PC) %>%
  summarise(
    FDR_10 = sum(q<0.1, na.rm=T),
    FDR_05 = sum(q<0.05,  na.rm=T),
    FDR_01 = sum(q<0.01,  na.rm=T),
    TestFeats=n()) %>%
  gather("FDR", "n", -PC, -TestFeats) %>%
  mutate(PC_relabeled = recode(PC, !!!MolQTL_Labels)) %>%
  mutate(Assay_relabeled = recode(PC, !!!MolQTL_AssayLabels)) %>%
  mutate(FinalLabels = paste(PC_relabeled, Assay_relabeled, sep="\n")) %>%
  mutate(OrderFactor = recode(PC, !!!Factor_Order)) %>%
  mutate(FinalLabels = fct_reorder(FinalLabels, OrderFactor))

Num.QTLs <- ggplot(NumQTLs.dat, aes(x=FinalLabels, y=n, fill=FDR)) +
  geom_col(position="identity", width=0.8) +
  Rotate_x_labels +
  scale_y_continuous(trans="log10", breaks=10**(1:5), labels=c(10, 100, "1K", "10K", "100K")) +
  # geom_text_repel(
  #   data = . %>%
  #     gather("FDR","n",-PC),
  #     aes(label=n, y=n), angle=90, hjust=-0.05,
  #   min.segment.length = unit(0, 'lines'),
  #   color="black") +
  geom_text(
    data = . %>%
      filter(FDR== "FDR_10"),
      aes(label=str_glue("{n} / {TestFeats}")), angle=90, hjust=0,
    color="black") +
  coord_cartesian(ylim=c(100,1E5)) +
  scale_fill_manual(values=c("FDR_10"="#ffeda0", "FDR_05"="#feb24c", "FDR_01"="#f03b20"), labels=c("FDR_10"="10%", "FDR_05"="5%", "FDR_01"="1%")) +
  # scale_x_discrete(labels=c())
  labs(x="molQTL type, assay", y="Number QTLs", fill="FDR", caption=str_wrap("*Each sQTL/APA-QTL count here represents a likely independent genetic effect that may effect multiple introns/APA-sites within a local leafcutter cluster of introns/APA-sites", 40)) +
  theme(axis.text.x = element_text(size=9, hjust=1, angle=70))

Num.QTLs

NumQTLs.dat %>% kable()

PC	TestFeats	FDR	n	PC_relabeled	Assay_relabeled	FinalLabels
APA_Nuclear	4941	FDR_10	277	APA QTLs*	Nuclear RNA 3’seq	APA QTLs*
Nuclear RNA 3’seq	7.0
APA_Total	2997	FDR_10	133	APA QTLs*	Total RNA 3’seq	APA QTLs*
Total RNA 3’seq	14.0
Expression.Splicing	14000	FDR_10	11966	Expression; eQTLs	polyA RNA-seq	Expression; eQTLs
polyA RNA-seq	10.0
Expression.Splicing.Subset_YRI	14000	FDR_10	3970	Expression; eQTLs (YRI only)	polyA RNA-seq	Expression; eQTLs (YRI only)
polyA RNA-seq	11.0
H3K27AC	100858	FDR_10	10222	H3K27Ac; hQTLs	H3K27Ac ChIP-seq	H3K27Ac; hQTLs
H3K27Ac ChIP-seq	2.0
H3K36ME3	14000	FDR_10	754	H3K36me3; hQTLs	H3K36me3 ChIP-seq	H3K36me3; hQTLs
H3K36me3 ChIP-seq	4.0
H3K4ME1	182749	FDR_10	6328	H3K4me1; hQTLs	H3K4me1 ChIP-seq	H3K4me1; hQTLs
H3K4me1 ChIP-seq	1.0
H3K4ME3	55769	FDR_10	4602	H3K4me3; hQTLs	H3K4me3 ChIP-seq	H3K4me3; hQTLs
H3K4me3 ChIP-seq	3.0
MetabolicLabelled.30min	14000	FDR_10	955	Expression; eQTLs	30min 4sU RNA-seq	Expression; eQTLs
30min 4sU RNA-seq	8.5
MetabolicLabelled.30min.Splicing	35099	FDR_10	390	Splicing	30min 4sU RNA-seq	Splicing
30min 4sU RNA-seq	8.0
MetabolicLabelled.60min	13993	FDR_10	960	Expression; eQTLs	60min 4sU RNA-seq	Expression; eQTLs
60min 4sU RNA-seq	9.0
MetabolicLabelled.60min.Splicing	35648	FDR_10	464	Splicing	60min 4sU RNA-seq	Splicing
60min 4sU RNA-seq	9.5
chRNA.Expression.Splicing	14000	FDR_10	2976	Expression; eQTLs	chRNA-seq	Expression; eQTLs
chRNA-seq	5.0
chRNA.Splicing	41406	FDR_10	3052	Splicing; sQTLs*	chRNA-seq	Splicing; sQTLs*
chRNA-seq	6.0
polyA.Splicing	38280	FDR_10	8637	Splicing; sQTLs*	polyA RNA-seq	Splicing; sQTLs*
polyA RNA-seq	12.0
polyA.Splicing.Subset_YRI	38280	FDR_10	2295	Splicing; sQTLs* (YRI only)	polyA RNA-seq	Splicing; sQTLs* (YRI only)
polyA RNA-seq	13.0
APA_Nuclear	4941	FDR_05	217	APA QTLs*	Nuclear RNA 3’seq	APA QTLs*
Nuclear RNA 3’seq	7.0
APA_Total	2997	FDR_05	84	APA QTLs*	Total RNA 3’seq	APA QTLs*
Total RNA 3’seq	14.0
Expression.Splicing	14000	FDR_05	10403	Expression; eQTLs	polyA RNA-seq	Expression; eQTLs
polyA RNA-seq	10.0
Expression.Splicing.Subset_YRI	14000	FDR_05	2834	Expression; eQTLs (YRI only)	polyA RNA-seq	Expression; eQTLs (YRI only)
polyA RNA-seq	11.0
H3K27AC	100858	FDR_05	7380	H3K27Ac; hQTLs	H3K27Ac ChIP-seq	H3K27Ac; hQTLs
H3K27Ac ChIP-seq	2.0
H3K36ME3	14000	FDR_05	540	H3K36me3; hQTLs	H3K36me3 ChIP-seq	H3K36me3; hQTLs
H3K36me3 ChIP-seq	4.0
H3K4ME1	182749	FDR_05	4383	H3K4me1; hQTLs	H3K4me1 ChIP-seq	H3K4me1; hQTLs
H3K4me1 ChIP-seq	1.0
H3K4ME3	55769	FDR_05	3246	H3K4me3; hQTLs	H3K4me3 ChIP-seq	H3K4me3; hQTLs
H3K4me3 ChIP-seq	3.0
MetabolicLabelled.30min	14000	FDR_05	596	Expression; eQTLs	30min 4sU RNA-seq	Expression; eQTLs
30min 4sU RNA-seq	8.5
MetabolicLabelled.30min.Splicing	35099	FDR_05	274	Splicing	30min 4sU RNA-seq	Splicing
30min 4sU RNA-seq	8.0
MetabolicLabelled.60min	13993	FDR_05	640	Expression; eQTLs	60min 4sU RNA-seq	Expression; eQTLs
60min 4sU RNA-seq	9.0
MetabolicLabelled.60min.Splicing	35648	FDR_05	362	Splicing	60min 4sU RNA-seq	Splicing
60min 4sU RNA-seq	9.5
chRNA.Expression.Splicing	14000	FDR_05	2039	Expression; eQTLs	chRNA-seq	Expression; eQTLs
chRNA-seq	5.0
chRNA.Splicing	41406	FDR_05	2345	Splicing; sQTLs*	chRNA-seq	Splicing; sQTLs*
chRNA-seq	6.0
polyA.Splicing	38280	FDR_05	7344	Splicing; sQTLs*	polyA RNA-seq	Splicing; sQTLs*
polyA RNA-seq	12.0
polyA.Splicing.Subset_YRI	38280	FDR_05	1766	Splicing; sQTLs* (YRI only)	polyA RNA-seq	Splicing; sQTLs* (YRI only)
polyA RNA-seq	13.0
APA_Nuclear	4941	FDR_01	114	APA QTLs*	Nuclear RNA 3’seq	APA QTLs*
Nuclear RNA 3’seq	7.0
APA_Total	2997	FDR_01	32	APA QTLs*	Total RNA 3’seq	APA QTLs*
Total RNA 3’seq	14.0
Expression.Splicing	14000	FDR_01	8172	Expression; eQTLs	polyA RNA-seq	Expression; eQTLs
polyA RNA-seq	10.0
Expression.Splicing.Subset_YRI	14000	FDR_01	1581	Expression; eQTLs (YRI only)	polyA RNA-seq	Expression; eQTLs (YRI only)
polyA RNA-seq	11.0
H3K27AC	100858	FDR_01	4394	H3K27Ac; hQTLs	H3K27Ac ChIP-seq	H3K27Ac; hQTLs
H3K27Ac ChIP-seq	2.0
H3K36ME3	14000	FDR_01	292	H3K36me3; hQTLs	H3K36me3 ChIP-seq	H3K36me3; hQTLs
H3K36me3 ChIP-seq	4.0
H3K4ME1	182749	FDR_01	2422	H3K4me1; hQTLs	H3K4me1 ChIP-seq	H3K4me1; hQTLs
H3K4me1 ChIP-seq	1.0
H3K4ME3	55769	FDR_01	1903	H3K4me3; hQTLs	H3K4me3 ChIP-seq	H3K4me3; hQTLs
H3K4me3 ChIP-seq	3.0
MetabolicLabelled.30min	14000	FDR_01	285	Expression; eQTLs	30min 4sU RNA-seq	Expression; eQTLs
30min 4sU RNA-seq	8.5
MetabolicLabelled.30min.Splicing	35099	FDR_01	160	Splicing	30min 4sU RNA-seq	Splicing
30min 4sU RNA-seq	8.0
MetabolicLabelled.60min	13993	FDR_01	302	Expression; eQTLs	60min 4sU RNA-seq	Expression; eQTLs
60min 4sU RNA-seq	9.0
MetabolicLabelled.60min.Splicing	35648	FDR_01	217	Splicing	60min 4sU RNA-seq	Splicing
60min 4sU RNA-seq	9.5
chRNA.Expression.Splicing	14000	FDR_01	1162	Expression; eQTLs	chRNA-seq	Expression; eQTLs
chRNA-seq	5.0
chRNA.Splicing	41406	FDR_01	1661	Splicing; sQTLs*	chRNA-seq	Splicing; sQTLs*
chRNA-seq	6.0
polyA.Splicing	38280	FDR_01	5684	Splicing; sQTLs*	polyA RNA-seq	Splicing; sQTLs*
polyA RNA-seq	12.0
polyA.Splicing.Subset_YRI	38280	FDR_01	1178	Splicing; sQTLs* (YRI only)	polyA RNA-seq	Splicing; sQTLs* (YRI only)
polyA RNA-seq	13.0

NumQTLs.dat %>%
  filter(FDR=="FDR_10") %>%
  pull(n) %>% sum()

[1] 57981

NumQTLs.dat %>%
  filter(FDR=="FDR_10") %>%
  pull(TestFeats) %>% sum()

[1] 620020

Box plots for TTC3

first do some quick data searching to figure out exactly the names of the phenotypes I want to make boxpltos for…

dat <- fread("../code/pi1/PairwisePi1Traits.P.all.txt.gz")

PeaksToTSS <- Sys.glob("../code/Misc/PeaksClosestToTSS/*_assigned.tsv.gz") %>%
  setNames(str_replace(., "../code/Misc/PeaksClosestToTSS/(.+?)_assigned.tsv.gz", "\\1")) %>%
  lapply(read_tsv) %>%
  bind_rows(.id="ChromatinMark") %>%
  mutate(GenePeakPair = paste(gene, peak, sep = ";")) %>%
  distinct(ChromatinMark, peak, gene, .keep_all=T)

PeaksToTSS %>%
  filter(ChromatinMark == "H3K4ME3") %>%
  filter(str_detect(gene, "ENSG00000075234"))

dat %>%
  filter(PC1 == "Expression.Splicing") %>%
  filter(str_detect(P1, "ENSG00000075234")) %>%
  # pull(PC2) %>% unique()
  filter(PC2 == "polyA.Splicing")

dat %>%
  filter(PC1 == "Expression.Splicing") %>%
  filter(str_detect(P1, "ENSG00000075234")) %>%
  filter(P2 %in% c("H3K4ME3_peak_31179", "22:46292328:46292791:clu_48515_+"))

So I’m looking to make boxplots for H3K4ME3_peak_31179 (hQTL), ENSG00000075234.17 (polyA eQTL), and chRNA sQTL and polyA sQTL for 22:46292328:46292791:clu_48515_+

TopSNP: “22:46291323:G:A”

Now use my helper script to get genotype/phenotype data for the top eSNP for those traits..

cd ../code/scripts/GenometracksByGenotype

python ExtractPhenotypeBedByGenotype.py --Bed ../../QTLs/QTLTools/Expression.Splicing/OnlyFirstReps.sorted.qqnorm.bed.gz --VCF ../../Genotypes/1KG_GRCh38/Autosomes.vcf.gz --FeatureName ENSG00000075234.17 --SnpPos chr22:46291323 > ../../scratch/Boxplot.dat.eQTL.tsv

python ExtractPhenotypeBedByGenotype.py --Bed ../../QTLs/QTLTools/polyA.Splicing/OnlyFirstReps.sorted.qqnorm.bed.gz --VCF ../../Genotypes/1KG_GRCh38/Autosomes.vcf.gz --FeatureName 22:46292328:46292791:clu_48515_+  --SnpPos chr22:46291323 > ../../scratch/Boxplot.dat.polyAsQTL.tsv

python ExtractPhenotypeBedByGenotype.py --Bed ../../QTLs/QTLTools/polyA.Splicing/OnlyFirstReps.sorted.qqnorm.bed.gz --VCF ../../Genotypes/1KG_GRCh38/Autosomes.vcf.gz --FeatureName 22:46292328:46292791:clu_48515_+  --SnpPos chr22:46291323 > ../../scratch/Boxplot.dat.polyAsQTL.tsv

python ExtractPhenotypeBedByGenotype.py --Bed ../../QTLs/QTLTools/H3K4ME3/OnlyFirstReps.sorted.qqnorm.bed.gz --VCF ../../Genotypes/1KG_GRCh38/Autosomes.vcf.gz --FeatureName H3K4ME3_peak_31179  --SnpPos chr22:46291323 > ../../scratch/Boxplot.dat.PromoterhQTL.tsv

Now let’s read in that data and do boxplots…

dat.for.boxplots <- Sys.glob("../code/scratch/Boxplot.dat.*.tsv") %>%
  setNames(str_replace(., "../code/scratch/Boxplot.dat.(.+?).tsv", "\\1")) %>%
  lapply(read_tsv, col_names=c("SampleID", "P", "genotype", "Ref", "Alt", "ID"), skip=1) %>%
  bind_rows(.id="PC")

dat.for.boxplots %>%
  filter(!genotype==3) %>%
  mutate(genotype= factor(genotype, levels=c(0,1,2))) %>%
  ggplot(aes(x=genotype, y=P, group=genotype)) +
  geom_boxplot(outlier.shape = NA, aes(color=genotype)) +
  # geom_jitter(width=0.25, alpha=0.1) +
  geom_smooth(method='lm', aes(group=1), se=F, color="gray") +
  scale_x_discrete(labels=c("0"="GG", "1"="GA", "2"="AA")) +
  scale_color_manual(values=c("0"="#f03b20", "1"="#7a0177", "2"="#08519c")) +
  # scale_x_discrete() +
  facet_wrap(~PC) +
  theme(legend.position="none") +
  labs(x=NULL, color=NULL)

dat.for.boxplots$PC %>% unique()

dat.for.boxplots %>%
  filter(!genotype==3) %>%
  filter(PC=="PromoterhQTL") %>%
  mutate(genotype= factor(genotype, levels=c(0,1,2))) %>%
  ggplot(aes(x=genotype, y=P, group=genotype)) +
  geom_boxplot(outlier.shape = NA, aes(color=genotype)) +
  # geom_jitter(width=0.25, alpha=0.1) +
  geom_smooth(method='lm', aes(group=1), se=F, color="gray") +
  scale_x_discrete(labels=c("0"="GG", "1"="GA", "2"="AA")) +
  scale_color_manual(values=c("0"="#f03b20", "1"="#7a0177", "2"="#08519c")) +
  geom_text(x=-Inf, y=Inf, label=" beta=0.15\n P=0.48", hjust=0, vjust=1, size=5.5) +
  # scale_x_discrete() +
  theme(legend.position="none") +
  Rotate_x_labels +
  labs(x=NULL, color=NULL, y="H3K4me3 @ promoter")
ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig2_ExampleBoxplot_hQTL.pdf", height=3, width=2)

dat.for.boxplots %>%
  filter(!genotype==3) %>%
  filter(PC=="eQTL") %>%
  mutate(genotype= factor(genotype, levels=c(0,1,2))) %>%
  ggplot(aes(x=genotype, y=P, group=genotype)) +
  geom_boxplot(outlier.shape = NA, aes(color=genotype)) +
  # geom_jitter(width=0.25, alpha=0.1) +
  geom_smooth(method='lm', aes(group=1), se=F, color="gray") +
  scale_x_discrete(labels=c("0"="GG", "1"="GA", "2"="AA")) +
  scale_color_manual(values=c("0"="#f03b20", "1"="#7a0177", "2"="#08519c")) +
  geom_text(x=-Inf, y=Inf, label=" beta=1.1\n P<5E-43", hjust=0, vjust=1, size=5.5) +
  # scale_x_discrete() +
  theme(legend.position="none") +
  Rotate_x_labels +
  labs(x=NULL, color=NULL, y="TTC38 expression")
ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig2_ExampleBoxplot_eQTL.pdf", height=3, width=2)

dat.for.boxplots %>%
  filter(!genotype==3) %>%
  filter(PC=="polyAsQTL") %>%
  mutate(genotype= factor(genotype, levels=c(0,1,2))) %>%
  ggplot(aes(x=genotype, y=P, group=genotype)) +
  geom_boxplot(outlier.shape = NA, aes(color=genotype)) +
  # geom_jitter(width=0.25, alpha=0.1) +
  geom_smooth(method='lm', aes(group=1), se=F, color="gray") +
  scale_x_discrete(labels=c("0"="GG", "1"="GA", "2"="AA")) +
  scale_color_manual(values=c("0"="#f03b20", "1"="#7a0177", "2"="#08519c")) +
  geom_text(x=-Inf, y=Inf, label=" beta=-1.1\n P<3e-24", hjust=0, vjust=1, size=5.5) +
  # scale_x_discrete() +
  theme(legend.position="none") +
  Rotate_x_labels +
  labs(x=NULL, color=NULL, y="Splicing of cassette exon")
ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig2_ExampleBoxplot_polyAsQTL.pdf", height=3, width=2)

dat.for.boxplots %>%
  filter(!genotype==3) %>%
  filter(PC=="chRNAsQTL") %>%
  mutate(genotype= factor(genotype, levels=c(0,1,2))) %>%
  ggplot(aes(x=genotype, y=P, group=genotype)) +
  geom_boxplot(outlier.shape = NA, aes(color=genotype)) +
  # geom_jitter(width=0.25, alpha=0.1) +
  geom_smooth(method='lm', aes(group=1), se=F, color="gray") +
  scale_x_discrete(labels=c("0"="GG", "1"="GA", "2"="AA")) +
  scale_color_manual(values=c("0"="#f03b20", "1"="#7a0177", "2"="#08519c")) +
  geom_text(x=-Inf, y=Inf, label=" beta=-1.3\n P<4e-19", hjust=0, vjust=1, size=5.5) +
  # scale_x_discrete() +
  theme(legend.position="none") +
  Rotate_x_labels +
  labs(x=NULL, color=NULL, y="Splicing of cassette exon")
ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig2_ExampleBoxplot_chRNAsQTL.pdf", height=3, width=2)

Classification of unannotated and unproductive juncs

Actually a supplement related to fig1

juncs.long <- fread("../code/SplicingAnalysis/CombinedJuncTables/YRI.tsv.gz")

IntronAnnotations <- read_tsv("../data/IntronAnnotationsFromYang.tsv.gz") %>%
  mutate(stop = end-1)

juncs.long.summary <- juncs.long %>%
  dplyr::select(chrom, start, stop, strand, Dataset, Count) %>%
  inner_join(IntronAnnotations) %>%
  group_by(Dataset, SemiSupergroupAnnotations, SuperAnnotation) %>%
  summarise(Sum=sum(Count))


P.PercentAnnotated <- juncs.long.summary %>%
  ungroup() %>%
  filter(Dataset %in% c("Expression.Splicing", "chRNA.Expression.Splicing")) %>%
  filter(!str_detect(SuperAnnotation, "Noncoding")) %>%
  group_by(Dataset, SuperAnnotation) %>%
  summarise(TotalSum = sum(Sum)) %>%
  ungroup() %>%
  mutate(PlotGroup = if_else(SuperAnnotation=="AnnotatedJunc_ProductiveCodingGene", "Annotated productive (basic tag)", "Other")) %>%
  group_by(Dataset) %>%
  mutate(DatasetSum = sum(TotalSum)) %>%
  ungroup() %>%
  group_by(Dataset, PlotGroup) %>%
  summarise(Percent=sum(TotalSum)/DatasetSum) %>%
  ungroup() %>%
  distinct() %>%
  mutate(Dataset = factor(Dataset, levels=c("Expression.Splicing", "chRNA.Expression.Splicing"))) %>%
  ggplot(aes(x=Dataset, y=Percent*100, fill=PlotGroup)) +
  geom_col() +
  scale_x_discrete(labels=c("Expression.Splicing"="polyA RNA", "chRNA.Expression.Splicing"="chRNA")) +
  scale_y_continuous(expand=c(0,0)) +
  scale_fill_manual(values=c("Annotated productive (basic tag)"="#1f78b4", "Other"="#969696")) + 
  scale_y_break(c(5, 95), expand=F, space=0.5) +
  labs(y="Percent of splice junction reads", fill="Intron classification") +
  theme(legend.position="none")
P.PercentAnnotated <- ggplotify::as.ggplot(print(P.PercentAnnotated))

P.PercentAnnotated

P.PercentOfUnannotated <- juncs.long.summary %>%
  ungroup() %>%
  filter(Dataset %in% c("chRNA.Expression.Splicing")) %>%
  filter(!str_detect(SuperAnnotation, "Noncoding")) %>%
  filter(SuperAnnotation!="AnnotatedJunc_ProductiveCodingGene") %>%
  mutate(PercentOfNon_AnnotatedProductive = Sum/sum(Sum)*100) %>%
  arrange(desc(SuperAnnotation), PercentOfNon_AnnotatedProductive) %>%
  mutate(n=row_number()) %>%
  mutate(SemiSupergroupAnnotations=fct_reorder(SemiSupergroupAnnotations, n)) %>%
  ggplot(aes(x=Dataset, y=PercentOfNon_AnnotatedProductive, fill=SemiSupergroupAnnotations)) +
  geom_col()
P.PercentOfUnannotated

One more version… with fixed scale… borrowed code from here

position_stack_and_nudge <- function(x = 0, y = 0, vjust = 1, reverse = FALSE) {
  ggproto(NULL, PositionStackAndNudge,
    x = x,
    y = y,
    vjust = vjust,
    reverse = reverse
  )
}

#' @rdname ggplot2-ggproto
#' @format NULL
#' @usage NULL
#' @noRd
PositionStackAndNudge <- ggproto("PositionStackAndNudge", PositionStack,
  x = 0,
  y = 0,

  setup_params = function(self, data) {
    c(
        list(x = self$x, y = self$y),
        ggproto_parent(PositionStack, self)$setup_params(data)
    )
  },

  compute_layer = function(self, data, params, panel) {
    # operate on the stacked positions (updated in August 2020)
    data = ggproto_parent(PositionStack, self)$compute_layer(data, params, panel)

    x_orig <- data$x
    y_orig <- data$y
    # transform only the dimensions for which non-zero nudging is requested
    if (any(params$x != 0)) {
      if (any(params$y != 0)) {
        data <- transform_position(data, function(x) x + params$x, function(y) y + params$y)
      } else {
        data <- transform_position(data, function(x) x + params$x, NULL)
      }
    } else if (any(params$y != 0)) {
      data <- transform_position(data, function(x) x, function(y) y + params$y)
    }
    data$nudge_x <- data$x
    data$nudge_y <- data$y
    data$x <- x_orig
    data$y <- y_orig

    data
  },

  compute_panel = function(self, data, params, scales) {
      ggproto_parent(PositionStack, self)$compute_panel(data, params, scales)
  }
)

AnnotationColors <- IntronAnnotations %>%
  distinct(SemiSupergroupAnnotations, SuperAnnotation) %>%
  mutate(Color = recode(SuperAnnotation, !!!c(
    "AnnotatedJunc_NoncodingGene"="#6a3d9a",
    "UnannotatedJunc_NoncodingGene"="#cab2d6",
    "AnnotatedJunc_ProductiveCodingGene"="#1f78b4",
    "UnannotatedJunc_ProductiveCodingGene"="#a6cee3",
    "AnnotatedJunc_UnproductiveCodingGene"="#e31a1c",
    "UnannotatedJunc_UnproductiveCodingGene"="#fb9a99")))

P.PercentOfUnannotated <- juncs.long.summary %>%
  ungroup() %>%
  filter(Dataset %in% c("chRNA.Expression.Splicing")) %>%
  filter(!str_detect(SuperAnnotation, "Noncoding")) %>%
  filter(SuperAnnotation!="AnnotatedJunc_ProductiveCodingGene") %>%
  mutate(PercentOfNon_AnnotatedProductive = Sum/sum(Sum)*100) %>%
  mutate(SemiSupergroupAnnotations=if_else(Sum <= 996286 & SuperAnnotation=="UnannotatedJunc_UnproductiveCodingGene", "Other predicted non-productive", SemiSupergroupAnnotations)) %>%
  group_by(Dataset, SemiSupergroupAnnotations, SuperAnnotation) %>%
  summarise(PercentOfNon_AnnotatedProductive=sum(PercentOfNon_AnnotatedProductive)) %>%
  ungroup() %>%
  arrange(desc(SuperAnnotation), PercentOfNon_AnnotatedProductive) %>%
  mutate(n=row_number()) %>%
  mutate(SemiSupergroupAnnotations=fct_reorder(SemiSupergroupAnnotations, n)) %>%
  ggplot(aes(x=Dataset, y=PercentOfNon_AnnotatedProductive, fill=SemiSupergroupAnnotations)) +
  geom_col(color='black') +
  geom_text_repel(
      aes(label = SemiSupergroupAnnotations, x=1.4),
      max.overlaps = Inf,
      min.segment.length=0,
      label.padding=1,
      position = position_stack_and_nudge(vjust = 0.5, x = 1),
      direction = 'both',
      arrow=NULL
  ) +
  scale_y_continuous(expand=c(0,0)) +
  scale_x_discrete(expand=c(0,0)) +
  scale_fill_manual(values=AnnotationColors %>% dplyr::select(SemiSupergroupAnnotations, Color) %>% deframe(), na.value="#fb9a99") +
  theme(legend.position='none') +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  labs(y="Percent junction reads amongst other")

P.PercentOfUnannotated

P.PercentJuncsLegend.dat <- 
  AnnotationColors %>%
  distinct(SuperAnnotation, .keep_all=T) %>%
  filter(!str_detect(SuperAnnotation, "Noncoding"))

P.PercentJuncsLegend <- ggplot(P.PercentJuncsLegend.dat, aes(x=SuperAnnotation, y=1, fill=SuperAnnotation)) +
  geom_col() +
  scale_fill_manual(values=P.PercentJuncsLegend.dat %>%
                      dplyr::select(SuperAnnotation, Color) %>% deframe(), labels=c("AnnotatedJunc_UnproductiveCodingGene"="Annotated unproductive", "AnnotatedJunc_ProductiveCodingGene"="Annotated productive", "UnannotatedJunc_ProductiveCodingGene"="Unannotated unproductive", "UnannotatedJunc_UnproductiveCodingGene"="Unannotated productive")) +
  labs(fill="Intron classification")

Number eQTLs per gene, hQTLs, sQTLs

dat <- fread("../code/pi1/PairwisePi1Traits.P.all.txt.gz")

dat$PC2 %>% unique()

 [1] "Expression.Splicing.Subset_YRI" "chRNA.Expression.Splicing"     
 [3] "MetabolicLabelled.30min"        "MetabolicLabelled.60min"       
 [5] "H3K27AC"                        "H3K4ME3"                       
 [7] "H3K4ME1"                        "H3K36ME3"                      
 [9] "APA_Nuclear"                    "APA_Total"                     
[11] "polyA.Splicing"                 "polyA.Splicing.Subset_YRI"     
[13] "chRNA.Splicing"                 "ProCap"                        
[15] "Expression.Splicing"

NumMolQTLsPerGene <- dat %>%
  filter(PC1=="Expression.Splicing" & FDR.x < 0.1) %>%
  mutate(P2 = case_when(
    PC2 %in% c("polyA.Splicing", "chRNA.Splicing", "polyA.Splicing.Subset_YRI") ~ str_extract(P2, "clu_.+$"),
    TRUE ~ P2
  )) %>%
  group_by(PC1, PC2, P2, GeneLocus) %>%
  slice(which.min(FDR.y)) %>%
  ungroup() %>%
  mutate(IsPc2SigFDR = FDR.y < 0.1) %>%
  mutate(IsPc2SigPermutationP = p_permutation.y < 0.01) %>%
  group_by(PC2, GeneLocus) %>%
  summarise(Num_PC2_QTLs_FDR = sum(IsPc2SigFDR),
            Num_PC2_QTLs_PermutationP = sum(IsPc2SigPermutationP)) %>%
  ungroup()

NumMolQTLsPerGene %>%
  filter(PC2 %in% c("H3K27AC", "polyA.Splicing")) %>%
  ggplot(aes(x=Num_PC2_QTLs_PermutationP)) +
  stat_ecdf() +
  facet_wrap(~PC2) +
  coord_cartesian(xlim=c(0,5))

NumMolQTLsPerGene %>%
  filter(PC2 %in% c("H3K27AC", "polyA.Splicing")) %>%
  ggplot(aes(x=Num_PC2_QTLs_FDR)) +
  stat_ecdf() +
  facet_wrap(~PC2) +
  coord_cartesian(xlim=c(0,5))

NumMolQTLsPerGene %>%
  ggplot(aes(x=Num_PC2_QTLs_PermutationP)) +
  stat_ecdf() +
  facet_wrap(~PC2) +
  coord_cartesian(xlim=c(0,5))

Ok, now let’s plot that nicer for saving…

eGenes <- dat %>%
  filter(PC1=="Expression.Splicing" & FDR.x < 0.1) %>%
  pull(P1) %>% unique()

P.Number_hQTLsForColoc <- NumMolQTLsPerGene %>%
  filter(GeneLocus %in% eGenes) %>%
  # filter(PC2 %in% c("H3K27AC", "polyA.Splicing")) %>%
  filter(PC2=="H3K27AC") %>%
  mutate(n = if_else(Num_PC2_QTLs_PermutationP>5, as.integer(5), Num_PC2_QTLs_PermutationP)) %>%
  count(PC2, n) %>%
  ggplot(aes(x=n, y=nn)) +
  geom_col() +
  scale_x_continuous(breaks=0:5, labels=c(0:4, ">4")) +
  labs(x="Number H3K27AC hQTLs w/in 100kb of gene", y="Number eGenes")
NumMolQTLsPerGene %>%
  filter(GeneLocus %in% eGenes) %>%
  # filter(PC2 %in% c("H3K27AC", "polyA.Splicing")) %>%
  filter(PC2=="H3K27AC") %>%
  mutate(n = if_else(Num_PC2_QTLs_PermutationP>5, as.integer(5), Num_PC2_QTLs_PermutationP)) %>%
  count(PC2, n)

# A tibble: 6 × 3
  PC2         n    nn
  <chr>   <int> <int>
1 H3K27AC     0  4241
2 H3K27AC     1  3313
3 H3K27AC     2  1942
4 H3K27AC     3  1064
5 H3K27AC     4   608
6 H3K27AC     5   756

P.Number_sQTLsForColoc <- NumMolQTLsPerGene %>%
  filter(GeneLocus %in% eGenes) %>%
  # filter(PC2 %in% c("H3K27AC", "polyA.Splicing")) %>%
  filter(PC2=="polyA.Splicing") %>%
  mutate(n = if_else(Num_PC2_QTLs_PermutationP>5, as.integer(5), Num_PC2_QTLs_PermutationP)) %>%
  count(PC2, n) %>%
  ggplot(aes(x=n, y=nn)) +
  geom_col() +
  scale_x_continuous(breaks=0:5, labels=c(0:4, ">4")) +
  labs(x="Number sQTLs within gene", y="Number eGenes")
NumMolQTLsPerGene %>%
  filter(GeneLocus %in% eGenes) %>%
  # filter(PC2 %in% c("H3K27AC", "polyA.Splicing")) %>%
  filter(PC2=="polyA.Splicing") %>%
  mutate(n = if_else(Num_PC2_QTLs_PermutationP>5, as.integer(5), Num_PC2_QTLs_PermutationP)) %>%
  count(PC2, n)

# A tibble: 6 × 3
  PC2                n    nn
  <chr>          <int> <int>
1 polyA.Splicing     0  4895
2 polyA.Splicing     1  2926
3 polyA.Splicing     2  1022
4 polyA.Splicing     3   305
5 polyA.Splicing     4   104
6 polyA.Splicing     5   124

P.Number_hQTLsForColoc

P.Number_sQTLsForColoc

Same plots, but only considering eGenes discovered in YRI subset

eGenes <- dat %>%
  filter(PC1=="Expression.Splicing.Subset_YRI" & FDR.x < 0.1) %>%
  pull(P1) %>% unique()

P.Number_hQTLsForColoc <- NumMolQTLsPerGene %>%
  filter(GeneLocus %in% eGenes) %>%
  # filter(PC2 %in% c("H3K27AC", "polyA.Splicing")) %>%
  filter(PC2=="H3K27AC") %>%
  mutate(n = if_else(Num_PC2_QTLs_PermutationP>5, as.integer(5), Num_PC2_QTLs_PermutationP)) %>%
  count(PC2, n) %>%
  ggplot(aes(x=n, y=nn)) +
  geom_col() +
  scale_x_continuous(breaks=0:5, labels=c(0:4, ">4")) +
  scale_y_continuous(expand=c(0,0)) +
  labs(x="Number H3K27AC hQTLs w/in 100kb of gene", y="Number eGenes")
NumMolQTLsPerGene %>%
  filter(GeneLocus %in% eGenes) %>%
  # filter(PC2 %in% c("H3K27AC", "polyA.Splicing")) %>%
  filter(PC2=="H3K27AC") %>%
  mutate(n = if_else(Num_PC2_QTLs_PermutationP>5, as.integer(5), Num_PC2_QTLs_PermutationP)) %>%
  count(PC2, n)

# A tibble: 6 × 3
  PC2         n    nn
  <chr>   <int> <int>
1 H3K27AC     0  1193
2 H3K27AC     1  1032
3 H3K27AC     2   668
4 H3K27AC     3   374
5 H3K27AC     4   238
6 H3K27AC     5   375

P.Number_sQTLsForColoc <- NumMolQTLsPerGene %>%
  filter(GeneLocus %in% eGenes) %>%
  # filter(PC2 %in% c("H3K27AC", "polyA.Splicing")) %>%
  filter(PC2=="polyA.Splicing") %>%
  mutate(n = if_else(Num_PC2_QTLs_PermutationP>5, as.integer(5), Num_PC2_QTLs_PermutationP)) %>%
  count(PC2, n) %>%
  ggplot(aes(x=n, y=nn)) +
  geom_col() +
  scale_x_continuous(breaks=0:5, labels=c(0:4, ">4")) +
  scale_y_continuous(expand=c(0,0)) +
  labs(x="Number sQTLs within gene", y="Number eGenes")
NumMolQTLsPerGene %>%
  filter(GeneLocus %in% eGenes) %>%
  # filter(PC2 %in% c("H3K27AC", "polyA.Splicing")) %>%
  filter(PC2=="polyA.Splicing") %>%
  mutate(n = if_else(Num_PC2_QTLs_PermutationP>5, as.integer(5), Num_PC2_QTLs_PermutationP)) %>%
  count(PC2, n)

# A tibble: 6 × 3
  PC2                n    nn
  <chr>          <int> <int>
1 polyA.Splicing     0  1411
2 polyA.Splicing     1  1050
3 polyA.Splicing     2   467
4 polyA.Splicing     3   168
5 polyA.Splicing     4    68
6 polyA.Splicing     5    85

P.Number_hQTLsForColoc

P.Number_sQTLsForColoc

coloc.dat <- read_tsv("../code/hyprcoloc/Results/ForColoc/MolColocNonRedundantFullSplicing/tidy_results_OnlyColocalized.txt.gz") %>%
  separate(phenotype_full, into=c("PC", "P"), sep=";")

Dat.TopQTLInFamily <- dat %>%
  filter(PC1=="Expression.Splicing" & FDR.x < 0.1) %>%
  mutate(P2 = case_when(
    PC2 %in% c("polyA.Splicing", "chRNA.Splicing", "polyA.Splicing.Subset_YRI") ~ str_extract(P2, "clu_.+$"),
    TRUE ~ P2
  )) %>%
  group_by(PC1, PC2, P2, GeneLocus) %>%
  slice(which.min(FDR.y)) %>%
  ungroup() %>%
  group_by(PC1, PC2, GeneLocus) %>%
  slice(which.min(FDR.y)) %>%
  ungroup()

coloc.dat %>%
  mutate(P = case_when(
    PC %in% c("polyA.Splicing", "chRNA.Splicing", "polyA.Splicing.Subset_YRI") ~ str_extract(P, "clu_.+$"),
    TRUE ~ P
  )) %>%
  left_join(
    Dat.TopQTLInFamily %>%
      dplyr::select(Locus=GeneLocus, PC=PC2, P=P2) %>%
      mutate(TopMolQTLAroundGene = T)
  ) %>%
  replace_na(list(TopMolQTLAroundGene=F)) %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
  ungroup() %>%
  group_by(Locus, PC) %>%
  summarise(ContainsTopMolQTL = any(TopMolQTLAroundGene)) %>%
  ungroup() %>%
  group_by(PC) %>%
  summarise(FracContainsTop = sum(ContainsTopMolQTL)/n()) %>%
  ungroup()

# A tibble: 21 × 2
   PC                             FracContainsTop
   <chr>                                    <dbl>
 1 APA_Nuclear                              0.939
 2 APA_Total                                1    
 3 CTCF                                     0    
 4 Expression.Splicing.Subset_YRI           0.999
 5 H3K27AC                                  0.701
 6 H3K36ME3                                 1    
 7 H3K4ME1                                  0.612
 8 H3K4ME3                                  0.763
 9 MetabolicLabelled.30min                  1    
10 MetabolicLabelled.30min.IER              0    
# … with 11 more rows

Same analysis without renaming to match all in cluster

Dat.TopQTLInFamily2 <- dat %>%
  filter(PC1=="Expression.Splicing" & FDR.x < 0.1) %>%
  # mutate(P2 = case_when(
  #   PC2 %in% c("polyA.Splicing", "chRNA.Splicing", "polyA.Splicing.Subset_YRI") ~ str_extract(P2, "clu_.+$"),
  #   TRUE ~ P2
  # )) %>%
  # group_by(PC1, PC2, P2, GeneLocus) %>%
  # slice(which.min(FDR.y)) %>%
  # ungroup() %>%
  group_by(PC1, PC2, GeneLocus) %>%
  slice(which.min(FDR.y)) %>%
  ungroup()

dat.FractionNonPrimaryMolQTLColocs <- coloc.dat %>%
  # mutate(P = case_when(
  #   PC %in% c("polyA.Splicing", "chRNA.Splicing", "polyA.Splicing.Subset_YRI") ~ str_extract(P, "clu_.+$"),
  #   TRUE ~ P
  # )) %>%
  left_join(
    Dat.TopQTLInFamily2 %>%
      dplyr::select(Locus=GeneLocus, PC=PC2, P=P2) %>%
      mutate(TopMolQTLAroundGene = T)
  ) %>%
  replace_na(list(TopMolQTLAroundGene=F)) %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
  ungroup() %>%
  group_by(Locus, PC) %>%
  summarise(ContainsTopMolQTL = any(TopMolQTLAroundGene)) %>%
  ungroup() %>%
  group_by(PC) %>%
  summarise(FracContainsTop = sum(ContainsTopMolQTL)/n()) %>%
  ungroup()

kable(dat.FractionNonPrimaryMolQTLColocs)

PC	FracContainsTop
APA_Nuclear	0.9393939
APA_Total	1.0000000
CTCF	0.0000000
Expression.Splicing.Subset_YRI	0.9992717
H3K27AC	0.7008850
H3K36ME3	1.0000000
H3K4ME1	0.6124661
H3K4ME3	0.7628866
MetabolicLabelled.30min	1.0000000
MetabolicLabelled.30min.IER	0.0000000
MetabolicLabelled.30min.Splicing	0.0000000
MetabolicLabelled.60min	1.0000000
MetabolicLabelled.60min.IER	0.0000000
MetabolicLabelled.60min.Splicing	0.0000000
ProCap	0.7437811
chRNA.Expression.Splicing	1.0000000
chRNA.Expression_ncRNA	0.0000000
chRNA.IER	0.0000000
chRNA.Splicing	0.7853403
polyA.IER	0.0000000
polyA.Splicing	0.7064057

P.NonPrimaryMolQTLs_OftenColoc <- dat.FractionNonPrimaryMolQTLColocs %>%
  filter(PC %in% c("H3K27AC", "polyA.Splicing")) %>%
  mutate(Complement = 1-FracContainsTop) %>%
  gather("ContainsTop", "Fraction",-PC) %>%
  ggplot(aes(x=PC, y=Fraction, fill=ContainsTop)) +
  geom_col(position="fill") +
  scale_y_continuous(limits=c(0, 1), expand=c(0,0)) +
  scale_fill_manual(values=c("FracContainsTop"="#636363", "Complement"="#f0f0f0"), labels=c("eGene colocalizes with top molQTL", "colocalizes with non top molQTL")) +
  scale_x_discrete(labels=c("H3K27AC hQTL", "polyA RNA sQTL")) +
  labs(x=NULL, y=str_wrap("Fraction of eQTL/molQTL colocalizations", 30), fill=NULL) +
  Rotate_x_labels

P.NonPrimaryMolQTLs_OftenColoc

Number of eQTLs that coloc with at least one molQTL

eGenes <- dat %>%
  filter(PC1=="Expression.Splicing.Subset_YRI" & FDR.x < 0.1) %>%
  pull(P1) %>% unique()

coloc.dat$PC %>% unique()

 [1] "polyA.Splicing"                   "MetabolicLabelled.30min"         
 [3] "MetabolicLabelled.60min"          "H3K27AC"                         
 [5] "H3K4ME3"                          "H3K4ME1"                         
 [7] "chRNA.IER"                        "ProCap"                          
 [9] "CTCF"                             "chRNA.Expression_ncRNA"          
[11] "Expression.Splicing.Subset_YRI"   "chRNA.Splicing"                  
[13] "MetabolicLabelled.30min.Splicing" "MetabolicLabelled.60min.Splicing"
[15] "chRNA.Expression.Splicing"        "H3K36ME3"                        
[17] "APA_Nuclear"                      "APA_Total"                       
[19] "polyA.IER"                        "MetabolicLabelled.30min.IER"     
[21] "MetabolicLabelled.60min.IER"

coloc.dat %>%
    group_by(Locus, snp) %>%
    filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
    ungroup() %>%
    count(Locus, PC) %>%
    count(PC) %>%
  mutate(PC_relabeled = recode(PC, !!!MolQTL_Labels)) %>%
  filter(!is.na(PC_relabeled) | PC=="Any hQTL") %>%
  filter(!PC_relabeled %in% c("Splicing", "Expression; eQTLs (YRI only)")) %>%
  mutate(Assay_relabeled = recode(PC, !!!MolQTL_AssayLabels)) %>%
  mutate(FinalLabels = paste(PC_relabeled, Assay_relabeled, sep="\n")) %>%
  mutate(OrderFactor = recode(PC, !!!Factor_Order)) %>%
  mutate(FinalLabels = fct_reorder(FinalLabels, OrderFactor)) %>%
  ggplot(aes(x=FinalLabels, y=n)) +
  geom_col() +
  scale_y_continuous(
    name="Number eGenes with colocalized molQTL", 
    sec.axis = sec_axis(~ . / length(eGenes), name = "% eGenes with colocalized molQTL")
  ) +
  Rotate_x_labels +
  labs(x="Molecular QTL type, Assay")

dat.eGeneColocsWith <- bind_rows(
  coloc.dat %>%
    group_by(Locus, snp) %>%
    filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
    ungroup() %>%
    count(Locus, PC) %>%
    count(PC),
  coloc.dat %>%
    mutate(PC = case_when(
      PC %in% c("H3K4ME1", "H3K4ME3", "H3K27AC", "H3K36ME3") ~ "Any hQTL",
      TRUE ~ PC
    )) %>%
    group_by(Locus, snp) %>%
    filter(any(PC=="Expression.Splicing.Subset_YRI")) %>%
    ungroup() %>%
    count(Locus, PC) %>%
    count(PC)
) %>%
  mutate(PC_relabeled = recode(PC, !!!MolQTL_Labels)) %>%
  filter(!is.na(PC_relabeled) | PC=="Any hQTL") %>%
  filter(!PC_relabeled %in% c("Splicing", "Expression; eQTLs (YRI only)")) %>%
  mutate(Assay_relabeled = recode(PC, !!!MolQTL_AssayLabels)) %>%
  mutate(FinalLabels = paste(PC_relabeled, Assay_relabeled, sep="\n")) %>%
  mutate(OrderFactor = recode(PC, !!!Factor_Order))

P.eGeneColocsWith <- dat.eGeneColocsWith %>%
  mutate(FinalLabels = str_remove_all(FinalLabels, "\\*")) %>%
  mutate(FinalLabels = case_when(
    FinalLabels == "Any hQTL\nAny hQTL" ~ "Any hQTL*",
    TRUE ~ FinalLabels
  )) %>%
  mutate(OrderFactor = case_when(
    FinalLabels == "Any hQTL*" ~ 4.5,
    TRUE ~ OrderFactor
  )) %>%
  mutate(FinalLabels = fct_reorder(FinalLabels, OrderFactor)) %>%
  distinct(.keep_all=T) %>%
  ggplot(aes(x=FinalLabels, y=n)) +
  geom_col() +
  scale_y_continuous(
    name="Number eGenes with colocalized molQTL", 
    sec.axis = sec_axis(~ . / length(eGenes), name = "% eGenes with colocalized molQTL")
  ) +
  Rotate_x_labels +
  labs(x="Molecular QTL type, Assay")
P.eGeneColocsWith

dat.eGeneColocsWith

# A tibble: 19 × 6
   PC                     n PC_relabeled Assay_relabeled FinalLabels OrderFactor
   <chr>              <int> <chr>        <chr>           <chr>             <dbl>
 1 APA_Nuclear           33 APA QTLs*    Nuclear RNA 3'… "APA QTLs*…         7  
 2 APA_Total             38 APA QTLs*    Total RNA 3'seq "APA QTLs*…        14  
 3 H3K27AC              565 H3K27Ac; hQ… H3K27Ac ChIP-s… "H3K27Ac; …         2  
 4 H3K36ME3             284 H3K36me3; h… H3K36me3 ChIP-… "H3K36me3;…         4  
 5 H3K4ME1              369 H3K4me1; hQ… H3K4me1 ChIP-s… "H3K4me1; …         1  
 6 H3K4ME3              485 H3K4me3; hQ… H3K4me3 ChIP-s… "H3K4me3; …         3  
 7 MetabolicLabelled…   433 Expression;… 30min 4sU RNA-… "Expressio…         8.5
 8 MetabolicLabelled…   456 Expression;… 60min 4sU RNA-… "Expressio…         9  
 9 chRNA.Expression.…   647 Expression;… chRNA-seq       "Expressio…         5  
10 chRNA.Splicing       191 Splicing; s… chRNA-seq       "Splicing;…         6  
11 polyA.Splicing       562 Splicing; s… polyA RNA-seq   "Splicing;…        12  
12 APA_Nuclear           33 APA QTLs*    Nuclear RNA 3'… "APA QTLs*…         7  
13 APA_Total             38 APA QTLs*    Total RNA 3'seq "APA QTLs*…        14  
14 Any hQTL             828 Any hQTL     Any hQTL        "Any hQTL\…        NA  
15 MetabolicLabelled…   433 Expression;… 30min 4sU RNA-… "Expressio…         8.5
16 MetabolicLabelled…   456 Expression;… 60min 4sU RNA-… "Expressio…         9  
17 chRNA.Expression.…   647 Expression;… chRNA-seq       "Expressio…         5  
18 chRNA.Splicing       191 Splicing; s… chRNA-seq       "Splicing;…         6  
19 polyA.Splicing       562 Splicing; s… polyA RNA-seq   "Splicing;…        12

Number of eQTLs that coloc with at least one molQTL (among non hQTL eQTLs)

eGenes.NohQTLColoc <-
  coloc.dat %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI") & !any(PC %in% c("H3K4ME3", "H3K27AC", "H3K4ME1", "H3K36ME3"))) %>%
  pull(Locus) %>% unique()

dat.eGeneColocsWith.NohQTL <-
  coloc.dat %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI") & !any(PC %in% c("H3K4ME3", "H3K27AC", "H3K4ME1", "H3K36ME3"))) %>%
  ungroup() %>%
  count(Locus, PC) %>%
  count(PC) %>%
  mutate(PC_relabeled = recode(PC, !!!MolQTL_Labels)) %>%
  filter(!is.na(PC_relabeled) | PC=="Any hQTL") %>%
  filter(!PC_relabeled %in% c("Splicing", "Expression; eQTLs (YRI only)")) %>%
  mutate(Assay_relabeled = recode(PC, !!!MolQTL_AssayLabels)) %>%
  mutate(FinalLabels = paste(PC_relabeled, Assay_relabeled, sep="\n")) %>%
  mutate(OrderFactor = recode(PC, !!!Factor_Order))

P.eGeneColocsWith.nohQTL <- dat.eGeneColocsWith.NohQTL %>%
  mutate(FinalLabels = str_remove_all(FinalLabels, "\\*")) %>%
  mutate(FinalLabels = case_when(
    FinalLabels == "Any hQTL\nAny hQTL" ~ "Any hQTL*",
    TRUE ~ FinalLabels
  )) %>%
  mutate(OrderFactor = case_when(
    FinalLabels == "Any hQTL*" ~ 4.5,
    TRUE ~ OrderFactor
  )) %>%
  distinct(.keep_all=T) %>%
  mutate(FinalLabels = fct_reorder(FinalLabels, OrderFactor)) %>%
  ggplot(aes(x=FinalLabels, y=n)) +
  geom_col() +
  scale_y_continuous(
    name="Number eGenes with colocalized molQTL", 
    sec.axis = sec_axis(~ . / length(eGenes.NohQTLColoc), name = "% eGenes with colocalized molQTL\nAmong those that did not coloc with hQTL")
  ) +
  Rotate_x_labels +
  labs(x="Molecular QTL type, Assay")
P.eGeneColocsWith.nohQTL

dat.eGeneColocsWith.NohQTL

# A tibble: 7 × 6
  PC                      n PC_relabeled Assay_relabeled FinalLabels OrderFactor
  <chr>               <int> <chr>        <chr>           <chr>             <dbl>
1 APA_Nuclear            16 APA QTLs*    Nuclear RNA 3'… "APA QTLs*…         7  
2 APA_Total              21 APA QTLs*    Total RNA 3'seq "APA QTLs*…        14  
3 MetabolicLabelled.…   153 Expression;… 30min 4sU RNA-… "Expressio…         8.5
4 MetabolicLabelled.…   162 Expression;… 60min 4sU RNA-… "Expressio…         9  
5 chRNA.Expression.S…   213 Expression;… chRNA-seq       "Expressio…         5  
6 chRNA.Splicing         96 Splicing; s… chRNA-seq       "Splicing;…         6  
7 polyA.Splicing        258 Splicing; s… polyA RNA-seq   "Splicing;…        12

Plot of number of eQTLs that coloc with an hQTL, versus those that coloc with an sQTL (and not an hQTL)

data.frame(name="Non hQTL", n=length(eGenes.NohQTLColoc))

      name   n
1 Non hQTL 545

dat.SimplifiedNumberColocs <- bind_rows(
  dat.eGeneColocsWith %>%
    filter(PC=="Any hQTL") %>%
    mutate(name="hQTL*", order=1) %>%
    dplyr::select(name, n, order),
  # dat.eGeneColocsWith %>%
  #   filter(PC=="polyA.Splicing") %>%
  #   mutate(name="sQTL", order=1.5) %>%
  #   dplyr::select(name, n, order),
  dat.eGeneColocsWith.NohQTL %>%
    filter(PC=="APA_Total") %>%
    mutate(name="APA QTL**", order=4) %>%
    dplyr::select(name, n, order),
  dat.eGeneColocsWith.NohQTL %>%
    filter(PC=="polyA.Splicing") %>%
    mutate(name="sQTL**", order=3) %>%
    dplyr::select(name, n, order),
) %>%
  add_row(name="Any molQTL**", n=length(eGenes.NohQTLColoc), order=2) %>%
  mutate(name = fct_reorder(name, order))

P.SimplifiedNumberColocs <- ggplot(dat.SimplifiedNumberColocs, aes(x=name, y=n)) +
  geom_col(fill="#bdbdbd") +
  scale_y_continuous(expand=c(0,0), breaks=c(0, 200, 400, 600, 800), labels=c(0, "", "", "", 800)) +
  geom_text(aes(label=name), y=10, angle=90, y=-Inf, hjust=0, size=8) +
  labs(x=NULL, y="# eGenes coloc w/ molQTL") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())
P.SimplifiedNumberColocs

kable(dat.SimplifiedNumberColocs)

name	n	order
hQTL*	828	1
APA QTL**	21	4
sQTL**	258	3
Any molQTL**	545	2

And finally, a plot about whether the sQTLs that do coloc with eQTL are enriched for non-productive juncs.

IntronAnnotations <- read_tsv("../data/IntronAnnotationsFromYang.tsv.gz")

ClusterLevelQTLAnnotations <- read_delim("../code/QTLs/QTLTools/polyA.Splicing/PermutationPassForColoc.txt.gz", delim=' ') %>%
  mutate(IntronID = str_replace(phe_id, "(^.+?clu.+?):EN.+$", "\\1")) %>%
  separate(IntronID, into=c("chrom", "start", "end", "cluster"), remove=F, convert=T, sep=":") %>%
  mutate(chrom=paste0("chr", chrom)) %>%
  mutate(q = qvalue(adj_beta_pval)$qvalues) %>%
  left_join(IntronAnnotations) %>%
  group_by(cluster) %>%
  mutate(ContainsUnproductive = any(SuperAnnotation %in% c("AnnotatedJunc_UnproductiveCodingGene", "UnannotatedJunc_UnproductiveCodingGene") & q<0.1)) %>%
  ungroup()

ClusterLevelQTLAnnotations %>%
  distinct(cluster, .keep_all=T) %>%
  count(ContainsUnproductive)

# A tibble: 2 × 2
  ContainsUnproductive     n
  <lgl>                <int>
1 FALSE                31259
2 TRUE                  4490

dat %>%
  filter(PC1=="Expression.Splicing" & FDR.x < 0.1) %>%
  filter(PC2=="polyA.Splicing" & p_permutation.y<0.01)

                       PC1                 P1          GeneLocus
    1: Expression.Splicing ENSG00000198793.13 ENSG00000198793.13
    2: Expression.Splicing ENSG00000127481.15 ENSG00000127481.15
    3: Expression.Splicing ENSG00000127481.15 ENSG00000127481.15
    4: Expression.Splicing ENSG00000160094.15 ENSG00000160094.15
    5: Expression.Splicing ENSG00000116922.14 ENSG00000116922.14
   ---                                                          
15055: Expression.Splicing ENSG00000197128.11 ENSG00000197128.11
15056: Expression.Splicing ENSG00000197128.11 ENSG00000197128.11
15057: Expression.Splicing ENSG00000197128.11 ENSG00000197128.11
15058: Expression.Splicing ENSG00000197128.11 ENSG00000197128.11
15059: Expression.Splicing ENSG00000197128.11 ENSG00000197128.11
       p_permutation.x    beta.x beta_se.x singletrait_topvar.x
    1:     1.91282e-02 -0.119659 0.0283057       1:11324733:C:T
    2:     4.07612e-05  0.125678 0.0219205       1:19123840:T:C
    3:     4.07612e-05  0.125678 0.0219205       1:19123840:T:C
    4:     2.74415e-02  0.181904 0.0442285       1:33257610:G:A
    5:     4.57030e-13  0.332516 0.0384069       1:37683521:C:T
   ---                                                         
15055:     8.41409e-30  0.677943 0.0490979      19:57468799:C:T
15056:     8.41409e-30  0.677943 0.0490979      19:57468799:C:T
15057:     8.41409e-30  0.677943 0.0490979      19:57468799:C:T
15058:     8.41409e-30  0.677943 0.0490979      19:57468799:C:T
15059:     8.41409e-30  0.677943 0.0490979      19:57468799:C:T
       singletrait_topvar_chr.x singletrait_topvar_pos.x        FDR.x
    1:                     chr1                 11324733 6.570641e-03
    2:                     chr1                 19123840 2.177979e-05
    3:                     chr1                 19123840 2.177979e-05
    4:                     chr1                 33257610 9.127945e-03
    5:                     chr1                 37683521 5.503866e-13
   ---                                                               
15055:                    chr19                 57468799 3.275936e-29
15056:                    chr19                 57468799 3.275936e-29
15057:                    chr19                 57468799 3.275936e-29
15058:                    chr19                 57468799 3.275936e-29
15059:                    chr19                 57468799 3.275936e-29
                  PC2                               P2 p_permutation.y
    1: polyA.Splicing    1:11213566:11228668:clu_262_-     4.49478e-03
    2: polyA.Splicing    1:19151859:19152313:clu_356_-     8.08673e-03
    3: polyA.Splicing    1:19151804:19152313:clu_356_-     2.36350e-03
    4: polyA.Splicing   1:33256654:33270487:clu_3059_+     3.48650e-03
    5: polyA.Splicing    1:37689883:37690002:clu_644_-     4.23258e-03
   ---                                                                
15055: polyA.Splicing 19:57475786:57476380:clu_44453_-     7.50154e-03
15056: polyA.Splicing 19:57475146:57475660:clu_44453_-     5.51692e-03
15057: polyA.Splicing 19:57467085:57476634:clu_44453_-     5.08816e-05
15058: polyA.Splicing 19:57456182:57476634:clu_44453_-     1.47938e-05
15059: polyA.Splicing 19:57455743:57476634:clu_44453_-     5.45558e-06
          beta.y beta_se.y singletrait_topvar.y singletrait_topvar_chr.y
    1:  0.371906 0.0695477      1:11264089:CA:C                     chr1
    2: -1.516750 0.2875090       1:19304026:G:C                     chr1
    3:  0.877172 0.1666940     1:19264724:CTT:C                     chr1
    4:  1.162510 0.2540520       1:33239322:G:C                     chr1
    5:  0.349353 0.0767724       1:37681494:T:C                     chr1
   ---                                                                  
15055: -0.209531 0.0459838      19:57449940:G:A                    chr19
15056:  0.312532 0.0719654      19:57469325:A:G                    chr19
15057: -0.279717 0.0478831     19:57450880:A:AC                    chr19
15058: -0.258663 0.0458791      19:57449940:G:A                    chr19
15059: -0.259654 0.0430860      19:57449940:G:A                    chr19
       singletrait_topvar_pos.y        FDR.y trait.x.p.in.y x.beta.in.y
    1:                 11264085 0.0453688625    3.25581e-01  0.05805960
    2:                 19304026 0.0731896350    8.65210e-01  0.01348910
    3:                 19264723 0.0264276619    4.43321e-01 -0.03353030
    4:                 33239322 0.0368014521    8.87217e-01 -0.00999069
    5:                 37681494 0.0431957225    1.39465e-04  0.28945300
   ---                                                                 
15055:                 57449940 0.0689080148    3.01964e-03 -0.14960100
15056:                 57469325 0.0535979864    9.87470e-04 -0.21649800
15057:                 57450880 0.0008509990    1.05411e-07 -0.27950300
15058:                 57449940 0.0002733075    2.71867e-06 -0.23710600
15059:                 57449940 0.0001085746    3.63617e-09 -0.27939000
       x.beta_se.in.y
    1:      0.0589955
    2:      0.0794210
    3:      0.0437003
    4:      0.0704026
    5:      0.0753311
   ---               
15055:      0.0501670
15056:      0.0652881
15057:      0.0517077
15058:      0.0498955
15059:      0.0464090

Clusters.ThatColocWith_sQTL_not_hQTL <- coloc.dat %>%
  group_by(Locus, snp) %>%
  filter(any(PC=="Expression.Splicing.Subset_YRI") & !any(PC %in% c("H3K4ME3", "H3K27AC", "H3K4ME1"))) %>%
  ungroup() %>%
  filter(PC=="polyA.Splicing") %>%
  separate(P, into=c("chrom", "start", "end", "cluster"), remove=F, convert=T, sep=":") %>%
  distinct(cluster)

Clusters.AttemptColoc_butOther <- dat %>%
  filter(PC1=="Expression.Splicing" & FDR.x < 0.1) %>%
  filter(PC2=="polyA.Splicing" & p_permutation.y<0.01) %>%
  distinct(P2) %>%
  separate(P2, into=c("chrom", "start", "end", "cluster"), remove=F, convert=T, sep=":") %>%
  distinct(cluster) %>%
  filter(!cluster %in% Clusters.ThatColocWith_sQTL_not_hQTL$cluster)

dat.EnrichmentOfUnrpductiveClusters <- bind_rows(
  Clusters.AttemptColoc_butOther %>%
    mutate(Category="DoesNotColocWith_eGene"),
  Clusters.ThatColocWith_sQTL_not_hQTL %>%
    mutate(Category="DoesColocWith_eGene")
) %>%
  left_join(
    ClusterLevelQTLAnnotations %>%
      distinct(cluster, ContainsUnproductive))

test.results <- dat.EnrichmentOfUnrpductiveClusters %>%
  count(Category, ContainsUnproductive) %>%
  pivot_wider(names_from="Category", values_from ="n") %>%
  mutate(ContainsUnproductive=!ContainsUnproductive) %>%
  column_to_rownames("ContainsUnproductive") %>%
  as.matrix() %>%
  fisher.test(alternative="less")
test.results


    Fisher's Exact Test for Count Data

data:  .
p-value = 1.982e-15
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.0000000 0.5214947
sample estimates:
odds ratio 
 0.4311536

P.EnrichmentOfUnrpductiveClusters <- dat.EnrichmentOfUnrpductiveClusters %>%
  count(Category, ContainsUnproductive) %>%
  ggplot(aes(x=Category, fill=ContainsUnproductive, y=n)) +
  geom_col() +
  scale_fill_manual(values=c("FALSE"="#377eb8", "TRUE"="#e41a1c"), labels=c("FALSE"="Productive sQTL", "TRUE"="Unproductive sQTL")) +
  scale_x_discrete(labels=c(str_wrap("sQTLs that coloc with eQTL and not hQTL", 20), str_wrap("Other sQTLs tested\nfor colocalization", 20))) +
  labs(caption=paste("One sided Fisher test P:", format.pval(test.results$p.value, 3), "\nOR:", round(1/test.results$estimate, 3)), y="Number of sQTLs", fill="sQTL type", x="") +
  Rotate_x_labels
P.EnrichmentOfUnrpductiveClusters

ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_SampleSize.pdf",P.SampleSizes, width=6, height=6)

ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_NumQTLs.pdf",NumQTLs, width=8, height=6)

ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_PercentAnnotated.pdf",P.PercentAnnotated, width=4, height=6)
ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_PercentOfUnannotated.pdf",P.PercentOfUnannotated, width=6, height=6)
ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_PercentAnnotatedLegend.pdf",P.PercentJuncsLegend, width=8, height=6)


ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_Multiple_hQTLs_PerGene.pdf",P.Number_hQTLsForColoc, width=6, height=6)
ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_Multiple_sQTLs_PerGene.pdf",P.Number_sQTLsForColoc, width=6, height=6)

ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_NonPrimaryMolQTLsColoc.pdf",P.NonPrimaryMolQTLs_OftenColoc, width=6, height=6)

ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_EnrichmentOfUnrpoductiveSqtls.pdf",P.EnrichmentOfUnrpductiveClusters, width=6, height=6)

ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_eGene_MolQTL_Colocs.pdf",P.eGeneColocsWith, width=10, height=7)
ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_eGene_MolQTL_NonhQTL_Colocs.pdf",P.eGeneColocsWith.nohQTL, width=10, height=7)


ggsave("/project2/yangili1/carlos_and_ben_shared/rough_figs/OriginalSubplots/2023_FinalFig_S_eGene_MolQTL_Simplified.pdf",P.SimplifiedNumberColocs, width=2.7, height=2.7)

sessionInfo()

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /software/openblas-0.3.13-el7-x86_64/lib/libopenblas_haswellp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggbreak_0.1.1      knitr_1.39         ggrepel_0.9.1      qvalue_2.28.0     
 [5] readxl_1.4.0       data.table_1.14.2  RColorBrewer_1.1-3 forcats_0.5.1     
 [9] stringr_1.4.0      dplyr_1.0.9        purrr_0.3.4        readr_2.1.2       
[13] tidyr_1.2.0        tibble_3.1.7       ggplot2_3.3.6      tidyverse_1.3.1   

loaded via a namespace (and not attached):
 [1] nlme_3.1-157       fs_1.5.2           lubridate_1.8.0    bit64_4.0.5       
 [5] httr_1.4.3         rprojroot_2.0.3    tools_4.2.0        backports_1.4.1   
 [9] bslib_0.3.1        utf8_1.2.2         R6_2.5.1           mgcv_1.8-40       
[13] DBI_1.1.2          colorspace_2.0-3   withr_2.5.0        tidyselect_1.1.2  
[17] bit_4.0.4          compiler_4.2.0     git2r_0.30.1       cli_3.3.0         
[21] rvest_1.0.2        xml2_1.3.3         labeling_0.4.2     sass_0.4.1        
[25] scales_1.2.0       digest_0.6.29      yulab.utils_0.0.6  rmarkdown_2.14    
[29] R.utils_2.11.0     pkgconfig_2.0.3    htmltools_0.5.2    highr_0.9         
[33] dbplyr_2.1.1       fastmap_1.1.0      rlang_1.0.2        rstudioapi_0.13   
[37] farver_2.1.0       gridGraphics_0.5-1 jquerylib_0.1.4    generics_0.1.2    
[41] jsonlite_1.8.0     vroom_1.5.7        R.oo_1.24.0        magrittr_2.0.3    
[45] ggplotify_0.1.0    Matrix_1.5-3       patchwork_1.1.1    Rcpp_1.0.8.3      
[49] munsell_0.5.0      fansi_1.0.3        lifecycle_1.0.1    R.methodsS3_1.8.1 
[53] stringi_1.7.6      whisker_0.4        yaml_2.3.5         plyr_1.8.7        
[57] grid_4.2.0         parallel_4.2.0     promises_1.2.0.1   crayon_1.5.1      
[61] lattice_0.20-45    haven_2.5.0        splines_4.2.0      hms_1.1.1         
[65] pillar_1.7.0       reshape2_1.4.4     reprex_2.0.1       glue_1.6.2        
[69] evaluate_0.15      ggfun_0.0.9        modelr_0.1.8       vctrs_0.4.1       
[73] tzdb_0.3.0         httpuv_1.6.5       cellranger_1.1.0   gtable_0.3.0      
[77] assertthat_0.2.1   xfun_0.30          broom_0.8.0        later_1.3.0       
[81] aplot_0.1.10       workflowr_1.7.0    ellipsis_0.3.2

MakeFinalFigs_Fig2

2023-05-18

List

Number QTLs

Box plots for TTC3

Classification of unannotated and unproductive juncs