Differential Splicing

Introduction
Results

Last updated: 2022-09-27

Checks: 5 2

Knit directory: 20211209_JingxinRNAseq/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(19900924)

The command set.seed(19900924) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: absolute

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute	relative
/project2/yangili1/bjf79/20211209_JingxinRNAseq/code/bigwigs/unstranded/(.+?).bw	../code/bigwigs/unstranded/(.+?).bw

Repository version: b2293ec

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version b2293ec. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    ._.DS_Store
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/20220707_TitrationSeries_DE_testing.nb.html
    Ignored:    code/%
    Ignored:    code/.DS_Store
    Ignored:    code/._.DS_Store
    Ignored:    code/._DOCK7.pdf
    Ignored:    code/._DOCK7_DMSO1.pdf
    Ignored:    code/._DOCK7_SM2_1.pdf
    Ignored:    code/._FKTN_DMSO_1.pdf
    Ignored:    code/._FKTN_SM2_1.pdf
    Ignored:    code/._MAPT.pdf
    Ignored:    code/._PKD1_DMSO_1.pdf
    Ignored:    code/._PKD1_SM2_1.pdf
    Ignored:    code/.snakemake/
    Ignored:    code/1KG_HighCoverageCalls.samplelist.txt
    Ignored:    code/5ssSeqs.tab
    Ignored:    code/Alignments/
    Ignored:    code/ChemCLIP/
    Ignored:    code/ClinVar/
    Ignored:    code/DE_testing/
    Ignored:    code/DE_tests.mat.counts.gz
    Ignored:    code/DE_tests.txt.gz
    Ignored:    code/DoseResponseData/
    Ignored:    code/Fastq/
    Ignored:    code/FastqFastp/
    Ignored:    code/FragLenths/
    Ignored:    code/Meme/
    Ignored:    code/Multiqc/
    Ignored:    code/OMIM/
    Ignored:    code/OldBigWigs/
    Ignored:    code/PhyloP/
    Ignored:    code/QC/
    Ignored:    code/ReferenceGenomes/
    Ignored:    code/Session.vim
    Ignored:    code/SplicingAnalysis/
    Ignored:    code/TracksSession
    Ignored:    code/bigwigs/
    Ignored:    code/featureCounts/
    Ignored:    code/geena/
    Ignored:    code/hg38ToMm39.over.chain.gz
    Ignored:    code/igv_session.template.xml
    Ignored:    code/igv_session.xml
    Ignored:    code/log
    Ignored:    code/logs/
    Ignored:    code/scratch/
    Ignored:    code/test.txt.gz
    Ignored:    code/testPlottingWithMyScript.ForJingxin.sh
    Ignored:    code/testPlottingWithMyScript.ForJingxin2.sh
    Ignored:    code/testPlottingWithMyScript.ForJingxin3.sh
    Ignored:    code/testPlottingWithMyScript.ForJingxin4.sh
    Ignored:    code/testPlottingWithMyScript.sh
    Ignored:    data/._Hijikata_TableS1_41598_2017_8902_MOESM2_ESM.xls
    Ignored:    data/._Hijikata_TableS2_41598_2017_8902_MOESM3_ESM.xls
    Ignored:    output/._PioritizedIntronTargets.pdf

Unstaged changes:
    Modified:   analysis/20211216_DifferentialSplicing.Rmd
    Modified:   analysis/20220919_chRNA_IntronRetention.Rmd
    Modified:   analysis/20220922_ExploreSubparRNAseq.Rmd
    Modified:   analysis/20220923_ExploreSpecificityEstimates.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/20211216_DifferentialSplicing.Rmd) and HTML (docs/20211216_DifferentialSplicing.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	0d59720	Benjmain Fair	2022-09-08	update site
Rmd	db9a4f0	Benjmain Fair	2022-08-01	big update
html	db9a4f0	Benjmain Fair	2022-08-01	big update
Rmd	4a410f5	Benjmain Fair	2022-03-08	initial commit
html	4a410f5	Benjmain Fair	2022-03-08	initial commit

Introduction

Our collaborator, Jingxin, is interested in how small molecular drugs might effect splicing as eventually be used as treatments, like risdiplasm to treat SMA. Although antisense oligos have also been used to treat causal splicing defects in some SMA forms, small molecules may have general advatage over antisense oligos in in terms of efficient delivery to the necessary tissues. His lab has treated cells with 4 small molecules (SM2, CUOME, A10, and C2) and a DMSO control, 3 replicates for each treatment and DMSO. The goal with this experiment is to explore the splicing effects of these small molecular drugs, trying to better understand their molecular effects. Splicing was measured by RNA-seq which I have aligned with STAR and measured splicing and differential splicing with leafcutter. I also performed differential expression testing, by counting reads, fitting expression models and testing for differences with DMSO using edgeR.

Results

Differential expression analysis

I have processed data in a snakemake pipeline. In this Rmarkdown document I’ll plot the results of these analyses. First load necessary libraries and read in differential expression testing data…øØ

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(gplots)


Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

library(edgeR)

Loading required package: limma

library(ggrepel)
library(RColorBrewer)
library(GGally)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

library(readxl)

sample.list <- read_tsv("../code/bigwigs/BigwigList.tsv",
                        col_names = c("SampleName", "bigwig", "group", "strand")) %>%
  filter(strand==".") %>%
  dplyr::select(-strand) %>%
  mutate(old.sample.name = str_replace(bigwig, "/project2/yangili1/bjf79/20211209_JingxinRNAseq/code/bigwigs/unstranded/(.+?).bw", "\\1")) %>%
  separate(SampleName, into=c("treatment", "dose.nM", "cell.type", "libType", "rep"), convert=T, remove=F, sep="_") %>%
  left_join(
    read_tsv("../code/bigwigs/BigwigList.groups.tsv", col_names = c("group", "color", "bed", "supergroup")),
    by="group"
  )

Parsed with column specification:
cols(
  SampleName = col_character(),
  bigwig = col_character(),
  group = col_character(),
  strand = col_character()
)

Parsed with column specification:
cols(
  group = col_character(),
  color = col_character(),
  bed = col_character(),
  supergroup = col_character()
)

Jingxin was particularly interested in the DE and differential splicing results for the genes MAPT (encoding tau protein) and SNCA (alpha synuclein). Originally I tested only genes with average expression of CountsPerMillion>2, resulting in ~11000 genes to test. But MAPT and SNCA are quite lowly expressed in this cell type (fibroblasts, I think? …need to ask Jingxin). So I redid the DE testing with more permissive filters to allow testing of those genes. Here are the results…

First let’s look at some basic quality control (reads mapped to genes per sample, clustering of samples by gene expression):

GeneCounts <- read_tsv("../code/DE_testing/Counts.mat.txt.gz") %>%
  column_to_rownames("Geneid") %>%
  DGEList()

Parsed with column specification:
cols(
  Geneid = col_character(),
  `A10-1` = col_double(),
  `A10-2` = col_double(),
  `A10-3` = col_double(),
  `C2-1` = col_double(),
  `C2-2` = col_double(),
  `C2-3` = col_double(),
  `CUOME-1` = col_double(),
  `CUOME-2` = col_double(),
  `CUOME-3` = col_double(),
  `DMSO-1` = col_double(),
  `DMSO-2` = col_double(),
  `DMSO-3` = col_double(),
  `SM2-1` = col_double(),
  `SM2-2` = col_double(),
  `SM2-3` = col_double()
)

#Genic read counts per sample
GeneCounts$samples %>%
  as.data.frame() %>%
  rownames_to_column("Sample") %>%
  ggplot(aes(x=Sample, y=lib.size)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle("Gene-mapping read counts per sample")

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

#Define genes of interest
GenesOfInterest <- c("MAPT", "SNCA", "SMN2", "HTT", "GALC", "FOXM1")

# Raw cpm for genes of interest:
GeneCounts %>%
  cpm() %>%
  as.data.frame() %>%
  rownames_to_column("Geneid") %>%
  separate(Geneid, into=c("EnsemblID", "GeneSymbol"), sep = "_") %>%
  filter(GeneSymbol %in% GenesOfInterest) %>%
  select(-EnsemblID) %>%
  gather(key="Sample", value="CountsPerMillion", -GeneSymbol) %>%
  mutate(Treatment = str_replace(Sample, "(.+?)-.+?", "\\1")) %>%
  ggplot(aes(x=Treatment, y=CountsPerMillion)) +
  geom_jitter(width=0.1) +
  scale_y_continuous(trans="log10") +
  facet_wrap(~GeneSymbol, scales = "free_y") +
  theme_classic()+
  ggtitle("CPM per sample for some genes of interest")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

GeneCounts %>%
  cpm() %>%
  as.data.frame() %>%
  rownames_to_column("Geneid") %>%
  separate(Geneid, into=c("EnsemblID", "GeneSymbol"), sep = "_") %>%
  filter(GeneSymbol %in% c("HTT", "SNCA", "MAPT", "SMN2")) %>%
  select(-EnsemblID) %>%
  gather(key="Sample", value="CountsPerMillion", -GeneSymbol) %>%
  mutate(Treatment = str_replace(Sample, "(.+?)-.+?", "\\1")) %>%
  ggplot(aes(x=Treatment, y=CountsPerMillion)) +
  geom_jitter(width=0.1) +
  scale_y_continuous(trans="log10") +
  facet_wrap(~GeneSymbol, scales = "free_y") +
  theme_classic()+
  ggtitle("CPM for neuronal genes of interest")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

mean.cpm <- GeneCounts %>%
    cpm(log=T) %>%
    apply(1, mean)

hist(mean.cpm)

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

ExpressedGeneCounts <- GeneCounts[mean.cpm>1,]

#Speaman correlation matrix heatmap for CountsPerMillion expression
ExpressedGeneCounts %>%
  cpm(log=T) %>%
  cor(method="pearson") %>%
  heatmap.2(trace="none")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

Samples sequenced to similar depths, plenty of reads per sample. High correlation as expected for RNA-seq (Pearson R > 0.9) Samples cluster as expected, based on correlation of gene expression (CountsPerMillion). That is, each treatment clusters with its other replicates, and CUOME clusters closest to DMSO.

And now the differential testing results:

DE.results <- read_tsv("../code/DE_testing/Results.txt.gz") %>%
  separate(Geneid, into=c("EnsemblID", "GeneSymbol"), sep = "_")

Parsed with column specification:
cols(
  treatment = col_character(),
  Geneid = col_character(),
  logFC = col_double(),
  logCPM = col_double(),
  F = col_double(),
  PValue = col_double(),
  FDR = col_double()
)

head(DE.results)

# A tibble: 6 × 8
  treatment EnsemblID          GeneSymbol logFC logCPM     F   PValue      FDR
  <chr>     <chr>              <chr>      <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 C2        ENSG00000102401.20 ARMCX3     -4.16   5.72 5292. 3.15e-23 4.23e-19
2 C2        ENSG00000138413.13 IDH1       -4.42   6.77 5056. 4.70e-23 4.23e-19
3 C2        ENSG00000115866.11 DARS1      -3.49   5.91 4866. 6.56e-23 4.23e-19
4 C2        ENSG00000115415.20 STAT1      -3.97   9.91 4356. 1.73e-22 8.37e-19
5 C2        ENSG00000179889.19 PDXDC1     -3.49   5.18 3866. 4.91e-22 1.90e-18
6 C2        ENSG00000198231.13 DDX42      -2.64   5.72 2774. 8.91e-21 2.87e-17

#Total number tests
DE.results %>%
  drop_na() %>%
  count(treatment)

# A tibble: 4 × 2
  treatment     n
  <chr>     <int>
1 A10       19354
2 C2        19354
3 CUOME     19354
4 SM2       19354

DE.results %>% pull(treatment) %>% unique()

[1] "C2"    "CUOME" "SM2"   "A10"

DE.results %>%
  select(treatment, GeneSymbol, logFC, logCPM, PValue, FDR) %>%
  filter(GeneSymbol %in% GenesOfInterest)

# A tibble: 24 × 6
   treatment GeneSymbol   logFC  logCPM   PValue      FDR
   <chr>     <chr>        <dbl>   <dbl>    <dbl>    <dbl>
 1 C2        GALC       -1.81    3.81   1.09e-14 1.66e-12
 2 C2        HTT        -0.728   6.07   1.14e- 7 1.19e- 6
 3 C2        FOXM1      -0.153   7.01   1.01e- 2 2.45e- 2
 4 C2        SMN2        0.208   4.83   1.41e- 2 3.27e- 2
 5 C2        MAPT       -0.517   0.0804 1.72e- 2 3.90e- 2
 6 C2        SNCA       -0.483  -2.27   3.72e- 1 4.88e- 1
 7 CUOME     GALC       -0.655   3.81   5.89e- 8 1.63e- 6
 8 CUOME     SMN2       -0.347   4.83   2.90e- 4 1.85e- 3
 9 CUOME     SNCA        1.20   -2.27   2.20e- 2 6.24e- 2
10 CUOME     FOXM1      -0.0871  7.01   1.19e- 1 2.32e- 1
# … with 14 more rows

DE.results %>%
  mutate(Label = case_when(
    GeneSymbol %in% GenesOfInterest ~ paste(GeneSymbol, format.pval(PValue
, digits=2)),
    TRUE ~ NA_character_)) %>%
  mutate(`FDR<0.01`=FDR<0.01) %>%
  ggplot(aes(x=logFC, y=-log10(PValue), color=`FDR<0.01`, label=Label)) +
  geom_point(alpha=0.2) +
  geom_label_repel(segment.color = 'black', min.segment.length = 0, size=3) +
  facet_wrap(~treatment) +
  theme_classic() +
  ggtitle("Differential expression volcano") +
  labs(caption="Genes of interest labelled with nominal P")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

DE.results %>%
  mutate(Label = case_when(
    GeneSymbol %in% c("HTT", "SNCA", "MAPT", "SMN2") ~ paste(GeneSymbol, format.pval(PValue
, digits=2)),
    TRUE ~ NA_character_)) %>%
  mutate(`FDR<0.01`=FDR<0.01) %>%
  arrange(desc(Label)) %>%
  ggplot(aes(y=logFC, x=logCPM, color=`FDR<0.01`, label=Label)) +
  geom_point(alpha=0.2) +
  geom_label_repel(segment.color = 'black', min.segment.length = 0, size=3) +
  facet_wrap(~treatment) +
  theme_classic() +
  ggtitle("Differential expression MA-plot for neuronal genes of interest") +
  labs(caption="Genes of interest labelled with nominal P")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

DE.results %>%
  mutate(Label = case_when(
    GeneSymbol %in% GenesOfInterest ~ paste(GeneSymbol, format.pval(PValue
, digits=2)),
    TRUE ~ NA_character_)) %>%
  mutate(`FDR<0.01`=FDR<0.01) %>%
  arrange(desc(Label)) %>%
  ggplot(aes(y=logFC, x=logCPM, color=`FDR<0.01`, label=Label)) +
  geom_point(alpha=0.2) +
  geom_label_repel(segment.color = 'black', min.segment.length = 0, size=3) +
  facet_wrap(~treatment) +
  theme_classic() +
  ggtitle("Differential expression MA-plot") +
  labs(caption="Genes of interest labelled with nominal P")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

#Number significant tests
DE.results %>%
  select(treatment, FDR, logFC) %>%
  mutate(
    FDR_0.10 = FDR< 0.1,
    FDR_0.05 = FDR< 0.05,
    FDR_0.01 = FDR< 0.01,
    FCGreaterThan2 = abs(logFC) > 1
    ) %>%
  gather(key="FDR", value = "TestResults", FDR_0.10, FDR_0.05, FDR_0.01) %>%
  mutate(FDR=recode(FDR, FDR_0.01="FDR<0.01", FDR_0.05="FDR<0.05", FDR_0.10="FDR<0.10")) %>%
  group_by(FDR, treatment, FCGreaterThan2) %>%
  summarise(NumDiscoveries=sum(TestResults)) %>%
  ungroup() %>%
  mutate(FCGreaterThan2=if_else(FCGreaterThan2, "FoldChange > 2", "FoldChange < 2")) %>%
  ggplot(aes(x=treatment, y=NumDiscoveries, fill=FDR)) +
  geom_col(position="dodge") +
  facet_wrap(~FCGreaterThan2) +
  theme_bw() +
  ggtitle("Number differential expression discoveries")

`summarise()` has grouped output by 'FDR', 'treatment'. You can override using
the `.groups` argument.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

Note the colors on these plots are for FDR, taking into account multiple test problem. See table of Pvalues for results for these specific genes Jingxin is interested in… The C2 treatment somewhat downregulates MAPT (P=0.039, log2FC=-0.517). SM2 very strongly downregulates MAPT (P=2.2E-7, logFC=-1.97)

Let’s also plot the pairwise logFC of expression:

Limits = 5

upper_point <- function(data, mapping, ...) { 
  ggplot(data = data, mapping = mapping, ...) + 
    geom_point(..., alpha=0.01) + 
    scale_y_continuous(limits = c(-Limits, Limits)) +
    scale_x_continuous(limits = c(-Limits, Limits)) +
    geom_abline() +
    geom_text_repel(aes(label=Label), color='red', segment.color = 'red', min.segment.length = 0, size=3, max.overlaps=20) +
    theme_bw()
}


diag_limitrange <- function(data, mapping, ...) { 
  ggplot(data = data, mapping = mapping, ...) + 
    geom_density(...) + 
    coord_cartesian(xlim=c(-Limits,Limits)) +
    theme_bw()
}


DE.results %>%
  mutate(Label = case_when(
    GeneSymbol %in% GenesOfInterest ~ GeneSymbol,
    TRUE ~ NA_character_)) %>%
  select(Label, treatment, logFC, EnsemblID) %>%
  # group_by(treatment) %>%
  # mutate(rn = row_number()) %>%
  # ungroup() %>%
  pivot_wider(id_cols = c("EnsemblID", "Label"), names_from = "treatment" , values_from = "logFC") %>%
  tidyr::unnest() %>%
  arrange(Label) %>%
  mutate(color=if_else(is.na(Label), "black", "red")) %>%
  mutate(alpha=if_else(is.na(Label), 0.01, 1)) %>%
  ggpairs(
    title="treatment pair expression logFC and Spearman coef",
    columns = 3:6,
    upper=list(continuous = wrap("cor", method = "spearman", hjust=0.7)),
    lower=list(continuous = upper_point),
    diag=list(continuous = diag_limitrange))

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, so based on differential expression, SM2 and A10 are quite similar (correlation coefficient of logFC = 0.802).

Now I will move onto the splicing analyses…

Differential splicing analysis

Basic quality control and differntial splicing test results

First, an overview of the data from a splicing perspective by looking at the leafcutter splice junction count table.

counts <- read.table("../code/SplicingAnalysis/leafcutter/clustering/autosomes/leafcutter_perind_numers.counts.gz", sep = ' ',  header=T) %>%
  rownames_to_column("intron") %>%
  gather(key="Sample", value="Count", -intron)

head(counts)

                    intron Sample Count
1 chr1:17055:17233:clu_1_-  A10.1    92
2 chr1:17055:17606:clu_1_-  A10.1     8
3 chr1:17055:17915:clu_1_-  A10.1    10
4 chr1:17368:17606:clu_1_-  A10.1    50
5 chr1:17742:17915:clu_1_-  A10.1    68
6 chr1:18061:18268:clu_2_-  A10.1    43

First, let’s check evenness of junction coverage across samples. Sum total splice junction counts.

counts %>%
  group_by(Sample) %>%
  summarise(TotalJunctionCounts = sum(Count)) %>%
  ggplot(aes(x=Sample, y=TotalJunctionCounts)) +
  geom_col() +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle("Splice junction read counts per sample")

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

All samples are reasonbly well matched by this metric. Now let’s check how samples cluster looking at sperman correlation coefficient of intron excision ratios

PSI.table <- counts %>%
  mutate(Cluster = str_replace(intron, ".+:(.+)$", "\\1")) %>%
  group_by(Sample, Cluster) %>%
  mutate(ClusterSum = sum(Count)) %>%
  mutate(PSI = Count/ClusterSum * 100) %>%
  ungroup()

PSI.table %>%
  select(intron, Sample, PSI) %>%
  spread(key = Sample, value = PSI) %>%
  column_to_rownames("intron") %>%
  drop_na() %>%
  cor(method="spearman") %>%
  heatmap.2(trace="none")

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

Conditions cluster together, except for a smaple of CUOME and DMSO which is expected as I was told we expect CUOME to not have much effect on splicing, similar to DMSO control. Good. Now let’s read in the leafcutter differential splicing results.

cluster_sig_files <- list.files("../code/SplicingAnalysis/leafcutter/differential_splicing", "*_cluster_significance.txt", full.names = T)
effect_sizes_files <- list.files("../code/SplicingAnalysis/leafcutter/differential_splicing", "*_effect_sizes.txt", full.names = T)
treatments <- str_replace(cluster_sig_files, ".+/(.+?)_cluster_significance.txt", "\\1")

cluster.sig <- map(cluster_sig_files, read_tsv) %>%
  set_names(cluster_sig_files) %>%
  bind_rows(.id="f") %>%
  mutate(treatment = str_replace(f, ".+/(.+?)_cluster_significance.txt", "\\1")) %>%
  select(-f)

Parsed with column specification:
cols(
  status = col_character(),
  loglr = col_double(),
  df = col_double(),
  p = col_double(),
  cluster = col_character(),
  p.adjust = col_double()
)
Parsed with column specification:
cols(
  status = col_character(),
  loglr = col_double(),
  df = col_double(),
  p = col_double(),
  cluster = col_character(),
  p.adjust = col_double()
)
Parsed with column specification:
cols(
  status = col_character(),
  loglr = col_double(),
  df = col_double(),
  p = col_double(),
  cluster = col_character(),
  p.adjust = col_double()
)
Parsed with column specification:
cols(
  status = col_character(),
  loglr = col_double(),
  df = col_double(),
  p = col_double(),
  cluster = col_character(),
  p.adjust = col_double()
)

effect_sizes <- map(effect_sizes_files, read_tsv) %>%
  set_names(effect_sizes_files) %>%
  bind_rows(.id="f") %>%
  mutate(treatment = str_replace(f, ".+/(.+?)_effect_sizes.txt", "\\1")) %>%
  select(-f) %>%
  unite("psi_treatment", treatments, sep=" ", na.rm=T) %>%
  mutate(psi_treatment = as.numeric(psi_treatment),
         cluster = str_replace(intron, "(.+?:).+:(.+?)", "\\1\\2"))

Parsed with column specification:
cols(
  intron = col_character(),
  logef = col_double(),
  A10 = col_double(),
  DMSO = col_double(),
  deltapsi = col_double()
)

Parsed with column specification:
cols(
  intron = col_character(),
  logef = col_double(),
  C2 = col_double(),
  DMSO = col_double(),
  deltapsi = col_double()
)

Parsed with column specification:
cols(
  intron = col_character(),
  logef = col_double(),
  CUOME = col_double(),
  DMSO = col_double(),
  deltapsi = col_double()
)

Parsed with column specification:
cols(
  intron = col_character(),
  logef = col_double(),
  SM2 = col_double(),
  DMSO = col_double(),
  deltapsi = col_double()
)

Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(treatments)` instead of `treatments` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.

head(effect_sizes)

# A tibble: 6 × 7
  intron                  logef psi_treatment    DMSO deltapsi treatment cluster
  <chr>                   <dbl>         <dbl>   <dbl>    <dbl> <chr>     <chr>  
1 chr1:17055:17233:clu…  0.354         0.337  0.379    0.0414  A10       chr1:c…
2 chr1:17055:17606:clu…  0.0208        0.0359 0.0289  -0.00700 A10       chr1:c…
3 chr1:17055:17915:clu… -0.774         0.0266 0.00966 -0.0169  A10       chr1:c…
4 chr1:17368:17606:clu…  0.126         0.276  0.247   -0.0293  A10       chr1:c…
5 chr1:17742:17915:clu…  0.274         0.324  0.336    0.0118  A10       chr1:c…
6 chr1:3635074:3635160… -0.141         0.556  0.499   -0.0573  A10       chr1:c…

Oops, manually inspecting the effect size signs looks like they are all reversed. Usually I think of deltapsi as treatment - control. Here it is the opposite. Let’s flip the sign. then combine the effect sizes table with the cluster significance and plot some results

effect_sizes <- effect_sizes %>%
  mutate(deltapsi = deltapsi*-1,
         logef = logef *-1)


#Combine effect sizes and significance for all samples into single df
leafcutter.ds <- effect_sizes %>%
  left_join(cluster.sig, by=c("treatment", "cluster"))

Plot distribution of how many introns per cluster

leafcutter.ds %>%
  count(treatment, cluster) %>%
  ggplot(aes(x=n)) +
  stat_ecdf() +
  coord_cartesian(xlim=c(0,5)) +
  facet_wrap(~treatment) +
  ylab("ecdf") +
  theme_bw() +
  ggtitle("Distribution of number introns per leafcutter cluster")

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

Ok so the median number of introns per cluster is 3…

Now, let’s plot some results of the differential splicing tests…

# Plot histogram of P-values
leafcutter.ds %>%
  distinct(treatment, cluster, .keep_all=T) %>%
  ggplot(aes(x=p)) +
  geom_histogram() +
  facet_wrap(~treatment) +
  theme_bw() +
  ggtitle("differential splicing test P-value histograms")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

#Make volcano plot of deltaPSI (max within cluster) and signifance
leafcutter.ds %>%
  group_by(treatment, cluster) %>%
  filter(abs(deltapsi) == max(abs(deltapsi))) %>%
  ungroup() %>%
  ggplot(aes(x=deltapsi, y=-log10(p.adjust), color=p.adjust<0.01)) +
  geom_point(alpha=0.05) +
  xlab("max effect size within cluster\nDeltaPSI") +
  facet_wrap(~treatment) +
  theme_bw() +
  ggtitle("Differential splicing volcano")

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

Ok, clearly there are many significant results, at a range of effect sizes. As expected, the CUOME sample has the least signficantly differentially spliced clusters. With so many points its hard to really see how many significant clusters there are. Let’s plot this a different way.

#Number of tests
leafcutter.ds %>%
  distinct(treatment, cluster) %>%
  ggplot(aes(x=treatment)) +
  ylab("total differential splicing tests") +
  geom_bar() +
  theme_bw() +
  ggtitle("Number differential splicing tests per treatment") +
  labs(caption = "Each leafcutter cluster of introns comprises a single test")

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

#Number of significant tests
leafcutter.ds %>%
  select(treatment, cluster, deltapsi, p.adjust) %>%
  group_by(treatment, cluster) %>%
  filter(abs(deltapsi) == max(abs(deltapsi))) %>%
  ungroup() %>%
  mutate(
    FDR_0.10 = p.adjust< 0.1,
    FDR_0.05 = p.adjust< 0.05,
    FDR_0.01 = p.adjust< 0.01,
    deltapsi_0.1 = abs(deltapsi) > 0.1
  ) %>%
  gather(key="FDR", value = "TestResults", FDR_0.10, FDR_0.05, FDR_0.01) %>%
  mutate(FDR=recode(FDR, FDR_0.01="FDR<0.01", FDR_0.05="FDR<0.05", FDR_0.10="FDR<0.10")) %>%
  group_by(FDR, deltapsi_0.1, treatment) %>%
  summarise(NumDiscoveries=sum(TestResults)) %>%
  ungroup() %>%
  mutate(deltapsi_0.1=if_else(deltapsi_0.1, "deltaPSI > 0.1", "deltaPSI < 0.1")) %>%
  ggplot(aes(x=treatment, y=NumDiscoveries, fill=FDR)) +
  geom_col(position="dodge") +
  facet_wrap(~deltapsi_0.1) +
  theme_bw() +
  ggtitle("Number differential splicing discoveries")

`summarise()` has grouped output by 'FDR', 'deltapsi_0.1'. You can override
using the `.groups` argument.

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

Ok, so we still get over 1000 significant clusters for the three main treatments even if we restrict to FDR<0.01 and deltapsi > 10%.

Yang was curious about this plot and increasing the resolution of the PSI effect size groups (instead of just <0.1 an >0.1)… Let’s plot that. Same plot as above but faceting by different PSI ranges.

leafcutter.ds %>%
  select(treatment, cluster, deltapsi, p.adjust) %>%
  group_by(treatment, cluster) %>%
  filter(abs(deltapsi) == max(abs(deltapsi))) %>%
  ungroup() %>%
  mutate(
    FDR_0.10 = p.adjust< 0.1,
    FDR_0.05 = p.adjust< 0.05,
    FDR_0.01 = p.adjust< 0.01,
    deltapsi_group = cut(abs(deltapsi), breaks=c(0,0.1, 0.2, 0.5, 1))
  ) %>%
  gather(key="FDR", value = "TestResults", FDR_0.10, FDR_0.05, FDR_0.01) %>%
  mutate(FDR=recode(FDR, FDR_0.01="FDR<0.01", FDR_0.05="FDR<0.05", FDR_0.10="FDR<0.10")) %>%
  group_by(FDR, deltapsi_group, treatment) %>%
  summarise(NumDiscoveries=sum(TestResults)) %>%
  ungroup() %>%
  ggplot(aes(x=treatment, y=NumDiscoveries, fill=FDR)) +
  geom_col(position="dodge") +
  facet_wrap(~deltapsi_group) +
  theme_bw() +
  ggtitle("Number differential splicing discoveries")

`summarise()` has grouped output by 'FDR', 'deltapsi_group'. You can override
using the `.groups` argument.

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

Ok, that is an important way to visualize the significant results I think. Notably, there are only hundreds to a thousand or so differential splicing results with delta-PSI > 0.2 for A10, C2, and SM2. Virtually no changes of that size for CUOME.

To ask Jingxin:

How were these compunds dosed… Since these are analogs, these may have generally similar effects. If you just upped the dosage of CUOME would you get something like SM2?

What distinguishes the affected introns?

Now let’s try to understand the molecular splicing defect better by looking if there is any meaningful assymetries in splice effect size/direction by alternative splice type (alt 3’ss, alt 5’ss, etc) among introns in significant clusters .

First let’s read in the output of regtools annotate which annotates whether the intron is novel splice donor, splice acceptor, or known… I used the ‘basic’ gencode gtf, which limits transcripts to the ‘basic’ annotated ones and not so many transcripts. I also scored 5’ss and extracted the 5’ss sequence which I want to merge in with the annotations data frame

ColorKey <- sample.list %>%
  filter(cell.type=="Fibroblast") %>%
  filter(!treatment=="DMSO") %>%
  distinct(treatment, .keep_all=T) %>%
  dplyr::select(treatment, color) %>%
  deframe()

SpliceTypeAnnotations <- read_tsv("../code/SplicingAnalysis/leafcutter/JuncfilesMerged.annotated.basic.bed.gz") %>%
  unite(Intron, chrom, start, end) %>%
  distinct(Intron, .keep_all = T) %>%
  select(Intron, strand, exons_skipped, anchor, gene_names, gene_ids, splice_site, strand) %>%
  mutate(IntronType = recode(anchor, A="Alt 5'ss", D="Alt 3'ss", DA="Annotated", N="New Intron", NDA="New Splice Site combination"))

Parsed with column specification:
cols(
  chrom = col_character(),
  start = col_double(),
  end = col_double(),
  name = col_character(),
  score = col_double(),
  strand = col_character(),
  splice_site = col_character(),
  acceptors_skipped = col_double(),
  exons_skipped = col_double(),
  donors_skipped = col_double(),
  anchor = col_character(),
  known_donor = col_double(),
  known_acceptor = col_double(),
  known_junction = col_double(),
  gene_names = col_character(),
  gene_ids = col_character(),
  transcripts = col_character()
)

head(SpliceTypeAnnotations)

# A tibble: 6 × 8
  Intron  strand exons_skipped anchor gene_names gene_ids splice_site IntronType
  <chr>   <chr>          <dbl> <chr>  <chr>      <chr>    <chr>       <chr>     
1 chr1_1… +                  1 DA     DDX11L1    ENSG000… GT-AG       Annotated 
2 chr1_1… +                  6 N      <NA>       <NA>     GT-AG       New Intron
3 chr1_1… -                 21 N      <NA>       <NA>     GT-AG       New Intron
4 chr1_1… -                  0 N      <NA>       <NA>     GT-AG       New Intron
5 chr1_1… -                  0 N      <NA>       <NA>     GT-AG       New Intron
6 chr1_1… -                  0 N      <NA>       <NA>     GT-AG       New Intron

FivePrimeSS <- read_tsv("../code/SplicingAnalysis/leafcutter/JuncfilesMerged.annotated.basic.bed.5ss.tab.gz", col_names = c("Intron", "DonorSeq", "DonorScore")) %>%
  mutate(Intron = str_replace(Intron, "(.+)_.+?::.+$", "\\1"))

Parsed with column specification:
cols(
  Intron = col_character(),
  DonorSeq = col_character(),
  DonorScore = col_double()
)

SpliceTypeAnnotations <- SpliceTypeAnnotations %>%
  inner_join(FivePrimeSS, by="Intron")

leafcutter.ds.annotated <- leafcutter.ds %>%
  mutate(Intron=str_replace(intron, "(.+?):(.+?):(.+?):.+", "\\1_\\2_\\3")) %>%
  inner_join(SpliceTypeAnnotations, by="Intron") %>%
  mutate(UpstreamOfDonor2BaseSeq = substr(DonorSeq, 3, 4)) 

leafcutter.ds.annotated %>%
  filter(p.adjust < 0.01) %>%
  ggplot(aes(x=deltapsi, fill=treatment)) +
  geom_histogram(position="dodge") +
  xlab("Effect size\ndeltaPSI") +
  facet_wrap(~IntronType, scales="free_y") +
  scale_fill_manual(values=ColorKey) +
  theme_light() +
  ggtitle("Distribution of splicing effect sizes") +
  labs(caption=str_wrap("Alt 3'ss means 5'ss but not 3'ss in annotation file (Gencode primary transcripts). Alt 5'ss is similar. New intron means neither splice site is in annotation. New splice site combination means both splice sites are annotated but not together", 60))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

#Same data plotted as ecdf
leafcutter.ds.annotated %>%
  filter(p.adjust < 0.01) %>%
  ggplot(aes(x=deltapsi, color=treatment)) +
  stat_ecdf() +
  scale_color_manual(values=ColorKey) +
  ylab("CumulativeFraction") +
  facet_wrap(~IntronType) +
  theme_light() +
  ggtitle("Distribution of splicing effect sizes, as ecdf")

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

It seems that direction of effects for all types of splice sites are symmetric about 0. In contrast, I have made this plot before for SF3B1 mutants which assymetrically increase alt 3’ss as has been observed by many others.

Maybe this plot makes sense to plot effect sizes and log-ratios of intron excision ratio changes upon treatment, rather than deltapsi. I’m concerned that alternative 3’ss for example, might in general have lower PSI, focusing on deltaPSI might mask strong molecular effects in terms of log-ratio.

But first Let’s write out this table for Jingxin

leafcutter.ds.annotated %>%
  rename(psi_DMSO=DMSO) %>%
  write_tsv("../output/leafcutter_merged.tsv.gz")

Ok, back to plotting…

leafcutter.ds.annotated %>%
  filter(p.adjust < 0.01) %>%
  ggplot(aes(x=logef, color=treatment)) +
  stat_ecdf() +
  scale_color_manual(values=ColorKey) +
  coord_cartesian(xlim=c(-2, 2)) +
  ylab("CumulativeFraction") +
  facet_wrap(~IntronType) +
  theme_light() +
  ggtitle("Distribution of splicing effect sizes (expressed as log(FoldChange), logef, as ecdf")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Okay, actually this way of looking at the data makes more sense I think… The treatments (besides CUOME) generally increase alternative splicing and annotated basic splice sites increase in relative usage.

But let’s also plot the same thing as a histogram just to get a sense of the absolute count of significant detected effects in each class…

leafcutter.ds.annotated %>%
  filter(p.adjust < 0.01) %>%
  ggplot(aes(x=logef, fill=treatment)) +
  geom_histogram(breaks=seq(-2,2,0.2), position="dodge") +
  coord_cartesian(xlim=c(-2, 2)) +
  scale_fill_manual(values=ColorKey) +
  ylab("Count") +
  facet_grid(rows = vars(IntronType), cols = vars(treatment), scales="free_y") +
  theme_light() +
  ggtitle("Histogram of splicing effect sizes")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, I can see that there are a similar number of upregulated 3’ss as there are 5’ss.

Let’s go back to the cumulative distribution plots to look at effect size assymetries but for other ‘classes’ of introns based on the canonically (5’ss GT … AG 3’ss) sequence:

leafcutter.ds.annotated %>%
  filter(p.adjust < 0.01) %>%
  ggplot(aes(x=logef, color=treatment)) +
  scale_color_manual(values=ColorKey) +
  stat_ecdf() +
  coord_cartesian(xlim=c(-2, 2)) +
  ylab("CumulativeFraction") +
  facet_wrap(~splice_site) +
  theme_light() +
  ggtitle("Ecdf of sig splicing effect sizes, by 5'ss-3'ss dinucleotides")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Nothing particularly notable here, though, there aren’t many non GT-AG introns to begin with so the ecdf is very jaggety.

Let’s make the same plot but for different -2 to 0 positions relative to the splice donor, since risdiplasm (and these analogs) presumably can bind at this region at some splice donors which have a register change like at SMN2.

leafcutter.ds.annotated %>%
  filter(p.adjust < 0.01) %>%
  ggplot(aes(x=logef, color=treatment)) +
  stat_ecdf() +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_vline(
    data = . %>%
      group_by(UpstreamOfDonor2BaseSeq, treatment) %>%
      summarise(med=median(logef, na.rm=T)),
    aes(xintercept=med, color=treatment),
    linetype='dashed'
  ) +
  scale_color_manual(values=ColorKey) +
  coord_cartesian(xlim=c(-2, 2)) +
  ylab("CumulativeFraction") +
  facet_wrap(~UpstreamOfDonor2BaseSeq) +
  theme_light() +
  ggtitle("Effect sizes for NN|GT introns, grouped by NN") +
  labs(caption="Only included introns in FDR<0.01 significant clusters")

`summarise()` has grouped output by 'UpstreamOfDonor2BaseSeq'. You can override
using the `.groups` argument.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

leafcutter.ds.annotated %>%
  ggplot(aes(x=logef, color=treatment)) +
  stat_ecdf() +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_vline(
    data = . %>%
      group_by(UpstreamOfDonor2BaseSeq, treatment) %>%
      summarise(med=median(logef, na.rm=T)),
    aes(xintercept=med, color=treatment),
    linetype='dashed'
  ) +
  scale_color_manual(values=ColorKey) +
  coord_cartesian(xlim=c(-0.8, 0.8)) +
  ylab("CumulativeFraction") +
  facet_wrap(~UpstreamOfDonor2BaseSeq) +
  theme_light() +
  ggtitle("Effect sizes for NN|GT introns, grouped by NN") +
  labs(caption="All introns included")

`summarise()` has grouped output by 'UpstreamOfDonor2BaseSeq'. You can override
using the `.groups` argument.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

Woah. there is a huge effect for splice sites with GA upstream of 5’ss.

Let’s plot the exact same thing as a histogram, so we can get a sense of the absolute number of events in each of these groups:

leafcutter.ds.annotated %>%
  filter(p.adjust < 0.01) %>%
  ggplot(aes(x=logef, fill=treatment)) +
  geom_histogram(breaks=seq(-2,2,0.2)) +
  geom_vline(xintercept = 0, linetype="dashed") +
  coord_cartesian(xlim=c(-2, 2)) +
  ylab("Number Signif Introns") +
  scale_fill_manual(values=ColorKey) +
  facet_wrap(~UpstreamOfDonor2BaseSeq) +
  theme_light() +
  ggtitle("Effect sizes for NN|GT introns, grouped by NN") +
  labs(caption="Only included introns in FDR<0.01 significant clusters")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, so even though GA effects are clearly strong, most significant effects, and in fact, even most signficant effects in the positive direction are not from GA|GT splice sites, since AG|GT is just more common.

Let’s plot the relative abundance of annotated introns with these sequences to begin with:

#Frequency table of -2:-1 dinucleotides at 5'ss
DinucletodieFreqs <- leafcutter.ds.annotated %>%
  filter(treatment == "A10") %>%
  filter(IntronType == "Annotated") %>%
  count(UpstreamOfDonor2BaseSeq) %>%
  mutate(Percent = n/sum(n)*100)
DinucletodieFreqs

# A tibble: 16 × 3
   UpstreamOfDonor2BaseSeq     n Percent
   <chr>                   <int>   <dbl>
 1 AA                       2004   4.39 
 2 AC                        475   1.04 
 3 AG                      25367  55.6  
 4 AT                        943   2.07 
 5 CA                       1057   2.32 
 6 CC                        342   0.750
 7 CG                       2967   6.50 
 8 CT                        817   1.79 
 9 GA                        668   1.46 
10 GC                        241   0.528
11 GG                       4166   9.13 
12 GT                        436   0.956
13 TA                        306   0.671
14 TC                        197   0.432
15 TG                       5218  11.4  
16 TT                        416   0.912

ggplot(DinucletodieFreqs, aes(x=UpstreamOfDonor2BaseSeq, y=Percent)) +
  geom_col() +
  xlab("NN dinucleotide") +
  theme_classic() +
  ggtitle("Frequency of NN|GT at annotated 5'ss")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

I wonder a lot of the GA|GT introns are in clusters with non GA|GT introns which will naturally have significant effects on cluster-based intron excision ratio. Let’s investigate that, to get a better sense of what proportion of significant changes can be attributed in to a primary effect (on a GA|GT intron). To restate the question another way: A lot of the non GA|GT splicing effects will just be downregulated introns due to an upregulated GA|GT intron in the same cluster… So I want to know what fraction of significant splicing changes can be attributed to a primary effect on a GA|GT intron.

For this I will plot the P-value histograms of clusters containing GA|GT versus that do not.

leafcutter.ds.annotated %>%
  group_by(cluster, treatment, p) %>%
  summarise(GA_in_cluster = ifelse(any("GA" %in% UpstreamOfDonor2BaseSeq), "GA|GT in cluster", "GA|GT not in cluster")) %>%
  ggplot(aes(x=p)) +
  geom_histogram() +
  facet_grid(rows=vars(treatment), cols = vars(GA_in_cluster)) +
  theme_light() +
  ggtitle("P values for clusters, grouped by whether they contain GA|GT intron")

`summarise()` has grouped output by 'cluster', 'treatment'. You can override
using the `.groups` argument.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, it appears that even in SM2, most of the significant changes are in intron clusters that don’t have an GA|GT intron. Also of note, nearly all clusters with a GA|GT intron have a significant change.

Another way to look at this would be to look at effect size. leafcutter is quite sensitive such that even small delta-psi splicing changes may have cery low P-values. So perhaps a lot of the small P-values in the non GA|GT clusters represent relatively small effect sizes

Let’s plot the histogram of effect sizes for GA|GT versus non GA|GT containing clusters… Trying to ask: “What fraction of large effect sizes singificant changes are for GA|GT containing clusters”

leafcutter.ds.annotated %>%
  group_by(cluster, treatment, p.adjust) %>%
  mutate(GA_in_cluster = ifelse(any("GA" %in% UpstreamOfDonor2BaseSeq), "GA|GT in cluster", "GA|GT not in cluster")) %>%
  filter(abs(logef) == max(abs(logef))) %>%
  filter(p.adjust < 0.01) %>%
  ggplot(aes(x=logef, fill=GA_in_cluster)) +
  geom_histogram(breaks=seq(-5,5,0.2)) +
  coord_cartesian(xlim=c(-5, 5)) +
  ylab("Number Signif Introns") +
  xlab("max(logef) in cluster") +
  # facet_grid(rows=vars(treatment), cols = vars(GA_in_cluster), scales = "free_y") +
  # facet_grid(cols = vars(GA_in_cluster), scales = "free_y") +
  facet_grid(rows = vars(treatment), cols = , scales = "free_y") +
  ggtitle("GA|GT introns only start explain most splicing changes at large effect sizes")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, so even if you consider only the max effect in each cluster (as a way of removing redundant non-positive effect sizes of non GA|GT introns in GA|GT-containing clusters), non GA|GT clusters still make up the large majority of large effects. That is to say, I believe there are more large-effect (possibly indirect) splicing effects than direct effects on GA|GT. I suppose many of the splicing effects on non-GA|GT introns could still be direct, for example, by promoting snRNA:5’ss duplex pairing at other bulged nucleotides besides the -1A position. This paper and this review, fig1 has a nice figure of other snRNA:5’ss pairings with bulged nucleotides that might be worth exploring at some point… However, the GA|GT effect is so strong that I will continue to focus on that class of introns to learn more…

Note that in the above plot, I plotted the max(logef in cluster) and it is not always positive, as I expect for actual GA|GT introns. What is going on. One hypothesis is that a lot of the Alt 5’ss GA|GT introns are paired with a similarly effect alt 3’ss to create a new cassette exon which is also upregulated. Let’s look at the GA|GT-containing clusters and plot the effects, colored by type of splice site.

leafcutter.ds.annotated %>%
  group_by(cluster, treatment, p.adjust) %>%
  mutate(GA_in_cluster = ifelse(any("GA" %in% UpstreamOfDonor2BaseSeq), "GA|GT in cluster", "GA|GT not in cluster")) %>%
  filter(abs(logef) == max(abs(logef))) %>%
  filter(p.adjust < 0.01) %>%
  filter(GA_in_cluster == "GA|GT in cluster") %>%
  ggplot(aes(x=logef, fill=IntronType)) +
  geom_histogram(breaks=seq(-5,5,0.2)) +
  coord_cartesian(xlim=c(-5, 5)) +
  ylab("Number Signif Introns") +
  xlab("max(logef) in cluster") +
  # facet_grid(rows=vars(treatment), cols = vars(GA_in_cluster), scales = "free_y") +
  # facet_grid(cols = vars(GA_in_cluster), scales = "free_y") +
  facet_wrap(~treatment,  scales = "free_y") +
  theme_light() +
  ggtitle("Alt 5'ss might be a part of cryptic exon activation")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, I see that the max effect for GA|GT introns is almost just as common to be alt 3’ss, suggesting cryptic exon activation is common mode. And that comes at the expense of downregulation of annotated introns. Maybe a simpler way to plot the same idea is to plot the distribution or median affect size all types of introns in GA|GT containing clusters:

leafcutter.ds.annotated %>%
  group_by(cluster, treatment, p.adjust) %>%
  mutate(GA_in_cluster = ifelse(any("GA" %in% UpstreamOfDonor2BaseSeq), "GA|GT in cluster", "GA|GT not in cluster")) %>%
  ungroup() %>%
  filter(GA_in_cluster=="GA|GT in cluster") %>%
  group_by(treatment, IntronType) %>%
  summarise(MedianEffect = median(deltapsi)) %>%
  ggplot(aes(x=IntronType, fill=IntronType, y=MedianEffect)) +
  geom_col() +
  facet_wrap(~treatment) +
  theme_bw() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  ggtitle("Alt 5'ss might be a part of cryptic exon activation") +
  labs(caption="GA|GT intron clusters have similar median effect of alt 3'ss, consistent with cryptic exon")

`summarise()` has grouped output by 'treatment'. You can override using the
`.groups` argument.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

leafcutter.ds.annotated %>%
  group_by(cluster, treatment, p.adjust) %>%
  mutate(GA_in_cluster = ifelse(any("GA" %in% UpstreamOfDonor2BaseSeq), "GA|GT in cluster", "GA|GT not in cluster")) %>%
  ungroup() %>%
  filter(GA_in_cluster=="GA|GT in cluster") %>%
  ggplot(aes(x=IntronType, fill=IntronType, y=deltapsi)) +
  geom_violin() +
  coord_cartesian(ylim=c(-0.1, 0.1)) +
  facet_wrap(~treatment) +
  theme_bw() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) +
  ggtitle("Alt 5'ss might be a part of cryptic exon activation") +
  labs(caption="GA|GT intron clusters have similar median effect of alt 3'ss, consistent with cryptic exon")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, actually maybe a better way to plot this is to just look at the scatter plot of correlation of delta-psi for GA|GT introns, and alt 3’ss in the same cluster if it exists. So specifically, I’ll try plotting the delta-psi effect size for each pair of introns within a cluster, grouped by the type of intron. So in the case of cryptic exon activation, i expect alt 3’ss to be paired with similar positive effect size change with alt 5’ss, and negative effect size for annotated.

ToCor <- leafcutter.ds.annotated %>%
  group_by(cluster, treatment, p.adjust) %>%
  filter(any("GA" %in% UpstreamOfDonor2BaseSeq & IntronType=="Alt 5'ss" & sign(logef)==1 & p.adjust<0.01)) %>%
  ungroup() %>%
  add_count(cluster, treatment, p.adjust) %>%
  # filter(n==3) %>%
  select(intron, cluster, Effect=deltapsi, treatment, IntronType, n) %>%
  # pivot_wider(id_cols=c("cluster", "intron") ,names_from="IntronType", values_from="logef")
  left_join(., ., by = c("treatment", "cluster", "n")) %>% 
  filter(intron.x != intron.y) %>% 
  rowwise() %>%
  mutate(name = toString(sort(c(intron.x,intron.y)))) %>% 
  distinct(name, cluster, .keep_all = T) %>%
  mutate(IntronType_name = toString(sort(c(IntronType.x,IntronType.y))))

Woah… There is a lot going on and I think it is because the clusters with too many introns are hard to interpret. Let’s consider just looking at clusters with a few introns.

ToCor %>% distinct(cluster, n) %>%
  ggplot(aes(x=n)) +
  stat_ecdf() +
  coord_cartesian(xlim=c(0,10)) +
  ylab("ecdf") +
  xlab("num introns in cluster") +
  theme_classic() +
  ggtitle("About half of significant GA-GT clusters have <= 5 introns")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

ToCor %>%
  filter(n <= 5) %>%
  ggplot(aes(x=Effect.x, y=Effect.y)) +
  geom_point(alpha=0.05) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  facet_grid(cols=vars(IntronType_name), rows=vars(n), labeller = labeller(IntronType_name = label_wrap_gen(10))) +
  ggtitle("DeltaPsi Correlation of within-cluster effect-sizes by intron-type") +
  labs(caption="Significant GA-GT-containing clusters with n introns included") +
  theme_bw()

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

Ok, So I feel confident now that what usually happens is that when there is a cluster with a GA|GT intron, it gets increased as an alt 5’ss, concominant with an increase in alt 3’ss to create a new exon, and sometimes a new intron comes too… and all this happens at the expense of annotated introns.

Let’s check this another way. Out of all upregulated cryptic 5’ss, how many are in a cluster with an upregulated cryptic 3’ss within 500bp of the upregulated 5’ss.

#ToDO
ToCor %>%
  filter()

Check if upregulated 3’ss is as common as upregulated 5’ss

leafcutter.ds.annotated %>%
  group_by(cluster, treatment, p.adjust) %>%
  filter(p.adjust < 0.01) %>%
  # filter((any(IntronType=="Alt 3'ss" & sign(logef)==1)) | (any(IntronType=="Alt 5'ss" & sign(logef)==1)) ) %>%
  filter(any("GA" %in% UpstreamOfDonor2BaseSeq)) %>%
  ungroup() %>%
  add_count(cluster, treatment, p.adjust) %>%
  filter(n==2) %>%
  # filter(treatment == "SM2") %>%
  group_by(cluster, treatment, p.adjust, n) %>%
  arrange(IntronType) %>%
  summarise(ClusterConfiguration = paste(IntronType, collapse = ", ")) %>%
  ungroup() %>%
  count(treatment, n, ClusterConfiguration, name="NumClusters") %>%
  ggplot(aes(y=NumClusters, x=ClusterConfiguration)) +
  geom_col() +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  facet_wrap(~treatment) +
  ggtitle("Sig clusters (n=2 introns) with GA|GT intron are usually cryptic 5'ss, as expected")

`summarise()` has grouped output by 'cluster', 'treatment', 'p.adjust'. You can
override using the `.groups` argument.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

leafcutter.ds.annotated %>%
  group_by(cluster, treatment, p.adjust) %>%
  filter(p.adjust < 0.01) %>%
  filter((any(IntronType=="Alt 3'ss" & sign(logef)==1)) | (any(IntronType=="Alt 5'ss" & sign(logef)==1)) ) %>%
  # filter(any("GA" %in% UpstreamOfDonor2BaseSeq)) %>%
  ungroup() %>%
  add_count(cluster, treatment, p.adjust) %>%
  filter(n==2) %>%
  # filter(treatment == "SM2") %>%
  group_by(cluster, treatment, p.adjust, n) %>%
  arrange(IntronType) %>%
  summarise(ClusterConfiguration = paste(IntronType, collapse = ", ")) %>%
  ungroup() %>%
  count(treatment, n, ClusterConfiguration, name="NumClusters") %>%
  ggplot(aes(y=NumClusters, x=ClusterConfiguration)) +
  geom_col() +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  facet_wrap(~treatment) +
  ggtitle("There are about as many upregulated cryptic 3'ss as upregulated cryptic 5'ss")

`summarise()` has grouped output by 'cluster', 'treatment', 'p.adjust'. You can
override using the `.groups` argument.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

leafcutter.ds.annotated %>%
  group_by(cluster, treatment, p.adjust) %>%
  filter(p.adjust < 0.01) %>%
  filter((any(IntronType=="Alt 3'ss" & sign(logef)==1)) | (any(IntronType=="Alt 5'ss" & sign(logef)==1)) ) %>%
  filter(any("GA" %in% UpstreamOfDonor2BaseSeq)) %>%
  ungroup() %>%
  add_count(cluster, treatment, p.adjust) %>%
  filter(n==2) %>%
  # filter(treatment == "SM2") %>%
  group_by(cluster, treatment, p.adjust, n) %>%
  arrange(IntronType) %>%
  summarise(ClusterConfiguration = paste(IntronType, collapse = ", ")) %>%
  ungroup() %>%
  count(treatment, n, ClusterConfiguration, name="NumClusters") %>%
  ggplot(aes(y=NumClusters, x=ClusterConfiguration)) +
  geom_col() +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  facet_wrap(~treatment)

`summarise()` has grouped output by 'cluster', 'treatment', 'p.adjust'. You can
override using the `.groups` argument.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

leafcutter.ds.annotated %>%
  filter(treatment == "SM2") %>%
  filter(p.adjust < 0.01) %>%
  count(cluster) %>%
  ggplot(aes(x=n)) +
  stat_ecdf() +
  ylab("Cumulative fraction clusters") +
  xlab("Number introns in cluster") +
  ggtitle("Significant introns (FDR<0.01) in clusters that can be hard to interpret") +
  coord_cartesian(xlim=c(0,10)) +
  theme_classic()

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

leafcutter.ds.annotated %>%
  group_by(cluster, treatment, p.adjust) %>%
  filter(p.adjust < 0.01) %>%
  # filter((any(IntronType=="Alt 3'ss" & sign(logef)==1)) | (any(IntronType=="Alt 5'ss" & sign(logef)==1)) ) %>%
  # filter(any("GA" %in% UpstreamOfDonor2BaseSeq)) %>%
  ungroup() %>%
  add_count(cluster, treatment, p.adjust) %>%
  filter(n %in% c(2,3,4 )) %>%
  filter(treatment == "SM2") %>%
  group_by(cluster, treatment, p.adjust, n) %>%
  arrange(IntronType) %>%
  summarise(ClusterConfiguration = paste(IntronType, collapse = ", ")) %>%
  ungroup() %>%
  count(treatment, n, ClusterConfiguration, name="NumClusters") %>%
  mutate(Clusterwise_AS_Type = case_when(
        str_detect(ClusterConfiguration, "New Splice Site combination") & str_detect(ClusterConfiguration, "Alt 3'ss") ~ NA_character_,
    str_detect(ClusterConfiguration, "New Splice Site combination") & str_detect(ClusterConfiguration, "Alt 5'ss") ~ NA_character_,
    str_detect(ClusterConfiguration,"Alt 5'ss") & !str_detect(ClusterConfiguration,"Alt 3'ss") ~ "Alt 5'ss cluster",
    str_detect(ClusterConfiguration, "Alt 3'ss") & !str_detect(ClusterConfiguration,"Alt 5'ss") ~ "Alt 3'ss cluster",
    str_detect(ClusterConfiguration, "Alt 3'ss") & str_detect(ClusterConfiguration,"Alt 3'ss") ~ "Cryptic exon",
    str_detect(ClusterConfiguration, "Annotated, Annotated, Annotation") ~ "Annotated cassette exon",
    str_detect(ClusterConfiguration, "New Splice Site combination") ~ "Annotated cassette exon",
    TRUE ~ NA_character_
  )) %>%
  ggplot(aes(y=NumClusters, x=ClusterConfiguration, fill=Clusterwise_AS_Type)) +
  geom_col() +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  facet_wrap(~treatment) +
  ggtitle("exon-skipping drives ~50% of significant clusters")

`summarise()` has grouped output by 'cluster', 'treatment', 'p.adjust'. You can
override using the `.groups` argument.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

Are most cryptic 5’ss clusters with 2 introns at GA|GT introns

leafcutter.ds.annotated %>%
  group_by(cluster, treatment, p.adjust) %>%
  filter(p.adjust < 0.01) %>%
  filter((any(IntronType=="Alt 5'ss" & sign(logef)==1)) ) %>%
  mutate(DoesClusterContainGAGT.Intron = any("GA" %in% UpstreamOfDonor2BaseSeq)) %>%
  ungroup() %>%
  add_count(cluster, treatment, p.adjust) %>%
  filter(n==2) %>%
  count(cluster, treatment, DoesClusterContainGAGT.Intron, name="NumClusters") %>%
  # filter(treatment == "SM2") %>%
  ggplot(aes(y=NumClusters, x=treatment, fill=DoesClusterContainGAGT.Intron)) +
  geom_col() +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Quick double check that the Gencode basic annotations that I have been working off are usually but not always 1 transcript per gene

#What is the gencode:
transcripts <- read_tsv("/project2/yangili1/bjf79/ChromatinSplicingQTLs/code/ReferenceGenome/Annotations/GTFTools_BasicAnnotations/gencode.v34.chromasomal.tss.bed", col_names = c("chr", "start", "end", "strand", "transcript", "gene")) %>%
  distinct(transcript, gene)

transcripts %>%
  count(gene) %>%
  pull(n) %>% median()

transcripts %>%
  count(gene) %>%
  ggplot(aes(x=n)) +
  coord_cartesian(xlim=c(0,10)) +
  geom_bar()

Ok, yes,the slight majority of genes are a single transcript per gene in this annotation.

Checking effects of NNGA|GT introns

Now let’s look at more than jus the GA|GT. For example, filtering for GA|GT, let’s make the same plot for the two nucleotides upstream (NNGA|GT)

leafcutter.ds.annotated %>%
  filter(p.adjust < 0.01) %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  mutate(UpstreamOfGAGT = substr(DonorSeq, 1, 2)) %>%
  ggplot(aes(x=logef, color=treatment)) +
  stat_ecdf() +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_vline(
    data = . %>%
      group_by(UpstreamOfGAGT, treatment) %>%
      summarise(med=median(logef, na.rm=T)),
    aes(xintercept=med, color=treatment),
    linetype='dashed'
  ) +
  scale_color_manual(values=ColorKey) +
  ggtitle("Effect sizes for NNGA|GT introns, grouped by NN") +
  coord_cartesian(xlim=c(-8, 8)) +
  ylab("CumulativeFraction") +
  facet_wrap(~UpstreamOfGAGT) +
  theme_light() +
  labs(caption="Only FDR<0.01 GA|GT introns included")

`summarise()` has grouped output by 'UpstreamOfGAGT'. You can override using
the `.groups` argument.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

leafcutter.ds.annotated %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  mutate(UpstreamOfGAGT = substr(DonorSeq, 1, 2)) %>%
  ggplot(aes(x=logef, color=treatment)) +
  stat_ecdf() +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_vline(
    data = . %>%
      group_by(UpstreamOfGAGT, treatment) %>%
      summarise(med=median(logef, na.rm=T)),
    aes(xintercept=med, color=treatment),
    linetype='dashed'
  ) +
  scale_color_manual(values=ColorKey) +
  ggtitle("Effect sizes for NNGA|GT introns, grouped by NN") +
  coord_cartesian(xlim=c(-4, 4)) +
  ylab("CumulativeFraction") +
  facet_wrap(~UpstreamOfGAGT) +
  theme_light() +
  labs(caption="All GA|GT introns included")

`summarise()` has grouped output by 'UpstreamOfGAGT'. You can override using
the `.groups` argument.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

leafcutter.ds.annotated %>%
  filter(p.adjust < 0.01) %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  mutate(UpstreamOfGAGT = substr(DonorSeq, 1, 2)) %>%
  ggplot(aes(x=logef, color=treatment)) +
  geom_histogram(breaks=seq(-8,8,0.2)) +
  geom_vline(xintercept = 0, linetype="dashed") +
  scale_color_manual(values=ColorKey) +
  ggtitle("Effect sizes for NNGA|GT introns, grouped by NN") +
  coord_cartesian(xlim=c(-8, 8)) +
  ylab("Number Signif Introns") +
  facet_wrap(~UpstreamOfGAGT) +
  theme_light() +
  labs(caption="Only FDR<0.01 GA|GT introns included")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

I made some of these plots with filtering for only significant introns, and others including all introns. I think all introns might be the less biased way to do this analysis, to avoid winners curse effects.

Ok, perhaps for all the drugs except for CUOME, AGGA|GT might be the most effected sequence… Note that we are looking into exonic sequences. There could be confounding factors from coding restraints. For example, nucleotides that are more likely to introduce stop codons might have weaker measured effects due to NMD dampening. Let’s approach this motif analysis a bit differently, maybe with MEME suite tools or something. I’ll get to that later…

Let’s also check if affected GA|GT introns tend to be weaker in terms of the other part of the motif. I have scored 5’ss based on PWM (log-odds ratio). This log-odds ratio is formally defined as the log ratio of the probability of seeing the sequence in a 5’ss versus background (where background is uniform nucleotide probabilities):

\(log_2(\frac{Pr(Motif)}{Pr(Background)})\)

and the resulting PWM can also be used to create this weblogo:

PWM

Let’s investigate how drug effects on GA|GT 5’ss introns is correlated with motif score. But first, to get some intuitions on this motif score, let’s plot the distribution for each class of splice sites annotations. I expect a weaker score generally for cryptic 5’ss.

SpliceTypeAnnotations %>%
  ggplot(aes(x=DonorScore, color=IntronType)) +
  stat_ecdf() +
  ylab("CumulativeFraction") +
  coord_cartesian(xlim=c(2,12)) +
  theme_classic() +
  ggtitle("Alt 5'ss gnerally have weaker 5'ss motifs")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, it looks like the score typically ranges from 5-10 or so, and unannotated 5’ss introns (and “New introns” which by definition have both an unannotated 5’ss and unannoted 3’ss) have weaker scores. Similarly, let’s plot the distribution of scores for GA|GT versus non GA|GT introns. We could techincally just look up the effect of this from the PWM, but let’s plot the calculated distributoin of scores anyway…

SpliceTypeAnnotations %>%
  mutate(UpstreamOfDonor2BaseSeq = substr(DonorSeq, 3, 4))  %>%
  mutate(IsGAGTIntron = ifelse(UpstreamOfDonor2BaseSeq == "GA", "GA|GT", "!GAGT")) %>%
  ggplot(aes(x=DonorScore, color=IsGAGTIntron)) +
  stat_ecdf() +
  ylab("CumulativeFraction") +
  coord_cartesian(xlim=c(2,12)) +
  theme_classic() +
  ggtitle("GA|GT introns cause weaker 5'ss motif score")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, that makes sense, GA|GT is a pretty weak 5’ss motif to begin with (best is AG|GT)…

Now let’s see if the more “affected” GA|GT introns by the treatments tend to have weaker scores.

Here I am plotting the effect size estimate of ALL GA|GT 5’ss introns as a function of log-odds 5’ss score. I negative correlation would make sense with the hypothesis that weak 5’ss will be more affected.

leafcutter.ds.annotated %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  # filter(IntronType == "Alt 5'ss") %>%
  mutate(NominallySignificant = p<0.05) %>%
  ggplot(aes(x=DonorScore, y=logef, color=NominallySignificant, group=treatment)) +
  geom_point(alpha=0.3) +
  geom_smooth(method="lm") +
  geom_hline(yintercept = 0, linetype="dashed") +
  facet_wrap(~treatment) +
  coord_cartesian(xlim=c(2,12), ylim=c(-6,6)) +
  theme_classic() +
  ggtitle("Stronger 5'ss are less affected by treatment")

`geom_smooth()` using formula 'y ~ x'

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, there is definitely some negative correlation, except for maybe in CUOME. I decided to color points by if the splicing event is nominally significant, which nearly all points are in this set of GA|GT introns. The correlation is quite weak, and even otherwise very strong GA|GT splice sites are affected. That is, this trend doesn’t explain much of the variance. Let’s also plot the same thing but for different classes of introns.

leafcutter.ds.annotated %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  mutate(NominallySignificant = p<0.05) %>%
  ggplot(aes(x=DonorScore, y=logef, color=IntronType)) +
  geom_point(alpha=0.3) +
  geom_smooth(method="lm") +
  geom_hline(yintercept = 0, linetype="dashed") +
  facet_wrap(~treatment) +
  coord_cartesian(xlim=c(2,12), ylim=c(-6,6)) +
  theme_classic() +
  ggtitle("5'ss score x logef effect interaction with intron type")

`geom_smooth()` using formula 'y ~ x'

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, there is a downward trend for “new splice site combination”, and “annotated”, and there isn’t so much a downward trend for alt 5’ss, though the effect sizes for those are so obviously positive. A summarised interpretation of this:

Treatment nearly always increases crpytic GA|GT 5’SS, which tend to already have a weak splice site.
Annotated 5’ss GA|GT tend to also increase in splicing, especially if they are otherwise weak 5’ splice sites, such that sometimes new splice sites combinations (eg exon skipping) can arise.

Now let’s more carefully explore the differences between the drug treatments.

How do the treatments affect splicing differently

So far, it seems CUOME still has a direct GA|GT splicing effect but just much weaker than SM2, and the other drugs are somewhere in between. I wonder to what extent the splicing changes are more or less identical as assessed by the order of the drugs ranked by effect on any given GA|GT splice site. Let’s randomly sample some introns and plot to explore… I’ll plot both the logef and the deltapsi since i’m not sure what metric makes more sense for this analysis…

sample_n_of <- function(data, size, ...) {
  dots <- quos(...)
  
  group_ids <- data %>% 
    group_by(!!! dots) %>% 
    group_indices()
  
  sampled_groups <- sample(unique(group_ids), size)
  
  data %>% 
    filter(group_ids %in% sampled_groups)
}


set.seed(0)
leafcutter.ds.annotated %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  group_by(intron) %>%
  mutate(IsSigInAny = any(p.adjust < 0.01)) %>%
  ungroup() %>%
  filter(IsSigInAny) %>%
  add_count(intron) %>%
  filter(n==4) %>%
  sample_n_of(10, intron) %>%
  gather(key="EffectSizeMetric", value="EffectSize", logef, deltapsi) %>%
  ggplot(aes(x=treatment, y=EffectSize, color=EffectSizeMetric)) +
  geom_point() +
  ylab("splicing logef") +
  xlab("treatments, randomly selected significant GA|GT introns") +
  facet_wrap(~intron + EffectSizeMetric, scales = "free_y", ncol=4) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle("Are there common rank-order patterns of treatment effects?")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok a couple of clear patterns emerge. For example, there are quite a few introns where effect of SM2>C2>A10>CUOME, or something roughly like that. And sometimes the patterns look a bit different whether you consider deltapsi or logef. Let’s make sure I understand exactly what the logef output by leafcutter is… I thought it would be \(log_2\frac{PSI_{treatment}}{PSI_{DMSO}}\) . To make sure, let’s plot the logef calculated from the expression above (from the leafcutter PSI and deltapsi output values), versus the leafcutter logef value:

leafcutter.ds.annotated %>%
  mutate(logef_calculated_from_psi = log2(psi_treatment/DMSO)) %>%
  ggplot(aes(x=logef, y= logef_calculated_from_psi)) +
  xlim(c(-6,6)) +
  ylim(c(-6,6)) +
  # geom_hex(bins=100) +
  geom_point(alpha=0.1) +
  coord_cartesian(xlim=c(-6, 6), ylim=c(-6, 6)) +
  theme_classic() +
  ggtitle("Verification that leafcutter deltapsi is related to logef") +
  labs(caption="Read markdown text above and below for description")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, yes, I think that expression is correct, though there is likely rounding errors and other reasons they won’t exactly match… But in general I think the logef and deltapsi values from leafcutter are related in the way that matches my understanding.

In any case, I’ll continue to plot the effects both ways for this analysis of differences in treatment effects.

It will be useful to look at more than 10 introns simultaneously. I think maybe a useful way to look at the data will be to standardize the effects with intron, the plot many lines overlayed. Notably, here I only filtered for introns that are significant (p.adjust<0.01) in at least one condition, meaning some of these effect sizes are from non-significant changes in any particular treatment and the effect size estimates may be really noisy. I can consider filtering for effects that are significant in all treatments later, as a way to focus on the more robust measurements.

leafcutter.ds.annotated %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  group_by(intron) %>%
  mutate(IsSigInAny = any(p.adjust < 0.01)) %>%
  ungroup() %>%
  filter(IsSigInAny) %>%
  add_count(intron) %>%
  filter(n==4) %>%
  gather(key="EffectSizeMetric", value="EffectSize", logef, deltapsi) %>%
  group_by(EffectSizeMetric, intron) %>%
  mutate(StandardizedEffect = scale(EffectSize)) %>%
  ggplot(aes(x=treatment, y=StandardizedEffect, group=intron)) +
  geom_line(alpha=0.1) +
  ylab("standardized effect size") +
  xlab("treatments, all significant GA|GT introns") +
  facet_wrap(~EffectSizeMetric) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle("All scaled significant effects")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Woah. interesting plot but hard to look at. There are a lot of different patterns. and the arbitrary ordering of the treatments can draw your eyes to a particular patterns, even if it is not the most important takeaway… For example, my eyes are drawn to the density of lines that go up while passing from CUOME to SM2, whereas maybe I don’t get to directly see the effects between A10 and CUOME. So I certainly need a less arbitrary way of analyzing these patterns. What I eventually think could be a fruitful approach is to cluster introns based on these patterns, and then do motif analysis (eg MEME) to understand why some introns are in a cluster that is more affected by treatment X and least by Y, for example.

I should note that technically this practice of plotting effect sizes after filtering for significance introduces biases in effect size estimates due to winner’s curse effects… I am essentially double dipping the data to do both significance testing and effect size estimation. In practice I don’t know if these winner’s curse effects biases are the same for all treatments, which have different Pvalue distributions and different numbers of FDR significant introns. Perhap’s I should consider only filters by nominal Pvalues, so hopefully the winner’s curse bias isn’t so dependent on Pvalue distribution/FDR estimation.

So to (hopefully) cut down a bit on the density while select the more robust effect size estimates in a less biased manner, let’s replot but filter for introns that are that nominally significant in all treatments.

leafcutter.ds.annotated %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  group_by(intron) %>%
  mutate(IsNominallySigAll = all(p < 0.01)) %>%
  ungroup() %>%
  filter(IsNominallySigAll) %>%
  add_count(intron) %>%
  filter(n==4) %>%
  gather(key="EffectSizeMetric", value="EffectSize", logef, deltapsi) %>%
  group_by(EffectSizeMetric, intron) %>%
  mutate(StandardizedEffect = scale(EffectSize)) %>%
  ggplot(aes(x=treatment, y=StandardizedEffect, group=intron)) +
  geom_line(alpha=0.1) +
  ylab("standardized effect size") +
  xlab("treatments, all significant GA|GT introns") +
  facet_wrap(~EffectSizeMetric) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(caption="All introns nominally significant P<0.01")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, I should still consider clustering samples, since it fixes the arbitrary ordering problem. Let’s try a hiearchal clustering first, and visualize the dendrogram. There may be obvious groups from this.

d.deltapsi <- leafcutter.ds.annotated %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  group_by(intron) %>%
  mutate(IsNominallySigAll = all(p < 0.01)) %>%
  ungroup() %>%
  filter(IsNominallySigAll) %>%
  add_count(intron) %>%
  filter(n==4) %>%
  gather(key="EffectSizeMetric", value="EffectSize", logef, deltapsi) %>%
  group_by(EffectSizeMetric, intron) %>%
  mutate(StandardizedEffect = scale(EffectSize)) %>%
  ungroup() %>%
  filter(EffectSizeMetric=="deltapsi") %>%
  select(StandardizedEffect, intron, treatment) %>%
  spread(key = "treatment", value="StandardizedEffect") %>%
  dist()

d.logef <- leafcutter.ds.annotated %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  group_by(intron) %>%
  mutate(IsNominallySigAll = all(p < 0.01)) %>%
  ungroup() %>%
  filter(IsNominallySigAll) %>%
  add_count(intron) %>%
  filter(n==4) %>%
  gather(key="EffectSizeMetric", value="EffectSize", logef, deltapsi) %>%
  group_by(EffectSizeMetric, intron) %>%
  mutate(StandardizedEffect = scale(EffectSize)) %>%
  ungroup() %>%
  filter(EffectSizeMetric=="logef") %>%
  select(StandardizedEffect, intron, treatment) %>%
  spread(key = "treatment", value="StandardizedEffect") %>%
  dist()


ward <- hclust(d.deltapsi, method="ward.D")
complete <- hclust(d.deltapsi)

plot(ward)

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

plot(complete)

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

I wasn’t sure what was the best way to do the clustering so I tried two methods, and will get some sense of the similarity by grouping the tree into four and looking at the concordance of those

NumGroups <- 4

groups.ward <- cutree(ward, k=NumGroups)
groups.complete <- cutree(complete, k=NumGroups)

data.frame(groups.ward, groups.complete) %>%
  ggplot(aes(x=groups.ward, fill=as.factor(groups.complete))) +
  geom_bar() +
  ggtitle("cluster groupings are sensitive to clustering method")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, so the choice of clustering method does seem to make a bit of a difference, in the sense that if you cut the dendrogram into 4 groups, the resulting groups won’t be quite the same depending on these two clustering methods… Let’s plot the clustering of one of the ways as a heatmap showing the delta psi effect sizes

Let’s see if it makes a difference if I do the same clustering algorithm but compute distance matrix based on deltapsi vs logef:

groups.deltapsi <- hclust(d.deltapsi) %>% cutree(k=NumGroups)
groups.logef <- hclust(d.logef) %>% cutree(k=NumGroups)

data.frame(groups.deltapsi, groups.logef) %>%
  ggplot(aes(x=groups.ward, fill=as.factor(groups.complete))) +
  geom_bar() +
   ggtitle("cluster groupings are sensitive to using deltapsi vs logef")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, so even whether I use deltapsi or logef makes the clusters not so robust. basically, how I cluster introns into these groups is very subjective. Let’s try one more thing before giving up on this clustering idea… Let’s try plotting the clustering of samples and introns based on deltapsi from each replicate, for these same set of nominally significant in all GA|GT introns. I think visually seeing the heatmap can be useful, no matter the clustering method.

GAGTIntronsToPlot <- leafcutter.ds.annotated %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  group_by(intron) %>%
  mutate(IsNominallySigAll = all(p < 0.01)) %>%
  ungroup() %>%
  filter(IsNominallySigAll) %>%
  add_count(intron) %>%
  filter(n==4) %>%
  pull(intron) %>% unique()

PSI.table %>%
  select(intron, PSI, Sample) %>%
  filter(intron %in% GAGTIntronsToPlot) %>%
  spread(key="Sample", value="PSI") %>%
  mutate(DMSO.Avg = (DMSO.1 + DMSO.2 + DMSO.3)/3) %>%
  mutate_at(vars(-intron), funs(. - DMSO.Avg)) %>%
  select(-DMSO.Avg) %>%
  column_to_rownames("intron") %>%
  as.matrix() %>%
  heatmap.2(trace="none", col=brewer.pal(11,"Spectral"))

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

PSI.table %>%
  select(intron, PSI, Sample) %>%
  filter(intron %in% GAGTIntronsToPlot) %>%
  spread(key="Sample", value="PSI") %>%
  mutate(DMSO.Avg = (DMSO.1 + DMSO.2 + DMSO.3)/3) %>%
  mutate_at(vars(-intron), funs(. - DMSO.Avg)) %>%
  select(-DMSO.Avg) %>%
  column_to_rownames("intron") %>%
  as.matrix() %>%
  heatmap.2(trace="none", col=brewer.pal(11,"Spectral"), hclustfun = function(x) hclust(x,method = 'ward.D'))

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

PSI.table %>%
  select(intron, PSI, Sample) %>%
  filter(intron %in% GAGTIntronsToPlot) %>%
  spread(key="Sample", value="PSI") %>%
  mutate(DMSO.Avg = (DMSO.1 + DMSO.2 + DMSO.3)/3) %>%
  mutate_at(vars(-intron), funs(. - DMSO.Avg)) %>%
  select(-DMSO.Avg) %>%
  column_to_rownames("intron") %>%
  as.matrix() %>%
  t() %>%
  scale(center=F) %>%
  t() %>%
  heatmap.2(trace="none", col=brewer.pal(11,"Spectral"))

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

Ok, for the first heatmap plot, each row is an intron, each column is a sample, and the color is where I computed the delta-PSI as \(\Delta PSI = \frac{PSI_{treat}}{mean(PSI_{DMSO})}\). For the second for the second version of the plot, it’s the same thing but with the “Ward” clustering method instead of the default. In the third, it’s the same as the first but I scaled each row before plotting and clustering (divide by standard deviation).

In this way, all the clustering of samples is relatively similar. The clustering of introns I think is not. And it could just be my eyes overinterpreting these clustering, but I feel like perhaps there is a small group of introns that makes C2 look more like SM2 than A10. It would be interesting to compare some of these clustering patterns to the molecular structures.

I will hold off on trying to cluster introns to do motif analysis until later, if ever, since I don’t currently see a robust way to do that.

For now maybe another way to understand similarites and differences in effects between treatments is to visualize a regression between splicing effect sizes between treatments…

First let’s just plot the scatter plot between logef between a pair of treatments for all GA|GT introns measured.

leafcutter.ds.annotated %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  select(intron, logef, treatment) %>%
  spread("treatment", "logef") %>%
ggplot(aes(x=A10, y=SM2)) +
  geom_point(alpha=0.3) +
  coord_cartesian(xlim=c(-6, 6), ylim=c(-6, 6)) +
  theme_classic() +
  ggtitle("Splicing b/n these treatments are well correlated") +
  labs(caption="GA|GT introns only")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Wow, that is a striking correlation. And a lot of effects that are very high in SM2 but low in A10 (or vice versa) are likely just due noisy effect size estimates from low junction read counts. In particular, all the points lie roughly along either axis I think are suspiciously likely to be technical artefects from low junction read counts. Let’s plot the same thing for all pairs of treatments, and calculate spearman correlation coefficient:

lower_limitrange_point <- function(data, mapping, ...) { 
  ggplot(data = data, mapping = mapping, ...) + 
    geom_point(..., alpha=0.3) + 
    scale_y_continuous(limits = c(-5, 5)) +
    scale_x_continuous(limits = c(-5, 5)) +
    geom_abline() +
    theme_bw()
}

lower_limitrange_hex <- function(data, mapping, ...) { 
  ggplot(data = data, mapping = mapping, ...) + 
    xlim(c(-5,5)) +
    ylim(c(-5,5)) +
    geom_abline() +
    geom_hex(..., bins=50) + 
    scale_fill_viridis_c(option="C") +
    theme_bw()
}

diag_limitrange <- function(data, mapping, ...) { 
  ggplot(data = data, mapping = mapping, ...) + 
    geom_density(...) + 
    coord_cartesian(xlim=c(-5,5)) +
    theme_bw()
}


leafcutter.ds.annotated %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  select(intron, logef, treatment) %>%
  spread("treatment", "logef") %>%
  select(-intron) %>%
  ggpairs(
    title="Pairwise splicing effect correlation. Only GA|GT introns",
    upper=list(continuous = wrap("cor", method = "spearman", hjust=0.7)),
    lower=list(continuous = lower_limitrange_hex),
    diag=list(continuous = diag_limitrange)
    )

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, the spearman correlation between about 0.4 to about 0.85 depending on the pairwise comparison… And some of the lack of correlation is certainly just technical noise… In general I would say these treatments are quite similar. Let’s make the same plot for the “indirect effects”, that is, for introns in clusters without a GA|GT intron:

leafcutter.ds.annotated %>%
  group_by(cluster, treatment) %>%
  filter(!any("GA" %in% UpstreamOfDonor2BaseSeq)) %>%
  ungroup() %>%
  select(intron, logef, treatment) %>%
  spread("treatment", "logef") %>%
  select(-intron) %>%
  ggpairs(
    title="Pairwise splicing effect correlation. All introns",
    upper=list(continuous = wrap("cor", method = "spearman", hjust=0.7)),
    lower=list(continuous = lower_limitrange_hex),
    diag=list(continuous = diag_limitrange)
    )

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

So there is some (more modest) correlation even when considering introns that don’t have a GA|GT intron in the cluster. Let’s revisit the idea that some of these non GA|GT intron splicing effects might be ‘direct’ effects on 5’ss:U1snRNA pairing at a different bulge. For example, I think another candidate sequence to explore besides gA|gt (where A represents a bulged nucleotide from u1 snRNA) would be ag|gtAaagt, since it also contains a bulged adenosine but would otherwise pair perfectly to u1.

Permuted null.

Let’s see if any of introns of these hand-selected candidate sequence were tested for leafcutter effects and if they have a general increase in splicing, as a signature of direct effects. First I will make a plot for the established effect of gA|gtaag introns with a bulged “A”, then I will make some analogous plots for other bulged snRNA:5’ss pairings described in the literature or hypothesized by me.

PlotEffectSizeDistribution <- function(data){
  ggplot(data, aes(x=logef, color=treatment)) +
  stat_ecdf() +
  geom_vline(xintercept = 0, linetype="dashed") +
  scale_color_manual(values=ColorKey) +
  coord_cartesian(xlim=c(-2, 2)) +
  ylab("CumulativeFraction") +
  xlab("log2(SplicingFoldChange)") +
  # facet_wrap(~treatment) +
  theme_light()  +
  theme(legend.position = "none")
}

leafcutter.ds.annotated %>%
  filter(str_detect(DonorSeq, "\\w\\wAGGTAAG\\w")) %>% 
  PlotEffectSizeDistribution +
  ggtitle("Effects for nnag|guaagn introns; (optimal u1:5'ss pairing)")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

leafcutter.ds.annotated %>%
  filter(str_detect(DonorSeq, "\\w\\wGAGTAAG\\w")) %>% 
  PlotEffectSizeDistribution +
  ggtitle("Effects for nngA|guaagn introns; (bulged A)")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

leafcutter.ds.annotated %>%
  filter(str_detect(DonorSeq, "\\w\\wGTGTAAG\\w")) %>% 
  PlotEffectSizeDistribution +
  ggtitle("Effects for nngT|guaagn introns; (bulged T)")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

pre-mRNAs with sequence most consistent with bulged base pairing like in my hypothetical example (ag|gtAaagt):

leafcutter.ds.annotated %>%
  filter(str_detect(DonorSeq, "\\w\\w\\wGGTAAAGT")) %>%
  PlotEffectSizeDistribution +
  ggtitle("Effects for nnng|guAaagu introns; (bulged A)")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

pre-mRNAs with sequence most consistent with shifted base pairing like Fig1C:

leafcutter.ds.annotated %>%
  filter(str_detect(DonorSeq, "\\w\\wCAGTTAAG\\w")) %>%
  PlotEffectSizeDistribution +
  ggtitle("Effects for nnca|guuaagn introns; (shifted register)")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

pre-mRNAs with sequence most consistent with bulged base pairing like Fig1D:

leafcutter.ds.annotated %>%
  filter(str_detect(DonorSeq, "\\w\\w\\wGGTTAAG\\w")) %>%
  PlotEffectSizeDistribution +
  ggtitle("Effects for nnng|guUaagn introns; (bulge at T)")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok interesting. I think maybe there is also an effect for introns with this sequence G|GTTAAG in which the the second U would be bulged. Similar to gA|gt, SM2 has the stronger efect and CUOME the weakest.

Let’s check one more bulged pairing that is also described in the review:

pre-mRNAs with sequence most consistent with bulged base pairing like Fig1E:

leafcutter.ds.annotated %>%
  filter(str_detect(DonorSeq, "\\w\\w\\wGGTAGTA\\w")) %>%
  PlotEffectSizeDistribution +
  ggtitle("Effects for nnng|gua.guan introns; (bulge in snRNA at .)")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

leafcutter.ds.annotated %>%
  filter(str_detect(DonorSeq, "^\\w\\w\\w\\wGTAA\\wGT")) %>%
  PlotEffectSizeDistribution +
  ggtitle("Effects for nnnn|guaaNgn introns; (bulged N)")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

So from this crude sequence analysis of handpicked 5’ss sequences: gA|gt has an obvious and strong increase in splicing, and as has been described in the literature it is due to stabilization of the -1A in the snRNA:5’ss duplex. Here I also see a smaller but believable (imo) effect for g|gtTaag where the +3T would be bulged and possibily stabilized by the small molecules… But since T in this case would be bulged and un-paired from u1, maybe other nucleotides (eg A) have a stronger effect if they are at this position. So let’s take this one step furher to look the effect of the N nucleotide in introns that match g|gtNaag:

leafcutter.ds.annotated %>%
  filter(str_detect(DonorSeq, "\\w\\w\\wGGT\\wAAG\\w")) %>%
  mutate(N = substr(DonorSeq, 7,7)) %>%
  ggplot(aes(x=logef, color=treatment)) +
  stat_ecdf() +
  scale_color_manual(values=ColorKey) +
  geom_vline(xintercept = 0, linetype="dashed") +
  coord_cartesian(xlim=c(-2, 2)) +
  ylab("CumulativeFraction") +
  facet_grid(cols = vars(treatment), rows=vars(N)) +
  ggtitle("Effects for nnng|guNaagu introns; (bulged N)") +
  theme_light()

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

Ok, it seems I really only see that effect for g|gtTaag and not other nucleotides at that position. Let’s summarise all these results about testing specific sequences of introns for assymetrical effects, into a single heatmap.

PlotOut <- leafcutter.ds.annotated %>%
  # filter(IntronType == "Alt 5'ss") %>%
  select(intron, treatment, logef, DonorSeq) %>%
  mutate(
    `11_nnng|guUaag` = str_detect(DonorSeq, "^\\w\\w\\wGGTTAAG"),
    `2_nngA|gunnnn` = str_detect(DonorSeq, "^\\w\\wGAGT"),
    `3_nngA|guaagn` = str_detect(DonorSeq, "^\\w\\wGAGTAAG"),
    `4_nagA|guaagn` = str_detect(DonorSeq, "^\\wAGAGTAAG\\w"),
    `5_nGgA|guaagn` = str_detect(DonorSeq, "^\\wGGAGTAAG\\w"),
    `6_nCgA|guaagn` = str_detect(DonorSeq, "^\\wCGAGTAAG\\w"),
    `7_nUgA|guaagn` = str_detect(DonorSeq, "^\\wTGAGTAAG\\w"),
    `7.2_nagC|guaagn` = str_detect(DonorSeq, "^\\wGACGTAAG\\w"),
    `7.3_nagG|guaagn` = str_detect(DonorSeq, "^\\wGAGGTAAG\\w"),
    `8_nngU|guaagn` = str_detect(DonorSeq, "^\\w\\wGTGTAAG"),
    `9_nagU|guaagn` = str_detect(DonorSeq, "^\\wAGTGTAAG"),
    `1_nnag|guaagn` = str_detect(DonorSeq, "^\\w\\wAGGTAAG\\w"),
    `10_nnng|guVaag` = str_detect(DonorSeq, "^\\w\\w\\wGGT[ACG]AAG"),
    `9.5_nnnn|guaaHg` = str_detect(DonorSeq, "^\\w\\w\\w\\wGTAA[ACT]G")
  ) %>%
  filter_at(vars(-c("intron", "treatment", "logef", "DonorSeq")), any_vars(. == T)) %>%
  gather(key="Pattern", value="IsMatch", -c("intron", "treatment", "logef", "DonorSeq")) %>%
  filter(IsMatch) %>%
  group_by(Pattern, treatment) %>%
  summarise(median = median(logef),
            p = wilcox.test(logef, alternative="greater")$p.value,
            n = sum(IsMatch)) %>%
  ungroup() %>%
  separate(Pattern, into = c("PlotOrder", "Pattern"), sep = "_", convert=T) %>%
  arrange(PlotOrder) %>%
  ggplot(aes(x=reorder(Pattern, desc(PlotOrder)), y=treatment, fill=median)) +
  # geom_point(aes(size=-log10(p), color=median)) +
  geom_tile() +
  geom_text(aes(label=paste0("p=",format.pval(p, digits=2), "\nn=", n) ), size=3) +
  # scale_fill_viridis_c(option = "E", limits = c(-1.3, 1.3)) +
  # scale_fill_distiller(palette = "Spectral", limits = c(-1.3, 1.3)) +
  scale_fill_gradient2(name="Median log2(FoldChange)") +
  scale_x_discrete(expand = c(0, 0)) +
  scale_y_discrete(expand = c(0, 0)) +
  xlab("5'Splice Site Pattern,\npredicted bulges and mismatches capitalized") +
  coord_flip() +
  theme_classic() +
  theme(axis.text.y=element_text(size=16,  family="mono")) +
  theme(legend.position="bottom") +
  ggtitle("Summary heatmap of median effects by 5'ss pattern")

`summarise()` has grouped output by 'Pattern'. You can override using the
`.groups` argument.

PlotOut

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

ggsave("../code/scratch/BulgeTests.table.pdf", PlotOut, height=7, width=7)

TODO: also check for systematic bulge effects at 3’ss and bp. for example, the consensus bp that perfectly complements u2 snRNA would be CTAAC. What if we had CTAAAC… would those cryptic 3’ss get activated. Or at the 3’ss, would TAAG be preferentially activated

How does differential splicing relate to expression?

Another thing Jingxin suggested plotting is to reproduce something his postdoc noticed: that splicing changes are generally associated with a non-symetrical signal in expression change.

Let’s merge the differential splicing and dfferential expression results by host gene to investigate:

Before merging tables, I am curious how often introns overlap more than one gene, since these would be sort of ambiguous cases on how I should deal with them. How many genes can each intron be attribued to:

SpliceTypeAnnotations %>%
  head(10000) %>%
    separate_rows(gene_names, sep = ",") %>%
  count(Intron) %>%
  ggplot(aes(x=n)) +
  geom_histogram() +
  ggtitle("How many host-genes annotated for each intron")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

Ok, the vast majority of introns that can be attributed to a gene are attributed to exactly one host gene.

Before merging the results, let’s still filter out introns that can’t be mapped to any gene or that map to more than one host gene… Then let’s plot a scatter plot to see any potential correlation between splicing effects and expression effects. Here I’ll filter for the max-within-intron delta-psi

DatToReproduceSplicingExpressionEffect <- leafcutter.ds.annotated %>%
  filter((!is.na(gene_names)) & str_detect(gene_ids, ",", negate=T)) %>%
  mutate(gene_ids=str_replace(gene_ids, "\\..+$", "")) %>%
  inner_join(
    (DE.results %>%
       mutate(gene_ids=str_replace(EnsemblID, "\\..+$", ""))), 
    by=c("treatment","gene_ids"))


# DE.results %>% pull(treatment) %>% unique()

DatToReproduceSplicingExpressionEffect %>%
  group_by(treatment, cluster) %>%
  filter(abs(deltapsi) == max(abs(deltapsi))) %>%
  ungroup() %>%
  filter(p.adjust < 0.01) %>%
  ggplot(aes(x=deltapsi, y=logFC)) +
  geom_point(alpha=0.05) +
  facet_wrap(~treatment) +
  theme_bw() +
  ggtitle("Scatter of host-gene expression change vs splicing change")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

First of all I clearly see some assymetrical effects in the sense that there are more downregulated genes than up in general.

I think this could replicate what Jingxin’s postdoc, Jungxin Zhao found by plotting volcano plots. Reproducing exactly what he plotted might take a little more work since it requires thinking about exon inclusion/skipping which is a bit different from leafcutter intron-centric measurements… But let’s explore this correlation b/n expression and splicing a bit more… Firstly let’s do the same but filter just for the likely “direct” targets that have GA|GT 5’ splice sites

DatToReproduceSplicingExpressionEffect %>%
  mutate(IsGAGTIntron = ifelse(UpstreamOfDonor2BaseSeq == "GA", "GA|GT", "!GAGT")) %>%
  mutate(IsSplicingChangeSignificant = ifelse(p.adjust < 0.01, "Sig", "NS")) %>%
  mutate(IsSigIsGAGT = paste(IsGAGTIntron, IsSplicingChangeSignificant)) %>%
  ggplot(aes(x=deltapsi, y=logFC)) +
  geom_point(alpha=0.1) +
  geom_smooth(method="lm") +
  facet_grid(rows = vars(treatment), cols = vars(IsSigIsGAGT)) +
  theme_bw() +
  ggtitle("Increase in GA|GT intron-splicing generally with downregulation")

`geom_smooth()` using formula 'y ~ x'

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, perhaps there is some assymetry here, wherein the GA|GT 5’ splice sites, as plotted above are generally upregulated in splicing, while there host genes are more likely to be downregulated at expression level. Perhaps the model is that the GA splice sites generally induce cryptic splicing that mediates NMD.

Let’s investigate a bit more by plotting whether the introns are annotated vs unannotated (cryptic). Let’s also facet by whether the 5’ss is GA|GT rather than filter for that.

DatToReproduceSplicingExpressionEffect %>%
  filter(treatment=="SM2") %>%
  mutate(IsSplicingChangeSignificant = ifelse(p.adjust < 0.01, "Sig", "NS")) %>%
  ggplot(aes(x=deltapsi, y=logFC)) +
  geom_point(alpha=0.3, aes(color=IntronType)) +
  geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  ylab("Host gene expression\nlog2(FoldChange)") +
  xlab("Splicing change\nDeltaPSI") +
  # facet_wrap(~IsSplicingChangeSignificant) +
  theme_bw() +
  ggtitle("Increase in cryptic splice sites associated with host-gene downregulation")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

DatToReproduceSplicingExpressionEffect %>%
  mutate(IsGAGTIntron = ifelse(UpstreamOfDonor2BaseSeq == "GA", "GA|GT", "!GAGT")) %>%
  mutate(IsSplicingChangeSignificant = ifelse(p.adjust < 0.01, "Sig", "NS")) %>%
  mutate(IsSigIsGAGT = paste(IsGAGTIntron, IsSplicingChangeSignificant)) %>%
  ggplot(aes(x=deltapsi, y=logFC)) +
  geom_point(alpha=0.3, aes(color=IntronType)) +
  geom_hline(yintercept = 0, linetype="dashed") +
  geom_vline(xintercept = 0, linetype="dashed") +
  ylab("Host gene expression\nlog2(FoldChange)") +
  xlab("Splicing change\nDeltaPSI") +
  facet_grid(rows = vars(treatment), cols = vars(IsSigIsGAGT)) +
  theme_bw() +
  theme(legend.position="bottom") +
  ggtitle("Increase in cryptic splice sites associated with host-gene downregulation")

I think this plot is pretty consistent with that model about cryptic 5’ss that induce host transcript downregulation, possibly by NMD. Upregulation of cryptic 3’ss also tend to have a host transcript downregulation effect, but the direct effects (those at GA|GT 5’ss) tend to overwhelmingly be activation of cryptic 5’ss.

Let’s consider different ways to present similar information graphically.

let’s try, a volcano plot of DE expression, plotted by significant splice types, just for SM2:

DatToReproduceSplicingExpressionEffect %>%
  filter(treatment == "SM2") %>%
  mutate(
    DE = case_when(
      FDR < 0.01 & sign(logFC) == 1 ~ "Up",
      FDR < 0.01 & sign(logFC) == -1 ~ "Down",
      TRUE ~ "NS"
    )
  ) %>%
  ggplot(aes(x=deltapsi, y=-log10(p.adjust), color=DE )) +
  geom_point(alpha=0.2) +
  facet_wrap(~IntronType) +
  theme_classic() +
  ggtitle("Increase in cryptic splice sites associated with host-gene downregulation")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Lastly, let’s try visualizing as stacked bar graphs showing the proportion of up vs down vs not-significant host gene changes upon up/down/NS change in splicing types. I think this is an important way to show the data for the technical detail that cryptic splice sites are just harder to identify decreases for since they are already rare by nature. But if you categorize the cryptic splice sites that you DO see decreases for, and see different host gene changes in that group versus the cryptic splice sites that increase, then I will be more convinced that this isn’t technical detail is the whole explanation.

DatToReproduceSplicingExpressionEffect %>%
  filter(treatment == "SM2") %>%
  group_by(treatment, cluster) %>%
  filter(abs(deltapsi)==max(abs(deltapsi))) %>%
  filter(abs(deltapsi) > 0.05) %>%
  ungroup() %>%
  mutate(
    SplicingSign = case_when(
      p.adjust < 0.01 & sign(deltapsi) == 1 ~ "SplicingIncrease",
      p.adjust < 0.01 & sign(deltapsi) == -1 ~ "SplicingDecrease",
      p.adjust > 0.01 ~ "Splicing NS"
    ),
    ExpressionSign = case_when(
      FDR < 0.01 & sign(logFC) == 1 ~ "ExpressionIncrease",
      FDR < 0.01 & sign(logFC) == -1 ~ "ExpressionDecrease",
      FDR > 0.01 ~ "Expression NS"
    )
  ) %>%
  count(IntronType, treatment, SplicingSign, ExpressionSign) %>%
  mutate(ExpressionSign = fct_relevel(ExpressionSign, "ExpressionIncrease", "Expression NS", "ExpressionDecrease")) %>%
  ggplot(aes(
    x=SplicingSign,
    y=n,
    fill=ExpressionSign
  )) +
    geom_col(position="fill") +
    ylab("Fraction") +
    facet_wrap(~IntronType) +
    scale_x_discrete(name="SplicingChange", limits=c("SplicingDecrease","Splicing NS","SplicingIncrease")) +
    scale_fill_brewer(name="HostGeneExpressionChange", limits=c("ExpressionIncrease", "Expression NS", "ExpressionDecrease"), palette = "Spectral") +
    theme_classic() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
    ggtitle("Increase in cryptic splice sites associated with host-gene downregulation")

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

Checking known splice targets from the literature

Next, we can also use these splice site annotations to match significant splicing changes to host genes. Let’s check on the SNCA and MAPT genes Jingxin is interseted in. I’ll also search for GAPDH introns as a positive control to make sure the code works as expected.

SpliceTypeAnnotations %>%
  filter(str_detect(gene_names, pattern=GenesOfInterest)) %>%
  sample_n(10)

# A tibble: 10 × 10
   Intron strand exons_skipped anchor gene_names gene_ids splice_site IntronType
   <chr>  <chr>          <dbl> <chr>  <chr>      <chr>    <chr>       <chr>     
 1 chr4_… +                  1 NDA    HTT        ENSG000… GT-AG       New Splic…
 2 chr4_… +                  0 DA     HTT        ENSG000… GT-AG       Annotated 
 3 chr5_… +                  1 A      SNCAIP     ENSG000… GT-AG       Alt 5'ss  
 4 chr4_… +                  0 A      HTT        ENSG000… GT-AG       Alt 5'ss  
 5 chr12… -                  1 NDA    FOXM1      ENSG000… GT-AG       New Splic…
 6 chr12… -                  0 D      FOXM1      ENSG000… GT-AG       Alt 3'ss  
 7 chr12… -                  2 A      FOXM1      ENSG000… GT-AG       Alt 5'ss  
 8 chr14… -                  0 A      GALC       ENSG000… GT-AG       Alt 5'ss  
 9 chr5_… +                  1 DA     SNCAIP     ENSG000… GT-AG       Annotated 
10 chr4_… +                  1 NDA    HTT        ENSG000… GT-AG       New Splic…
# … with 2 more variables: DonorSeq <chr>, DonorScore <dbl>

SpliceTypeAnnotations %>%
  filter(str_detect(gene_names, pattern=GenesOfInterest)) %>%
  inner_join(
    (
      leafcutter.ds %>%
        mutate(Intron=str_replace(intron, "(.+?):(.+?):(.+?):.+", "\\1_\\2_\\3"))
    ), by="Intron")

# A tibble: 19 × 22
   Intron strand exons_skipped anchor gene_names gene_ids splice_site IntronType
   <chr>  <chr>          <dbl> <chr>  <chr>      <chr>    <chr>       <chr>     
 1 chr12… -                  0 DA     FOXM1      ENSG000… GT-AG       Annotated 
 2 chr12… -                  0 DA     FOXM1      ENSG000… GT-AG       Annotated 
 3 chr12… -                  0 DA     FOXM1      ENSG000… GT-AG       Annotated 
 4 chr12… -                  0 DA     FOXM1      ENSG000… GT-AG       Annotated 
 5 chr14… -                  0 DA     GALC       ENSG000… GT-AG       Annotated 
 6 chr14… -                  0 DA     GALC       ENSG000… GT-AG       Annotated 
 7 chr14… -                  0 DA     GALC       ENSG000… GT-AG       Annotated 
 8 chr4_… +                  0 DA     HTT        ENSG000… GT-AG       Annotated 
 9 chr4_… +                  0 DA     HTT        ENSG000… GT-AG       Annotated 
10 chr4_… +                  0 DA     HTT        ENSG000… GT-AG       Annotated 
11 chr4_… +                  0 DA     HTT        ENSG000… GT-AG       Annotated 
12 chr4_… +                  0 DA     HTT        ENSG000… GT-AG       Annotated 
13 chr4_… +                  0 DA     HTT        ENSG000… GT-AG       Annotated 
14 chr4_… +                  0 DA     HTT        ENSG000… GT-AG       Annotated 
15 chr4_… +                  0 DA     HTT        ENSG000… GT-AG       Annotated 
16 chr5_… +                  0 DA     SMN2       ENSG000… GT-AG       Annotated 
17 chr5_… +                  0 DA     SMN2       ENSG000… GT-AG       Annotated 
18 chr5_… +                  0 DA     SMN2       ENSG000… GT-AG       Annotated 
19 chr5_… +                  0 DA     SMN2       ENSG000… GT-AG       Annotated 
# … with 14 more variables: DonorSeq <chr>, DonorScore <dbl>, intron <chr>,
#   logef <dbl>, psi_treatment <dbl>, DMSO <dbl>, deltapsi <dbl>,
#   treatment <chr>, cluster <chr>, status <chr>, loglr <dbl>, df <dbl>,
#   p <dbl>, p.adjust <dbl>

It appears none of the junctions in SNCA or MAPT genes passed filters for testing. They may have been two lowly expressed to test for leafcutter, and I was fairly permissive already for filtering junctions for differential splicing testing. I think these genes just aren’t well expressed enough to really glean splicing information at these specific genes. If I want to analyze them at all I will have to go back to more raw data before leafcutter testing which does some filtering.

Let’s read in the leafcutter count table and search for these genes:

PSI.table %>%
  mutate(Intron=str_replace(intron, "(.+?):(.+?):(.+?):.+", "\\1_\\2_\\3")) %>%
  inner_join(SpliceTypeAnnotations, by="Intron") %>%
  filter(str_detect(gene_names, pattern=c("SNCA", "MAPT")))

# A tibble: 0 × 16
# … with 16 variables: intron <chr>, Sample <chr>, Count <int>, Cluster <chr>,
#   ClusterSum <int>, PSI <dbl>, Intron <chr>, strand <chr>,
#   exons_skipped <dbl>, anchor <chr>, gene_names <chr>, gene_ids <chr>,
#   splice_site <chr>, IntronType <chr>, DonorSeq <chr>, DonorScore <dbl>

These genes also apparently did not pass default leafcutter clustering parameters. Let’s read in a redo of the splice junction count table with more permissive leafcutter parameters for filtering lowly expressed junctions.

counts.permissive <- read.table("../code/SplicingAnalysis/leafcutter/clustering_permissive/autosomes/leafcutter_perind_numers.counts.gz", sep = ' ',  header=T) %>%
  rownames_to_column("intron") %>%
  gather(key="Sample", value="Count", -intron)

PSI.table.permissive <- counts.permissive %>%
  mutate(Cluster = str_replace(intron, ".+:(.+)$", "\\1")) %>%
  group_by(Sample, Cluster) %>%
  mutate(ClusterSum = sum(Count)) %>%
  mutate(PSI = Count/ClusterSum * 100) %>%
  ungroup()

PSI.table.permissive %>%
  mutate(Intron=str_replace(intron, "(.+?):(.+?):(.+?):.+", "\\1_\\2_\\3")) %>%
  inner_join(SpliceTypeAnnotations, by="Intron") %>%
  filter(str_detect(gene_names, pattern=c("SNCA", "MAPT"))) %>% head()

# A tibble: 6 × 16
  intron       Sample Count Cluster ClusterSum   PSI Intron strand exons_skipped
  <chr>        <chr>  <int> <chr>        <int> <dbl> <chr>  <chr>          <dbl>
1 chr4:898356… A10.1      3 clu_13…          3 100   chr4_… -                  2
2 chr5:122311… A10.1      0 clu_14…          8   0   chr5_… +                  1
3 chr5:122312… A10.1      5 clu_14…          8  62.5 chr5_… +                  1
4 chr5:122389… A10.1      0 clu_14…          8   0   chr5_… +                  0
5 chr5:122391… A10.1      2 clu_14…          7  28.6 chr5_… +                  1
6 chr5:122401… A10.1      0 clu_14…          7   0   chr5_… +                  0
# … with 7 more variables: anchor <chr>, gene_names <chr>, gene_ids <chr>,
#   splice_site <chr>, IntronType <chr>, DonorSeq <chr>, DonorScore <dbl>

Ok. So now we have some junctions for SNCA and MAPT… Let’s plot admixture plots for all the GA|GT-containing clusters overlapping all the genes of interest:

PSI.table.permissive.genesofinterest <- PSI.table.permissive %>%
  mutate(Intron=str_replace(intron, "(.+?):(.+?):(.+?):.+", "\\1_\\2_\\3")) %>%
  inner_join(SpliceTypeAnnotations, by="Intron") %>%
  mutate(UpstreamOfDonor2BaseSeq = substr(DonorSeq, 3, 4)) %>%
  mutate(IsGAGT = UpstreamOfDonor2BaseSeq== "GA") %>%
  filter(gene_names %in% c("SNCA", "MAPT")) %>%
  group_by(Cluster) %>%
  mutate(GA_in_cluster = ifelse(any("GA" %in% UpstreamOfDonor2BaseSeq), "GA|GT in cluster", "GA|GT not in cluster")) %>%
  ungroup()


colourCount = length(unique(PSI.table.permissive.genesofinterest$intron))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))

PSI.table.permissive.genesofinterest %>%
  mutate(Cluster_label = paste(Cluster, gene_names)) %>%
  sample_frac() %>%
  ggplot(aes(x=Sample, y=Count, fill=Intron, color=IsGAGT)) +
  geom_col() +
  facet_wrap(~Cluster_label) +
  scale_fill_manual(values = getPalette(colourCount)) +
  theme_classic() +
  # theme(legend.position="none") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Version	Author	Date
4a410f5	Benjmain Fair	2022-03-08

Ok, none of the introns in SNCA or MAPT that we detected are GAGT introns (which I assume the ones that were previously documented in the literature are) so we just can’t measure those splicing of those introns in this dataset, despite that we have differential expression results that are consistent with the literature.

Let’s try looking at GA|GT splicing in the other genes of interest noted in the literature too (eg HTT, etc)…

I’m going to turn this block off for now… The plotting looks like a mess because there is too many introns being plotted, and I’ll have to manually go thru the literature to find the exact splice sites of interset to focus on for plotting.

PSI.table.permissive.genesofinterest <- PSI.table.permissive %>%
  mutate(Intron=str_replace(intron, "(.+?):(.+?):(.+?):.+", "\\1_\\2_\\3")) %>%
  inner_join(SpliceTypeAnnotations, by="Intron") %>%
  mutate(UpstreamOfDonor2BaseSeq = substr(DonorSeq, 3, 4)) %>%
  mutate(IsGAGT = UpstreamOfDonor2BaseSeq== "GA") %>%
  filter(gene_names %in% GenesOfInterest) %>%
  group_by(Cluster) %>%
  mutate(GA_in_cluster = ifelse(any("GA" %in% UpstreamOfDonor2BaseSeq), "GA|GT in cluster", "GA|GT not in cluster")) %>%
  ungroup()

colourCount = length(unique(PSI.table.permissive.genesofinterest$intron))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))

PSI.table.permissive.genesofinterest %>%
  filter(GA_in_cluster == "GA|GT in cluster") %>%
  mutate(Cluster_label = paste(Cluster, gene_names)) %>%
  sample_frac() %>%
  ggplot(aes(x=Sample, y=PSI, fill=Intron, color=IsGAGT)) +
  geom_col() +
  facet_wrap(~Cluster_label) +
  scale_fill_manual(values = getPalette(colourCount)) +
  theme_classic() +
  # theme(legend.position="none") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Using ClinVar to prioritize potential therapeutic targets

So finally, I think it will be interesting to look at Pangolin expected splicing changes in response to ClinVar pathogenic mutations, and see if any of these drugs work in counteracting direction across any genes. Much like the Nature communications paper (I forget the authors, I’ll link it later here.). I should note this straightforward thinking of treatment to combat pathogenic splicing changes isn’t the only way these could be used therapeutically. For example, in SMN2, there is a naturally occuring poison exon that can be skipped by treatment, which leads to SMN2 upregulation and partial rescue of pathogenic loss-of-function mutations in SMN1. So in addition to this direct approach of looking at pathogenic splicing mutations to oppose, I should also consider pipeline that uses information about poison exons, and the mode of inheritance of diseases.

In any case, as for the straightforward approach about reversing pathogenic splicing mutations, let’s first read in the ClinVar data with Pangolin splicing effect predictions, which I eventually hope to use to help annotate more potentially clinically relevant splicing events

Explore ClinVar molecular and clinical significant annotations

# table of gwas considered
ClinVar <- read_tsv("../code/scratch/test.txt.gz", col_names = c("Chr", "Pos", "Sig", "Condition", "MolecularSig", "Pangolin", "OMIM")) %>%
  separate(MolecularSig, into=c("MolSigCode", "MolSigDescription"), sep="\\|") %>%
  separate(Pangolin, into=c("Gene", "GreatestUp", "GreatestDown"), sep="\\|") %>%
  separate(GreatestUp, into=c("PangolinRelPosUp", "PangolinChangeUp"), sep=":", convert=T) %>%
  separate(GreatestDown, into=c("PangolinRelPosDown", "PangolinChangeDown"), sep=":", convert=T) %>%
  mutate(PangolinChangeDown = as.numeric(PangolinChangeDown)) %>%
  filter(!is.na(PangolinChangeDown)) %>%
  mutate(PangolinAbsPosUp = Pos + PangolinRelPosUp) %>%
  mutate(PangolinAbsPosDown = Pos + PangolinRelPosDown)

Parsed with column specification:
cols(
  Chr = col_double(),
  Pos = col_double(),
  Sig = col_character(),
  Condition = col_character(),
  MolecularSig = col_character(),
  Pangolin = col_character(),
  OMIM = col_character()
)

ClinVar %>% pull(MolSigDescription) %>% unique()

 [1] "intron_variant"                      "synonymous_variant"                 
 [3] "missense_variant"                    "initiatior_codon_variant"           
 [5] "frameshift_variant"                  "nonsense"                           
 [7] "inframe_deletion"                    "inframe_insertion"                  
 [9] "3_prime_UTR_variant"                 "non-coding_transcript_variant"      
[11] "5_prime_UTR_variant"                 "splice_donor_variant"               
[13] "splice_acceptor_variant"             "inframe_indel"                      
[15] "stop_lost"                           "genic_upstream_transcript_variant"  
[17] "no_sequence_alteration"              "genic_downstream_transcript_variant"

MolSigsToLump <- c("inframe_insertion", "non-coding_transcript_variant", "inframe_indel", "genic_downstream_transcript_variant", "5_prime_UTR_variant", "inframe_deletion", "3_prime_UTR_variant", "genic_upstream_transcript_variant" , "no_sequence_alteration", "intron_variant")
MolSigsSplicing <- c("splice_donor_variant", "splice_acceptor_variant")

ClinVar <- ClinVar %>%
  mutate(MolSigDescription_simplified = case_when(
    MolSigDescription %in% MolSigsToLump ~ "Other small effects",
    MolSigDescription %in% MolSigsSplicing ~ "SpliceSiteVariant",
    T ~ MolSigDescription
  ))

ClinVar %>%
  count(Sig, MolSigDescription_simplified) %>%
  ggplot(aes(x=Sig, y=n, fill=MolSigDescription_simplified)) +
  geom_col(pos="stack") +
  ylab("Num ClinVar mutations") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle("Pathogenic/LikelyPathogenic variants enriched for SpiceSiteVariants")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

ClinVar %>%
  count(Sig, MolSigDescription_simplified) %>%
  ggplot(aes(x=Sig, y=n, fill=MolSigDescription_simplified)) +
  geom_col(pos="fill") +
  ylab("Fraction ClinVar mutations") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
    ggtitle("Pathogenic/LikelyPathogenic variants enriched for SpiceSiteVariants")

Ok that was a peek at the ClinVar database. Eventually I will fold in the Pangolin splicing change predictions on pathogenic or likely pathogenic variants in splice sites, to see if any of the treatments switch splicing in counteracticting directions.

Explore Pangolin splicing effect predictions

As a santiy check that Pangolin results make sense, let’s subset the clinvar mutations annotated as splice site mutations, and check the max predicted Pangolin splicing change effect size.

ClinVar %>%
  mutate(SplicingVariant = MolSigDescription %in% MolSigsSplicing) %>%
  select(SplicingVariant, PangolinChangeDown, PangolinChangeUp) %>%
  gather(key="PangolinDirection", "EffectSize", -SplicingVariant) %>%
  mutate(SplicingVariant=if_else(SplicingVariant, "SpliceSiteVariant", "NotSpliceSiteVariant")) %>%
  ggplot(aes(x=PangolinDirection, y=EffectSize)) +
  geom_violin() +
  ylab("Pangolin predicted splicing effect size") +
  facet_wrap(~SplicingVariant) +
  theme_bw() +
  ggtitle("ClinVar SpliceSiteVariant annotations have large predicted splicing effects")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01
4a410f5	Benjmain Fair	2022-03-08

Ok, that makes sense. ClinVar splice site variants have much larger Pangolin predicted effects on splicing.

First, I want to be sure I understand what the ClinVar “splice_donor_variant” and “splice_acceptor_variant” codes really mean (is it only variants in the essential GT-AG, or does it extend farther into the intronic motifs?)… I tried looking thru documentation online for this, but couldn’t find a detailed description. So let’s plot the distribution of distances between the max Pangolin effected splice sites and the variant:

ClinVar %>%
  filter(MolSigDescription %in% MolSigsSplicing) %>%
  select(Pos, PangolinRelPosUp, PangolinRelPosDown, MolSigDescription) %>%
  gather(key="PangolinDirection", "RelativeDist", PangolinRelPosUp, PangolinRelPosDown) %>%
  group_by(Pos, PangolinDirection, MolSigDescription) %>%
  filter(abs(RelativeDist) == min(abs(RelativeDist))) %>%
  ungroup() %>%
  distinct(.keep_all = T) %>%
  ggplot(aes(x=RelativeDist)) +
  geom_histogram(breaks=seq(-10,10)) +
  coord_cartesian(xlim=c(-10,10)) +
  facet_wrap(~PangolinDirection+MolSigDescription) +
  theme_classic() +
  ggtitle("I think SpliceSiteVariants refers only to GT-AG")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

It’s hard to exactly interpret this, especially since I am considering relative position without regard to strand, so there are sort of mirrored effects going on… But since for both “splice_donor_variant” and “splice_acceptor_variant” observations, there is such a strong signal at -2 to -1 I think it is likely that both of those codes refer to just the essential GT and AG 5’ and 3’ splice sites.

So we may also want to consider some of the ClinVar variants with strong Pangolin splicing effects even if they aren’t SpliteSiteVariants. For example, I think that variants coded as “intron_variant” may include important variants at the a splice site motif (besides the essential GT-AG, which would be coded as “splice_donor_variant” and “splice_acceptor_variant”). So first, let’s see how many of these other categories there are, within the pathogenic or likely pathogenic categories that have Pangolin predicted splicing effects…

ClinVar$Sig %>% table()

.
                                     Affects 
                                          96 
                                      Benign 
                                      137204 
                        Benign/Likely_benign 
                                       19475 
Conflicting_interpretations_of_pathogenicity 
                                       41246 
                               Likely_benign 
                                      211331 
                           Likely_pathogenic 
                                       32730 
                                  Pathogenic 
                                       69290 
                Pathogenic/Likely_pathogenic 
                                        8277 
                      Uncertain_significance 
                                      317711 
                                 association 
                                         188 
                         confers_sensitivity 
                                           2 
                               drug_response 
                                        1546 
                                not_provided 
                                        8937 
                                       other 
                                         835 
                                  protective 
                                          20 
                                 risk_factor 
                                         278

PlotTypesOfPathogenicVariantsAndPangolinEffects <- ClinVar %>%
  filter(Sig %in% c("Pathogenic", "Pathogenic/Likely_pathogenic", "Likely_pathogenic")) %>%
  group_by(Chr, Pos) %>%
  mutate(LargestPangolinChange = max(abs(PangolinChangeDown), abs(PangolinChangeUp))) %>%
  ungroup() %>%
  ggplot(aes(x=LargestPangolinChange, fill=MolSigDescription_simplified)) +
  geom_histogram() +
  ylab("Number variants") +
  theme_bw() +
  ggtitle("Some Non 'SpliceSiteVariants' have appreciable predicted splicing effects")

PlotTypesOfPathogenicVariantsAndPangolinEffects

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

## Zooming in...
PlotTypesOfPathogenicVariantsAndPangolinEffects +
  coord_cartesian(ylim=c(0, 5000))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

PlotTypesOfPathogenicVariantsAndPangolinEffects +
  coord_cartesian(ylim=c(0, 5000))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It seems that pangolin does identify some “Pathogenic” or “Likely pathogenic” variants with potentially meaningful predicted splicing effects (say >0.25) that are not SpliceSiteVariants… But splice site variants seems to capture most of these large effects.

Here I’ll plot the types of molecular changes for likely pathogenic SNPs with Pangolin changes > 0.1, as the ClinVar variants to consider merging with leafcutter results…

ClinVar %>%
  filter(Sig %in% c("Pathogenic", "Pathogenic/Likely_pathogenic", "Likely_pathogenic")) %>%
  group_by(Chr, Pos) %>%
  mutate(LargestPangolinChange = max(abs(PangolinChangeDown), abs(PangolinChangeUp))) %>%
  ungroup() %>%
  filter(LargestPangolinChange > 0.1) %>%
  ggplot(aes(x=MolSigDescription)) +
  geom_bar() +
  ylab("Number variants") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle("Number variants with PredictedSplicingChange > 0.1")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

Hmmm, some of these variants that Pangolin likely picks up on are definitely not directly splicing related and not good candidates for reversal by drug treatment. For example, frameshift mutations. So I’ll keep all the variants that have Pangolin changes > 1 AND are not in some of those categories:

MolecularCategoriesToInclude <- c("splice_acceptor_variant", "splice_donor_variant", "synonymous_variant", "intron_variant", "missense_variant")
SigCategoriesToInclude <- c("Pathogenic", "Pathogenic/Likely_pathogenic", "Likely_pathogenic")


ClinVar %>%
  filter(Sig %in% SigCategoriesToInclude) %>%
  group_by(Chr, Pos) %>%
  mutate(LargestPangolinChange = max(abs(PangolinChangeDown), abs(PangolinChangeUp))) %>%
  ungroup() %>%
  filter(LargestPangolinChange > 0.1) %>%
  filter(MolSigDescription %in% MolecularCategoriesToInclude) %>%
  ggplot(aes(x=LargestPangolinChange, fill=MolSigDescription)) +
  geom_histogram() +
  ylab("Number variants") +
  theme_bw() +
  ggtitle("Variants including to merge with diffSplicing results")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

Prioritize splicing-pathogenic introns that might be counteracted by treatment

Ok, now that I’ve decided what ClinVar variants to consider merging with the differential splicing (and possibly expression) results… I will merge the differential splicing results with ClinVar. I’ll plot the Pangolin effect on splicing for a pathogenic condition associated variant as a black dot, and add the treatment’s measured in vitro effect on the same splice site for splice site changes that are significant by leafcutter. I expect there to be so many results that I’ll just randomly select a smaller number for visual presentation.

Some notes and filters for exactly how I do this:

ClinVar variants with Pangolin predicted splicing change <0.1 \(\Delta PSI\) (PSI refers to a specific splice site) were excluded.
Only ClinVar variants annotated in the following molecular categories were included: splice_acceptor_variant, splice_donor_variant, synonymous_variant, intron_variant, missense_variant
Only ClinVar variants annotated in the following clinical significance categories were included: Pathogenic, Pathogenic/Likely_pathogenic, Likely_pathogenic
Pangolin splicing effects of a mutation are reported as the two splice sites with the greatest up and greatest down change. Here I considered only the splice site with the greater absolute value Pangolin effect, then to estimate the effect of the drug treatment on the same splice site, I summed up the deltapsi for all introns that contain that splice site
After estimateing the treatment effect on the Pangolin most effected splice site, to prioritize the most likely direct effects, only splice sites that are in a cluster with GA|GT were considered, and with nominal p<0.01 (not using FDR/p.adjust because it is likely overly conservative, since genomewide FDR estimation uses higher proportion of more negative tests from non GA|GT introns)
In the data frame table that I construct, for each (splice site):(treatment):(condition) trio that passes the above filters, in the same row I report DE results for the host gene and leafcutter results for the GA|GT intron with the largest deltapsi (which doesn’t necessarily have to contain the same exact splice site as the Pangolin effect; it need only be in the same cluster).

ClinVar %>% pull(Sig) %>% unique()

 [1] "Uncertain_significance"                      
 [2] "Likely_benign"                               
 [3] "Benign"                                      
 [4] "Likely_pathogenic"                           
 [5] "Pathogenic"                                  
 [6] "Affects"                                     
 [7] "Conflicting_interpretations_of_pathogenicity"
 [8] "Benign/Likely_benign"                        
 [9] "not_provided"                                
[10] "Pathogenic/Likely_pathogenic"                
[11] "risk_factor"                                 
[12] "association"                                 
[13] "other"                                       
[14] "drug_response"                               
[15] "protective"                                  
[16] "confers_sensitivity"

ClinVar %>%
  count(Chr, Pos, Gene, Condition) %>%
  arrange(desc(n))

# A tibble: 820,677 × 5
     Chr       Pos Gene               Condition                                n
   <dbl>     <dbl> <chr>              <chr>                                <int>
 1    15  43386344 ENSG00000137822.13 ['not_provided']                        19
 2     3  37025629 ENSG00000076242.16 ['Hereditary_nonpolyposis_colorecta…    17
 3     3  25638314 ENSG00000077097.16 ['not_provided']                        13
 4    11   2884860 ENSG00000129757.15 ['Beckwith-Wiedemann_syndrome']         12
 5    14  95099771 ENSG00000100697.15 ['DICER1-related_pleuropulmonary_bl…    11
 6    18  51084012 ENSG00000141646.14 ['Myhre_syndrome|Hereditary_hemorrh…    10
 7     2 165294040 ENSG00000136531.19 ['Seizures', '_benign_familial_infa…     9
 8     2 206149933 ENSG00000023228.14 ['not_provided']                         9
 9    11  94419465 ENSG00000020922.13 ['Ataxia-telangiectasia-like_disord…     9
10    17  18130817 ENSG00000091536.19 ['not_provided']                         9
# … with 820,667 more rows

# Merge ClinVar table with leafcutter and DE results in a way that makes sense
ClinVarPangolinResults.merged.with.leafcutter <- ClinVar %>%
  select(Sig, MolSigDescription, Chr, Pos, Condition, Gene, PangolinChangeUp, PangolinAbsPosUp, PangolinChangeDown, PangolinAbsPosDown) %>%
  filter(Sig %in% SigCategoriesToInclude) %>%
  filter(MolSigDescription %in% MolecularCategoriesToInclude) %>%
  # group_by(Chr, Pos) %>%
  mutate(LargestPangolinChange = pmax(abs(PangolinChangeDown), abs(PangolinChangeUp))) %>%
  # ungroup() %>%
  filter(LargestPangolinChange > 0.1) %>%
  arrange(desc(LargestPangolinChange)) %>%
  distinct(Chr, Pos, Condition, .keep_all = T) %>%
  unite(PangolinUp, PangolinChangeUp, PangolinAbsPosUp) %>%
  unite(PangolinDown, PangolinChangeDown, PangolinAbsPosDown) %>%
  gather("PangolinDirection", "EffectSize_SpliceSitePos", PangolinUp, PangolinDown) %>%
  separate(EffectSize_SpliceSitePos, into = c("PangolinEffectSize", "SpliceSitePos"), sep = "_", convert=T) %>%
  unite(SpliceSitePos, Chr, SpliceSitePos) %>%
  group_by(Pos) %>%
  filter(abs(PangolinEffectSize) == max(abs(PangolinEffectSize))) %>%
  ungroup() %>%
  # distinct(SpliceSitePos, .keep_all=T) %>%
  inner_join(
    x=(
      leafcutter.ds.annotated %>%
      group_by(cluster) %>%
      mutate(GA_in_cluster = any(UpstreamOfDonor2BaseSeq == "GA")) %>%
      ungroup() %>%
      filter(GA_in_cluster) %>%
      separate(intron, into=c("chr", "start", "stop", "cluster"), sep = ":", remove=F ) %>%
      mutate(chr = str_remove(chr, "chr") ) %>%
      gather(key="start_or_stop", "SpliceSitePos", start, stop) %>%
      mutate(SpliceSitePos = paste(chr, SpliceSitePos, sep = "_"))
    ),
    y=.,
    by="SpliceSitePos"
  ) %>%
  mutate(DonorOrAcceptor = case_when(
    strand == "+" & start_or_stop == "start" ~ "5'ss",
    strand == "-" & start_or_stop == "stop" ~ "5'ss",
    strand == "+" & start_or_stop == "stop" ~ "3'ss",
    strand == "-" & start_or_stop == "start" ~ "3'ss"
  )) %>%
  mutate(Condition = str_remove_all(Condition, "[\\[\\]']")) %>%
  group_by(SpliceSitePos, treatment) %>%
  mutate(TreatmentEffect = sum(deltapsi)) %>%
  ungroup() %>%
  # arrange() #First by GA, then by deltapsi
  distinct(SpliceSitePos, treatment, .keep_all = T) %>%
  select(cluster,treatment, GA_in_cluster:TreatmentEffect) %>%
  inner_join( 
    (DatToReproduceSplicingExpressionEffect %>%
       arrange(desc(UpstreamOfDonor2BaseSeq=="GA"), desc(deltapsi)) %>%
       distinct(treatment, cluster, .keep_all = T) %>%
       separate(cluster, into=c("chrom", "cluster"), sep = ":")
    ),
    by = c("cluster", "treatment")
  ) %>%
  filter(p < 0.01) %>%
  filter(Condition != "['not_provided']") %>%
  mutate(EffectSum = PangolinEffectSize + TreatmentEffect) %>%
  mutate(ID = str_glue("{DonorOrAcceptor} in {GeneSymbol}: chr{SpliceSitePos}\n{Condition}"))

hist(ClinVarPangolinResults.merged.with.leafcutter$LargestPangolinChange, breaks=30, xlim=c(0,1))

head(ClinVarPangolinResults.merged.with.leafcutter)

# A tibble: 6 × 46
  cluster     treatment GA_in_cluster start_or_stop SpliceSitePos Sig           
  <chr>       <chr>     <lgl>         <chr>         <chr>         <chr>         
1 clu_1286_-  A10       TRUE          start         1_224418416   Likely_pathog…
2 clu_14283_- A10       TRUE          start         10_124408965  Likely_pathog…
3 clu_15979_- A10       TRUE          start         12_21499615   Likely_pathog…
4 clu_17868_+ A10       TRUE          start         13_102649486  Likely_pathog…
5 clu_18910_+ A10       TRUE          start         15_40413087   Pathogenic    
6 clu_19100_+ A10       TRUE          start         15_67066360   Likely_pathog…
# … with 40 more variables: MolSigDescription <chr>, Pos <dbl>,
#   Condition <chr>, Gene <chr>, LargestPangolinChange <dbl>,
#   PangolinDirection <chr>, PangolinEffectSize <dbl>, DonorOrAcceptor <chr>,
#   TreatmentEffect <dbl>, intron <chr>, logef <dbl>, psi_treatment <dbl>,
#   DMSO <dbl>, deltapsi <dbl>, chrom <chr>, status <chr>, loglr <dbl>,
#   df <dbl>, p <dbl>, p.adjust <dbl>, Intron <chr>, strand <chr>,
#   exons_skipped <dbl>, anchor <chr>, gene_names <chr>, gene_ids <chr>, …

dim(ClinVarPangolinResults.merged.with.leafcutter)

[1] 226  46

ToPlot <- ClinVarPangolinResults.merged.with.leafcutter %>%
  sample_n(50)
ggplot(ToPlot, aes(x=ID, xend=ID, color=treatment)) +
    geom_segment( aes(y=PangolinEffectSize, yend=EffectSum)) +
    geom_point(aes(y = PangolinEffectSize), size=3, color="black") +
    geom_point(aes(y = EffectSum)) +
    geom_hline(aes(yintercept=0), linetype="dashed") +
    scale_x_discrete(labels=setNames(substr(ToPlot$Condition, 1, 20), ToPlot$ID)) +
    facet_grid(rows=vars(treatment), scales = "free", space="free") +
    ylim(c(-1, 0.25)) +
    xlab("ClinVar Condition") +
    ylab("Pangolin + treatment\nsplicing effect\nSplice site deltaPSI") +
    coord_flip() +
    theme_light() +
    theme(legend.position="none") +
  ggtitle("Random sample of ClinVar data and sig leafcutter results merged")

Ok, so that was a random sample. From ~250 rows overall where a significant leafcutter change could be linked to a ClinVar splice site predicted to be changed by Pangolin. There’s a few things that I notice that could be improved for better filtering and prestentation of the most relevant/interesting results: Most of these clinVar splice site mutations have predicted effects that are almost complete knock out of splice site, presumably mutating the near essential GT-AG sequences. I don’t forsee any of these drugs rescuing splice sites that are mutated at GT-AG… More likely is that more modest splice site mutations that have less Pangolin predicted effects can be rescused. Therefore, I will filter out the results where Pangolin predicts a >95% decrease in splice site usage, to focus what I believe to be the more ‘fixable’ splicing defects. I may also consider filtering just for the cases where the drug effect is in opposite direction as Pangolin prediction. Secondly, I notice that there are multiple lines for the same condition. Nextly, there are still quate a few uninformative condition annotations (‘not_provided’, ‘See cases’) can can be filtered out.

First let’s filter those out.

#What are the most represented conditions. Some of these are porbably the unformative ones I should filter out
ClinVarPangolinResults.merged.with.leafcutter %>%
  count(Condition) %>%
  arrange(desc(n)) %>%
  head()

# A tibble: 6 × 2
  Condition                                                                    n
  <chr>                                                                    <int>
1 not_provided                                                                43
2 Primary_hyperoxaluria, _type_II                                             16
3 Ellis-van_Creveld_syndrome                                                   8
4 Hereditary_cancer-predisposing_syndrome                                      6
5 Joubert_syndrome_15                                                          6
6 Microcephaly, _normal_intelligence_and_immunodeficiency|Hereditary_canc…     6

ClinVarPangolinResults.merged.with.leafcutter <- ClinVarPangolinResults.merged.with.leafcutter %>%
  filter(! Condition %in% c("not_provided", "See_cases") )

let’s also filter out the ones with Pangolin effects greater than 95% decrease, filter for ones that work in opposite direction as treatment, and then plot.

# not consolidating
ToPlot <- ClinVarPangolinResults.merged.with.leafcutter %>%
  filter(!sign(PangolinEffectSize) == sign(TreatmentEffect)) %>%
  filter(PangolinEffectSize > -0.95) %>%
  filter(TreatmentEffect>0.05)

ToPlot

# A tibble: 5 × 46
  cluster     treatment GA_in_cluster start_or_stop SpliceSitePos Sig           
  <chr>       <chr>     <lgl>         <chr>         <chr>         <chr>         
1 clu_9397_-  A10       TRUE          start         6_42704611    Pathogenic    
2 clu_6594_+  C2        TRUE          start         4_80275012    Pathogenic    
3 clu_14792_- A10       TRUE          stop          11_68760256   Pathogenic/Li…
4 clu_11153_- A10       TRUE          stop          8_18064457    Likely_pathog…
5 clu_14792_- SM2       TRUE          stop          11_68760256   Pathogenic/Li…
# … with 40 more variables: MolSigDescription <chr>, Pos <dbl>,
#   Condition <chr>, Gene <chr>, LargestPangolinChange <dbl>,
#   PangolinDirection <chr>, PangolinEffectSize <dbl>, DonorOrAcceptor <chr>,
#   TreatmentEffect <dbl>, intron <chr>, logef <dbl>, psi_treatment <dbl>,
#   DMSO <dbl>, deltapsi <dbl>, chrom <chr>, status <chr>, loglr <dbl>,
#   df <dbl>, p <dbl>, p.adjust <dbl>, Intron <chr>, strand <chr>,
#   exons_skipped <dbl>, anchor <chr>, gene_names <chr>, gene_ids <chr>, …

ggplot(ToPlot, aes(x=ID, xend=ID, color=treatment)) +
    geom_segment( aes(y=PangolinEffectSize, yend=EffectSum)) +
    geom_point(aes(y = PangolinEffectSize), size=3, color="black") +
    geom_point(aes(y = EffectSum)) +
    geom_hline(aes(yintercept=0), linetype="dashed") +
    scale_x_discrete(labels=setNames(str_wrap(str_replace(ToPlot$ID, "not_provided", ""), width=30), ToPlot$ID)) +
    facet_grid(rows=vars(treatment), scales = "free", space="free") +
    ylim(c(-1, 0.25)) +
    xlab("SpliceSite in Gene: ClinVarVariant\nCondition") +
    ylab("Pangolin + treatment\nsplicing effect\n(Splice site deltaPSI)") +
    coord_flip() +
    theme_light() +
    theme(legend.position="none") +
  labs(caption="Only GA|GT introns with meaningful antagonist effects included")

Ok, actually let’s filter for ones where the treatment effect is at least say 0.05. Let’s also plot the effected splice site in the label

Finally, let’s plot the splicing admixture plot for that splice cluster that appears significant in multiple conditions:

# clu_613_-

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

PSI.table %>%
  filter(Cluster %in% c("clu_613_-", "clu_12905_+", "clu_20571_-")) %>%
  mutate(Intron=str_replace(intron, "(.+?):(.+?):(.+?):.+", "\\1_\\2_\\3")) %>%
  inner_join(SpliceTypeAnnotations, by="Intron") %>%
  left_join(
    (ToPlot %>%
       arrange(cluster, deltapsi) %>%
       distinct(cluster, .keep_all = T)
     ),
    by = c("Cluster"="cluster")
    ) %>%
  group_by(Cluster) %>%
  mutate(Gene = Mode(gene_names)) %>%
  ungroup() %>%
  mutate(ContainsPangolinMostAffectedSpliceSite = intron.x == intron.y) %>%
  mutate(Cluster_label = paste0(Cluster, " ", Gene, "\n", Condition)) %>%
  ggplot(aes(x=Sample, y=PSI, fill=Intron, color=ContainsPangolinMostAffectedSpliceSite)) +
  geom_col() +
  facet_wrap(~Cluster_label) +
  scale_fill_manual(values = getPalette(17)) +
  theme_classic() +
  # theme(legend.position="none") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

PSI.table %>%
  filter(Cluster %in% c("clu_613_-")) %>%
  mutate(Intron=str_replace(intron, "(.+?):(.+?):(.+?):.+", "\\1_\\2_\\3")) %>%
  inner_join(SpliceTypeAnnotations, by="Intron") %>%
  left_join(
    (ToPlot %>%
       arrange(cluster, deltapsi) %>%
       distinct(cluster, .keep_all = T)
     ),
    by = c("Cluster"="cluster")
    ) %>%
  group_by(Cluster) %>%
  mutate(Gene = Mode(gene_names)) %>%
  ungroup() %>%
  mutate(ContainsPangolinMostAffectedSpliceSite = intron.x == intron.y) %>%
  mutate(Cluster_label = paste0(Cluster, " ", Gene, "\n", Condition)) %>%
  ggplot(aes(x=Sample, y=PSI, fill=Intron)) +
  geom_col() +
  facet_wrap(~Cluster_label) +
  scale_fill_manual(values = getPalette(17)) +
  theme_classic() +
  # theme(legend.position="none") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

ToPlot %>%
  select(intron) %>%
  distinct() %>%
  separate(intron, into=c("chrom", "start", "stop", "cluster")) %>%
  write_tsv("../output/PioritizedIntronTargets.bed", col_names = F)

PlotOut <- PSI.table %>%
  filter(Cluster %in% c("clu_613_-", "clu_12905_+", "clu_20571_-")) %>%
  mutate(Intron=str_replace(intron, "(.+?):(.+?):(.+?):.+", "\\1_\\2_\\3")) %>%
  inner_join(SpliceTypeAnnotations, by="Intron") %>%
  left_join(
    (ToPlot %>%
       arrange(cluster, deltapsi) %>%
       distinct(cluster, .keep_all = T)
     ),
    by = c("Cluster"="cluster")
    ) %>%
  group_by(Cluster) %>%
  mutate(Gene = Mode(gene_names)) %>%
  ungroup() %>%
  mutate(ContainsPangolinMostAffectedSpliceSite = intron.x == intron.y) %>%
  mutate(Cluster_label = paste0(Cluster, " ", Gene, "\n", Condition)) %>%
  ggplot(aes(x=Sample, y=PSI, fill=Intron, color=ContainsPangolinMostAffectedSpliceSite)) +
  geom_col() +
  facet_wrap(~Cluster_label) +
  scale_fill_manual(values = getPalette(17)) +
  theme_classic() +
  # theme(legend.position="none") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  theme(strip.text.x = element_text(size = 6))

ggsave("../output/PioritizedIntronTargets.pdf", PlotOut, width=14, height=6)

PSI.table %>%
  filter(Cluster %in% c("clu_613_-")) %>%
  mutate(Intron=str_replace(intron, "(.+?):(.+?):(.+?):.+", "\\1_\\2_\\3")) %>%
  inner_join(SpliceTypeAnnotations, by="Intron") %>%
  left_join(
    (ToPlot %>%
       arrange(cluster, deltapsi) %>%
       distinct(cluster, .keep_all = T)
     ),
    by = c("Cluster"="cluster")
    )

TODO:

Consider using the ClinVar database a different ways:

Some ideas:

identify ClinVar variants that cause frame-shifts and early stop codons, and intersect with leafcutter results for the ones that might skip those exons, and hopefully rescue expression. If the ClinVar variant is not present in the experimental (likely healthy-donor) fibroblasts (and they presumably are not), we perhaps wouldn’t expect expression changes to be present in our DE results
Similarly, get list of ClinVar mutations with dominant negative inheritence mode, and identify splicing changes that might exon-skip the mutation.
get list (perhaps OMIM?) of genes and disease-causal-effect directions and intersect with leafcutter results and DE results that are consistent with a rescue effect.

Prioritize treatment effects that may suppress dominant negative mutations in exons

First, to explore the molecular characteristics of autosomal dominant conditions, let’s plot the molecular siginficance code distribution for conditions annotated with “autosomal_dominant” in the condition description…

ClinVar <- ClinVar %>%
  filter(Sig %in% SigCategoriesToInclude) %>%
  mutate(ModeOfInheritence = case_when(
    str_detect(Condition, "utosomal_dominant") ~ "Autosomal_dominant",
    str_detect(Condition, "utosomal_recessive") ~ "Autosomal_recessive",
    TRUE ~ "other"))

ClinVar %>%
  filter(Sig %in% SigCategoriesToInclude) %>%
  filter(str_detect(Condition, "gain_of")) %>%
  pull(Condition) %>% unique()

[1] "['Hyper-IgE_recurrent_infection_syndrome_1', '_autosomal_dominant|Autoimmune_disease', '_multisystem', '_infantile-onset', '_1|STAT3_gain_of_function|not_provided']"
[2] "['Hyper-IgE_recurrent_infection_syndrome_1', '_autosomal_dominant|Inherited_Immunodeficiency_Diseases|Inborn_genetic_diseases|STAT3_gain_of_function|not_provided']" 
[3] "['Hyper-IgE_recurrent_infection_syndrome_1', '_autosomal_dominant|STAT3_gain_of_function|not_provided']"                                                             
[4] "['Hyper-IgE_recurrent_infection_syndrome_1', '_autosomal_dominant|Hyper-IgE_syndrome|STAT3_gain_of_function|not_provided']"                                          
[5] "['Hyper-IgE_recurrent_infection_syndrome_1', '_autosomal_dominant|STAT3_gain_of_function']"                                                                          
[6] "['Hyper-IgE_recurrent_infection_syndrome_1', '_autosomal_dominant|Inherited_Immunodeficiency_Diseases|STAT3_gain_of_function|not_provided']"                         
[7] "['Hyper-IgE_recurrent_infection_syndrome_1', '_autosomal_dominant|Autoimmune_disease', '_multisystem', '_infantile-onset', '_1|STAT3_gain_of_function']"

ClinVar %>%
  filter(Sig %in% SigCategoriesToInclude) %>%
  count(ModeOfInheritence) %>%
  ggplot(aes(x=ModeOfInheritence, y=n)) +
  geom_col() +
  ylab("Number ClinVar mutations") +
  scale_y_continuous(trans="log10") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ggtitle("Mode of inheritence in ClinVar/OMIM")

Version	Author	Date
db9a4f0	Benjmain Fair	2022-08-01

Using OMIM annotations for searching potential targets

AR <- read_excel("../data/Hijikata_TableS1_41598_2017_8902_MOESM2_ESM.xls", col_types="text") %>%
  filter(Inheritance=="AR") %>%
  mutate(ModeInheritence = "AutosomalRecessive")

AD <- excel_sheets("../data/Hijikata_TableS2_41598_2017_8902_MOESM3_ESM.xls") %>% 
  set_names() %>% 
  map(read_excel, path = "../data/Hijikata_TableS2_41598_2017_8902_MOESM3_ESM.xls", col_types="text") %>%
  bind_rows(.id = "ModeInheritence")

OmimTable <- read_tsv("../code/OMIM/mimTitles.txt", skip=2)

Parsed with column specification:
cols(
  `# Prefix` = col_character(),
  `MIM Number` = col_double(),
  `Preferred Title; symbol` = col_character(),
  `Alternative Title(s); symbol(s)` = col_character(),
  `Included Title(s); symbols` = col_character()
)

HijikataMappings <- bind_rows(AR, AD) %>%
  mutate(ModeInheritence = recode(ModeInheritence, DN="DominantNegative", GF="GainOfFunction", HI="Haploinsufficient")) %>%
  mutate(MIM_phenotype=as.numeric(MIM_phenotype)) %>%
  left_join(OmimTable, by=c("MIM_phenotype"="MIM Number"))

HijikataMappings %>%
  filter(Gene %in% GenesOfInterest)

# A tibble: 2 × 12
  Gene  RefSeq   Length MIM_gene MIM_phenotype Inheritance ModeInheritence PMID 
  <chr> <chr>    <chr>  <chr>            <dbl> <chr>       <chr>           <chr>
1 GALC  NP_0001… 685    606890          245200 AR          AutosomalReces… <NA> 
2 SNCA  NP_0011… 140    163890          168601 AD          GainOfFunction  1602…
# … with 4 more variables: `# Prefix` <chr>, `Preferred Title; symbol` <chr>,
#   `Alternative Title(s); symbol(s)` <chr>, `Included Title(s); symbols` <chr>

HijikataMappings %>%
  ggplot(aes(x=ModeInheritence)) +
  geom_bar() +
  ylab("Number OMIM condition:gene pairs") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

DE_and_DS_OMIM_Map <- HijikataMappings %>%
  filter(ModeInheritence %in% c("DominantNegative", "GainOfFunction")) %>%
  left_join(
      leafcutter.ds.annotated %>%
      filter((!is.na(gene_names)) & str_detect(gene_ids, ",", negate=T)) %>%
      mutate(gene_ids=str_replace(gene_ids, "\\..+$", "")) %>%
        full_join(
          (DE.results %>%
             mutate(gene_ids=str_replace(EnsemblID, "\\..+$", ""))), 
        by=c("treatment","gene_ids")),
    by=c("Gene"="GeneSymbol")) %>%
  mutate(IsDownregulate = (FDR < 0.01 & sign(logFC)==-1)) %>%
  mutate(IsCrypticSplicingSig = p.adjust < 0.01 & IntronType %in% c("Alt 5'ss", "Alt 3'ss") & sign(logef)==1) %>%
  group_by(treatment, Gene, MIM_phenotype) %>%
  arrange(desc(IsDownregulate), desc(IsCrypticSplicingSig)) %>%
  ungroup() %>%
  distinct(treatment, Gene, MIM_phenotype, .keep_all = T)

DE_and_DS_OMIM_Map %>%
  distinct(treatment, Gene, MIM_phenotype, .keep_all = T) %>%
  count(treatment, IsCrypticSplicingSig, IsDownregulate) %>%
  drop_na() %>%
  mutate(treatment_effect = case_when(
    IsCrypticSplicingSig & IsDownregulate ~ "Downregulation & crpytic splicing detected",
    IsCrypticSplicingSig & !IsDownregulate ~ "Crpytic splicing detected",
    !IsCrypticSplicingSig & IsDownregulate ~ "Downregulation detected",
    !IsCrypticSplicingSig & !IsDownregulate ~ "Neither"
  )) %>%
  mutate(treatment_effect= factor(treatment_effect, levels = c("Downregulation & crpytic splicing detected", "Crpytic splicing detected", "Downregulation detected", "Neither"))) %>%
  ggplot(aes(x=treatment, y=n, fill=treatment_effect)) +
    geom_col() +
    ylab("Number OMIM condition:gene pairs") +
    xlab("treatment") +
    scale_fill_discrete() +
    theme_classic()

DE_and_DS_OMIM_Map %>%
  distinct(treatment, Gene, MIM_phenotype, .keep_all = T) %>%
  filter(treatment == "SM2") %>%
  filter(ModeInheritence=="GainOfFunction" & IsDownregulate & IsCrypticSplicingSig) %>%
  # filter(UpstreamOfDonor2BaseSeq == "GA" ) %>%
  select(Gene, `Preferred Title; symbol`, UpstreamOfDonor2BaseSeq, psi_treatment, intron, logFC) %>%
  select(Gene, `Preferred Title; symbol`)

# A tibble: 9 × 2
  Gene   `Preferred Title; symbol`                                              
  <chr>  <chr>                                                                  
1 AKT3   MEGALENCEPHALY-POLYMICROGYRIA-POLYDACTYLY-HYDROCEPHALUS SYNDROME 2; MP…
2 ACVR1  FIBRODYSPLASIA OSSIFICANS PROGRESSIVA; FOP                             
3 STAT1  IMMUNODEFICIENCY 31C; IMD31C                                           
4 PDGFRB MYOFIBROMATOSIS, INFANTILE, 1; IMF1                                    
5 RNF170 ATAXIA, SENSORY, 1, AUTOSOMAL DOMINANT; SNAX1                          
6 PTDSS1 LENZ-MAJEWSKI HYPEROSTOTIC DWARFISM; LMHD                              
7 KITLG  HYPERPIGMENTATION WITH OR WITHOUT HYPOPIGMENTATION, FAMILIAL PROGRESSI…
8 PIEZO1 DEHYDRATED HEREDITARY STOMATOCYTOSIS 1 WITH OR WITHOUT PSEUDOHYPERKALE…
9 SNTA1  LONG QT SYNDROME 12; LQT12

  # distinct(Gene, MIM_phenotype)

  # filter(UpstreamOfDonor2BaseSeq == "AG") %>%
  # filter(IntronType == "Alt 5'ss") %>%
  # filter( sign(logef)==1)

DE_and_DS_OMIM_Map %>%
  distinct(treatment, Gene, MIM_phenotype, .keep_all = T) %>%
  filter(treatment == "SM2") %>%
  filter(ModeInheritence=="GainOfFunction" & IsDownregulate & IsCrypticSplicingSig) %>%
  # filter(UpstreamOfDonor2BaseSeq == "GA" ) %>%
  select(Gene, `Preferred Title; symbol`, UpstreamOfDonor2BaseSeq, psi_treatment, intron, logFC)  %>%
  write_tsv("../output/FibroblastOMIM_GOF_Downregulated_AndCrypticSpliced_GeneList.tsv")

Now let’s write out some info to help me plot stuff in a useful way with my pygenometracks scripts. More specifically, write out a bash script which I will then run as follows in the next code chunk

genes.to.plot <- DE_and_DS_OMIM_Map %>%
  distinct(treatment, Gene, MIM_phenotype, .keep_all = T) %>%
  filter(treatment == "SM2") %>%
  filter(ModeInheritence=="GainOfFunction" & IsDownregulate & IsCrypticSplicingSig)

# Method1 for selecting introns to plot sashimi
leafcutter.ds.annotated %>%
  distinct(intron, cluster) %>%
  filter(cluster %in% genes.to.plot$cluster) %>%
  separate(intron, into=c("chrom", "start", "stop", "cluster2"), sep=":", remove=F, convert=T) %>%
  select(chrom, start, stop, intron) %>%
  arrange(chrom, start, stop) %>%
  write_tsv("../code/scratch/ToPlot.introns.bed")

# Method2 for selecting introns to plot sashimi
all.samples.5ss <- read_tsv("../code/SplicingAnalysis/FullSpliceSiteAnnotations/JuncfilesMerged.annotated.basic.bed.5ss.tab.gz", col_names = c("intron", "seq", "score")) %>%
  mutate(intron = str_replace(intron, "^(.+?)::.+$", "\\1")) %>%
  separate(intron, into=c("chrom", "start", "stop", "strand"), sep="_", convert=T, remove=F)

all.samples.intron.annotations <- read_tsv("../code/SplicingAnalysis/FullSpliceSiteAnnotations/JuncfilesMerged.annotated.basic.bed.gz")

all.samples.PSI <- read_tsv("../code/SplicingAnalysis/leafcutter_all_samples/PSI.table.bed.gz")
  
all.samples.PSI %>%
  select(1:6) %>%
  inner_join(all.samples.5ss, by=c("#Chrom"="chrom", "start", "end"="stop", "strand")) %>%
  inner_join(all.samples.intron.annotations, by=c("#Chrom"="chrom", "start", "end", "strand") ) %>%
  filter(str_detect(seq, "^\\w{2}GAGT")) %>%
  filter(gene_names %in% genes.to.plot$Gene) %>%
  write_tsv("../code/scratch/ToPlot.introns.bed")

# Method1 for plot boundaries
cluster_boundaries <- leafcutter.ds.annotated %>%
  distinct(intron, cluster) %>%
  filter(cluster %in% genes.to.plot$cluster) %>%
  separate(intron, into=c("chrom", "start", "stop", "cluster2"), sep=":", remove=F, convert=T) %>%
  group_by(cluster, chrom) %>%
  summarise(min = min(start) - 5000, max= max(stop)+5000)

genes.to.plot %>%
  left_join(cluster_boundaries, by="cluster") %>%
  mutate(cmd = str_glue("python scripts/GenometracksByGenotype/AggregateBigwigsForPlotting.py --GroupSettingsFile <(awk '$1 ~ /_u$/' bigwigs/BigwigList.groups.tsv ) --BigwigList bigwigs/BigwigList.tsv --Normalization WholeGenome --Region {chrom}:{min}-{max} --BigwigListType KeyFile --OutputPrefix scratch/ -vv --TracksTemplate scripts/GenometracksByGenotype/tracks_templates/GeneralPurposeColoredByGenotypeWithSupergroups.ini --Bed12GenesToIni scripts/GenometracksByGenotype/PremadeTracks/gencode.v26.FromGTEx.genes.bed12.gz  --FilterJuncsByBed scratch/ToPlot.introns.bed\npyGenomeTracks --tracks scratch/tracks.ini --out scratch/ForJingxin_{Gene}.png --region {chrom}:{min}-{max} --trackLabelFraction 0.15\n\n")) %>%
  select(cmd) %>%
  write.table("../code/testPlottingWithMyScript.ForJingxin.sh", quote=F, row.names = F, col.names = F)


#Method2 for plot boundaries
genes <- read_tsv("../code/scripts/GenometracksByGenotype/PremadeTracks/gencode.v26.FromGTEx.genes.bed12.gz", col_names = c("chrom", "start", "stop", "name", "score", "strand", 1:7)) %>%
  select(1:6)

genes.to.plot %>%
  left_join(genes, by=c("Gene"="name")) %>%
  mutate(min=start-1000, max=stop+1000) %>%
  mutate(cmd = str_glue("python scripts/GenometracksByGenotype/AggregateBigwigsForPlotting.py --GroupSettingsFile <(awk '$1 ~ /_u$/' bigwigs/BigwigList.groups.tsv ) --BigwigList bigwigs/BigwigList.tsv --Normalization WholeGenome --Region {chrom}:{min}-{max} --BigwigListType KeyFile --OutputPrefix scratch/ -vv --TracksTemplate scripts/GenometracksByGenotype/tracks_templates/GeneralPurposeColoredByGenotypeWithSupergroups.ini --Bed12GenesToIni scripts/GenometracksByGenotype/PremadeTracks/gencode.v26.FromGTEx.genes.bed12.gz  --FilterJuncsByBed scratch/ToPlot.introns.bed\npyGenomeTracks --tracks scratch/tracks.ini --out scratch/ForJingxin_{Gene}.png --region {chrom}:{min}-{max} --trackLabelFraction 0.15\n\n")) %>%
  select(cmd) %>%
  write.table("../code/testPlottingWithMyScript.ForJingxin.sh", quote=F, row.names = F, col.names = F)

Run script as follows

conda activate GenometracksByGenotype
cd ../code
bash testPlottingWithMyScript.ForJingxin.sh

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /software/openblas-0.2.19-el7-x86_64/lib/libopenblas_haswellp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readxl_1.3.1       GGally_1.4.0       RColorBrewer_1.1-2 ggrepel_0.9.1     
 [5] edgeR_3.26.5       limma_3.40.6       gplots_3.0.1.1     forcats_0.4.0     
 [9] stringr_1.4.0      dplyr_1.0.9        purrr_0.3.4        readr_1.3.1       
[13] tidyr_1.2.0        tibble_3.1.7       ggplot2_3.3.6      tidyverse_1.3.0   

loaded via a namespace (and not attached):
 [1] nlme_3.1-140       bitops_1.0-6       fs_1.5.2           lubridate_1.7.4   
 [5] httr_1.4.4         rprojroot_2.0.2    tools_3.6.1        backports_1.4.1   
 [9] utf8_1.1.4         R6_2.4.0           KernSmooth_2.23-15 DBI_1.1.0         
[13] mgcv_1.8-40        colorspace_1.4-1   withr_2.5.0        tidyselect_1.1.2  
[17] compiler_3.6.1     git2r_0.26.1       cli_3.3.0          rvest_0.3.5       
[21] xml2_1.3.2         labeling_0.3       caTools_1.17.1.2   scales_1.1.0      
[25] hexbin_1.27.3      digest_0.6.20      rmarkdown_1.13     pkgconfig_2.0.2   
[29] htmltools_0.5.3    dbplyr_1.4.2       fastmap_1.1.0      highr_0.9         
[33] rlang_1.0.5        rstudioapi_0.14    generics_0.1.3     farver_2.1.0      
[37] jsonlite_1.6       gtools_3.9.2.2     magrittr_1.5       Matrix_1.2-18     
[41] Rcpp_1.0.5         munsell_0.5.0      fansi_0.4.0        lifecycle_1.0.1   
[45] stringi_1.4.3      whisker_0.3-2      yaml_2.2.0         plyr_1.8.4        
[49] grid_3.6.1         gdata_2.18.0       promises_1.0.1     crayon_1.3.4      
[53] lattice_0.20-38    haven_2.3.1        splines_3.6.1      hms_0.5.3         
[57] locfit_1.5-9.1     knitr_1.39         pillar_1.7.0       reprex_0.3.0      
[61] glue_1.6.2         evaluate_0.15      modelr_0.1.8       vctrs_0.4.1       
[65] httpuv_1.5.1       cellranger_1.1.0   gtable_0.3.0       reshape_0.8.8     
[69] assertthat_0.2.1   xfun_0.31          broom_1.0.0        later_0.8.0       
[73] viridisLite_0.3.0  workflowr_1.6.2    ellipsis_0.3.2