Last updated: 2023-03-02

Checks: 5 2

Knit directory: 20211209_JingxinRNAseq/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(19900924) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute relative
/project2/yangili1/bjf79/20211209_JingxinRNAseq/code/bigwigs/unstranded/(.+?).bw ../code/bigwigs/unstranded/(.+?).bw

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 32a80fd. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/20220707_TitrationSeries_DE_testing.nb.html
    Ignored:    code/.DS_Store
    Ignored:    code/._DOCK7.pdf
    Ignored:    code/._DOCK7_DMSO1.pdf
    Ignored:    code/._DOCK7_SM2_1.pdf
    Ignored:    code/._FKTN_DMSO_1.pdf
    Ignored:    code/._FKTN_SM2_1.pdf
    Ignored:    code/._MAPT.pdf
    Ignored:    code/._PKD1_DMSO_1.pdf
    Ignored:    code/._PKD1_SM2_1.pdf
    Ignored:    code/.snakemake/
    Ignored:    code/1KG_HighCoverageCalls.samplelist.txt
    Ignored:    code/5ssSeqs.tab
    Ignored:    code/Alignments/
    Ignored:    code/ChemCLIP/
    Ignored:    code/ClinVar/
    Ignored:    code/DE_testing/
    Ignored:    code/DE_tests.mat.counts.gz
    Ignored:    code/DE_tests.txt.gz
    Ignored:    code/DataNotToCommit/
    Ignored:    code/DoseResponseData/
    Ignored:    code/Fastq/
    Ignored:    code/FastqFastp/
    Ignored:    code/FragLenths/
    Ignored:    code/Meme/
    Ignored:    code/Multiqc/
    Ignored:    code/OMIM/
    Ignored:    code/OldBigWigs/
    Ignored:    code/PhyloP/
    Ignored:    code/QC/
    Ignored:    code/ReferenceGenomes/
    Ignored:    code/Session.vim
    Ignored:    code/SplicingAnalysis/
    Ignored:    code/TracksSession
    Ignored:    code/bigwigs/
    Ignored:    code/featureCounts/
    Ignored:    code/figs/
    Ignored:    code/geena/
    Ignored:    code/hg38ToMm39.over.chain.gz
    Ignored:    code/igv_session.template.xml
    Ignored:    code/igv_session.xml
    Ignored:    code/log
    Ignored:    code/logs/
    Ignored:    code/rstudio-server.job
    Ignored:    code/scratch/
    Ignored:    code/test.txt.gz
    Ignored:    code/testPlottingWithMyScript.ForJingxin.sh
    Ignored:    code/testPlottingWithMyScript.ForJingxin2.sh
    Ignored:    code/testPlottingWithMyScript.ForJingxin3.sh
    Ignored:    code/testPlottingWithMyScript.ForJingxin4.sh
    Ignored:    code/testPlottingWithMyScript.sh
    Ignored:    data/~$52CompoundsTempPlateLayoutForPipettingConvenience.xlsx
    Ignored:    output/._PioritizedIntronTargets.pdf

Unstaged changes:
    Modified:   analysis/20220221_ExploreExpOf52.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/20220221_ExploreExpOf52.Rmd) and HTML (docs/20220221_ExploreExpOf52.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 32a80fd Benjmain Fair 2023-03-01 updated exp52 nb
html 32a80fd Benjmain Fair 2023-03-01 updated exp52 nb
Rmd e996344 Benjmain Fair 2023-02-28 update 52exp nb
html e996344 Benjmain Fair 2023-02-28 update 52exp nb
Rmd 0e2a360 Benjmain Fair 2023-02-24 update exp52 notebook
html 0e2a360 Benjmain Fair 2023-02-24 update exp52 notebook
Rmd a8c9152 Benjmain Fair 2023-02-23 update site
html a8c9152 Benjmain Fair 2023-02-23 update site
Rmd 94cef29 Benjmain Fair 2023-02-22 added Rmd for 52Exp

knitr::opts_chunk$set(echo = TRUE, warning = F, message = F)

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(RColorBrewer)
library(data.table)

Attaching package: 'data.table'
The following objects are masked from 'package:dplyr':

    between, first, last
The following object is masked from 'package:purrr':

    transpose
library(edgeR)
Loading required package: limma
library(qvalue)
library(Heatplus)
library(gplots)

Attaching package: 'gplots'
The following object is masked from 'package:stats':

    lowess
library(ggrepel)
library(knitr)
library(ggnewscale)

# define some useful funcs
sample_n_of <- function(data, size, ...) {
  dots <- quos(...)
  
  group_ids <- data %>% 
    group_by(!!! dots) %>% 
    group_indices()
  
  sampled_groups <- sample(unique(group_ids), size)
  
  data %>% 
    filter(group_ids %in% sampled_groups)
}


# Set theme
theme_set(
  theme_classic() +
  theme(text=element_text(size=16,  family="Helvetica")))

# I use layer a lot, to rotate long x-axis labels
Rotate_x_labels <- theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

#test plot
ggplot(mtcars, aes(x=mpg, y=cyl)) +
  geom_point()

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

Intro

Jingxin’s lab has synthesized (or maybe contracted out synthesis of) hundred(s) branaplam/risdiplam derivatives. They screened them for splice-modifying activity using a SMN2 minigene I believe, resulting in 52 compounds that they send Gabi and I for further experiments: We recieved the 52 compounds in a 96 well plate, and in subsequent plots we are naming them just by the well position they were shipped in (ie A01, A02, …, E04). The molar concentration of each molecule was determined by Jingxin’s lab and shipped at 500x concentration. I think molar concentration is based on EC90 of SMN2 splice modifying activity in their initial screen. We grew LCLs cells to a density of approximately 1M cells per mL, and split into seperate flasks each containing 7mL: 2 seperate replicate flasks for each molecule treatment, and 6 flasks of DMSO control (total of 110 samples), and applied 14uL of treatment. The following day, cells were collected by centrifugation, supernate was discarded and stored in 1mL of Trizol at -20C for futher processing. We extracted RNA by phase seperation (Trizol manufacturer’s instructions). The top phase was mixed with an equal volume of ethanol, and applied to Zymo5 nucleic-acid purification columns. After two washes, we performed on-column DNAse I treatment, followed by an extra application of binding buffer and two more washes, and finally elution in water. Qubit quantification was used to quantify RNA before using NEBNext ultra directional ii RNAseq kits with NEB’s polyA capture kit. There are a couple optional steps in the kit preparation: We fragmented the RNA for only5 minutes, and performed size selection (step 6.3) to preferentially obtain longer insert sizes. The libraries were intially sequenced on a MiSeq (~20M reads total), to determine optimal re-pooling volumes (aiming for equal representation of each sample) and to confirm long insert sizes to justify the 150+150 paired end sequencing that we ended up doing on the NovaSeq S4 (~10B reads total). In practice, I noticed that one sample had very very low number of reads, and will likely have to be dropped from analysis (see below). Also, sample pooling volumes weren’t as perfect as I would have hoped, but I think they are acceptable. see below.

In the remainder of this notebook, I will explore this data is ways similar to my previous notebooks regarding Jingxin’s RNAseq - analyses that I can do quickly. The pre-processing that I have done leading up to this notebook includes read trimming (fastp), alignment with STAR (I used single pass mode, but perhaps before publication it might be worth aligning all samples and re-aligning samples for STAR’s “two-pass” procedure for most sensitive identification of unannotated junctions), gene counting with featureCounts, differential expression analysis with edgeR (a separate contrast for each treatment, comparing to the same 6 DMSO replicates ), splicing quantification and differential testing with leafcutter (again, comparing each treatment to 6 DMSO replicates).

Reads per sample

Let’s just consider protein coding genes, as written out in my reformatted featureCounts output, which I will read in here. I will occasionally read in data from previous experiments too for comparison.

# read in sample metadata


Metadata.ExpOf52 <- read_tsv("../code/config/samples.52MoleculeExperiment.tsv") %>%
  mutate(cell.type = "LCL", libType="polyA", rep=BioRep, old.sample.name=SampleID, dose.nM=NA) %>%
  mutate(treatment = if_else(Treatment=="DMSO", "DMSO", paste0("W", Treatment))) %>%
  dplyr::select(treatment, cell.type, dose.nM, libType, rep, old.sample.name) %>%
  mutate(SampleName = paste(treatment, dose.nM, cell.type, libType, rep, sep = "_"))

Metadata.PreviousExperiments <- read_tsv("../code/bigwigs/BigwigList.tsv",col_names = c("SampleName", "bigwig", "group", "strand")) %>%
  filter(strand==".") %>%
  dplyr::select(-strand) %>%
  mutate(old.sample.name = str_replace(bigwig, "/project2/yangili1/bjf79/20211209_JingxinRNAseq/code/bigwigs/unstranded/(.+?).bw", "\\1")) %>%
  separate(SampleName, into=c("treatment", "dose.nM", "cell.type", "libType", "rep"), convert=T, remove=F, sep="_") %>%
  left_join(
    read_tsv("../code/bigwigs/BigwigList.groups.tsv", col_names = c("group", "color", "bed", "supergroup")),
    by="group"
  )

FullMetadata <- bind_rows(Metadata.ExpOf52, Metadata.PreviousExperiments) %>%
  mutate(Experiment = case_when(
    cell.type == "Fibroblast" ~ "Single high dose fibroblast",
    startsWith(old.sample.name, "TitrationExp") ~ "Dose response titration",
    startsWith(old.sample.name, "chRNA") ~ "nascent RNA profiling",
    startsWith(old.sample.name, "NewMolecule") ~ "Single high dose LCL"
  )) %>%
  mutate(color = case_when(
    treatment == "DMSO" ~ "#969696",
    Experiment == "Single high dose LCL" ~ "#252525",
    TRUE ~ color
  )) %>%
  mutate(leafcutter.name = str_replace_all(old.sample.name, "-", "."))



gene.counts <- read_tsv("../code/DE_testing/ExpOf52_Counts.mat.txt.gz") %>%
  column_to_rownames("Geneid") %>%
  DGEList() %>%
  calcNormFactors()

counts.plot.dat <- gene.counts$samples %>%
  rownames_to_column("old.sample.name") %>%
  inner_join(FullMetadata, by="old.sample.name") %>%
  mutate(dose.nM = case_when(
    treatment == "DMSO" ~ "NA",
    cell.type == "Fibroblast" ~ "CC50 dose",
    Experiment == "Single high dose LCL" ~ "SMN_EC90 dose",
    TRUE ~ as.character(dose.nM)
  )) %>%
  mutate(label = dose.nM) %>%
  arrange(Experiment, treatment, dose.nM)

counts.plot.labels <- counts.plot.dat %>%
  dplyr::select(old.sample.name, label) %>% deframe()

counts.plot.dat %>%
  filter(Experiment=="Single high dose LCL")
      old.sample.name group.x  lib.size norm.factors treatment cell.type
1   NewMolecule.E05-1       1 135983671    0.9401592      DMSO       LCL
2   NewMolecule.E06-2       1 317779082    0.6907965      DMSO       LCL
3   NewMolecule.E07-3       1  72297626    0.9693137      DMSO       LCL
4   NewMolecule.E05-4       1 129001545    0.8532237      DMSO       LCL
5   NewMolecule.E06-5       1 182905676    0.9465600      DMSO       LCL
6   NewMolecule.E07-6       1 126772689    0.7745714      DMSO       LCL
7   NewMolecule.A01-1       1 183728666    0.9614895      WA01       LCL
8   NewMolecule.A01-2       1 198632154    0.6962443      WA01       LCL
9   NewMolecule.A02-1       1 312182548    0.9826652      WA02       LCL
10  NewMolecule.A02-2       1 183395930    0.9897056      WA02       LCL
11  NewMolecule.A03-1       1 177704482    0.9400929      WA03       LCL
12  NewMolecule.A03-2       1 249934473    0.9561665      WA03       LCL
13  NewMolecule.A04-1       1 174653379    0.9222987      WA04       LCL
14  NewMolecule.A04-2       1 548033015    0.7589615      WA04       LCL
15  NewMolecule.A05-1       1 178435056    0.8614118      WA05       LCL
16  NewMolecule.A05-2       1 153251808    0.7788587      WA05       LCL
17  NewMolecule.A06-1       1 190563753    1.0080036      WA06       LCL
18  NewMolecule.A06-2       1 151314210    0.6000523      WA06       LCL
19  NewMolecule.A07-1       1 123503847    0.7953700      WA07       LCL
20  NewMolecule.A07-2       1 163742739    0.7842479      WA07       LCL
21  NewMolecule.A08-1       1 128532357    0.9945259      WA08       LCL
22  NewMolecule.A08-2       1 191110810    0.8467907      WA08       LCL
23  NewMolecule.A09-1       1 250617437    0.9637614      WA09       LCL
24  NewMolecule.A09-2       1 198092385    0.8870051      WA09       LCL
25  NewMolecule.A10-1       1 233224920    0.9467818      WA10       LCL
26  NewMolecule.A10-2       1 125874155    0.8707373      WA10       LCL
27  NewMolecule.A11-1       1 151421732    0.7724431      WA11       LCL
28  NewMolecule.A11-2       1 159203721    0.6239473      WA11       LCL
29  NewMolecule.A12-1       1 242979912    0.8638066      WA12       LCL
30  NewMolecule.A12-2       1 197958602    0.6814641      WA12       LCL
31  NewMolecule.B01-1       1 176245134    0.8257129      WB01       LCL
32  NewMolecule.B01-2       1 144824255    0.7994645      WB01       LCL
33  NewMolecule.B02-1       1 159560647    0.9525037      WB02       LCL
34  NewMolecule.B02-2       1 181332941    0.8442828      WB02       LCL
35  NewMolecule.B03-1       1 501440283    0.9212623      WB03       LCL
36  NewMolecule.B03-2       1 161562099    0.8031914      WB03       LCL
37  NewMolecule.B04-1       1 363641016    0.9203852      WB04       LCL
38  NewMolecule.B04-2       1 262007104    0.9186215      WB04       LCL
39  NewMolecule.B05-1       1 134091835    0.9803464      WB05       LCL
40  NewMolecule.B05-2       1 129596579    0.9316498      WB05       LCL
41  NewMolecule.B06-1       1 159743971    0.9776352      WB06       LCL
42  NewMolecule.B06-2       1  84886553    0.9970722      WB06       LCL
43  NewMolecule.B07-1       1 160357588    0.8486932      WB07       LCL
44  NewMolecule.B07-2       1 137933937    0.7173181      WB07       LCL
45  NewMolecule.B08-1       1 172609974    1.0109620      WB08       LCL
46  NewMolecule.B08-2       1  82813127    0.9109179      WB08       LCL
47  NewMolecule.B09-1       1 178767085    0.9536717      WB09       LCL
48  NewMolecule.B09-2       1 264104038    0.8378014      WB09       LCL
49  NewMolecule.B10-1       1 238413032    0.8317911      WB10       LCL
50  NewMolecule.B10-2       1 110987864    0.8812357      WB10       LCL
51  NewMolecule.B11-1       1 174607137    0.8259890      WB11       LCL
52  NewMolecule.B11-2       1 168465700    0.7937547      WB11       LCL
53  NewMolecule.B12-1       1 181271838    0.9129708      WB12       LCL
54  NewMolecule.B12-2       1 172779925    0.6964442      WB12       LCL
55  NewMolecule.C01-1       1 244342093    0.9665939      WC01       LCL
56  NewMolecule.C01-2       1 138058013    0.6168776      WC01       LCL
57  NewMolecule.C02-1       1 154946142    0.9809795      WC02       LCL
58  NewMolecule.C02-2       1 187610831    0.7841767      WC02       LCL
59  NewMolecule.C03-1       1 165446125    0.8161006      WC03       LCL
60  NewMolecule.C03-2       1 174266943    0.7833399      WC03       LCL
61  NewMolecule.C04-1       1 148019761    0.9639406      WC04       LCL
62  NewMolecule.C04-2       1   4025684    0.8837338      WC04       LCL
63  NewMolecule.C05-1       1 165455144    0.8796405      WC05       LCL
64  NewMolecule.C05-2       1 174718878    0.7735656      WC05       LCL
65  NewMolecule.C06-1       1 140277168    0.9390193      WC06       LCL
66  NewMolecule.C06-2       1 168324527    0.7628401      WC06       LCL
67  NewMolecule.C07-1       1 261153647    0.9823486      WC07       LCL
68  NewMolecule.C07-2       1 143163993    0.8017761      WC07       LCL
69  NewMolecule.C08-1       1 147676792    0.9694391      WC08       LCL
70  NewMolecule.C08-2       1 108832959    0.9102188      WC08       LCL
71  NewMolecule.C09-1       1 408007760    0.9806656      WC09       LCL
72  NewMolecule.C09-2       1 173089550    0.8327315      WC09       LCL
73  NewMolecule.C10-1       1 167604989    0.9559976      WC10       LCL
74  NewMolecule.C10-2       1 137455261    0.8389724      WC10       LCL
75  NewMolecule.C11-1       1 187805337    0.8355439      WC11       LCL
76  NewMolecule.C11-2       1 177066236    0.7906897      WC11       LCL
77  NewMolecule.C12-1       1 171294673    0.8092100      WC12       LCL
78  NewMolecule.C12-2       1 207161226    0.9439126      WC12       LCL
79  NewMolecule.D01-1       1 148474840    0.8973409      WD01       LCL
80  NewMolecule.D01-2       1 176919098    0.8387012      WD01       LCL
81  NewMolecule.D02-1       1 182860841    0.9258999      WD02       LCL
82  NewMolecule.D02-2       1 186819768    0.9105906      WD02       LCL
83  NewMolecule.D03-1       1 131745383    0.8897832      WD03       LCL
84  NewMolecule.D03-2       1 186120074    0.8657665      WD03       LCL
85  NewMolecule.D04-1       1 147445482    0.9188042      WD04       LCL
86  NewMolecule.D04-2       1 170662427    0.8789052      WD04       LCL
87  NewMolecule.D05-1       1 109693197    0.9849873      WD05       LCL
88  NewMolecule.D05-2       1  55187567    0.9279879      WD05       LCL
89  NewMolecule.D06-1       1 122174618    0.9893905      WD06       LCL
90  NewMolecule.D06-2       1 211272103    0.7724558      WD06       LCL
91  NewMolecule.D07-1       1 186303795    0.9672297      WD07       LCL
92  NewMolecule.D07-2       1 167009833    0.7753588      WD07       LCL
93  NewMolecule.D08-1       1 183462792    1.0004535      WD08       LCL
94  NewMolecule.D08-2       1 146479099    1.0033560      WD08       LCL
95  NewMolecule.D09-1       1 268002864    0.9646019      WD09       LCL
96  NewMolecule.D09-2       1  91945124    0.9215091      WD09       LCL
97  NewMolecule.D10-1       1 278370909    0.9590191      WD10       LCL
98  NewMolecule.D10-2       1 116412533    0.8802938      WD10       LCL
99  NewMolecule.D11-1       1 175024946    0.9210652      WD11       LCL
100 NewMolecule.D11-2       1 106706050    0.8053986      WD11       LCL
101 NewMolecule.D12-1       1  82466922    0.8892412      WD12       LCL
102 NewMolecule.D12-2       1 243296548    0.7777244      WD12       LCL
103 NewMolecule.E01-1       1 123440553    0.9716875      WE01       LCL
104 NewMolecule.E01-2       1 128461448    0.7058736      WE01       LCL
105 NewMolecule.E02-1       1 440121482    0.9524480      WE02       LCL
106 NewMolecule.E02-2       1 127160501    0.7928684      WE02       LCL
107 NewMolecule.E03-1       1 128754472    0.9062433      WE03       LCL
108 NewMolecule.E03-2       1 185922881    0.8547261      WE03       LCL
109 NewMolecule.E04-1       1 468869018    0.9580410      WE04       LCL
110 NewMolecule.E04-2       1 168588670    0.9200381      WE04       LCL
          dose.nM libType rep          SampleName bigwig group.y   color  bed
1              NA   polyA   1 DMSO_NA_LCL_polyA_1   <NA>    <NA> #969696 <NA>
2              NA   polyA   2 DMSO_NA_LCL_polyA_2   <NA>    <NA> #969696 <NA>
3              NA   polyA   3 DMSO_NA_LCL_polyA_3   <NA>    <NA> #969696 <NA>
4              NA   polyA   4 DMSO_NA_LCL_polyA_4   <NA>    <NA> #969696 <NA>
5              NA   polyA   5 DMSO_NA_LCL_polyA_5   <NA>    <NA> #969696 <NA>
6              NA   polyA   6 DMSO_NA_LCL_polyA_6   <NA>    <NA> #969696 <NA>
7   SMN_EC90 dose   polyA   1 WA01_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
8   SMN_EC90 dose   polyA   2 WA01_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
9   SMN_EC90 dose   polyA   1 WA02_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
10  SMN_EC90 dose   polyA   2 WA02_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
11  SMN_EC90 dose   polyA   1 WA03_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
12  SMN_EC90 dose   polyA   2 WA03_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
13  SMN_EC90 dose   polyA   1 WA04_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
14  SMN_EC90 dose   polyA   2 WA04_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
15  SMN_EC90 dose   polyA   1 WA05_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
16  SMN_EC90 dose   polyA   2 WA05_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
17  SMN_EC90 dose   polyA   1 WA06_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
18  SMN_EC90 dose   polyA   2 WA06_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
19  SMN_EC90 dose   polyA   1 WA07_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
20  SMN_EC90 dose   polyA   2 WA07_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
21  SMN_EC90 dose   polyA   1 WA08_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
22  SMN_EC90 dose   polyA   2 WA08_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
23  SMN_EC90 dose   polyA   1 WA09_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
24  SMN_EC90 dose   polyA   2 WA09_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
25  SMN_EC90 dose   polyA   1 WA10_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
26  SMN_EC90 dose   polyA   2 WA10_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
27  SMN_EC90 dose   polyA   1 WA11_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
28  SMN_EC90 dose   polyA   2 WA11_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
29  SMN_EC90 dose   polyA   1 WA12_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
30  SMN_EC90 dose   polyA   2 WA12_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
31  SMN_EC90 dose   polyA   1 WB01_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
32  SMN_EC90 dose   polyA   2 WB01_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
33  SMN_EC90 dose   polyA   1 WB02_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
34  SMN_EC90 dose   polyA   2 WB02_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
35  SMN_EC90 dose   polyA   1 WB03_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
36  SMN_EC90 dose   polyA   2 WB03_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
37  SMN_EC90 dose   polyA   1 WB04_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
38  SMN_EC90 dose   polyA   2 WB04_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
39  SMN_EC90 dose   polyA   1 WB05_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
40  SMN_EC90 dose   polyA   2 WB05_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
41  SMN_EC90 dose   polyA   1 WB06_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
42  SMN_EC90 dose   polyA   2 WB06_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
43  SMN_EC90 dose   polyA   1 WB07_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
44  SMN_EC90 dose   polyA   2 WB07_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
45  SMN_EC90 dose   polyA   1 WB08_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
46  SMN_EC90 dose   polyA   2 WB08_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
47  SMN_EC90 dose   polyA   1 WB09_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
48  SMN_EC90 dose   polyA   2 WB09_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
49  SMN_EC90 dose   polyA   1 WB10_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
50  SMN_EC90 dose   polyA   2 WB10_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
51  SMN_EC90 dose   polyA   1 WB11_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
52  SMN_EC90 dose   polyA   2 WB11_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
53  SMN_EC90 dose   polyA   1 WB12_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
54  SMN_EC90 dose   polyA   2 WB12_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
55  SMN_EC90 dose   polyA   1 WC01_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
56  SMN_EC90 dose   polyA   2 WC01_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
57  SMN_EC90 dose   polyA   1 WC02_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
58  SMN_EC90 dose   polyA   2 WC02_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
59  SMN_EC90 dose   polyA   1 WC03_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
60  SMN_EC90 dose   polyA   2 WC03_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
61  SMN_EC90 dose   polyA   1 WC04_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
62  SMN_EC90 dose   polyA   2 WC04_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
63  SMN_EC90 dose   polyA   1 WC05_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
64  SMN_EC90 dose   polyA   2 WC05_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
65  SMN_EC90 dose   polyA   1 WC06_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
66  SMN_EC90 dose   polyA   2 WC06_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
67  SMN_EC90 dose   polyA   1 WC07_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
68  SMN_EC90 dose   polyA   2 WC07_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
69  SMN_EC90 dose   polyA   1 WC08_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
70  SMN_EC90 dose   polyA   2 WC08_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
71  SMN_EC90 dose   polyA   1 WC09_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
72  SMN_EC90 dose   polyA   2 WC09_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
73  SMN_EC90 dose   polyA   1 WC10_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
74  SMN_EC90 dose   polyA   2 WC10_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
75  SMN_EC90 dose   polyA   1 WC11_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
76  SMN_EC90 dose   polyA   2 WC11_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
77  SMN_EC90 dose   polyA   1 WC12_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
78  SMN_EC90 dose   polyA   2 WC12_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
79  SMN_EC90 dose   polyA   1 WD01_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
80  SMN_EC90 dose   polyA   2 WD01_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
81  SMN_EC90 dose   polyA   1 WD02_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
82  SMN_EC90 dose   polyA   2 WD02_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
83  SMN_EC90 dose   polyA   1 WD03_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
84  SMN_EC90 dose   polyA   2 WD03_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
85  SMN_EC90 dose   polyA   1 WD04_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
86  SMN_EC90 dose   polyA   2 WD04_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
87  SMN_EC90 dose   polyA   1 WD05_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
88  SMN_EC90 dose   polyA   2 WD05_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
89  SMN_EC90 dose   polyA   1 WD06_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
90  SMN_EC90 dose   polyA   2 WD06_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
91  SMN_EC90 dose   polyA   1 WD07_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
92  SMN_EC90 dose   polyA   2 WD07_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
93  SMN_EC90 dose   polyA   1 WD08_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
94  SMN_EC90 dose   polyA   2 WD08_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
95  SMN_EC90 dose   polyA   1 WD09_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
96  SMN_EC90 dose   polyA   2 WD09_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
97  SMN_EC90 dose   polyA   1 WD10_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
98  SMN_EC90 dose   polyA   2 WD10_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
99  SMN_EC90 dose   polyA   1 WD11_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
100 SMN_EC90 dose   polyA   2 WD11_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
101 SMN_EC90 dose   polyA   1 WD12_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
102 SMN_EC90 dose   polyA   2 WD12_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
103 SMN_EC90 dose   polyA   1 WE01_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
104 SMN_EC90 dose   polyA   2 WE01_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
105 SMN_EC90 dose   polyA   1 WE02_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
106 SMN_EC90 dose   polyA   2 WE02_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
107 SMN_EC90 dose   polyA   1 WE03_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
108 SMN_EC90 dose   polyA   2 WE03_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
109 SMN_EC90 dose   polyA   1 WE04_NA_LCL_polyA_1   <NA>    <NA> #252525 <NA>
110 SMN_EC90 dose   polyA   2 WE04_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
    supergroup           Experiment   leafcutter.name         label
1         <NA> Single high dose LCL NewMolecule.E05.1            NA
2         <NA> Single high dose LCL NewMolecule.E06.2            NA
3         <NA> Single high dose LCL NewMolecule.E07.3            NA
4         <NA> Single high dose LCL NewMolecule.E05.4            NA
5         <NA> Single high dose LCL NewMolecule.E06.5            NA
6         <NA> Single high dose LCL NewMolecule.E07.6            NA
7         <NA> Single high dose LCL NewMolecule.A01.1 SMN_EC90 dose
8         <NA> Single high dose LCL NewMolecule.A01.2 SMN_EC90 dose
9         <NA> Single high dose LCL NewMolecule.A02.1 SMN_EC90 dose
10        <NA> Single high dose LCL NewMolecule.A02.2 SMN_EC90 dose
11        <NA> Single high dose LCL NewMolecule.A03.1 SMN_EC90 dose
12        <NA> Single high dose LCL NewMolecule.A03.2 SMN_EC90 dose
13        <NA> Single high dose LCL NewMolecule.A04.1 SMN_EC90 dose
14        <NA> Single high dose LCL NewMolecule.A04.2 SMN_EC90 dose
15        <NA> Single high dose LCL NewMolecule.A05.1 SMN_EC90 dose
16        <NA> Single high dose LCL NewMolecule.A05.2 SMN_EC90 dose
17        <NA> Single high dose LCL NewMolecule.A06.1 SMN_EC90 dose
18        <NA> Single high dose LCL NewMolecule.A06.2 SMN_EC90 dose
19        <NA> Single high dose LCL NewMolecule.A07.1 SMN_EC90 dose
20        <NA> Single high dose LCL NewMolecule.A07.2 SMN_EC90 dose
21        <NA> Single high dose LCL NewMolecule.A08.1 SMN_EC90 dose
22        <NA> Single high dose LCL NewMolecule.A08.2 SMN_EC90 dose
23        <NA> Single high dose LCL NewMolecule.A09.1 SMN_EC90 dose
24        <NA> Single high dose LCL NewMolecule.A09.2 SMN_EC90 dose
25        <NA> Single high dose LCL NewMolecule.A10.1 SMN_EC90 dose
26        <NA> Single high dose LCL NewMolecule.A10.2 SMN_EC90 dose
27        <NA> Single high dose LCL NewMolecule.A11.1 SMN_EC90 dose
28        <NA> Single high dose LCL NewMolecule.A11.2 SMN_EC90 dose
29        <NA> Single high dose LCL NewMolecule.A12.1 SMN_EC90 dose
30        <NA> Single high dose LCL NewMolecule.A12.2 SMN_EC90 dose
31        <NA> Single high dose LCL NewMolecule.B01.1 SMN_EC90 dose
32        <NA> Single high dose LCL NewMolecule.B01.2 SMN_EC90 dose
33        <NA> Single high dose LCL NewMolecule.B02.1 SMN_EC90 dose
34        <NA> Single high dose LCL NewMolecule.B02.2 SMN_EC90 dose
35        <NA> Single high dose LCL NewMolecule.B03.1 SMN_EC90 dose
36        <NA> Single high dose LCL NewMolecule.B03.2 SMN_EC90 dose
37        <NA> Single high dose LCL NewMolecule.B04.1 SMN_EC90 dose
38        <NA> Single high dose LCL NewMolecule.B04.2 SMN_EC90 dose
39        <NA> Single high dose LCL NewMolecule.B05.1 SMN_EC90 dose
40        <NA> Single high dose LCL NewMolecule.B05.2 SMN_EC90 dose
41        <NA> Single high dose LCL NewMolecule.B06.1 SMN_EC90 dose
42        <NA> Single high dose LCL NewMolecule.B06.2 SMN_EC90 dose
43        <NA> Single high dose LCL NewMolecule.B07.1 SMN_EC90 dose
44        <NA> Single high dose LCL NewMolecule.B07.2 SMN_EC90 dose
45        <NA> Single high dose LCL NewMolecule.B08.1 SMN_EC90 dose
46        <NA> Single high dose LCL NewMolecule.B08.2 SMN_EC90 dose
47        <NA> Single high dose LCL NewMolecule.B09.1 SMN_EC90 dose
48        <NA> Single high dose LCL NewMolecule.B09.2 SMN_EC90 dose
49        <NA> Single high dose LCL NewMolecule.B10.1 SMN_EC90 dose
50        <NA> Single high dose LCL NewMolecule.B10.2 SMN_EC90 dose
51        <NA> Single high dose LCL NewMolecule.B11.1 SMN_EC90 dose
52        <NA> Single high dose LCL NewMolecule.B11.2 SMN_EC90 dose
53        <NA> Single high dose LCL NewMolecule.B12.1 SMN_EC90 dose
54        <NA> Single high dose LCL NewMolecule.B12.2 SMN_EC90 dose
55        <NA> Single high dose LCL NewMolecule.C01.1 SMN_EC90 dose
56        <NA> Single high dose LCL NewMolecule.C01.2 SMN_EC90 dose
57        <NA> Single high dose LCL NewMolecule.C02.1 SMN_EC90 dose
58        <NA> Single high dose LCL NewMolecule.C02.2 SMN_EC90 dose
59        <NA> Single high dose LCL NewMolecule.C03.1 SMN_EC90 dose
60        <NA> Single high dose LCL NewMolecule.C03.2 SMN_EC90 dose
61        <NA> Single high dose LCL NewMolecule.C04.1 SMN_EC90 dose
62        <NA> Single high dose LCL NewMolecule.C04.2 SMN_EC90 dose
63        <NA> Single high dose LCL NewMolecule.C05.1 SMN_EC90 dose
64        <NA> Single high dose LCL NewMolecule.C05.2 SMN_EC90 dose
65        <NA> Single high dose LCL NewMolecule.C06.1 SMN_EC90 dose
66        <NA> Single high dose LCL NewMolecule.C06.2 SMN_EC90 dose
67        <NA> Single high dose LCL NewMolecule.C07.1 SMN_EC90 dose
68        <NA> Single high dose LCL NewMolecule.C07.2 SMN_EC90 dose
69        <NA> Single high dose LCL NewMolecule.C08.1 SMN_EC90 dose
70        <NA> Single high dose LCL NewMolecule.C08.2 SMN_EC90 dose
71        <NA> Single high dose LCL NewMolecule.C09.1 SMN_EC90 dose
72        <NA> Single high dose LCL NewMolecule.C09.2 SMN_EC90 dose
73        <NA> Single high dose LCL NewMolecule.C10.1 SMN_EC90 dose
74        <NA> Single high dose LCL NewMolecule.C10.2 SMN_EC90 dose
75        <NA> Single high dose LCL NewMolecule.C11.1 SMN_EC90 dose
76        <NA> Single high dose LCL NewMolecule.C11.2 SMN_EC90 dose
77        <NA> Single high dose LCL NewMolecule.C12.1 SMN_EC90 dose
78        <NA> Single high dose LCL NewMolecule.C12.2 SMN_EC90 dose
79        <NA> Single high dose LCL NewMolecule.D01.1 SMN_EC90 dose
80        <NA> Single high dose LCL NewMolecule.D01.2 SMN_EC90 dose
81        <NA> Single high dose LCL NewMolecule.D02.1 SMN_EC90 dose
82        <NA> Single high dose LCL NewMolecule.D02.2 SMN_EC90 dose
83        <NA> Single high dose LCL NewMolecule.D03.1 SMN_EC90 dose
84        <NA> Single high dose LCL NewMolecule.D03.2 SMN_EC90 dose
85        <NA> Single high dose LCL NewMolecule.D04.1 SMN_EC90 dose
86        <NA> Single high dose LCL NewMolecule.D04.2 SMN_EC90 dose
87        <NA> Single high dose LCL NewMolecule.D05.1 SMN_EC90 dose
88        <NA> Single high dose LCL NewMolecule.D05.2 SMN_EC90 dose
89        <NA> Single high dose LCL NewMolecule.D06.1 SMN_EC90 dose
90        <NA> Single high dose LCL NewMolecule.D06.2 SMN_EC90 dose
91        <NA> Single high dose LCL NewMolecule.D07.1 SMN_EC90 dose
92        <NA> Single high dose LCL NewMolecule.D07.2 SMN_EC90 dose
93        <NA> Single high dose LCL NewMolecule.D08.1 SMN_EC90 dose
94        <NA> Single high dose LCL NewMolecule.D08.2 SMN_EC90 dose
95        <NA> Single high dose LCL NewMolecule.D09.1 SMN_EC90 dose
96        <NA> Single high dose LCL NewMolecule.D09.2 SMN_EC90 dose
97        <NA> Single high dose LCL NewMolecule.D10.1 SMN_EC90 dose
98        <NA> Single high dose LCL NewMolecule.D10.2 SMN_EC90 dose
99        <NA> Single high dose LCL NewMolecule.D11.1 SMN_EC90 dose
100       <NA> Single high dose LCL NewMolecule.D11.2 SMN_EC90 dose
101       <NA> Single high dose LCL NewMolecule.D12.1 SMN_EC90 dose
102       <NA> Single high dose LCL NewMolecule.D12.2 SMN_EC90 dose
103       <NA> Single high dose LCL NewMolecule.E01.1 SMN_EC90 dose
104       <NA> Single high dose LCL NewMolecule.E01.2 SMN_EC90 dose
105       <NA> Single high dose LCL NewMolecule.E02.1 SMN_EC90 dose
106       <NA> Single high dose LCL NewMolecule.E02.2 SMN_EC90 dose
107       <NA> Single high dose LCL NewMolecule.E03.1 SMN_EC90 dose
108       <NA> Single high dose LCL NewMolecule.E03.2 SMN_EC90 dose
109       <NA> Single high dose LCL NewMolecule.E04.1 SMN_EC90 dose
110       <NA> Single high dose LCL NewMolecule.E04.2 SMN_EC90 dose
ReadsPerDataset <- ggplot(counts.plot.dat, aes(x=old.sample.name, y=lib.size/2E6, fill=color)) +
  geom_col() +
  scale_fill_identity() +
  scale_x_discrete(name="dose (nanomolar)", label=counts.plot.labels) +
  scale_y_continuous(expand=c(0,0)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1, size=3)) +
  theme(strip.text.x = element_text(size = 8)) +
  facet_grid(~Experiment, scales = "free_x", space = "free_x", labeller = label_wrap_gen(15)) +
  labs(title="RNA-seq datasets", y="Read count (M)")
ReadsPerDataset

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

So for the most part, these libraries were sequenced deeper than any of our previous experiments with Jingxin. Let’s make note of that one outlier sample that was clearly not sequenced well. This sample will probably have to be excluded from further analysis…

counts.plot.dat %>%
  arrange(lib.size) %>%
  head(1)
    old.sample.name group.x lib.size norm.factors treatment cell.type
1 NewMolecule.C04-2       1  4025684    0.8837338      WC04       LCL
        dose.nM libType rep          SampleName bigwig group.y   color  bed
1 SMN_EC90 dose   polyA   2 WC04_NA_LCL_polyA_2   <NA>    <NA> #252525 <NA>
  supergroup           Experiment   leafcutter.name         label
1       <NA> Single high dose LCL NewMolecule.C04.2 SMN_EC90 dose

Ok so rep2 of the molecule in well C04 will probably have to be excluded. 4M reads isn’t enough to get much out of, and I sort of worry whether those 4M reads are even from the corresponding library versus barcode contamination from other samples on the lane…

Gene expression PCA

First let’s perform PCA with all samples, including fibroblasts. I know this isn’t necessarily the most interpretable, but I am just curious what the first few PCs will look like…

SamplesToInclude <- FullMetadata %>%
  pull(old.sample.name)

CPM <- gene.counts %>%
  cpm(log=T, prior.count=T) %>%
  as.data.frame() %>%
  rownames_to_column("Geneid") %>%
  dplyr::select(Geneid, all_of(SamplesToInclude))

Top14K_ExpressedGenes <- (CPM %>%
  column_to_rownames("Geneid") %>%
  apply(1, mean) %>%
  sort(decreasing=T))[1:14000] %>%
  names()

pca.results <- CPM %>%
  filter(Geneid %in% Top14K_ExpressedGenes) %>%
  column_to_rownames("Geneid") %>%
  scale() %>% t() %>% prcomp(scale=T)

summary(pca.results)
Importance of components:
                           PC1     PC2     PC3      PC4      PC5      PC6
Standard deviation     72.6439 62.2561 42.4165 30.06520 18.23340 14.95023
Proportion of Variance  0.3769  0.2768  0.1285  0.06457  0.02375  0.01596
Cumulative Proportion   0.3769  0.6538  0.7823  0.84686  0.87061  0.88657
                            PC7      PC8      PC9    PC10    PC11    PC12
Standard deviation     12.10732 11.51993 10.52554 8.64035 8.05303 7.52829
Proportion of Variance  0.01047  0.00948  0.00791 0.00533 0.00463 0.00405
Cumulative Proportion   0.89704  0.90652  0.91443 0.91977 0.92440 0.92845
                          PC13    PC14    PC15    PC16    PC17    PC18   PC19
Standard deviation     7.43240 6.07546 5.90323 5.71783 5.45541 4.70763 4.5829
Proportion of Variance 0.00395 0.00264 0.00249 0.00234 0.00213 0.00158 0.0015
Cumulative Proportion  0.93239 0.93503 0.93752 0.93985 0.94198 0.94356 0.9451
                          PC20   PC21    PC22    PC23    PC24    PC25    PC26
Standard deviation     4.34842 4.2642 4.18983 4.05111 3.70776 3.66428 3.52860
Proportion of Variance 0.00135 0.0013 0.00125 0.00117 0.00098 0.00096 0.00089
Cumulative Proportion  0.94641 0.9477 0.94897 0.95014 0.95112 0.95208 0.95297
                          PC27    PC28    PC29    PC30    PC31    PC32   PC33
Standard deviation     3.46988 3.42074 3.30835 3.25718 3.20878 3.19963 3.1289
Proportion of Variance 0.00086 0.00084 0.00078 0.00076 0.00074 0.00073 0.0007
Cumulative Proportion  0.95383 0.95466 0.95545 0.95620 0.95694 0.95767 0.9584
                          PC34    PC35    PC36    PC37    PC38    PC39    PC40
Standard deviation     3.09510 3.07249 3.04599 3.02764 2.96070 2.91136 2.87777
Proportion of Variance 0.00068 0.00067 0.00066 0.00065 0.00063 0.00061 0.00059
Cumulative Proportion  0.95905 0.95973 0.96039 0.96105 0.96167 0.96228 0.96287
                          PC41    PC42    PC43    PC44    PC45    PC46    PC47
Standard deviation     2.85418 2.81059 2.76061 2.75270 2.74102 2.70975 2.66896
Proportion of Variance 0.00058 0.00056 0.00054 0.00054 0.00054 0.00052 0.00051
Cumulative Proportion  0.96345 0.96402 0.96456 0.96510 0.96564 0.96616 0.96667
                         PC48    PC49    PC50    PC51    PC52    PC53    PC54
Standard deviation     2.6481 2.62696 2.59712 2.56767 2.55087 2.53274 2.49007
Proportion of Variance 0.0005 0.00049 0.00048 0.00047 0.00046 0.00046 0.00044
Cumulative Proportion  0.9672 0.96767 0.96815 0.96862 0.96908 0.96954 0.96998
                          PC55    PC56    PC57    PC58    PC59    PC60    PC61
Standard deviation     2.48213 2.47583 2.46033 2.44854 2.41859 2.40764 2.38683
Proportion of Variance 0.00044 0.00044 0.00043 0.00043 0.00042 0.00041 0.00041
Cumulative Proportion  0.97042 0.97086 0.97129 0.97172 0.97214 0.97255 0.97296
                         PC62    PC63    PC64    PC65    PC66    PC67    PC68
Standard deviation     2.3676 2.34748 2.33774 2.32323 2.30991 2.29735 2.28203
Proportion of Variance 0.0004 0.00039 0.00039 0.00039 0.00038 0.00038 0.00037
Cumulative Proportion  0.9734 0.97375 0.97415 0.97453 0.97491 0.97529 0.97566
                          PC69    PC70    PC71    PC72    PC73    PC74    PC75
Standard deviation     2.26430 2.25797 2.25197 2.23527 2.21472 2.20971 2.19590
Proportion of Variance 0.00037 0.00036 0.00036 0.00036 0.00035 0.00035 0.00034
Cumulative Proportion  0.97603 0.97639 0.97675 0.97711 0.97746 0.97781 0.97815
                          PC76    PC77    PC78    PC79    PC80    PC81    PC82
Standard deviation     2.19124 2.16918 2.15949 2.14600 2.13137 2.11768 2.10960
Proportion of Variance 0.00034 0.00034 0.00033 0.00033 0.00032 0.00032 0.00032
Cumulative Proportion  0.97850 0.97883 0.97917 0.97950 0.97982 0.98014 0.98046
                          PC83    PC84    PC85    PC86    PC87   PC88   PC89
Standard deviation     2.10420 2.09067 2.07932 2.07158 2.06890 2.0552 2.0510
Proportion of Variance 0.00032 0.00031 0.00031 0.00031 0.00031 0.0003 0.0003
Cumulative Proportion  0.98077 0.98109 0.98140 0.98170 0.98201 0.9823 0.9826
                         PC90    PC91    PC92    PC93    PC94    PC95    PC96
Standard deviation     2.0325 2.02628 2.02095 2.01505 2.01453 2.00140 1.99207
Proportion of Variance 0.0003 0.00029 0.00029 0.00029 0.00029 0.00029 0.00028
Cumulative Proportion  0.9829 0.98320 0.98349 0.98378 0.98407 0.98436 0.98464
                          PC97    PC98    PC99   PC100   PC101   PC102   PC103
Standard deviation     1.97879 1.96978 1.96142 1.95843 1.94605 1.93103 1.92697
Proportion of Variance 0.00028 0.00028 0.00027 0.00027 0.00027 0.00027 0.00027
Cumulative Proportion  0.98492 0.98520 0.98547 0.98574 0.98602 0.98628 0.98655
                         PC104   PC105   PC106   PC107   PC108   PC109   PC110
Standard deviation     1.91522 1.90976 1.90458 1.90223 1.89465 1.89031 1.88708
Proportion of Variance 0.00026 0.00026 0.00026 0.00026 0.00026 0.00026 0.00025
Cumulative Proportion  0.98681 0.98707 0.98733 0.98759 0.98784 0.98810 0.98835
                         PC111   PC112   PC113   PC114   PC115   PC116   PC117
Standard deviation     1.86688 1.86214 1.85424 1.85032 1.83938 1.82714 1.82318
Proportion of Variance 0.00025 0.00025 0.00025 0.00024 0.00024 0.00024 0.00024
Cumulative Proportion  0.98860 0.98885 0.98910 0.98934 0.98958 0.98982 0.99006
                         PC118   PC119   PC120   PC121   PC122   PC123   PC124
Standard deviation     1.80790 1.80509 1.79744 1.78928 1.77623 1.77575 1.76304
Proportion of Variance 0.00023 0.00023 0.00023 0.00023 0.00023 0.00023 0.00022
Cumulative Proportion  0.99029 0.99052 0.99075 0.99098 0.99121 0.99143 0.99166
                         PC125   PC126   PC127   PC128   PC129   PC130   PC131
Standard deviation     1.75581 1.74527 1.74072 1.73710 1.72480 1.71605 1.71063
Proportion of Variance 0.00022 0.00022 0.00022 0.00022 0.00021 0.00021 0.00021
Cumulative Proportion  0.99188 0.99209 0.99231 0.99253 0.99274 0.99295 0.99316
                         PC132   PC133  PC134  PC135  PC136  PC137   PC138
Standard deviation     1.70481 1.69772 1.6895 1.6757 1.6641 1.6591 1.64981
Proportion of Variance 0.00021 0.00021 0.0002 0.0002 0.0002 0.0002 0.00019
Cumulative Proportion  0.99336 0.99357 0.9938 0.9940 0.9942 0.9944 0.99456
                         PC139   PC140   PC141   PC142   PC143   PC144   PC145
Standard deviation     1.64177 1.63931 1.62764 1.61481 1.61010 1.60438 1.59635
Proportion of Variance 0.00019 0.00019 0.00019 0.00019 0.00019 0.00018 0.00018
Cumulative Proportion  0.99476 0.99495 0.99514 0.99532 0.99551 0.99569 0.99587
                         PC146   PC147   PC148   PC149   PC150   PC151   PC152
Standard deviation     1.59185 1.58410 1.57662 1.56689 1.55334 1.55064 1.54392
Proportion of Variance 0.00018 0.00018 0.00018 0.00018 0.00017 0.00017 0.00017
Cumulative Proportion  0.99606 0.99624 0.99641 0.99659 0.99676 0.99693 0.99710
                         PC153   PC154   PC155   PC156   PC157   PC158   PC159
Standard deviation     1.52924 1.52064 1.51619 1.50377 1.49740 1.48478 1.47533
Proportion of Variance 0.00017 0.00017 0.00016 0.00016 0.00016 0.00016 0.00016
Cumulative Proportion  0.99727 0.99743 0.99760 0.99776 0.99792 0.99808 0.99823
                         PC160   PC161   PC162   PC163   PC164   PC165   PC166
Standard deviation     1.46910 1.45457 1.44128 1.42765 1.41422 1.39923 1.38427
Proportion of Variance 0.00015 0.00015 0.00015 0.00015 0.00014 0.00014 0.00014
Cumulative Proportion  0.99839 0.99854 0.99869 0.99883 0.99898 0.99912 0.99925
                         PC167   PC168   PC169   PC170   PC171   PC172
Standard deviation     1.37542 1.37181 1.35146 1.31545 1.27766 1.22693
Proportion of Variance 0.00014 0.00013 0.00013 0.00012 0.00012 0.00011
Cumulative Proportion  0.99939 0.99952 0.99965 0.99978 0.99989 1.00000
                           PC173
Standard deviation     1.839e-14
Proportion of Variance 0.000e+00
Cumulative Proportion  1.000e+00
pca.results.to.plot <- pca.results$x %>%
  as.data.frame() %>%
  rownames_to_column("old.sample.name") %>%
  dplyr::select(old.sample.name, PC1:PC6) %>%
  left_join(FullMetadata, by="old.sample.name")

ggplot(pca.results.to.plot, aes(x=PC1, y=PC2, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nAll samples")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC3, y=PC4, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nAll samples")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC5, y=PC6, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nAll samples")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

Let’s repeat but just include the dose titration experiment samples and this recent experiment of 52 molecules…

SamplesToInclude <- FullMetadata %>%
  filter(Experiment %in% c("Single high dose LCL", "Dose response titration")) %>%
  pull(old.sample.name)

CPM <- gene.counts %>%
  cpm(log=T, prior.count=T) %>%
  as.data.frame() %>%
  rownames_to_column("Geneid") %>%
  dplyr::select(Geneid, all_of(SamplesToInclude))

Top14K_ExpressedGenes <- (CPM %>%
  column_to_rownames("Geneid") %>%
  apply(1, mean) %>%
  sort(decreasing=T))[1:14000] %>%
  names()

pca.results <- CPM %>%
  filter(Geneid %in% Top14K_ExpressedGenes) %>%
  column_to_rownames("Geneid") %>%
  scale() %>% t() %>% prcomp(scale=T)

summary(pca.results)
Importance of components:
                           PC1     PC2      PC3      PC4      PC5      PC6
Standard deviation     83.1894 55.0561 24.77276 24.28695 19.94844 14.70160
Proportion of Variance  0.4943  0.2165  0.04383  0.04213  0.02842  0.01544
Cumulative Proportion   0.4943  0.7108  0.75467  0.79680  0.82522  0.84066
                            PC7     PC8      PC9    PC10    PC11    PC12
Standard deviation     13.00748 11.7139 10.86657 9.32844 8.28935 7.53691
Proportion of Variance  0.01209  0.0098  0.00843 0.00622 0.00491 0.00406
Cumulative Proportion   0.85275  0.8626  0.87098 0.87720 0.88211 0.88616
                          PC13    PC14    PC15    PC16    PC17    PC18   PC19
Standard deviation     7.24574 6.92730 6.29126 5.97731 5.69311 5.48199 5.2932
Proportion of Variance 0.00375 0.00343 0.00283 0.00255 0.00232 0.00215 0.0020
Cumulative Proportion  0.88991 0.89334 0.89617 0.89872 0.90104 0.90318 0.9052
                          PC20    PC21    PC22    PC23    PC24    PC25    PC26
Standard deviation     5.04198 4.97192 4.91997 4.79399 4.69313 4.61243 4.55639
Proportion of Variance 0.00182 0.00177 0.00173 0.00164 0.00157 0.00152 0.00148
Cumulative Proportion  0.90700 0.90877 0.91050 0.91214 0.91371 0.91523 0.91671
                          PC27    PC28   PC29    PC30    PC31    PC32    PC33
Standard deviation     4.54571 4.46480 4.4291 4.38057 4.30252 4.24036 4.19862
Proportion of Variance 0.00148 0.00142 0.0014 0.00137 0.00132 0.00128 0.00126
Cumulative Proportion  0.91819 0.91961 0.9210 0.92238 0.92371 0.92499 0.92625
                          PC34    PC35   PC36    PC37    PC38    PC39    PC40
Standard deviation     4.17229 4.12458 4.0955 4.07422 4.06222 3.98905 3.96266
Proportion of Variance 0.00124 0.00122 0.0012 0.00119 0.00118 0.00114 0.00112
Cumulative Proportion  0.92749 0.92871 0.9299 0.93109 0.93227 0.93341 0.93453
                         PC41    PC42    PC43    PC44    PC45    PC46    PC47
Standard deviation     3.9299 3.89363 3.89227 3.82138 3.81024 3.77689 3.75136
Proportion of Variance 0.0011 0.00108 0.00108 0.00104 0.00104 0.00102 0.00101
Cumulative Proportion  0.9356 0.93672 0.93780 0.93884 0.93988 0.94090 0.94190
                          PC48    PC49    PC50    PC51    PC52    PC53    PC54
Standard deviation     3.72866 3.72341 3.69228 3.66300 3.65323 3.62103 3.60460
Proportion of Variance 0.00099 0.00099 0.00097 0.00096 0.00095 0.00094 0.00093
Cumulative Proportion  0.94289 0.94388 0.94486 0.94582 0.94677 0.94771 0.94864
                          PC55    PC56   PC57    PC58    PC59    PC60    PC61
Standard deviation     3.58833 3.57787 3.5416 3.51746 3.50854 3.48229 3.46762
Proportion of Variance 0.00092 0.00091 0.0009 0.00088 0.00088 0.00087 0.00086
Cumulative Proportion  0.94955 0.95047 0.9514 0.95225 0.95313 0.95399 0.95485
                          PC62    PC63    PC64    PC65    PC66    PC67   PC68
Standard deviation     3.45692 3.43346 3.42045 3.39272 3.38289 3.36342 3.3474
Proportion of Variance 0.00085 0.00084 0.00084 0.00082 0.00082 0.00081 0.0008
Cumulative Proportion  0.95571 0.95655 0.95738 0.95821 0.95902 0.95983 0.9606
                          PC69    PC70    PC71    PC72    PC73    PC74    PC75
Standard deviation     3.32739 3.31967 3.30262 3.28465 3.28166 3.25750 3.23148
Proportion of Variance 0.00079 0.00079 0.00078 0.00077 0.00077 0.00076 0.00075
Cumulative Proportion  0.96142 0.96221 0.96299 0.96376 0.96453 0.96529 0.96603
                          PC76    PC77    PC78    PC79    PC80    PC81    PC82
Standard deviation     3.22940 3.21138 3.19718 3.18893 3.17672 3.16148 3.14200
Proportion of Variance 0.00074 0.00074 0.00073 0.00073 0.00072 0.00071 0.00071
Cumulative Proportion  0.96678 0.96751 0.96825 0.96897 0.96969 0.97041 0.97111
                         PC83   PC84    PC85    PC86    PC87    PC88    PC89
Standard deviation     3.1396 3.1281 3.11272 3.09873 3.07685 3.03933 3.03570
Proportion of Variance 0.0007 0.0007 0.00069 0.00069 0.00068 0.00066 0.00066
Cumulative Proportion  0.9718 0.9725 0.97321 0.97389 0.97457 0.97523 0.97589
                          PC90    PC91    PC92    PC93    PC94    PC95    PC96
Standard deviation     3.02075 3.01671 2.99872 2.98609 2.97603 2.95891 2.94129
Proportion of Variance 0.00065 0.00065 0.00064 0.00064 0.00063 0.00063 0.00062
Cumulative Proportion  0.97654 0.97719 0.97783 0.97847 0.97910 0.97973 0.98034
                          PC97    PC98   PC99   PC100   PC101   PC102   PC103
Standard deviation     2.91912 2.91648 2.8908 2.87791 2.86451 2.84156 2.83717
Proportion of Variance 0.00061 0.00061 0.0006 0.00059 0.00059 0.00058 0.00057
Cumulative Proportion  0.98095 0.98156 0.9822 0.98275 0.98333 0.98391 0.98449
                         PC104   PC105   PC106   PC107   PC108   PC109   PC110
Standard deviation     2.82141 2.79989 2.77712 2.76955 2.75960 2.74038 2.73493
Proportion of Variance 0.00057 0.00056 0.00055 0.00055 0.00054 0.00054 0.00053
Cumulative Proportion  0.98505 0.98561 0.98617 0.98671 0.98726 0.98779 0.98833
                         PC111   PC112   PC113   PC114   PC115  PC116  PC117
Standard deviation     2.72667 2.71699 2.69161 2.67425 2.67144 2.6408 2.6329
Proportion of Variance 0.00053 0.00053 0.00052 0.00051 0.00051 0.0005 0.0005
Cumulative Proportion  0.98886 0.98939 0.98990 0.99041 0.99092 0.9914 0.9919
                         PC118   PC119   PC120   PC121   PC122   PC123   PC124
Standard deviation     2.62226 2.60273 2.57230 2.56833 2.53999 2.53069 2.50099
Proportion of Variance 0.00049 0.00048 0.00047 0.00047 0.00046 0.00046 0.00045
Cumulative Proportion  0.99241 0.99289 0.99337 0.99384 0.99430 0.99475 0.99520
                         PC125   PC126   PC127   PC128   PC129   PC130  PC131
Standard deviation     2.48410 2.47870 2.46989 2.41798 2.40858 2.39057 2.3799
Proportion of Variance 0.00044 0.00044 0.00044 0.00042 0.00041 0.00041 0.0004
Cumulative Proportion  0.99564 0.99608 0.99652 0.99693 0.99735 0.99776 0.9982
                        PC132   PC133   PC134   PC135   PC136     PC137
Standard deviation     2.3681 2.33293 2.28534 2.21770 2.13166 3.325e-14
Proportion of Variance 0.0004 0.00039 0.00037 0.00035 0.00032 0.000e+00
Cumulative Proportion  0.9986 0.99895 0.99932 0.99968 1.00000 1.000e+00
pca.results.to.plot <- pca.results$x %>%
  as.data.frame() %>%
  rownames_to_column("old.sample.name") %>%
  dplyr::select(old.sample.name, PC1:PC6) %>%
  left_join(FullMetadata, by="old.sample.name")

ggplot(pca.results.to.plot, aes(x=PC1, y=PC2, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nTitration experiment and ExpOf52")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC3, y=PC4, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nTitration experiment and ExpOf52")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC5, y=PC6, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nTitration experiment and ExpOf52")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
pca.results.to.plot %>%
  filter(PC5>200)
    old.sample.name      PC1      PC2      PC3      PC4      PC5      PC6
1 NewMolecule.C04-2 78.41428 41.81794 7.094914 17.13685 217.0802 48.79238
  treatment cell.type dose.nM libType rep          SampleName bigwig group
1      WC04       LCL      NA   polyA   2 WC04_NA_LCL_polyA_2   <NA>  <NA>
    color  bed supergroup           Experiment   leafcutter.name
1 #252525 <NA>       <NA> Single high dose LCL NewMolecule.C04.2

Ok… some interesting patterns. That one outlier in PC5 is the sample with low read depth that I want to exclude. It seems like the lab/batch effects are quite strong, the first two PCs have nothing to do with dose. (Lighter gray points are the DMSO controls in both experiments).. Maybe PC3 is related to this? But it’s not going to be so straightforward to integrate this experiment of 52 data with the previous dose titration experiment also done in LCLs. Well at least previously when I perform PCA based on splicing junction excision ratios these batch and tissue effects seem to dissappear in the PCA, so maybe those splicing comparisons might be more fair without more careful integration. Now let’s try PCA again but now just looking at the samples in this recent experiment of 52 molecules, to check at least that the DMSO samples cluster together…

SamplesToInclude <- FullMetadata %>%
  filter(Experiment %in% c("Single high dose LCL")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  pull(old.sample.name)

CPM <- gene.counts %>%
  cpm(log=T, prior.count=T) %>%
  as.data.frame() %>%
  rownames_to_column("Geneid") %>%
  dplyr::select(Geneid, all_of(SamplesToInclude))

Top14K_ExpressedGenes <- (CPM %>%
  column_to_rownames("Geneid") %>%
  apply(1, mean) %>%
  sort(decreasing=T))[1:14000] %>%
  names()

pca.results <- CPM %>%
  filter(Geneid %in% Top14K_ExpressedGenes) %>%
  column_to_rownames("Geneid") %>%
  scale() %>% t() %>% prcomp(scale=T)

summary(pca.results)
Importance of components:
                          PC1      PC2      PC3     PC4      PC5      PC6
Standard deviation     90.959 35.44498 25.33393 18.9319 16.70679 13.71600
Proportion of Variance  0.591  0.08974  0.04584  0.0256  0.01994  0.01344
Cumulative Proportion   0.591  0.68071  0.72655  0.7521  0.77209  0.78553
                           PC7      PC8      PC9    PC10    PC11    PC12
Standard deviation     13.2829 11.50160 10.55358 9.41743 8.79989 8.34584
Proportion of Variance  0.0126  0.00945  0.00796 0.00633 0.00553 0.00498
Cumulative Proportion   0.7981  0.80758  0.81553 0.82187 0.82740 0.83238
                          PC13    PC14    PC15   PC16    PC17    PC18    PC19
Standard deviation     8.08116 7.48997 7.28621 7.0990 6.92833 6.66918 6.61082
Proportion of Variance 0.00466 0.00401 0.00379 0.0036 0.00343 0.00318 0.00312
Cumulative Proportion  0.83704 0.84105 0.84484 0.8484 0.85187 0.85505 0.85817
                          PC20    PC21    PC22    PC23    PC24    PC25    PC26
Standard deviation     6.51037 6.40642 6.33377 6.27073 6.12692 6.07714 5.94719
Proportion of Variance 0.00303 0.00293 0.00287 0.00281 0.00268 0.00264 0.00253
Cumulative Proportion  0.86119 0.86413 0.86699 0.86980 0.87248 0.87512 0.87765
                          PC27    PC28    PC29    PC30    PC31    PC32   PC33
Standard deviation     5.87452 5.82229 5.78341 5.75898 5.68407 5.60317 5.5522
Proportion of Variance 0.00247 0.00242 0.00239 0.00237 0.00231 0.00224 0.0022
Cumulative Proportion  0.88011 0.88253 0.88492 0.88729 0.88960 0.89184 0.8940
                          PC34    PC35    PC36    PC37    PC38    PC39    PC40
Standard deviation     5.53457 5.40381 5.40268 5.35630 5.33855 5.27676 5.24385
Proportion of Variance 0.00219 0.00209 0.00208 0.00205 0.00204 0.00199 0.00196
Cumulative Proportion  0.89623 0.89832 0.90040 0.90245 0.90449 0.90647 0.90844
                          PC41    PC42    PC43    PC44    PC45    PC46    PC47
Standard deviation     5.21234 5.20952 5.18056 5.13364 5.12455 5.08510 5.05057
Proportion of Variance 0.00194 0.00194 0.00192 0.00188 0.00188 0.00185 0.00182
Cumulative Proportion  0.91038 0.91232 0.91424 0.91612 0.91799 0.91984 0.92166
                          PC48    PC49    PC50    PC51    PC52    PC53    PC54
Standard deviation     5.02761 4.96764 4.91994 4.90446 4.85925 4.81311 4.80058
Proportion of Variance 0.00181 0.00176 0.00173 0.00172 0.00169 0.00165 0.00165
Cumulative Proportion  0.92347 0.92523 0.92696 0.92868 0.93036 0.93202 0.93367
                          PC55    PC56    PC57    PC58    PC59    PC60    PC61
Standard deviation     4.78097 4.76876 4.74355 4.70452 4.67266 4.66706 4.64767
Proportion of Variance 0.00163 0.00162 0.00161 0.00158 0.00156 0.00156 0.00154
Cumulative Proportion  0.93530 0.93692 0.93853 0.94011 0.94167 0.94323 0.94477
                          PC62    PC63    PC64    PC65    PC66    PC67    PC68
Standard deviation     4.63563 4.59928 4.56367 4.54641 4.53988 4.52220 4.49649
Proportion of Variance 0.00153 0.00151 0.00149 0.00148 0.00147 0.00146 0.00144
Cumulative Proportion  0.94630 0.94781 0.94930 0.95078 0.95225 0.95371 0.95516
                          PC69    PC70    PC71   PC72    PC73    PC74    PC75
Standard deviation     4.48580 4.46001 4.43564 4.4250 4.39833 4.35304 4.33560
Proportion of Variance 0.00144 0.00142 0.00141 0.0014 0.00138 0.00135 0.00134
Cumulative Proportion  0.95659 0.95801 0.95942 0.9608 0.96220 0.96355 0.96490
                          PC76   PC77   PC78    PC79    PC80    PC81    PC82
Standard deviation     4.28467 4.2674 4.2613 4.23460 4.21085 4.19227 4.17845
Proportion of Variance 0.00131 0.0013 0.0013 0.00128 0.00127 0.00126 0.00125
Cumulative Proportion  0.96621 0.9675 0.9688 0.97009 0.97135 0.97261 0.97385
                          PC83    PC84   PC85    PC86    PC87    PC88    PC89
Standard deviation     4.14368 4.14020 4.1010 4.07135 4.05284 3.99964 3.95561
Proportion of Variance 0.00123 0.00122 0.0012 0.00118 0.00117 0.00114 0.00112
Cumulative Proportion  0.97508 0.97631 0.9775 0.97869 0.97986 0.98101 0.98212
                          PC90    PC91    PC92    PC93    PC94    PC95    PC96
Standard deviation     3.93752 3.90476 3.89645 3.86114 3.84045 3.82990 3.81073
Proportion of Variance 0.00111 0.00109 0.00108 0.00106 0.00105 0.00105 0.00104
Cumulative Proportion  0.98323 0.98432 0.98541 0.98647 0.98752 0.98857 0.98961
                          PC97    PC98    PC99   PC100  PC101   PC102   PC103
Standard deviation     3.75910 3.71712 3.68479 3.61557 3.5502 3.51461 3.48465
Proportion of Variance 0.00101 0.00099 0.00097 0.00093 0.0009 0.00088 0.00087
Cumulative Proportion  0.99062 0.99160 0.99257 0.99351 0.9944 0.99529 0.99616
                         PC104   PC105   PC106   PC107   PC108     PC109
Standard deviation     3.44532 3.39627 3.26300 3.17868 3.10257 4.622e-14
Proportion of Variance 0.00085 0.00082 0.00076 0.00072 0.00069 0.000e+00
Cumulative Proportion  0.99701 0.99783 0.99859 0.99931 1.00000 1.000e+00
pca.results.to.plot <- pca.results$x %>%
  as.data.frame() %>%
  rownames_to_column("old.sample.name") %>%
  dplyr::select(old.sample.name, PC1:PC6) %>%
  left_join(FullMetadata, by="old.sample.name")

ggplot(pca.results.to.plot, aes(x=PC1, y=PC2, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nExpOf52")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC3, y=PC4, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nExpOf52")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC5, y=PC6, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nExpOf52")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

Ok, even though I am looking at only the first 6 PCs, it seems pretty clear that the DMSO samples don’t cleanly cluster together. Perhaps some of these molecules weren’t very active.

Now just for curiosity, let’s just include the fibroblast and the dose titration experiment.

SamplesToInclude <- FullMetadata %>%
  filter(Experiment %in% c("Dose response titration", "Single high dose fibroblast")) %>%
  pull(old.sample.name)

CPM <- gene.counts %>%
  cpm(log=T, prior.count=T) %>%
  as.data.frame() %>%
  rownames_to_column("Geneid") %>%
  dplyr::select(Geneid, all_of(SamplesToInclude))

Top14K_ExpressedGenes <- (CPM %>%
  column_to_rownames("Geneid") %>%
  apply(1, mean) %>%
  sort(decreasing=T))[1:14000] %>%
  names()

pca.results <- CPM %>%
  filter(Geneid %in% Top14K_ExpressedGenes) %>%
  column_to_rownames("Geneid") %>%
  scale() %>% t() %>% prcomp(scale=T)

summary(pca.results)
Importance of components:
                           PC1     PC2      PC3      PC4      PC5      PC6
Standard deviation     95.5166 43.8246 29.44168 20.92836 16.18447 14.81627
Proportion of Variance  0.6517  0.1372  0.06192  0.03129  0.01871  0.01568
Cumulative Proportion   0.6517  0.7889  0.85077  0.88206  0.90077  0.91645
                            PC7      PC8     PC9    PC10    PC11    PC12   PC13
Standard deviation     12.69624 10.41019 9.97714 8.53642 8.01215 6.54003 6.1490
Proportion of Variance  0.01151  0.00774 0.00711 0.00521 0.00459 0.00306 0.0027
Cumulative Proportion   0.92796  0.93570 0.94281 0.94802 0.95260 0.95566 0.9584
                          PC14    PC15    PC16    PC17    PC18    PC19    PC20
Standard deviation     5.81417 5.60050 5.26490 5.17174 5.10546 5.04332 4.95929
Proportion of Variance 0.00241 0.00224 0.00198 0.00191 0.00186 0.00182 0.00176
Cumulative Proportion  0.96078 0.96302 0.96500 0.96691 0.96877 0.97058 0.97234
                          PC21    PC22    PC23    PC24    PC25    PC26    PC27
Standard deviation     4.92047 4.86882 4.79843 4.69269 4.66221 4.60350 4.56525
Proportion of Variance 0.00173 0.00169 0.00164 0.00157 0.00155 0.00151 0.00149
Cumulative Proportion  0.97407 0.97576 0.97741 0.97898 0.98053 0.98205 0.98354
                          PC28    PC29    PC30   PC31    PC32   PC33    PC34
Standard deviation     4.38148 4.37470 4.29248 4.2711 4.13309 4.1052 4.06510
Proportion of Variance 0.00137 0.00137 0.00132 0.0013 0.00122 0.0012 0.00118
Cumulative Proportion  0.98491 0.98627 0.98759 0.9889 0.99011 0.9913 0.99250
                          PC35    PC36    PC37    PC38    PC39    PC40    PC41
Standard deviation     4.04592 3.98561 3.94879 3.89471 3.83228 3.76342 3.62769
Proportion of Variance 0.00117 0.00113 0.00111 0.00108 0.00105 0.00101 0.00094
Cumulative Proportion  0.99367 0.99480 0.99592 0.99700 0.99805 0.99906 1.00000
                            PC42
Standard deviation     4.242e-14
Proportion of Variance 0.000e+00
Cumulative Proportion  1.000e+00
pca.results.to.plot <- pca.results$x %>%
  as.data.frame() %>%
  rownames_to_column("old.sample.name") %>%
  dplyr::select(old.sample.name, PC1:PC6) %>%
  left_join(FullMetadata, by="old.sample.name")

ggplot(pca.results.to.plot, aes(x=PC1, y=PC2, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nTitrationExp and FibroblastExp")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC3, y=PC4, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nTitrationExp and FibroblastExp")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC5, y=PC6, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using top 14K genes\nTitrationExp and FibroblastExp")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

Wow, so in that, the second PC seems to pretty obviously reflect dose. I am quite surprised we don’t see something so obvious with the new samples… Somewhat concerning. Of course, perhaps it’s not too surprising considering this experiment just has so many more samples, only a few of which are DMSO, that maybe I shouldn’t expect some dose effect in the first few PCs… One other thing I have done that was helpful last time was to perform PCA on the titration series then project new samples in the same PC space. I’ll try that later…

Now Let’s check for some splicing effects. Maybe that will be more illuminating…

Verify genome-wide GA-GT splice site induction

Let’s start by checking for global enrichment of GA|GT juncs… If this doesn’t show an effect I’ll be very skeptical about the usefulness of this data.

leafcutter.sig <- Sys.glob("../code/SplicingAnalysis/leafcutter/differential_splicing/ExpOf52_*_cluster_significance.txt") %>%
  setNames(str_replace(., "../code/SplicingAnalysis/leafcutter/differential_splicing/ExpOf52_(.+?)_cluster_significance.txt", "W\\1")) %>%
  lapply(fread) %>%
  bind_rows(.id="treatment")

leafcutter.effects <- Sys.glob("../code/SplicingAnalysis/leafcutter/differential_splicing/ExpOf52_*_effect_sizes.txt") %>%
  setNames(str_replace(., "../code/SplicingAnalysis/leafcutter/differential_splicing/ExpOf52_(.+?)_effect_sizes.txt", "W\\1")) %>%
  lapply(fread, col.names=c("intron", "logef", "treatment_PSI", "DMSO_PSI", "deltapsi")) %>%
  bind_rows(.id="treatment") %>%
  mutate(name = str_replace(intron, "(.+?):(.+?):(.+?):clu.+([+-])", "\\1_\\2_\\3_\\4")) %>%
  mutate(deltapsi = deltapsi * -1, logef=logef*-1) %>%
  mutate(cluster = str_replace(intron, "(.+?):.+?:.+?:(clu.+[+-])", "\\1:\\2"))


Introns <- read_tsv("../code/SplicingAnalysis/FullSpliceSiteAnnotations/JuncfilesMerged.annotated.basic.bed.5ss.tab.gz", col_names = c("name", "seq", "score")) %>%
  separate(name, into=c("name", "pos"), sep = "::")

Introns.annotations <- read_tsv("../code/SplicingAnalysis/FullSpliceSiteAnnotations/JuncfilesMerged.annotated.basic.bed.gz") %>%
  mutate(name = paste(chrom, start, end, strand, sep = "_")) %>%
  left_join(Introns, by="name")


leafcutter.effects %>%
  inner_join(
    Introns.annotations %>%
    mutate(IsGAGT = if_else(str_detect(seq, "^\\w\\wGAGT"), "GA-GT", "not GA-GT")),
    by="name") %>%
  # count(IsGAGT, treatment)
  ggplot(aes(x=logef, group=treatment)) +
  stat_ecdf() +
  coord_cartesian(xlim=c(-2,2)) +
  facet_wrap(~IsGAGT) +
  theme_bw() +
  labs(title="Distribution of log(FC) for all tested introns\nExpOf52", y="ecdf", x="Splicing log(FC)")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

Ok so I see some subtle general up-regulation of GA-GT introns… That’s reassuring. But the degree to which all samples have about the same general effect size is surprising to me… I expected some of the compounds to be less active… Even though we normalized concentration based on SMN2 minigene EC90, I suspected a bit of noise in that estimate anyway. Let’s compare this to the analogous plot from the previous fibroblast dataset where I also did differential splicing with leafcutter to estimate splicing log(FC)…

leafcutter.effects.fibroblasts <- list.files(path="../code/SplicingAnalysis/leafcutter/differential_splicing", pattern="^[A-Z]+[0-9]*_effect_sizes.txt", full.names=T) %>%
  setNames(str_replace(., "../code/SplicingAnalysis/leafcutter/differential_splicing/(.+?)_effect_sizes.txt", "\\1")) %>%
  lapply(fread, col.names=c("intron", "logef", "treatment_PSI", "DMSO_PSI", "deltapsi")) %>%
  bind_rows(.id="treatment") %>%
  mutate(name = str_replace(intron, "(.+?):(.+?):(.+?):clu.+([+-])", "\\1_\\2_\\3_\\4")) %>%
  mutate(deltapsi = deltapsi * -1, logef=logef*-1) %>%
  mutate(cluster = str_replace(intron, "(.+?):.+?:.+?:(clu.+[+-])", "\\1:\\2"))

leafcutter.sig.fibroblasts <-list.files(path="../code/SplicingAnalysis/leafcutter/differential_splicing", pattern="^[A-Z]+[0-9]*_cluster_significance.txt", full.names=T) %>%
  setNames(str_replace(., "../code/SplicingAnalysis/leafcutter/differential_splicing/(.+?)_cluster_significance.txt", "\\1")) %>%
  lapply(fread) %>%
  bind_rows(.id="treatment")


Fibroblast.colors <- FullMetadata %>%
  filter(cell.type=="Fibroblast") %>%
  distinct(treatment, .keep_all=T) %>%
  dplyr::select(treatment, color) %>%
  deframe()

leafcutter.effects.fibroblasts %>%
  inner_join(
    Introns.annotations %>%
    mutate(IsGAGT = if_else(str_detect(seq, "^\\w\\wGAGT"), "GA-GT", "not GA-GT")),
    by="name") %>%
  # count(IsGAGT, treatment)
  ggplot(aes(x=logef, color=treatment)) +
  stat_ecdf() +
  scale_color_manual(values=Fibroblast.colors) +
  coord_cartesian(xlim=c(-2,2)) +
  facet_wrap(~IsGAGT) +
  theme_bw() +
  labs(title="Distribution of log(FC) for all tested introns\nFibroblast Exp", y="ecdf", x="Splicing log(FC)")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

Ok, so maybe the degree of these 52 treatments are roughly on the order of the CUOME treatment in fibroblast data. Not too strong. I’m still not totally sure this is even real…

Ok, so now let’s doing PCA just using GA|GT juncs… I think this should really work. Let’s start by including all samples. If I don’t really see DMSO control pop out from the rest, I will be a bit worried.

Splicing correlation matrix and PCA

Actually before doing PCA, to get a broad overview of results, let’s calculate the correlation coefficient across samples based on splicing. For sake of quickly getting an overview, and picking reasonable GA-GT introns (that aren’t too noisy to measure) let’s just consider the 500ish GA-GT introns I previously modelled. These are a pretty ‘good’ set of differentially spliced GA-GT introns based on risdiplam,branaplam,C2C5…

#Read in splice junction count table

PSI <- fread("../code/SplicingAnalysis/leafcutter_all_samples/leafcutter_perind_numers.bed.gz", sep = '\t' ) %>%
  dplyr::select(-"NewMolecule.C04.2")

Modelled.introns <- read_tsv("../output/EC50Estimtes.FromPSI.txt.gz") %>%
  dplyr::select(`#Chrom`, start, end, strand=strand.y)


colnames(PSI)
  [1] "#Chrom"             "start"              "end"               
  [4] "junc"               "gid"                "strand"            
  [7] "A10.1"              "A10.2"              "A10.3"             
 [10] "C2.1"               "C2.2"               "C2.3"              
 [13] "CUOME.1"            "CUOME.2"            "CUOME.3"           
 [16] "DMSO.1"             "DMSO.2"             "DMSO.3"            
 [19] "SM2.1"              "SM2.2"              "SM2.3"             
 [22] "TitrationExpBran.1" "TitrationExpBran.2" "TitrationExpBran.3"
 [25] "TitrationExpBran.4" "TitrationExpBran.5" "TitrationExpBran.6"
 [28] "TitrationExpBran.7" "TitrationExpBran.8" "TitrationExpC2C5.1"
 [31] "TitrationExpC2C5.2" "TitrationExpC2C5.3" "TitrationExpC2C5.4"
 [34] "TitrationExpC2C5.5" "TitrationExpC2C5.6" "TitrationExpC2C5.7"
 [37] "TitrationExpC2C5.8" "TitrationExpDMSO.1" "TitrationExpDMSO.2"
 [40] "TitrationExpDMSO.3" "TitrationExpRis.1"  "TitrationExpRis.2" 
 [43] "TitrationExpRis.3"  "TitrationExpRis.4"  "TitrationExpRis.5" 
 [46] "TitrationExpRis.6"  "TitrationExpRis.7"  "TitrationExpRis.8" 
 [49] "chRNA_1"            "chRNA_2"            "chRNA_3"           
 [52] "chRNA_4"            "chRNA_5"            "chRNA_6"           
 [55] "chRNA_7"            "chRNA_8"            "chRNA_9"           
 [58] "chRNA_10"           "chRNA_11"           "chRNA_12"          
 [61] "chRNA_13"           "chRNA_14"           "chRNA_15"          
 [64] "chRNA_16"           "chRNA_17"           "chRNA_18"          
 [67] "chRNA_19"           "chRNA_20"           "chRNA_21"          
 [70] "NewMolecule.A01.1"  "NewMolecule.A02.1"  "NewMolecule.A03.1" 
 [73] "NewMolecule.A04.1"  "NewMolecule.A05.1"  "NewMolecule.A06.1" 
 [76] "NewMolecule.A07.1"  "NewMolecule.A08.1"  "NewMolecule.A09.1" 
 [79] "NewMolecule.A10.1"  "NewMolecule.A11.1"  "NewMolecule.A12.1" 
 [82] "NewMolecule.B01.1"  "NewMolecule.B02.1"  "NewMolecule.B03.1" 
 [85] "NewMolecule.B04.1"  "NewMolecule.B05.1"  "NewMolecule.B06.1" 
 [88] "NewMolecule.B07.1"  "NewMolecule.B08.1"  "NewMolecule.B09.1" 
 [91] "NewMolecule.B10.1"  "NewMolecule.B11.1"  "NewMolecule.B12.1" 
 [94] "NewMolecule.C01.1"  "NewMolecule.C02.1"  "NewMolecule.C03.1" 
 [97] "NewMolecule.C04.1"  "NewMolecule.C05.1"  "NewMolecule.C06.1" 
[100] "NewMolecule.C07.1"  "NewMolecule.C08.1"  "NewMolecule.C09.1" 
[103] "NewMolecule.C10.1"  "NewMolecule.C11.1"  "NewMolecule.C12.1" 
[106] "NewMolecule.D01.1"  "NewMolecule.D02.1"  "NewMolecule.D03.1" 
[109] "NewMolecule.D04.1"  "NewMolecule.D05.1"  "NewMolecule.D06.1" 
[112] "NewMolecule.D07.1"  "NewMolecule.D08.1"  "NewMolecule.D09.1" 
[115] "NewMolecule.D10.1"  "NewMolecule.D11.1"  "NewMolecule.D12.1" 
[118] "NewMolecule.E01.1"  "NewMolecule.E02.1"  "NewMolecule.E03.1" 
[121] "NewMolecule.E04.1"  "NewMolecule.E05.1"  "NewMolecule.E06.2" 
[124] "NewMolecule.E07.3"  "NewMolecule.A01.2"  "NewMolecule.A02.2" 
[127] "NewMolecule.A03.2"  "NewMolecule.A04.2"  "NewMolecule.A05.2" 
[130] "NewMolecule.A06.2"  "NewMolecule.A07.2"  "NewMolecule.A08.2" 
[133] "NewMolecule.A09.2"  "NewMolecule.A10.2"  "NewMolecule.A11.2" 
[136] "NewMolecule.A12.2"  "NewMolecule.B01.2"  "NewMolecule.B02.2" 
[139] "NewMolecule.B03.2"  "NewMolecule.B04.2"  "NewMolecule.B05.2" 
[142] "NewMolecule.B06.2"  "NewMolecule.B07.2"  "NewMolecule.B08.2" 
[145] "NewMolecule.B09.2"  "NewMolecule.B10.2"  "NewMolecule.B11.2" 
[148] "NewMolecule.B12.2"  "NewMolecule.C01.2"  "NewMolecule.C02.2" 
[151] "NewMolecule.C03.2"  "NewMolecule.C05.2"  "NewMolecule.C06.2" 
[154] "NewMolecule.C07.2"  "NewMolecule.C08.2"  "NewMolecule.C09.2" 
[157] "NewMolecule.C10.2"  "NewMolecule.C11.2"  "NewMolecule.C12.2" 
[160] "NewMolecule.D01.2"  "NewMolecule.D02.2"  "NewMolecule.D03.2" 
[163] "NewMolecule.D04.2"  "NewMolecule.D05.2"  "NewMolecule.D06.2" 
[166] "NewMolecule.D07.2"  "NewMolecule.D08.2"  "NewMolecule.D09.2" 
[169] "NewMolecule.D10.2"  "NewMolecule.D11.2"  "NewMolecule.D12.2" 
[172] "NewMolecule.E01.2"  "NewMolecule.E02.2"  "NewMolecule.E03.2" 
[175] "NewMolecule.E04.2"  "NewMolecule.E05.4"  "NewMolecule.E06.5" 
[178] "NewMolecule.E07.6" 
PSI.GAGT.Only <- PSI %>%
  inner_join(Modelled.introns) %>%
  dplyr::select(junc, A10.1:NewMolecule.E07.6) %>%
  drop_na()


PSI.GAGT.Only.cormat <- PSI.GAGT.Only %>%
  column_to_rownames("junc") %>%
  cor(method='s')

heatmap.2(PSI.GAGT.Only.cormat, trace='none')

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

And now let’s do PCA based on these introns…

pca.results <- PSI.GAGT.Only %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% prcomp(scale=T)

summary(pca.results)
Importance of components:
                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     17.8068 10.8573 8.17051 7.30330 5.95972 5.17196 4.98289
Proportion of Variance  0.2909  0.1081 0.06125 0.04893 0.03259 0.02454 0.02278
Cumulative Proportion   0.2909  0.3991 0.46029 0.50923 0.54181 0.56635 0.58913
                           PC8     PC9    PC10    PC11    PC12   PC13    PC14
Standard deviation     4.65906 4.15658 4.07768 3.91171 3.76570 3.7066 3.51788
Proportion of Variance 0.01991 0.01585 0.01525 0.01404 0.01301 0.0126 0.01135
Cumulative Proportion  0.60905 0.62490 0.64015 0.65419 0.66720 0.6798 0.69116
                          PC15    PC16    PC17    PC18    PC19    PC20    PC21
Standard deviation     3.35281 3.25740 3.16906 3.12639 3.09122 3.02097 3.00494
Proportion of Variance 0.01031 0.00973 0.00921 0.00897 0.00877 0.00837 0.00828
Cumulative Proportion  0.70147 0.71121 0.72042 0.72939 0.73815 0.74653 0.75481
                          PC22    PC23    PC24    PC25    PC26    PC27    PC28
Standard deviation     2.91815 2.83474 2.79655 2.72807 2.67897 2.57623 2.56316
Proportion of Variance 0.00781 0.00737 0.00717 0.00683 0.00658 0.00609 0.00603
Cumulative Proportion  0.76262 0.77000 0.77717 0.78400 0.79058 0.79667 0.80270
                         PC29    PC30    PC31    PC32    PC33    PC34    PC35
Standard deviation     2.4933 2.42762 2.38281 2.32648 2.22686 2.14178 2.12131
Proportion of Variance 0.0057 0.00541 0.00521 0.00497 0.00455 0.00421 0.00413
Cumulative Proportion  0.8084 0.81381 0.81902 0.82398 0.82853 0.83274 0.83687
                          PC36    PC37    PC38    PC39    PC40    PC41   PC42
Standard deviation     2.11105 2.07792 2.04445 1.99915 1.96822 1.95962 1.8972
Proportion of Variance 0.00409 0.00396 0.00383 0.00367 0.00355 0.00352 0.0033
Cumulative Proportion  0.84096 0.84492 0.84876 0.85242 0.85598 0.85950 0.8628
                          PC43   PC44    PC45   PC46    PC47    PC48    PC49
Standard deviation     1.87969 1.8689 1.82108 1.8094 1.77040 1.72243 1.70644
Proportion of Variance 0.00324 0.0032 0.00304 0.0030 0.00288 0.00272 0.00267
Cumulative Proportion  0.86604 0.8692 0.87229 0.8753 0.87817 0.88089 0.88356
                          PC50    PC51    PC52    PC53    PC54    PC55   PC56
Standard deviation     1.70530 1.65973 1.63623 1.60037 1.59820 1.58595 1.5491
Proportion of Variance 0.00267 0.00253 0.00246 0.00235 0.00234 0.00231 0.0022
Cumulative Proportion  0.88623 0.88876 0.89121 0.89356 0.89591 0.89821 0.9004
                          PC57    PC58    PC59    PC60    PC61   PC62    PC63
Standard deviation     1.52812 1.50637 1.49055 1.46312 1.44493 1.4382 1.41790
Proportion of Variance 0.00214 0.00208 0.00204 0.00196 0.00192 0.0019 0.00184
Cumulative Proportion  0.90256 0.90464 0.90668 0.90864 0.91056 0.9124 0.91430
                          PC64    PC65   PC66    PC67    PC68    PC69    PC70
Standard deviation     1.40272 1.38000 1.3627 1.35727 1.33489 1.32463 1.30675
Proportion of Variance 0.00181 0.00175 0.0017 0.00169 0.00163 0.00161 0.00157
Cumulative Proportion  0.91610 0.91785 0.9196 0.92125 0.92288 0.92449 0.92606
                          PC71    PC72    PC73    PC74    PC75    PC76    PC77
Standard deviation     1.29953 1.29335 1.28311 1.26960 1.25385 1.24148 1.23204
Proportion of Variance 0.00155 0.00153 0.00151 0.00148 0.00144 0.00141 0.00139
Cumulative Proportion  0.92761 0.92914 0.93065 0.93213 0.93357 0.93499 0.93638
                          PC78    PC79    PC80   PC81    PC82    PC83    PC84
Standard deviation     1.21811 1.20481 1.20317 1.1894 1.17549 1.17356 1.15990
Proportion of Variance 0.00136 0.00133 0.00133 0.0013 0.00127 0.00126 0.00123
Cumulative Proportion  0.93774 0.93907 0.94040 0.9417 0.94297 0.94423 0.94546
                          PC85   PC86    PC87    PC88    PC89    PC90    PC91
Standard deviation     1.15034 1.1443 1.13021 1.12064 1.11255 1.10493 1.10202
Proportion of Variance 0.00121 0.0012 0.00117 0.00115 0.00114 0.00112 0.00111
Cumulative Proportion  0.94668 0.9479 0.94905 0.95020 0.95134 0.95246 0.95357
                          PC92    PC93    PC94    PC95    PC96    PC97    PC98
Standard deviation     1.09981 1.07942 1.06509 1.05582 1.05330 1.03539 1.02657
Proportion of Variance 0.00111 0.00107 0.00104 0.00102 0.00102 0.00098 0.00097
Cumulative Proportion  0.95468 0.95575 0.95679 0.95781 0.95883 0.95982 0.96078
                          PC99   PC100   PC101  PC102   PC103   PC104   PC105
Standard deviation     1.01193 1.01054 1.00006 0.9921 0.98595 0.98306 0.97870
Proportion of Variance 0.00094 0.00094 0.00092 0.0009 0.00089 0.00089 0.00088
Cumulative Proportion  0.96172 0.96266 0.96358 0.9645 0.96537 0.96626 0.96714
                         PC106   PC107   PC108  PC109  PC110   PC111   PC112
Standard deviation     0.96883 0.96234 0.95920 0.9351 0.9323 0.92778 0.91778
Proportion of Variance 0.00086 0.00085 0.00084 0.0008 0.0008 0.00079 0.00077
Cumulative Proportion  0.96800 0.96885 0.96969 0.9705 0.9713 0.97208 0.97285
                         PC113   PC114   PC115  PC116  PC117   PC118   PC119
Standard deviation     0.91312 0.89464 0.88541 0.8759 0.8726 0.86852 0.86330
Proportion of Variance 0.00076 0.00073 0.00072 0.0007 0.0007 0.00069 0.00068
Cumulative Proportion  0.97362 0.97435 0.97507 0.9758 0.9765 0.97717 0.97785
                         PC120   PC121   PC122   PC123   PC124   PC125   PC126
Standard deviation     0.85787 0.85264 0.84570 0.84076 0.82717 0.82139 0.82103
Proportion of Variance 0.00068 0.00067 0.00066 0.00065 0.00063 0.00062 0.00062
Cumulative Proportion  0.97853 0.97919 0.97985 0.98050 0.98112 0.98174 0.98236
                         PC127   PC128   PC129   PC130   PC131   PC132   PC133
Standard deviation     0.81732 0.80003 0.79463 0.79015 0.77880 0.77161 0.76729
Proportion of Variance 0.00061 0.00059 0.00058 0.00057 0.00056 0.00055 0.00054
Cumulative Proportion  0.98297 0.98356 0.98414 0.98471 0.98527 0.98582 0.98636
                         PC134   PC135   PC136   PC137   PC138   PC139   PC140
Standard deviation     0.75267 0.74685 0.74292 0.73295 0.72225 0.71640 0.71031
Proportion of Variance 0.00052 0.00051 0.00051 0.00049 0.00048 0.00047 0.00046
Cumulative Proportion  0.98688 0.98739 0.98789 0.98839 0.98887 0.98934 0.98980
                         PC141   PC142   PC143   PC144   PC145   PC146  PC147
Standard deviation     0.70834 0.69835 0.69170 0.68614 0.68132 0.66946 0.6641
Proportion of Variance 0.00046 0.00045 0.00044 0.00043 0.00043 0.00041 0.0004
Cumulative Proportion  0.99026 0.99071 0.99115 0.99158 0.99200 0.99242 0.9928
                        PC148   PC149   PC150   PC151   PC152   PC153   PC154
Standard deviation     0.6576 0.65180 0.64783 0.63299 0.62971 0.62458 0.61835
Proportion of Variance 0.0004 0.00039 0.00039 0.00037 0.00036 0.00036 0.00035
Cumulative Proportion  0.9932 0.99361 0.99399 0.99436 0.99472 0.99508 0.99543
                         PC155   PC156   PC157   PC158   PC159   PC160   PC161
Standard deviation     0.61050 0.60903 0.58801 0.58311 0.57708 0.56122 0.55556
Proportion of Variance 0.00034 0.00034 0.00032 0.00031 0.00031 0.00029 0.00028
Cumulative Proportion  0.99577 0.99611 0.99643 0.99674 0.99705 0.99734 0.99762
                         PC162   PC163   PC164   PC165   PC166   PC167   PC168
Standard deviation     0.54262 0.53499 0.53018 0.52517 0.51671 0.51438 0.50128
Proportion of Variance 0.00027 0.00026 0.00026 0.00025 0.00024 0.00024 0.00023
Cumulative Proportion  0.99789 0.99815 0.99841 0.99866 0.99891 0.99915 0.99938
                         PC169   PC170   PC171     PC172
Standard deviation     0.48872 0.48374 0.44743 1.373e-14
Proportion of Variance 0.00022 0.00021 0.00018 0.000e+00
Cumulative Proportion  0.99960 0.99982 1.00000 1.000e+00
pca.results.to.plot <- pca.results$x %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata, by="leafcutter.name")

ggplot(pca.results.to.plot, aes(x=PC1, y=PC2, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 1090 GAGT introns\nAll samples")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC3, y=PC4, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 1090 GAGT introns\nAll samples")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC5, y=PC6, color=color, shape=Experiment)) +
  geom_point(size=3) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 1090 GAGT introns\nAll samples")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

That is actually kind of reassuring. The first PC is definitely related to dose. And all DMSO samples cluster together. However, again, the effective dosage seems a bit weak compared to fibroblast experiments… Looks like most of the samples are roughly equivalent to ~100nM risdiplam, so still perhaps on the high end of what would be realistic clinical concentration but I think a slightly higher dose would have been more informative. In this particular PCA, it looks like PC3 sepearates branaplam vs risdiplam, and it’s not totally clear how these molecules are different along that axis… I could try, as mentioned before just doing PCA using the dose titration experiments then projecting the new samples in that space… But first let’s do a PCA just using the samples from the new experiment..

SamplesToInclude <- FullMetadata %>%
  filter(Experiment %in% c("Single high dose LCL")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  pull(leafcutter.name)

pca.results <- PSI.GAGT.Only %>%
  dplyr::select(junc, all_of(SamplesToInclude)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% prcomp(scale=T)

summary(pca.results)
Importance of components:
                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     17.3652 8.02663 7.63797 4.89303 4.62299 4.47221 4.29379
Proportion of Variance  0.2767 0.05911 0.05352 0.02196 0.01961 0.01835 0.01691
Cumulative Proportion   0.2767 0.33576 0.38928 0.41125 0.43085 0.44920 0.46612
                          PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     3.9200 3.84117 3.69235 3.60719 3.54779 3.47655 3.44963
Proportion of Variance 0.0141 0.01354 0.01251 0.01194 0.01155 0.01109 0.01092
Cumulative Proportion  0.4802 0.49375 0.50626 0.51820 0.52974 0.54083 0.55175
                          PC15    PC16    PC17    PC18    PC19    PC20    PC21
Standard deviation     3.33808 3.32645 3.30802 3.26244 3.24349 3.17997 3.10580
Proportion of Variance 0.01022 0.01015 0.01004 0.00976 0.00965 0.00928 0.00885
Cumulative Proportion  0.56197 0.57212 0.58216 0.59193 0.60158 0.61086 0.61971
                          PC22    PC23    PC24    PC25    PC26   PC27    PC28
Standard deviation     3.04656 3.01811 2.97941 2.92856 2.91952 2.8791 2.87597
Proportion of Variance 0.00852 0.00836 0.00814 0.00787 0.00782 0.0076 0.00759
Cumulative Proportion  0.62822 0.63658 0.64472 0.65259 0.66041 0.6680 0.67560
                          PC29    PC30    PC31    PC32    PC33    PC34   PC35
Standard deviation     2.84331 2.80557 2.77632 2.74624 2.73980 2.69767 2.6825
Proportion of Variance 0.00742 0.00722 0.00707 0.00692 0.00689 0.00668 0.0066
Cumulative Proportion  0.68302 0.69024 0.69731 0.70423 0.71112 0.71780 0.7244
                          PC36    PC37   PC38    PC39   PC40    PC41    PC42
Standard deviation     2.67372 2.66036 2.6208 2.58876 2.5790 2.55420 2.51670
Proportion of Variance 0.00656 0.00649 0.0063 0.00615 0.0061 0.00599 0.00581
Cumulative Proportion  0.73096 0.73745 0.7438 0.74990 0.7560 0.76199 0.76780
                          PC43    PC44    PC45    PC46   PC47    PC48    PC49
Standard deviation     2.50155 2.48335 2.48004 2.45997 2.4489 2.42392 2.41933
Proportion of Variance 0.00574 0.00566 0.00564 0.00555 0.0055 0.00539 0.00537
Cumulative Proportion  0.77354 0.77920 0.78484 0.79039 0.7959 0.80128 0.80665
                          PC50    PC51    PC52    PC53    PC54    PC55    PC56
Standard deviation     2.38644 2.36665 2.33817 2.31661 2.30298 2.29857 2.27662
Proportion of Variance 0.00522 0.00514 0.00502 0.00492 0.00487 0.00485 0.00476
Cumulative Proportion  0.81188 0.81702 0.82203 0.82695 0.83182 0.83667 0.84142
                          PC57    PC58    PC59   PC60    PC61    PC62    PC63
Standard deviation     2.26629 2.25224 2.22404 2.2140 2.17655 2.16350 2.15976
Proportion of Variance 0.00471 0.00465 0.00454 0.0045 0.00435 0.00429 0.00428
Cumulative Proportion  0.84613 0.85079 0.85533 0.8598 0.86417 0.86846 0.87274
                          PC64    PC65    PC66    PC67    PC68    PC69    PC70
Standard deviation     2.12927 2.10607 2.09074 2.07866 2.05863 2.05530 2.02185
Proportion of Variance 0.00416 0.00407 0.00401 0.00396 0.00389 0.00388 0.00375
Cumulative Proportion  0.87690 0.88097 0.88498 0.88895 0.89283 0.89671 0.90046
                          PC71    PC72   PC73    PC74    PC75    PC76    PC77
Standard deviation     2.00272 1.99787 1.9816 1.96398 1.94003 1.93185 1.92771
Proportion of Variance 0.00368 0.00366 0.0036 0.00354 0.00345 0.00342 0.00341
Cumulative Proportion  0.90414 0.90780 0.9114 0.91494 0.91840 0.92182 0.92523
                          PC78    PC79    PC80    PC81    PC82    PC83    PC84
Standard deviation     1.88660 1.86190 1.85577 1.83633 1.82108 1.80387 1.78728
Proportion of Variance 0.00327 0.00318 0.00316 0.00309 0.00304 0.00299 0.00293
Cumulative Proportion  0.92849 0.93168 0.93483 0.93793 0.94097 0.94396 0.94689
                          PC85    PC86    PC87    PC88    PC89    PC90    PC91
Standard deviation     1.77038 1.76254 1.73891 1.72961 1.69315 1.68904 1.67390
Proportion of Variance 0.00288 0.00285 0.00277 0.00274 0.00263 0.00262 0.00257
Cumulative Proportion  0.94976 0.95261 0.95539 0.95813 0.96076 0.96338 0.96595
                          PC92   PC93    PC94    PC95    PC96    PC97    PC98
Standard deviation     1.64617 1.6161 1.59282 1.58560 1.55184 1.52377 1.51716
Proportion of Variance 0.00249 0.0024 0.00233 0.00231 0.00221 0.00213 0.00211
Cumulative Proportion  0.96843 0.9708 0.97316 0.97547 0.97767 0.97980 0.98192
                          PC99   PC100   PC101   PC102   PC103   PC104   PC105
Standard deviation     1.49663 1.46361 1.46094 1.45131 1.43022 1.42266 1.39131
Proportion of Variance 0.00205 0.00197 0.00196 0.00193 0.00188 0.00186 0.00178
Cumulative Proportion  0.98397 0.98594 0.98789 0.98983 0.99170 0.99356 0.99534
                         PC106   PC107   PC108     PC109
Standard deviation     1.35016 1.32492 1.22667 3.724e-14
Proportion of Variance 0.00167 0.00161 0.00138 0.000e+00
Cumulative Proportion  0.99701 0.99862 1.00000 1.000e+00
pca.results.to.plot <- pca.results$x %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata, by="leafcutter.name")

ggplot(pca.results.to.plot, aes(x=PC1, y=PC2, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 1090 GAGT introns\nExpOf52 samples")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC3, y=PC4, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 1090 GAGT introns\nExpOf52 samples")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC5, y=PC6, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 1090 GAGT introns\nExpOf52 samples")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

Ok, reassuringly, again DMSO samples stick out in the first PC, and I see a few biological replicates are close together.

Ok, now I want to try projecting the old titration series data into this same PCA spaace, which might help interpretability.

SamplesToProject <- FullMetadata %>%
  filter(!Experiment %in% c("Single high dose LCL")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  pull(leafcutter.name)

pca.results.to.plot_AdditionProjections <- PSI.GAGT.Only %>%
  dplyr::select(junc, all_of(SamplesToProject)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% predict(pca.results, .) %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata, by="leafcutter.name")


TreatmentColorsForLabels.FibroblastColorsAdded <-
FullMetadata %>%
  group_by(treatment) %>%
  filter(dose.nM == max(dose.nM) | treatment == "DMSO" | cell.type=="Fibroblast") %>%
  ungroup() %>%
  distinct(treatment, .keep_all=T) %>%
  arrange(dose.nM) %>%
  mutate(vjust=row_number()*1.2)

TreatmentColorsLabels.Layer.FibroblastColorsAdded <- geom_text(
    data = TreatmentColorsForLabels.FibroblastColorsAdded, 
    aes(label=treatment, color=color, vjust=vjust),
    y=Inf, x=Inf, hjust=1.05
  )

  
ggplot(pca.results.to.plot, aes(x=PC1, y=PC2, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 1090 GAGT introns\nExpOf52 samples", caption="Additional samples projected")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC3, y=PC4, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 1090 GAGT introns\nAll samples")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

I think these PCA are interseting… it’s still hard to make sense of it all… And this isn’t really the way to identify new axes of specifity since the introns were selected precisely because of their effect in risdiplam/branaplam/C2C5 in previous experiments… Better would be to identify all differentially spliced introns and perform a PCA on that, also projecting the old risdiplam/branaplam/C2C5 data in the PCA for help with interpretation… Let’s work towards that….

Leafcutter differential splicing

First, let’s explore the leafcutter differential splicing results… From previous plots, I suspect the effective dosage of these new data to be much less than the fibroblast data, and perhaps we will get much fewer significant differential splicing events… First Let’s check it out, the number of differentially spliced clusters, some P value histograms, etc.

leafcutter.sig %>%
  sample_n_of(12, treatment) %>%
  filter(!is.na(p)) %>%
  ggplot(aes(x=p)) +
    geom_histogram() +
    facet_wrap(~treatment, scales="free") +
  labs(caption="P value histograms for some random splicing contrasts", title="Exp of 52")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

…and compare that to the fibroblast tests…

leafcutter.sig.fibroblasts %>%
  filter(!is.na(p)) %>%
  ggplot(aes(x=p)) +
    geom_histogram() +
    facet_wrap(~treatment, scales="free") +
  labs(caption="P value histograms for some random splicing contrasts", title="fibroblast experiment")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

ok, so a lot these will probably have even less significant hits than the fibroblast CUOME contrast.

Let’s see how many hits at an FDR of 10%…

bind_rows(
bind_rows(leafcutter.sig.fibroblasts, leafcutter.sig) %>%
  filter(p.adjust < 0.01) %>%
  count(treatment) %>%
  mutate(FDR = 0.01),
bind_rows(leafcutter.sig.fibroblasts, leafcutter.sig) %>%
  filter(p.adjust < 0.05) %>%
  count(treatment) %>%
  mutate(FDR = 0.05),
bind_rows(leafcutter.sig.fibroblasts, leafcutter.sig) %>%
  filter(p.adjust < 0.1) %>%
  count(treatment) %>%
  mutate(FDR = 0.1)) %>%
  left_join(FullMetadata, by='treatment') %>%
  ggplot(aes(x=treatment, y=n)) +
  geom_bar(stat="identity",position = "identity", alpha=.15) +
  facet_grid(~Experiment, scales = "free", space = "free_x", labeller = label_wrap_gen(10)) +
  Rotate_x_labels +
  theme(axis.text.x = element_text(size=6)) +
  theme(strip.text.x = element_text(size = 5)) +
  labs(title="Number of leafcutter signigicant clusters", x="treatment", caption="FDR 1%, 5%, 10%")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
bind_rows(leafcutter.sig.fibroblasts, leafcutter.sig) %>%
  filter(p.adjust < 2) %>%
  count(treatment) %>%
  left_join(FullMetadata, by='treatment') %>%
  ggplot(aes(x=treatment, y=n)) +
  geom_bar(stat="identity",position = "identity", alpha=1) +
  facet_grid(~Experiment, scales = "free", space = "free_x", labeller = label_wrap_gen(10)) +
  Rotate_x_labels +
  theme(axis.text.x = element_text(size=6)) +
  theme(strip.text.x = element_text(size = 5)) +
  labs(title="Number of leafcutter tests", x="treatment")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

Ok, most treatments have even less hits than the CUOME contrast. In principle, sharing information across samples could help with this. I think this might be important towards finding difference amongst samples and identification of new specificities. Like mash could be useful. But that takes a bit of time to implement… leafcutter doesn’t just output standard errors, which I think are required. I suppose one hackish solution would be to perform a contrast with all treatments combined, vs DMSO… then identify significant clusters. To find more molecule-specific effects, I could then look at the nominal P value from the non aggregated treatment contrast.

Following this idea, I did a contrast with all molecules vs DMSO… Here are the leafcutter results…

leafcutter.effects.MergedContrast <- read_tsv("../code/SplicingAnalysis/MergedExp52_Contrast/_effect_sizes.txt")
leafcutter.sig.MergedContrast <- read_tsv("../code/SplicingAnalysis/MergedExp52_Contrast/_cluster_significance.txt")

bind_rows(
bind_rows(leafcutter.sig.fibroblasts, leafcutter.sig, leafcutter.sig.MergedContrast) %>%
  filter(p.adjust < 0.01) %>%
  count(treatment) %>%
  mutate(FDR = 0.01),
bind_rows(leafcutter.sig.fibroblasts, leafcutter.sig, leafcutter.sig.MergedContrast) %>%
  filter(p.adjust < 0.05) %>%
  count(treatment) %>%
  mutate(FDR = 0.05),
bind_rows(leafcutter.sig.fibroblasts, leafcutter.sig, leafcutter.sig.MergedContrast) %>%
  filter(p.adjust < 0.1) %>%
  count(treatment) %>%
  mutate(FDR = 0.1)) %>%
  left_join(FullMetadata, by='treatment') %>%
  replace_na(list(treatment="Merged", Experiment="Single high dose LCL")) %>%
  ggplot(aes(x=treatment, y=n)) +
  geom_bar(stat="identity",position = "identity", alpha=.15) +
  facet_grid(~Experiment, scales = "free", space = "free_x", labeller = label_wrap_gen(10)) +
  Rotate_x_labels +
  theme(axis.text.x = element_text(size=6)) +
  theme(strip.text.x = element_text(size = 5)) +
  labs(title="Number of leafcutter signigicant clusters", x="treatment", caption="FDR 1%, 5%, 10%")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
bind_rows(leafcutter.sig.fibroblasts, leafcutter.sig, leafcutter.sig.MergedContrast) %>%
  filter(p.adjust < 2) %>%
  count(treatment) %>%
  left_join(FullMetadata, by='treatment') %>%
  replace_na(list(treatment="Merged", Experiment="Single high dose LCL")) %>%
  ggplot(aes(x=treatment, y=n)) +
  geom_bar(stat="identity",position = "identity", alpha=1) +
  facet_grid(~Experiment, scales = "free", space = "free_x", labeller = label_wrap_gen(10)) +
  Rotate_x_labels +
  theme(axis.text.x = element_text(size=6)) +
  theme(strip.text.x = element_text(size = 5)) +
  labs(title="Number of leafcutter tests", x="treatment")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

Wow. That did surprisingly little. I don’t have the time to troubleshoot if there is a bug or if this is a legitamite result of doing this merged contrast.

So now let’s try counting how many unique clusters there are amongst all the attempted contrasts. Let’s be permissive and consider FDR 10%

#Number of total hits (not necessarily unique)
leafcutter.sig %>%
  filter(p.adjust < 0.1) %>% nrow()
[1] 32162
#Number of unique cluster hits amongst all hits
leafcutter.sig %>%
  filter(p.adjust < 0.1) %>%
  distinct(cluster) %>% nrow()
[1] 10803

Ok, so a lot of the hits are unique to some contrasts, in the sense that I get 10803/32162 unique clusters with FDR<10%. I think what I will do next is do the PCA, but just include introns that are FDR<10% in at least one contrast, and also consider some minimal effect size, and also consider just GAGT introns within those clusters.

IntronsToInclude <- leafcutter.sig %>%
  filter(p.adjust < 0.1) %>%
  inner_join(leafcutter.effects, by=c("treatment", "cluster")) %>%
  filter(deltapsi > 0.01) %>%
  inner_join(Introns.annotations, by='name') %>%
  filter(str_detect(seq, "^\\w\\wGAGT")) %>%
  pull(intron)



length(IntronsToInclude)
[1] 6383
PSI.IntronsToInclude <- PSI %>%
  filter(junc %in% IntronsToInclude) %>%
  dplyr::select(junc, A10.1:NewMolecule.E07.6) %>%
  drop_na()

dim(PSI.IntronsToInclude)
[1] 475 173

Ok, so only 475 introns left… This probably won’t change anything from my previous PCA with ~1000 GAGT introns… Well let’s follow through anyway…

SamplesToInclude <- FullMetadata %>%
  filter(Experiment %in% c("Single high dose LCL")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  pull(leafcutter.name)


pca.results <- PSI.IntronsToInclude %>%
  dplyr::select(junc, all_of(SamplesToInclude)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% prcomp(scale=T)

summary(pca.results)
Importance of components:
                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     11.6045 6.86645 5.16078 3.74247 3.39272 3.03012 2.81941
Proportion of Variance  0.2835 0.09926 0.05607 0.02949 0.02423 0.01933 0.01673
Cumulative Proportion   0.2835 0.38276 0.43883 0.46832 0.49255 0.51188 0.52862
                           PC8     PC9   PC10    PC11    PC12    PC13    PC14
Standard deviation     2.76866 2.66673 2.5319 2.46868 2.42788 2.38280 2.30959
Proportion of Variance 0.01614 0.01497 0.0135 0.01283 0.01241 0.01195 0.01123
Cumulative Proportion  0.54476 0.55973 0.5732 0.58605 0.59846 0.61042 0.62165
                          PC15    PC16    PC17   PC18   PC19    PC20    PC21
Standard deviation     2.24564 2.21327 2.18855 2.1681 2.1462 2.10843 2.08510
Proportion of Variance 0.01062 0.01031 0.01008 0.0099 0.0097 0.00936 0.00915
Cumulative Proportion  0.63226 0.64258 0.65266 0.6626 0.6723 0.68161 0.69076
                          PC22    PC23    PC24    PC25    PC26    PC27    PC28
Standard deviation     2.02730 2.01845 1.95205 1.91161 1.86509 1.84638 1.83184
Proportion of Variance 0.00865 0.00858 0.00802 0.00769 0.00732 0.00718 0.00706
Cumulative Proportion  0.69942 0.70799 0.71602 0.72371 0.73103 0.73821 0.74527
                          PC29    PC30    PC31    PC32    PC33    PC34    PC35
Standard deviation     1.82624 1.81628 1.76089 1.73452 1.69435 1.68276 1.66454
Proportion of Variance 0.00702 0.00695 0.00653 0.00633 0.00604 0.00596 0.00583
Cumulative Proportion  0.75230 0.75924 0.76577 0.77210 0.77815 0.78411 0.78994
                          PC36    PC37   PC38    PC39    PC40    PC41    PC42
Standard deviation     1.65906 1.62985 1.6161 1.59049 1.57398 1.56278 1.55085
Proportion of Variance 0.00579 0.00559 0.0055 0.00533 0.00522 0.00514 0.00506
Cumulative Proportion  0.79574 0.80133 0.8068 0.81215 0.81737 0.82251 0.82757
                          PC43    PC44    PC45    PC46    PC47    PC48    PC49
Standard deviation     1.54608 1.53399 1.53190 1.52334 1.51515 1.46995 1.45829
Proportion of Variance 0.00503 0.00495 0.00494 0.00489 0.00483 0.00455 0.00448
Cumulative Proportion  0.83260 0.83756 0.84250 0.84738 0.85222 0.85677 0.86124
                          PC50    PC51   PC52    PC53    PC54    PC55    PC56
Standard deviation     1.43807 1.41947 1.4118 1.39282 1.37367 1.36640 1.35806
Proportion of Variance 0.00435 0.00424 0.0042 0.00408 0.00397 0.00393 0.00388
Cumulative Proportion  0.86560 0.86984 0.8740 0.87812 0.88209 0.88602 0.88991
                          PC57   PC58    PC59    PC60    PC61    PC62    PC63
Standard deviation     1.34707 1.3249 1.31220 1.28503 1.27454 1.26128 1.24039
Proportion of Variance 0.00382 0.0037 0.00362 0.00348 0.00342 0.00335 0.00324
Cumulative Proportion  0.89373 0.8974 0.90105 0.90452 0.90794 0.91129 0.91453
                         PC64    PC65    PC66    PC67    PC68    PC69    PC70
Standard deviation     1.2326 1.21995 1.20051 1.18288 1.17106 1.16242 1.14475
Proportion of Variance 0.0032 0.00313 0.00303 0.00295 0.00289 0.00284 0.00276
Cumulative Proportion  0.9177 0.92086 0.92390 0.92684 0.92973 0.93257 0.93533
                          PC71    PC72    PC73    PC74    PC75    PC76    PC77
Standard deviation     1.13640 1.12952 1.10683 1.10248 1.08830 1.07329 1.06585
Proportion of Variance 0.00272 0.00269 0.00258 0.00256 0.00249 0.00243 0.00239
Cumulative Proportion  0.93805 0.94074 0.94332 0.94588 0.94837 0.95079 0.95319
                          PC78    PC79    PC80    PC81    PC82    PC83    PC84
Standard deviation     1.05289 1.03925 1.02051 1.01535 0.99384 0.96728 0.96611
Proportion of Variance 0.00233 0.00227 0.00219 0.00217 0.00208 0.00197 0.00196
Cumulative Proportion  0.95552 0.95779 0.95999 0.96216 0.96424 0.96621 0.96817
                          PC85    PC86    PC87    PC88    PC89    PC90    PC91
Standard deviation     0.95319 0.93714 0.92882 0.89236 0.88891 0.87637 0.86367
Proportion of Variance 0.00191 0.00185 0.00182 0.00168 0.00166 0.00162 0.00157
Cumulative Proportion  0.97008 0.97193 0.97375 0.97542 0.97709 0.97870 0.98028
                          PC92    PC93    PC94    PC95    PC96    PC97    PC98
Standard deviation     0.85877 0.85077 0.83298 0.81957 0.81333 0.80289 0.76496
Proportion of Variance 0.00155 0.00152 0.00146 0.00141 0.00139 0.00136 0.00123
Cumulative Proportion  0.98183 0.98335 0.98481 0.98623 0.98762 0.98898 0.99021
                         PC99   PC100   PC101   PC102   PC103   PC104   PC105
Standard deviation     0.7542 0.74544 0.71378 0.70698 0.70179 0.68516 0.66594
Proportion of Variance 0.0012 0.00117 0.00107 0.00105 0.00104 0.00099 0.00093
Cumulative Proportion  0.9914 0.99258 0.99365 0.99470 0.99574 0.99673 0.99766
                         PC106   PC107   PC108     PC109
Standard deviation     0.64548 0.59660 0.58250 8.508e-15
Proportion of Variance 0.00088 0.00075 0.00071 0.000e+00
Cumulative Proportion  0.99854 0.99929 1.00000 1.000e+00
pca.results.to.plot <- pca.results$x %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata, by="leafcutter.name")

SamplesToProject <- FullMetadata %>%
  filter(!Experiment %in% c("Single high dose LCL")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  pull(leafcutter.name)

pca.results.to.plot_AdditionProjections <- PSI.IntronsToInclude %>%
  dplyr::select(junc, all_of(SamplesToProject)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% predict(pca.results, .) %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata, by="leafcutter.name")


TreatmentColorsForLabels.FibroblastColorsAdded <-
FullMetadata %>%
  group_by(treatment) %>%
  filter(dose.nM == max(dose.nM) | treatment == "DMSO" | cell.type=="Fibroblast") %>%
  ungroup() %>%
  distinct(treatment, .keep_all=T) %>%
  arrange(dose.nM) %>%
  mutate(vjust=row_number()*1.2)

TreatmentColorsLabels.Layer.FibroblastColorsAdded <- geom_text(
    data = TreatmentColorsForLabels.FibroblastColorsAdded, 
    aes(label=treatment, color=color, vjust=vjust),
    y=Inf, x=Inf, hjust=1.05
  )

  
ggplot(pca.results.to.plot, aes(x=PC1, y=PC2, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 475 GAGT introns\nAll samples", caption="Introns based on being significant in any contrast")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23
ggplot(pca.results.to.plot, aes(x=PC3, y=PC4, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 475 GAGT introns\nAll samples", caption="Introns based on being significant in any contrast")

Version Author Date
a8c9152 Benjmain Fair 2023-02-23

Ok this is actually somewhat interpretable. Again the main axis of specificity seems to be along branaplam vs risdiplam effects, and there is plenty of stuff in between. WD07 is actually 500nM risdiplam, which looks fitting in relation to previous experiments. I think perhaps WC05 (SMSM32) and WD08 (4F-Py) the most interesting point I see so far… as both replicates doesn’t neatly lie on either the risdiplam axis or the branaplam axis. Just browsing these molecules structure that ZT shared with me, these seem by my eye pretty structurally distinct. I would plot the structures of these in the PCA but ggimage is kind of tricky to install in my current R environment, so that may have to wait for later

  • update: I turned out making that plot with ggimage in a seperate notebook that I run in a different R environment… I wrote out the data for that plot in the block below and wrote a seperate notebook to just read in that data and plot…
FullMetadata %>%
  write_tsv("../output/20230223_FullMetadata.tsv")


bind_rows(pca.results.to.plot,
          pca.results.to.plot_AdditionProjections) %>%
  write_tsv("../output/20230223_PCA_Projections.tsv.gz")
 

PSI.IntronsToInclude %>%
  write_tsv("../output/20230223_IntronsToInclude.tsv.gz")

Let’s make sure these three molecules don’t have very different read depth or something…

pca.results.to.plot %>%
  inner_join(
    counts.plot.dat %>%
      dplyr::select(SampleName, lib.size)
    ) %>%
ggplot( aes(x=PC1, y=PC2, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment, color=lib.size)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  scale_color_viridis_c(trans='log10') +
  # scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 475 GAGT introns\nAll samples", caption="Introns based on being significant in any contrast")

Version Author Date
0e2a360 Benjmain Fair 2023-02-24

Ok, that makes me feel better these three interesting candidate molecules aren’t just outliers in read depth.

Next I want to plot a heatmap of the correlation matrix based on these introns, I wonder if any clear clustering patterns will come out of this… Part of what I want to get out of this is to answer if these three outliers (WD08, WC05, WB12) are similar to each other, or if they are as distant from eachother as they are from risdiplam/branaplam.

SamplesToInclude <- FullMetadata %>%
  filter(Experiment %in% c("Single high dose LCL", "Dose response titration")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  pull(leafcutter.name)

PSI.IntronsToInclude.cormat <- PSI.IntronsToInclude %>%
  dplyr::select(junc, all_of(SamplesToInclude)) %>%
  column_to_rownames("junc") %>%
  cor(method='s')

GetColorPallete <- function(MyVector, PalletteName = "Dark2"){
  UniqueFactors <- MyVector %>% unique()
  return (setNames(brewer.pal(n=length(UniqueFactors), name=PalletteName), UniqueFactors))
}



MetadataForCormat <- data.frame(leafcutter.name = PSI.IntronsToInclude.cormat %>%
  rownames()) %>%
  left_join(FullMetadata) %>%
  mutate(color = if_else(
    treatment %in% c("WD08", "WC05", "WB12"),
    "#e34a33",
    color
  )) %>%
  mutate(ExperimentColor = recode(Experiment, !!!GetColorPallete(FullMetadata$Experiment)
))


  
heatmap.2(PSI.IntronsToInclude.cormat, trace='none', RowSideColors = MetadataForCormat$color, ColSideColors = MetadataForCormat$color, labRow = MetadataForCormat$treatment, cexRow=0.2, labCol = MetadataForCormat$treatment, cexCol=0.2)

Version Author Date
0e2a360 Benjmain Fair 2023-02-24
pdf("../code/scratch/Heatmap.pdf")
heatmap.2(PSI.IntronsToInclude.cormat, trace='none', RowSideColors = MetadataForCormat$color, ColSideColors = MetadataForCormat$color, labRow = MetadataForCormat$treatment, cexRow=0.2, labCol = MetadataForCormat$treatment, cexCol=0.2)
dev.off()

Ok samples mostly just are clustering by effective dose, so you can’t really see the specificity effect here like you can in the second PC above… Well, you can see in the hierarchal clustering that both reps of WC05, a rep of WD08, and a rep of WB12 cluster together… And far away on another branch are the other replicates of WD08 and WB12…

so maybe their effects are similar? It’s just really hard to tell.

As a slightly less biased way of making this PCA and possibly reproducing the potential uniqueness of these 3 molecules, let’s try changing the criteria for including introns to be more inclusive.

IntronsToInclude <- leafcutter.sig %>%
  filter(p.adjust < 0.1) %>%
  inner_join(leafcutter.effects, by=c("treatment", "cluster")) %>%
  filter(abs(deltapsi) > 0.01) %>%
  distinct(name, .keep_all=T) %>%
  inner_join(Introns.annotations, by='name') %>%
  # filter(str_detect(seq, "^\\w\\wGAGT")) %>%
  pull(intron)

length(IntronsToInclude)
[1] 47065
PSI.IntronsToInclude <- PSI %>%
  filter(junc %in% IntronsToInclude) %>%
  dplyr::select(junc, A10.1:NewMolecule.E07.6) %>%
  drop_na()

dim(PSI.IntronsToInclude)
[1] 23416   173
SamplesToInclude <- FullMetadata %>%
  filter(Experiment %in% c("Single high dose LCL")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  pull(leafcutter.name)


pca.results <- PSI.IntronsToInclude %>%
  dplyr::select(junc, all_of(SamplesToInclude)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% prcomp(scale=T)


pca.results.to.plot <- pca.results$x %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata, by="leafcutter.name") %>%
  mutate(color = if_else(
    treatment %in% c("WD08", "WC05", "WB12"),
    "#e34a33",
    color
  )) 

SamplesToProject <- FullMetadata %>%
  filter(!Experiment %in% c("Single high dose LCL")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  pull(leafcutter.name)

pca.results.to.plot_AdditionProjections <- PSI.IntronsToInclude %>%
  dplyr::select(junc, all_of(SamplesToProject)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% predict(pca.results, .) %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata, by="leafcutter.name")

  
ggplot(pca.results.to.plot, aes(x=PC1, y=PC2, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 23416 introns\nAll samples", caption="Introns based on being significant in any contrast")

Version Author Date
0e2a360 Benjmain Fair 2023-02-24
pca.results.to.plot %>%
  inner_join(
    counts.plot.dat %>%
      dplyr::select(SampleName, lib.size)
    ) %>%
ggplot(aes(x=PC1, y=PC2, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment, color=lib.size)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  scale_color_viridis_c(trans='log10') +
  # scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 23416 introns\nAll samples", caption="Introns based on being significant in any contrast. Colored by depth")

Version Author Date
0e2a360 Benjmain Fair 2023-02-24
ggplot(pca.results.to.plot, aes(x=PC3, y=PC4, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 23416 introns\nAll samples", caption="Introns based on being significant in any contrast")

Version Author Date
0e2a360 Benjmain Fair 2023-02-24
ggplot(pca.results.to.plot, aes(x=PC4, y=PC5, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 23416 introns\nAll samples", caption="Introns based on being significant in any contrast")

Version Author Date
0e2a360 Benjmain Fair 2023-02-24

Ok interesting… when you look at all significant introns with abs(deltapsi) > 0.01, the first few PCs measure hidden technical variables that don’t seem to cluster biological replicates very well. deeper PCs seem to work better to find the dosage axis (PC4) branaplam vs risdiplam axis (PC5) and biological replicates still are close to one another… But in this space, a different set of molecules starts to show up, and perhaps on a different ‘arm’ of the data. I still see D08 sticking out.. WE02 also sticks out here along that axis, and WE02 was previously along the branaplam axis and is structurally most like branaplam.

Now I’m a bit concerned about the robustness of the PCA to interpret it as being a distinct thing.. Let’s try changing up the intron critera a tad a again and see how robust the PCA is…

IntronsToInclude <- leafcutter.sig %>%
  filter(p.adjust < 0.1) %>%
  inner_join(leafcutter.effects, by=c("treatment", "cluster")) %>%
  filter(logef > 0) %>%
  distinct(name, .keep_all=T) %>%
  inner_join(Introns.annotations, by='name') %>%
  filter(str_detect(seq, "^\\w\\wGAGT")) %>%
  pull(intron)

length(IntronsToInclude)
[1] 1138
PSI.IntronsToInclude <- PSI %>%
  filter(junc %in% IntronsToInclude) %>%
  dplyr::select(junc, A10.1:NewMolecule.E07.6) %>%
  drop_na()

dim(PSI.IntronsToInclude)
[1] 632 173
SamplesToInclude <- FullMetadata %>%
  filter(Experiment %in% c("Single high dose LCL")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  pull(leafcutter.name)


pca.results <- PSI.IntronsToInclude %>%
  dplyr::select(junc, all_of(SamplesToInclude)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% prcomp(scale=T)


pca.results.to.plot <- pca.results$x %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata, by="leafcutter.name") %>%
  mutate(color = if_else(
    treatment %in% c("WD08", "WC05", "WB12"),
    "#e34a33",
    color
  )) 

SamplesToProject <- FullMetadata %>%
  filter(!Experiment %in% c("Single high dose LCL")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  pull(leafcutter.name)

pca.results.to.plot_AdditionProjections <- PSI.IntronsToInclude %>%
  dplyr::select(junc, all_of(SamplesToProject)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% predict(pca.results, .) %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata, by="leafcutter.name")

  
ggplot(pca.results.to.plot, aes(x=PC1, y=PC2, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Introns based on being significant in any contrast")

Version Author Date
0e2a360 Benjmain Fair 2023-02-24
pca.results.to.plot %>%
  inner_join(
    counts.plot.dat %>%
      dplyr::select(SampleName, lib.size)
    ) %>%
ggplot(aes(x=PC1, y=PC2, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment, color=lib.size)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  scale_color_viridis_c(trans='log10') +
  # scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Introns based on being significant in any contrast. Colored by depth")

Version Author Date
0e2a360 Benjmain Fair 2023-02-24
ggplot(pca.results.to.plot, aes(x=PC3, y=PC4, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Introns based on being significant in any contrast")

Version Author Date
0e2a360 Benjmain Fair 2023-02-24
ggplot(pca.results.to.plot, aes(x=PC3, y=PC4, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  # geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Introns based on being significant in any contrast")

Version Author Date
0e2a360 Benjmain Fair 2023-02-24
ggplot(pca.results.to.plot, aes(x=PC4, y=PC5, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Introns based on being significant in any contrast")

Version Author Date
0e2a360 Benjmain Fair 2023-02-24
ggplot(pca.results.to.plot, aes(x=PC4, y=PC5, color=color, shape=Experiment)) +
  geom_text(size=3, aes(label=treatment)) +
  # geom_point(data = pca.results.to.plot_AdditionProjections) +
  TreatmentColorsLabels.Layer.FibroblastColorsAdded +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Introns based on being significant in any contrast")

Version Author Date
0e2a360 Benjmain Fair 2023-02-24

Hmm, ok much like the first PCA. Of course these are surely mostly overlapping introns. But at least is sort of comforting. I thought looking at deeper PCs might reveal something… Mostly reveals that interpreting PCs is not always straightforward.

Also, if C05, D08, and B12 have unique properties, it is not so clear if are similar or distinct from each other.

First let’s write out these PCA, for later use…

pca.results.to.plot %>%
  write_tsv("../output/20230227_PCA_Projections.tsv.gz")

Investigation of the C05, D08, and B12

Let’s check the behavior of the hits (and all samples) on some particular introns or genes… The HTT, MYB, SMN2, STAT1

First let’s glance at the differential expression test results (which also include effect size estimates for these genes). Since gene expression changes are generally less noisy to estimate than splicing, we can also use these as proxies for risdiplam vs branaplam-specific effects for a set of known genes.

diff.expression <- fread("../code/DE_testing/ExpOf52_Results.txt.gz") %>%
  mutate(treatment = str_replace(treatment, "group(.+?)", "W\\1")) %>%
  separate(Geneid, into=c("EnsemblID", "symbol"), sep='_')

diff.expression %>%
  filter(symbol %in% c("STAT1", "HTT", "MYB", "SMN2", "SMN1", "GALC", "FOXM1")) %>%
  left_join(FullMetadata) %>%
  mutate(color = if_else(
    treatment %in% c("WD08", "WC05", "WB12"),
    "#e34a33",
    color
  )) %>%
  mutate(Sig = case_when(
    PValue < 0.001 ~ "!",
    PValue < 0.05 ~ "*",
    TRUE ~ NA_character_
  )) %>%
  ggplot(aes(x=treatment, y=logFC, fill=color)) +
  geom_col() +
  geom_text(aes(label=Sig), y=Inf, vjust=1) +
  scale_fill_identity() +
  facet_wrap(~symbol) +
  Rotate_x_labels +
  theme(axis.text.x = element_text(size=3))

Version Author Date
e996344 Benjmain Fair 2023-02-28
0e2a360 Benjmain Fair 2023-02-24

Hmm.. Offhand I can’t really remember which if any of these genes (besides HTT) should be branaplam or risdiplam-specific. I think one helpful way to plot this will be to incorporate titration series data somehow, but that is a bit of work. Another way will be to color all samples (some of which are clearly on the risdiplam or branaplam axis) by the degree to which they were on those axes in PC space. Let’s come up with a color scheme based on that idea…

For example, looking at the previous PCA plot…

If PC2 > 0 then blue, ifelse PC2 < 0, then green And PC2 is an alpha fill scale.

NewColors <- pca.results.to.plot %>%
  filter(Experiment == "Single high dose LCL") %>%
  group_by(treatment) %>%
  summarise(PC1 = mean(PC1), PC2=mean(PC2)) %>%
  mutate(NewColor = case_when(
    PC2 > 0 ~ "#08519c",
    PC2 < 0 ~ "#006d2c",
  ))

MinPC1 <- min(NewColors$PC1)
MaxPC1 <- max(NewColors$PC1)

NewColors <- NewColors %>%
  mutate(alpha = (PC1 - MinPC1)/(MaxPC1 - MinPC1)) %>%
  mutate(NewColor = if_else(
    treatment %in% c("WD08", "WC05", "WB12"),
    "#e34a33",
    NewColor
  ))

diff.expression %>%
  filter(symbol %in% c("STAT1", "HTT", "MYB", "SMN2", "SMN1", "GALC", "FOXM1")) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  mutate(Sig = case_when(
    PValue < 0.001 ~ "!",
    PValue < 0.05 ~ "*",
    TRUE ~ NA_character_
  )) %>%
  ggplot(aes(x=treatment, y=logFC, fill=NewColor)) +
  geom_col(aes(alpha=alpha)) +
  geom_text(aes(label=Sig), y=Inf, vjust=1) +
  scale_fill_identity() +
  facet_wrap(~symbol) +
  Rotate_x_labels +
  theme(axis.text.x = element_text(size=6), legend.position='none') +
  labs(title="Expression changes in genes of interest", y="log2FC", caption=str_wrap("P<0.05 = *\nP<0.001 = !\nColor and  fill transparency based on PC1 & PC2 to distinguish risdiplam/branaplam axes", 30))

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28
0e2a360 Benjmain Fair 2023-02-24
ggsave("../code/scratch/scratch.pdf", height=6, width=10)

Ok, so based on these subset of genes, the red molecules have some risdiplam-like effects and some branaplam-like effects. Let’s look more globally at the DE results..

XaxisColors <- NewColors %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  arrange(treatment) %>% pull(NewColor)


NumberDE.genes <- bind_rows(
diff.expression %>%
  filter(FDR < 0.01) %>%
  count(treatment) %>%
  mutate(FDR = 0.01),
diff.expression %>%
  filter(FDR < 0.05) %>%
  count(treatment) %>%
  mutate(FDR = 0.05),
diff.expression %>%
  filter(FDR < 0.1) %>%
  count(treatment) %>%
  mutate(FDR = 0.1)) %>%
  left_join(NewColors, by='treatment') %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  ggplot(aes(x=treatment, y=n)) +
  geom_bar(stat="identity",position = "identity", alpha=.15) +
  geom_text(y=Inf,  vjust=1, label="*", aes(color=NewColor, alpha=alpha)) +
  Rotate_x_labels +
  scale_color_identity() +
  theme(axis.text.x = element_text(size=6)) +
  theme(legend.position='none') +
  labs(title="Number of DE genes", x="treatment", caption=str_wrap("FDR 1%, 5%, 10%\nAsterisk Color for PC1/PC2 to approximate branaplam/risdiplam axes", 30))
NumberDE.genes

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28
0e2a360 Benjmain Fair 2023-02-24

Ok wow WD08 is a real outlier there. Tons of DE genes… It also had the most leafcutter hits from this experiment. I’ll have to look more into that later. First let’s zoom in on the y=axis to better look at the other contrasts.

NumberDE.genes +
  coord_cartesian(ylim=c(0,20))

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28
0e2a360 Benjmain Fair 2023-02-24

Ok so a typical contrast has more like 5-10 DE genes at a reasonable FDR threshold. It’s also worth noting that the branaplam like molecules, when roughly have similar amounts of differential splicing, have much less differential expression.

Let me go back to WD08… I’m wondering if this sample just had a crazy amount of read depth or something..

counts.plot.dat %>%
  filter(Experiment == "Single high dose LCL") %>%
  dplyr::select(SampleName, lib.size, treatment) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(SampleName = fct_reorder(SampleName, PC2 * alpha)) %>%
  ggplot(aes(x=SampleName, y=lib.size)) +
  geom_col(aes(fill=NewColor, alpha=alpha)) +
  Rotate_x_labels +
  scale_fill_identity() +
  theme(axis.text.x = element_text(size=6)) +
  theme(legend.position='none')

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28
0e2a360 Benjmain Fair 2023-02-24

Ok, so those three treatments, included WD08 are not huge outliers in terms of sequencing depth. Maybe there is something genuinely different about it that makes it have many DE genes and splicing changes.

Let’s go back to looking at genomewide GA-GT activation levels…

leafcutter.effects %>%
  inner_join(
    Introns.annotations %>%
    mutate(IsGAGT = if_else(str_detect(seq, "^\\w\\wGAGT"), "GA-GT", "not GA-GT")),
    by="name") %>%
  group_by(IsGAGT, treatment) %>%
  summarise(medLogef = median(logef)) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  ggplot(aes(x=treatment, y=medLogef, fill=NewColor, alpha=alpha)) +
  geom_col() +
  scale_fill_identity() +
  coord_flip() +
  facet_wrap(~IsGAGT) +
  theme(axis.text.y = element_text(size=6)) +
  theme(legend.position='none') +
  labs(title="3 treatments in question still activate GA-GT\nto similar levels as branaplam/risdiplam -like treatments", y="median splicing logFC")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28
0e2a360 Benjmain Fair 2023-02-24

Ok I have to do this for all dinucleotides…

leafcutter.effects %>%
  inner_join(
    Introns.annotations %>%
    mutate(UpstreamOfDonor2BaseSeq = substr(seq, 3, 4)),
    by="name") %>%
  group_by(UpstreamOfDonor2BaseSeq, treatment) %>%
  summarise(medLogef = median(logef)) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  ggplot(aes(y=treatment, x=UpstreamOfDonor2BaseSeq, fill=medLogef)) +
  geom_raster() +
  scale_fill_viridis_c() +
  theme(axis.text.y = element_text(size=6, colour = XaxisColors))

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28
0e2a360 Benjmain Fair 2023-02-24

Ok, well firstly it is nice to see preferential activation of GA across the board. And there isn’t any other dinucleotide that pops out in the three red samples in question. And the degree of GA-GT activation doesn’t seem all that different from other treatments similar in PC1. I think this is particularly interesting for WD08 which at 3-4x as many differentially spliced clusters and ~100x as many DE genes as a typical treatment. I’m leaning towards thinking this molecule just might elicit non-splicing related toxic effects. Let’s count the number of differentially spliced NNGT introns for each effect to better understand this… I’m wondering whether those 3-4x as many differentially spliced introns are from non GAGT introns (eg indirect effects).

# leafcutter.effects %>%
#   inner_join(
#     Introns.annotations %>%
#     mutate(UpstreamOfDonor2BaseSeq = substr(seq, 3, 4)),
#     by="name") %>%
#   group_by(UpstreamOfDonor2BaseSeq, treatment) %>%
#   summarise(medLogef = median(logef)) %>%
#   left_join(NewColors) %>%
#   mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
#   mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
#   ggplot(aes(y=treatment, x=UpstreamOfDonor2BaseSeq, fill=medLogef)) +
#   geom_raster() +
#   scale_fill_viridis_c() +
#   theme(axis.text.y = element_text(size=6, colour = XaxisColors))

leafcutter.effects_and_sig <- leafcutter.effects %>%
  inner_join(leafcutter.sig) %>%
  inner_join(
    Introns.annotations %>%
    mutate(UpstreamOfDonor2BaseSeq = substr(seq, 3, 4)),
    by="name")

leafcutter.effects_and_sig %>%
  group_by(cluster, treatment) %>%
  mutate(ContainsGAGT = any(UpstreamOfDonor2BaseSeq=="GA")) %>%
  ungroup() %>%
  filter(p.adjust < 0.1) %>%
  distinct(cluster, treatment, .keep_all=T) %>%
  count(ContainsGAGT, treatment) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  ggplot(aes(x=treatment, y=n, fill=ContainsGAGT)) +
  geom_col(position="stack") +
  coord_flip() +
  theme(axis.text.y = element_text(size=6, colour = XaxisColors), legend.position='bottom') +
  labs(title="WD08 has many non-GAGT effects", y="Number significant clusters, FDR<10%")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28
0e2a360 Benjmain Fair 2023-02-24

Ok, so now I am sort of worried the WD08 effect might be indirect effects. There is much more AS but much of it is not at GAGT introns, and also there is tons of tons of DE.

The other two interesting molecules were WC05 and WB12. Those ones I still don’t understand as well. Looking at the PCA again, WC05 is the larger deparature from the rsidiplam/branaplam axes than WB12. I’ll keep a particular eye on WC05.

I’m going to look at the -3 and -4 positions now. So I’ll just filter for the GAGT introns, and make a similar plot as the heatmap I plotted above for the -3 and -4 position …

leafcutter.effects_and_sig %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  mutate(NNGAGT = substr(seq, 1, 2)) %>%
  group_by(NNGAGT, treatment) %>%
  summarise(medLogef = median(logef)) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  ggplot(aes(y=treatment, x=NNGAGT, fill=medLogef)) +
  geom_raster() +
  scale_fill_viridis_c() +
  theme(axis.text.y = element_text(size=6, colour = XaxisColors))

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28
0e2a360 Benjmain Fair 2023-02-24

Ok, so similar to previous Bhattacharya et al, and our dose titration series, we can see the -4 and -3 positions distinguish the branaplam-like molecules vs the risdiplam-like molecules. WC05 and WB12 don’t look particularly special on this plot. If anything I think WC05 may look a bit more like risdiplam in this respect (I’m looking in particular at the AT column), consistent with its structural similarity with risdiplam. WC04 pops out as interesting, even though it didn’t necessarily pop out from the PCA. I do notice in the above plots that WC04 had a lot of non GAGT diff splicing and a larger number of DE genes. Could also be some indirect effects.

Let’s go back to the DE results. For now I’ll plot volcano plots of some select molecules to see what the top genes are for some samples of interest: WC05, WB12, and some reference samples to represent risdiplam like effects (WB09, WB10) and branaplam like effects (WE02, WE04)

diff.expression %>%
  filter(treatment %in% c("WC05", "WB12", "WE04", "WE02", "WB09","WB10")) %>%
  mutate(IsSig = FDR<0.1) %>%
  mutate(label = if_else(
    IsSig,
    symbol,
    NA_character_
  )) %>%
  mutate(treatment = recode(treatment,
    "WE02" = "WE02; branaplam-like",
    "WE04" = "WE04; branaplam-like",
    "WB09" = "WB09; risdiplam-like",
    "WB10" = "WB10; risdiplam-like",
  )) %>%
  ggplot(aes(x=logFC, y=-log10(PValue), color=IsSig)) +
  geom_point() +
  scale_color_manual(values=c(`TRUE`="red", `FALSE`="black")) +
  geom_text_repel(aes(label=label), size=3, max.overlaps=) +
  facet_wrap(~treatment, scales = "free") +
  labs("DE results, focusing on WB12 and WC05", color="FDR<10%")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28
0e2a360 Benjmain Fair 2023-02-24

Ok honing in on some genes, in some ways WC05 and WB12 are more risdiplam-like (DENND5A), and in some other ways they are more branaplam like (CRYL1), and in some ways they might be like niether (ZNF837)…

Let’s plot these genes and a few others more closely…

diff.expression %>%
  filter(symbol %in% c("HTT", "RCC1", "CRYL1", "ZNF837", "DENND5A", "FHOD3", "STAT1", "PITPNB")) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  mutate(Sig = case_when(
    PValue < 0.001 ~ "!",
    PValue < 0.05 ~ "*",
    TRUE ~ NA_character_
  )) %>%
  ggplot(aes(x=treatment, y=logFC, fill=NewColor)) +
  geom_col(aes(alpha=alpha)) +
  geom_text(aes(label=Sig), y=Inf, vjust=1) +
  scale_fill_identity() +
  facet_wrap(~symbol) +
  Rotate_x_labels +
  theme(axis.text.x = element_text(size=5), legend.position='none') +
  labs(title="Expression changes in genes of interest", y="log2FC", caption=str_wrap("P<0.05 = *\nP<0.001 = !\nColor and  fill transparency based on PC1 & PC2 to distinguish risdiplam/branaplam axes", 30))

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28

There are a few treatments in this plot that also catch my eye in HTT and DENND5A… Let’s look at those closer..

diff.expression %>%
  filter(symbol %in% c("HTT", "DENND5A")) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33" | treatment %in% c("WC10", "WC11", "WC12"), 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  mutate(Sig = case_when(
    PValue < 0.001 ~ "!",
    PValue < 0.05 ~ "*",
    TRUE ~ NA_character_
  )) %>%
  ggplot(aes(x=treatment, y=logFC, fill=NewColor)) +
  geom_col(aes(alpha=alpha)) +
  geom_text(aes(label=Sig), y=Inf, vjust=1) +
  scale_fill_identity() +
  facet_wrap(~symbol) +
  Rotate_x_labels +
  theme(axis.text.x = element_text(size=5), legend.position='none') +
  labs(title="Expression changes in genes of interest", y="log2FC", caption=str_wrap("P<0.05 = *\nP<0.001 = !\nColor and  fill transparency based on PC1 & PC2 to distinguish risdiplam/branaplam axes. C10, C11, C12 reordered by reds", 30))

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28

Let’s repeat but just consider more systematically any DE gene found in C05 or B12 with FDR<20%

diff.expression %>%
  filter(treatment %in% c("WC05", "WB12")) %>%
  filter( FDR<0.2) %>%
  distinct(EnsemblID) %>%
  left_join(diff.expression) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  mutate(Sig = case_when(
    PValue < 0.001 ~ "!",
    PValue < 0.05 ~ "*",
    TRUE ~ NA_character_
  )) %>%
  ggplot(aes(x=treatment, y=logFC, fill=NewColor)) +
  geom_col(aes(alpha=alpha)) +
  geom_text(aes(label=Sig), y=Inf, vjust=1) +
  scale_fill_identity() +
  facet_wrap(~symbol) +
  Rotate_x_labels +
  theme(axis.text.x = element_text(size=5), legend.position='none') +
  labs(title="Expression changes in genes of interest", y="log2FC", caption=str_wrap("P<0.05 = *\nP<0.001 = !\nColor and  fill transparency based on PC1 & PC2 to distinguish risdiplam/branaplam axes", 30))

Version Author Date
32a80fd Benjmain Fair 2023-03-01
# ggsave("../code/scratch/scratch.pdf", height=8, width=12)

Ok, based on this I’m not sure there are any astoundingly specific DE changes with WC05 or WB12… Sure in some ways it is unique becase it may be risdiplam like, and in some other ways it may be more branaplam like, and there are some changes in effect sizes, but no totally new DE discoveries.

I think it might be better to just report this to Jingxin before trying to make a list of things to verify by qPCR… That doesn’t seem to interesting to me given these results.

Well, actually, let’s at least do the analgous analysis but for splicing. And if that looks at all promising, then I can consider doing a more careful differential splicing or differential expression to directly compare the WC05 or WB12 treatments to more risdiplam like treatments or more branaplam like treatments to find what makes them unique.

Still to do that…*

Next thing I am curious about is following up on the WC04 effect at the -4 and -3 position. It seemed to have a preference for CGGA|GT and GGGA|GT that was not found in either risdiplam or branaplam like molecules. First let me verify this is really a believable effect and not just noise due to sampling a very low number of introns or something.

First let’s see how many introns that median effect was based on and if it is significant…

For sake of not making the plot too big, let’s just include a few hand picked dinucleotides based on previous plot.

leafcutter.effects_and_sig %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  mutate(NNGAGT = substr(seq, 1, 2)) %>%
  filter(NNGAGT %in% c("AA","AT", "CA","CG", "GG", "TA", "TG", "TT")) %>%
  group_by(NNGAGT, treatment) %>%
  summarise(
    medLogef = median(logef),
    n=n(),
    MannWhitneyP=format.pval(wilcox.test(logef, alternative="greater")$p.value, 2)) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  ggplot(aes(y=treatment, x=NNGAGT, fill=medLogef)) +
  geom_raster() +
  geom_text(color='red', size=3, aes(label = paste(n, MannWhitneyP, sep='|'))) +
  scale_fill_viridis_c() +
  theme(axis.text.y = element_text(size=6, colour = XaxisColors))

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28

Oh ok, actually those weird C04 effects are because that was the treatment where one of the replicates was a failed sample, that I have been excluding from some (but mistakenly not all) analyses.

Let’s also plot the effects for that new bulge (guUaag) that I hypothesized in previous experiments is affected by these molecules (albeit to a lesser extent than gA|gu bulge).

leafcutter.effects_and_sig %>%
  dplyr::select(treatment, intron, seq, logef) %>%
  pivot_wider(names_from = "treatment", values_from = "logef") %>%
  mutate(
    `1_ncag|guaagn` = str_detect(seq, "^\\wCAGGTAAG\\w"),
    `2_nagA|guaagn` = str_detect(seq, "^\\wAGAGTAAG\\w"),
    `3_nagG|guaagn` = str_detect(seq, "^\\wAGGGTAAG\\w"),
    `4_nagU|guaagn` = str_detect(seq, "^\\wAGTGTAAG\\w"),
    # `5.1_uagA|guaagn` = str_detect(seq, "^TAGAGTAAG\\w"),
    # `5.2_aagA|guaagn` = str_detect(seq, "^AAGAGTAAG\\w"),
    `5.3_cagA|guaagn` = str_detect(seq, "^CAGAGTAAG\\w"),
    `5.4_augA|guaagn` = str_detect(seq, "^ATGAGTAAG\\w"),
    `6_nnng|guUaag` = str_detect(seq, "^\\w\\w\\wGGTTAAG"),
    `7_nnng|guAaag` = str_detect(seq, "^\\w\\w\\wGGTAAAG"),
  ) %>%
  # filter_at(vars(matches("^\\d.+_.+$",perl=T)),any_vars(. == T)) %>%
  gather(key="Pattern", value="IsMatch", matches("^\\d.+",perl=T)) %>%
  # pull(Pattern) %>% unique()
  filter(IsMatch) %>%
  gather(key="treatment", value="logef", matches("^W.+")) %>%
  group_by(Pattern, treatment) %>%
  summarise(medLogef = median(logef, na.rm = T),
            p = format.pval(wilcox.test(logef, alternative="greater")$p.value, 2),
            n = sum(IsMatch)) %>%
  ungroup() %>%
  separate(Pattern, into = c("PlotOrder", "Pattern"), sep = "_", convert=T) %>%
  arrange(PlotOrder) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  ggplot(aes(y=treatment, x=Pattern, fill=medLogef)) +
  geom_raster() +
  geom_text(color='red', size=2, aes(label = paste(n, p, sep='|'))) +
  scale_fill_gradient2(name=str_wrap("Median dose-response effect", 10)) +
  Rotate_x_labels +
  theme(axis.text.y = element_text(size=6, colour = XaxisColors))

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28

Ok, there really isn’t anything to be said of that. We just lack power to say anything about the hypothesized effect on the guUaag bulge. We Again, we can’t even really see the effects at the GT|GT introns which I previously saw at high concentrations and have been documented in other lit.

Ok, I think the last thing I want to try is to find the C05/B12-specific splicing changes by doing a direct contrast with branaplam-like and risdiplam-like molecules..

Let’s replot the PCA, and highlight the samples I proposed picking for leafcutter contrasts. I’ll do 3 contrasts for each sample of interest, labelled RisdiLike_High, RisdiLike_Med, and BranadLike_Med. I’ll write out the leafcutter groups files for those contrasts, perform the contrast (in the snakemake), and then explore results in later code chunks:

dir.create("../output/CustomLeafcutterGroupFiles")

Samples_of_interest <- c("WD08", "WC05", "WB12")
RisdiLike_Med_Samples <- c("WB06", "WB09", "WB10")
RisdiLike_High_Samples <- c("WD05", "WD07", "WB09")
BranaLike_Med_Samples <- c("WE04", "WE02", "WD10")


NewColors %>%
  dplyr::select(treatment, NewColor, alpha) %>%
  inner_join(pca.results.to.plot) %>%
ggplot(aes(x=PC1, y=PC2, color=NewColor)) +
  geom_point(size=3, aes(alpha=alpha)) +
  geom_text_repel(data = . %>%
            filter(treatment %in% c(Samples_of_interest, RisdiLike_Med_Samples)),
          aes(label=treatment)) +
  # geom_point(data = pca.results.to.plot_AdditionProjections) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Samples in RisdiLike_Med contrast labelled")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
NewColors %>%
  dplyr::select(treatment, NewColor, alpha) %>%
  inner_join(pca.results.to.plot) %>%
ggplot(aes(x=PC1, y=PC2, color=NewColor)) +
  geom_point(size=3, aes(alpha=alpha)) +
  geom_text_repel(data = . %>%
            filter(treatment %in% c(Samples_of_interest, RisdiLike_High_Samples)),
          aes(label=treatment)) +
  # geom_point(data = pca.results.to.plot_AdditionProjections) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Samples in RisdiLike_High contrast labelled")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
NewColors %>%
  dplyr::select(treatment, NewColor, alpha) %>%
  inner_join(pca.results.to.plot) %>%
ggplot(aes(x=PC1, y=PC2, color=NewColor)) +
  geom_point(size=3, aes(alpha=alpha)) +
  geom_text_repel(data = . %>%
            filter(treatment %in% c(Samples_of_interest, BranaLike_Med_Samples)),
          aes(label=treatment)) +
  # geom_point(data = pca.results.to.plot_AdditionProjections) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Samples in BranaLike_Med contrast labelled")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
ControlSamples <- data.frame(BranaLike_Med_Samples, RisdiLike_High_Samples, RisdiLike_Med_Samples) %>%
  gather(key="ContrastName", value = "treatment") %>%
  mutate(Group = "Control") %>%
  left_join(FullMetadata) %>%
  dplyr::select(old.sample.name, Group, ContrastName)

for (treatment in Samples_of_interest){
  data.frame(treatment=treatment, ContrastName = c("BranaLike_Med_Samples", "RisdiLike_High_Samples", "RisdiLike_Med_Samples")) %>%
    left_join(FullMetadata) %>%
    dplyr::select(old.sample.name, ContrastName) %>%
    mutate(Group = "Treatment") %>%
    bind_rows(ControlSamples) %>%
    arrange(ContrastName, Group, old.sample.name) %>%
    group_by(ContrastName) %>%
    group_walk(~ write_tsv(.x, paste0("../output/CustomLeafcutterGroupFiles/Contrast_", treatment, "_vs_",.y$ContrastName, "groups.tsv"), col_names = F))
}

And then I added rules to the snakemake pipeline to run these contrasts:

snakemake GatherCustomContrasts

Now let’s explore the results..

leafcutter.effects.CustomContrasts <- list.files(path="../code/SplicingAnalysis/Exp52_CustomContrasts", pattern="^.+_effect_sizes.txt", full.names=T) %>%
  setNames(str_replace(., "../code/SplicingAnalysis/Exp52_CustomContrasts/(.+?)_effect_sizes.txt", "\\1")) %>%
  lapply(fread) %>%
  bind_rows(.id="Contrast") %>%
  mutate(name = str_replace(intron, "(.+?):(.+?):(.+?):clu.+([+-])", "\\1_\\2_\\3_\\4")) %>%
  mutate(cluster = str_replace(intron, "(.+?):.+?:.+?:(clu.+[+-])", "\\1:\\2"))

leafcutter.sig.CustomContrasts <-list.files(path="../code/SplicingAnalysis/Exp52_CustomContrasts", pattern="^.+_cluster_significance.txt", full.names=T) %>%
  setNames(str_replace(., "../code/SplicingAnalysis/Exp52_CustomContrasts/(.+?)_cluster_significance.txt", "\\1")) %>%
  lapply(fread) %>%
  bind_rows(.id="Contrast")

leafcutter.effects.Merged.CustomContrasts <- leafcutter.effects.CustomContrasts %>%
  inner_join(leafcutter.sig.CustomContrasts, by=c('cluster', 'Contrast')) %>%
  left_join(Introns.annotations, by="name") %>%
  mutate(
    TestTreatment = str_replace(Contrast,"^Contrast_(.+?)_vs_(.+?)_Samplesgroups$", "\\1"),
    ControlTreatment = str_replace(Contrast,"^Contrast_(.+?)_vs_(.+?)_Samplesgroups$", "\\2"),
    UpstreamOfDonor2BaseSeq = substr(seq, 3, 4)
  )

leafcutter.effects.Merged.CustomContrasts$TestTreatment %>% unique()
[1] "WB12" "WC05" "WD08"
leafcutter.effects.Merged.CustomContrasts %>%
  group_by(cluster, TestTreatment, ControlTreatment) %>%
  mutate(ContainsGAGT = any(UpstreamOfDonor2BaseSeq=="GA")) %>%
  ungroup() %>%
  filter(p.adjust < 0.1) %>%
  distinct(cluster, TestTreatment, ControlTreatment, .keep_all=T) %>%
  count(ContainsGAGT, TestTreatment, ControlTreatment) %>%
  ggplot(aes(x=ControlTreatment, y=n, fill=ContainsGAGT)) +
  geom_col(position="stack") +
  facet_wrap(~TestTreatment, scales="free") +
  theme(legend.position='bottom') +
  Rotate_x_labels +
  labs(title=str_wrap("contrasts vs brana-like and risdi-like treatments", 40), y="Number significant clusters, FDR<10%")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28

Ok, so there are fair number of significant differences. That was totally expected. but of course I am particulary interseted in the GAGT introns that are more activated in C05/D08/B12

leafcutter.effects.Merged.CustomContrasts %>%
  group_by(cluster, TestTreatment, ControlTreatment) %>%
  mutate(ContainsGAGT = any(UpstreamOfDonor2BaseSeq=="GA")) %>%
  ungroup() %>%
  filter(p.adjust < 0.1) %>%
  filter(ContainsGAGT) %>%
  mutate(sign = sign(logef)) %>%
  count(sign, TestTreatment, ControlTreatment) %>%
  ggplot(aes(x=ControlTreatment, y=n, fill=as.factor(sign))) +
  geom_col(position="stack") +
  facet_wrap(~TestTreatment, scales="free") +
  theme(legend.position='bottom') +
  Rotate_x_labels +
  labs(title=str_wrap("contrasts vs brana-like and risdi-like treatments", 40), y="Number significant GAGT introns, FDR<10%")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28

Ok, about the same number of upregulated vs downregulated introns in each contrast. What I am really after I should say are the introns that are upregulated in all contrasts. So I’ll filter for GA|GT introns that are FDR<10% significant in at least one contrast, and nominally significant in all contrasts (P<0.05), and with a effect size direction favoring upregulation in the treatment of interest.

IntronsActivatedByTreatmentsOfInterest <- leafcutter.effects.Merged.CustomContrasts %>%
  filter(UpstreamOfDonor2BaseSeq == "GA") %>%
  group_by(intron, TestTreatment) %>%
  filter(any(p.adjust < 0.1)) %>%
  filter(all(p < 0.05)) %>%
  filter(all(logef > 0)) %>%
  filter(all(deltapsi > 0.05)) %>%
  ungroup()

IntronsActivatedByTreatmentsOfInterest %>%
  dplyr::select(intron, Treatment, Control, TestTreatment, ControlTreatment ) %>%
  group_by(TestTreatment, intron) %>%
  mutate(MeanTreatment = mean(Treatment)) %>%
  distinct(intron, TestTreatment, ControlTreatment, .keep_all=T) %>%
  dplyr::select(-Treatment) %>%
  pivot_wider(names_from = "ControlTreatment", values_from="Control") %>%
  gather(key="Group", value="IntronicPSI", -c("intron", "TestTreatment")) %>%
  mutate(Group= relevel(factor(Group), ref="MeanTreatment")) %>%
  mutate(label = paste(TestTreatment, intron, sep='\n')) %>%
  ggplot(aes(x=Group, y=IntronicPSI, fill=Group)) +
  geom_col() +
  facet_wrap(~label, scales="free") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        strip.text.x = element_text(size = 5)) +
  labs(title="GA-GT introns most specific to samples of interest", caption=str_wrap("DeltaPSI>5% in all contrasts, FDR<10% in at least one contrast, P<0.05 in all contrasts", 40))

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28
# ggsave("../code/scratch/scratch.pdf", height=6, width=12)

Ok these are the kinds of introns I’m looking for, that Jingxin can consider validating as specific to these three molecules of interest. Let’s look the original data for each sample for these sets of introns… And then double check in the genome browser.

IntronsActivatedByTreatmentsOfInterest.vec <- IntronsActivatedByTreatmentsOfInterest %>% pull(intron) %>% unique()

AllSamplesInContrasts <- c(Samples_of_interest, RisdiLike_Med_Samples, RisdiLike_High_Samples, BranaLike_Med_Samples)


PSI %>%
  filter(junc %in% IntronsActivatedByTreatmentsOfInterest.vec) %>%
  dplyr::select(junc, contains("NewMolecule")) %>%
  gather(key = "leafcutter.name", value="IntronicPSI", -junc) %>%
  left_join(FullMetadata, by="leafcutter.name") %>%
  inner_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33" | treatment %in% c("WC10", "WC11", "WC12"), 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  filter(treatment %in% AllSamplesInContrasts) %>%
  # sample_n_of(5, junc) %>%
  ggplot(aes(x=treatment, y=IntronicPSI, color=NewColor)) +
  geom_text(aes(alpha=alpha, label=treatment), size=3) +
  facet_wrap(~junc, scales="free") +
  scale_color_identity() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        strip.text.x = element_text(size = 5),
        legend.position='none') +
  labs(title="Candidate treatment-of-interest-specific effects", x="Treatments, roughly in order", y="Intronic PSI")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28

Ok, I guess that seems reasonable based on the previous plots. I don’t think there is a bug in my code.

Ok, let’s include all samples in these plots now, to see to what extent these candidates are just specific to the molecule of interest.

PSI %>%
  filter(junc %in% IntronsActivatedByTreatmentsOfInterest.vec) %>%
  dplyr::select(junc, contains("NewMolecule")) %>%
  gather(key = "leafcutter.name", value="IntronicPSI", -junc) %>%
  left_join(FullMetadata, by="leafcutter.name") %>%
  inner_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33" | treatment %in% c("WC10", "WC11", "WC12"), 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  # sample_n_of(5, junc) %>%
  ggplot(aes(x=treatment, y=IntronicPSI, color=NewColor)) +
  geom_text(aes(alpha=alpha, label=treatment), size=3) +
  facet_wrap(~junc, scales="free") +
  scale_color_identity() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        strip.text.x = element_text(size = 5),
        legend.position='none') +
    labs(title="Candidate treatment-of-interest-specific effects", x="Treatments, roughly in order", y="Intronic PSI")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
e996344 Benjmain Fair 2023-02-28

Ok, there are a few D08 specific effects, but perhaps that isn’t surprising given previous plots that show so many effects in this particular treatment.

In any case, as promised, I’ll write out some of the most promising things, including FHOD3, as junctions worthy of validation… Maybe the better approach for identifying the worthy candidate list this is just to forget these specific contrasts, and look at the DE plots I made, and intersect with potential GA|GT intron drivers…

GenesToHighlight <- c("HSPA1L", "CRYL1", "FHOD3", "ZFYVE26")

diff.expression %>%
  filter(symbol %in% GenesToHighlight) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  mutate(Sig = case_when(
    PValue < 0.001 ~ "!",
    PValue < 0.05 ~ "*",
    TRUE ~ NA_character_
  )) %>%
  ggplot(aes(x=treatment, y=logFC, fill=NewColor)) +
  geom_col(aes(alpha=alpha)) +
  geom_text(aes(label=Sig), y=Inf, vjust=1) +
  scale_fill_identity() +
  facet_wrap(~symbol) +
  Rotate_x_labels +
  theme(axis.text.x = element_text(size=5), legend.position='none') +
  labs(title="Expression changes in genes of interest", y="log2FC", caption=str_wrap("P<0.05 = *\nP<0.001 = !\nColor and  fill transparency based on PC1 & PC2 to distinguish risdiplam/branaplam axes", 30))

Version Author Date
32a80fd Benjmain Fair 2023-03-01

These 4 genes are examples of how these red treatments may sometimes be more branaplam like on some genes, and more risdiplam like on others. Now let’s try and find the GA-GT introns drivers for these genes…

For convenience let’s look from the data table I wrote out from estimating EC50s from the dose response experiment

GA.GT.introns_FromDoseResponseExperiment <- read_tsv("../output/EC50Estimtes.FromPSI.txt.gz")

GA.GT.introns_FromDoseResponseExperiment %>%
  separate_rows(gene_names, sep=",") %>%
  filter(gene_names %in% GenesToHighlight)
# A tibble: 1 × 34
  junc        `#Chrom`  start    end gid   strand.y seq   Donor.score gene_names
  <chr>       <chr>     <dbl>  <dbl> <chr> <chr>    <chr>       <dbl> <chr>     
1 chr13:2040… chr13    2.04e7 2.04e7 chr1… -        GAGA…        5.45 CRYL1     
# … with 25 more variables: gene_ids <chr>, SpliceDonor <chr>,
#   UpstreamSpliceAcceptor <chr>, IntronType <chr>, Steepness <dbl>,
#   LowerLimit <dbl>, UpperLimit <dbl>, ED50_Branaplam <dbl>, ED50_C2C5 <dbl>,
#   ED50_Risdiplam <dbl>, spearman.coef.Branaplam <dbl>,
#   spearman.coef.C2C5 <dbl>, spearman.coef.Risdiplam <dbl>,
#   EC.Ratio.Test.Estimate_Branaplam.C2C5 <dbl>,
#   EC.Ratio.Test.Estimate_Branaplam.Risdiplam <dbl>, …

Ok actually that didn’t find much… And according to that previous data, CRYL1 GA-GT intron as actually more risdiplam-specific… weird. Ok, let’s just go back to trying to find the GA-GT intron from the leafcutter results…

leafcutter.effects_and_sig %>%
  filter(UpstreamOfDonor2BaseSeq=="GA") %>%
  separate_rows(gene_names, sep=",") %>%
  filter(gene_names %in% GenesToHighlight) %>%
  distinct(intron, gene_names) %>%
  kable()
intron gene_names
chr13:20404741:20405904:clu_31084_- CRYL1
chr18:36742468:36742737:clu_40538_+ FHOD3
leafcutter.effects_and_sig %>%
  filter(UpstreamOfDonor2BaseSeq=="GA") %>%
  separate_rows(gene_names, sep=",") %>%
  filter(gene_names %in% GenesToHighlight) %>%
  # distinct(intron, .keep_all=T) %>%
  dplyr::select(intron, treatment_PSI, DMSO_PSI, treatment, gene_names) %>%
  group_by(intron) %>%
  mutate(DMSO_PSI = mean(DMSO_PSI)) %>%
  ungroup() %>%
  pivot_wider(names_from="treatment", values_from="treatment_PSI", names_glue="{treatment}_PSI") %>%
  dplyr::select(intron, gene_names, everything()) %>%
  gather(key="treatment", value="IntronPSI", -c("gene_names", "intron")) %>%
  mutate(treatment = str_replace(treatment, "^(.+?)_PSI", "\\1")) %>%
  mutate(label=paste(intron, gene_names, sep='\n')) %>%
  left_join(NewColors) %>%
  mutate(PC2 = if_else(NewColor == "#e34a33", 0, PC2)) %>%
  mutate(treatment = fct_reorder(treatment, PC2 * alpha)) %>%
  ggplot(aes(x=treatment, y=IntronPSI, fill=NewColor)) +
  geom_col(aes(alpha=alpha)) +
  scale_fill_identity() +
  facet_wrap(~label, scales="free") +
  Rotate_x_labels +
  theme(legend.position='none', axis.text.x = element_text(size=5))

Version Author Date
32a80fd Benjmain Fair 2023-03-01

Ok, so I recovered the same CRYL1 intron, and also the FHOD3 intron. Still couldn’t find the intron for the other two genes… I think I am satisfied enough just to say these two introns might be good to validate molecule-specific effects, and also probably any number of the risdiplam-specific or branaplam-specific introns that I previously shared from the dose-response data.

Also, looking at the genome browser of these two introns listed above it becomes obvious what the exon coordinates are if you were to design qPCR probes for validation… FHOD3 is particularly interesting… Much like I previously observed in HTT, the actual intron with the GA-GT 5’ss seems to be mostly retained in the treatment - it is the upstream cryptic intron that more obviously gets excised in the treatment. This suggests that the treatment may aid in U1 binding and activation of exon definition without necessarily activating the 5’ss bound to the molecule. This may also somewhat explain my previous observation that these molecules induce much cryptic 3’ss that aren’t in a leafcutter cluster with a cryptic 5’ss.

Here is an screenshot of what I mean.

[FHOD3]

PCA on subset of scaffolds

Jingxin shared a powerpoint in which he color coded samples based on their scaffolds… Using these and some of my custom observations, I wrote factors for each treatment. I’m going to save that data replot PCA a few different ways for exploration.

TreatmentFactors <- read_csv("../code/DataNotToCommit/CompoundsAndManualAnnotationFactors.csv") %>%
  mutate(treatment = paste0("W", Well))

First let me replot the PCA the way I have been plotting it previously with colors consistent with previous plots

pca.results.to.plot %>%
  left_join(
    NewColors %>%
      dplyr::select(NewColor, alpha, treatment)) %>%
ggplot(aes(x=PC1, y=PC2, color=NewColor, alpha=alpha)) +
  geom_text(size=2, aes(label=treatment)) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Colored consistent in other plots")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
pca.results.to.plot %>%
  left_join(TreatmentFactors) %>%
ggplot(aes(x=PC1, y=PC2, color=Factor)) +
  geom_text(size=2, aes(label=treatment)) +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Colored by structural groupings from Jingxins")

Version Author Date
32a80fd Benjmain Fair 2023-03-01
pca.results.to.plot %>%
  left_join(TreatmentFactors) %>%
ggplot(aes(x=PC3, y=PC4, color=Factor)) +
  geom_text(size=2, aes(label=treatment)) +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nAll samples", caption="Colored by structural groupings from Jingxins")

Version Author Date
32a80fd Benjmain Fair 2023-03-01

Ok, one thing I want to try is to repeat the PCA but only exclude the branaplam-based molecules.. I think I will project the dose-response experiment data to aid in interpretation.

NonBranaplamBasedMolecules <- TreatmentFactors %>%
  filter(!Factor=="G") %>% pull(treatment)

SamplesToInclude <- 
  FullMetadata %>%
  filter(Experiment %in% c("Single high dose LCL")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  filter(treatment %in% c("DMSO", NonBranaplamBasedMolecules)) %>%
  pull(leafcutter.name)
  

pca.results <- PSI.IntronsToInclude %>%
  dplyr::select(junc, all_of(SamplesToInclude)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% prcomp(scale=T)


pca.results.to.plot <- pca.results$x %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata %>%
              dplyr::select(leafcutter.name, treatment, SampleName)) %>%
  left_join(TreatmentFactors, by="treatment") %>%
  mutate(ToHighlight = if_else(
    treatment %in% c("WB12", "WC05", "WD08"),
    1,
    0.5
    ))

SamplesToProject <- FullMetadata %>%
  filter(Experiment %in% c("Dose response titration")) %>%
  pull(leafcutter.name)

pca.results.to.plot_AdditionProjections <- PSI.IntronsToInclude %>%
  dplyr::select(junc, all_of(SamplesToProject)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% predict(pca.results, .) %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata, by="leafcutter.name")

P <- ggplot(pca.results.to.plot, aes(x=PC1, y=PC2)) +
  geom_text(size=3, aes(label=treatment, color=Factor, alpha=ToHighlight)) +
  scale_alpha_continuous(range=c(0.5,1)) +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nNon branaplam skeletons", caption="Colored by structural groupings from Jingxins")
P

Version Author Date
32a80fd Benjmain Fair 2023-03-01
P +
  new_scale_colour() +
  geom_point(data = pca.results.to.plot_AdditionProjections,
             aes(color=color)) +
  scale_color_identity()

Version Author Date
32a80fd Benjmain Fair 2023-03-01
P <- ggplot(pca.results.to.plot, aes(x=PC3, y=PC4)) +
  geom_text(size=3, aes(label=treatment, color=Factor, alpha=ToHighlight)) +
  scale_alpha_continuous(range=c(0.5,1)) +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nNon branaplam skeletons", caption="Colored by structural groupings from Jingxins")
P

Version Author Date
32a80fd Benjmain Fair 2023-03-01
P +
  new_scale_colour() +
  geom_point(data = pca.results.to.plot_AdditionProjections,
             aes(color=color)) +
  scale_color_identity()

ggplot(pca.results.to.plot, aes(x=PC5, y=PC6)) +
  geom_text(size=3, aes(label=treatment, color=Factor, alpha=ToHighlight)) +
  scale_alpha_continuous(range=c(0.5,1)) +
  new_scale_colour() +
  geom_point(data = pca.results.to.plot_AdditionProjections,
             aes(color=color)) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nNon branaplam skeletons", caption="Colored by structural groupings from Jingxins")

pca.results.to.plot %>%
  inner_join(
    counts.plot.dat %>%
      dplyr::select(SampleName, lib.size)
    ) %>%
ggplot(aes(x=PC5, y=PC6, color=lib.size)) +
  geom_text(size=3, aes(label=treatment)) +
  scale_alpha_continuous(range=c(0.5,1)) +
  scale_color_viridis_c() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nNon branaplam skeletons", caption="Colored by structural groupings from Jingxins")

And let’s do the same for branaplam based molecules

BranaplamBasedMolecules <- TreatmentFactors %>%
  filter(Factor=="G") %>% pull(treatment)

SamplesToInclude <- 
  FullMetadata %>%
  filter(Experiment %in% c("Single high dose LCL")) %>%
  filter(!old.sample.name == "NewMolecule.C04-2") %>%
  filter(treatment %in% c("DMSO", BranaplamBasedMolecules)) %>%
  pull(leafcutter.name)
  

pca.results <- PSI.IntronsToInclude %>%
  dplyr::select(junc, all_of(SamplesToInclude)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% prcomp(scale=T)


pca.results.to.plot <- pca.results$x %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata %>%
              dplyr::select(leafcutter.name, treatment, SampleName)) %>%
  left_join(TreatmentFactors, by="treatment") %>%
  mutate(ToHighlight = if_else(
    treatment %in% c("WB12", "WC05", "WD08"),
    1,
    0.5
    ))

SamplesToProject <- FullMetadata %>%
  filter(Experiment %in% c("Dose response titration")) %>%
  pull(leafcutter.name)

pca.results.to.plot_AdditionProjections <- PSI.IntronsToInclude %>%
  dplyr::select(junc, all_of(SamplesToProject)) %>%
  column_to_rownames("junc") %>%
  scale() %>% t() %>% predict(pca.results, .) %>%
  as.data.frame() %>%
  rownames_to_column("leafcutter.name") %>%
  dplyr::select(leafcutter.name, PC1:PC6) %>%
  left_join(FullMetadata, by="leafcutter.name")

P <- ggplot(pca.results.to.plot, aes(x=PC1, y=PC2)) +
  geom_text(size=3, aes(label=treatment, color=Factor, alpha=ToHighlight)) +
  scale_alpha_continuous(range=c(0.5,1)) +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nNon branaplam skeletons", caption="Colored by structural groupings from Jingxins")
P

Version Author Date
32a80fd Benjmain Fair 2023-03-01
P +
  new_scale_colour() +
  geom_point(data = pca.results.to.plot_AdditionProjections,
             aes(color=color)) +
  scale_color_identity()

Version Author Date
32a80fd Benjmain Fair 2023-03-01
P <- ggplot(pca.results.to.plot, aes(x=PC3, y=PC4)) +
  geom_text(size=3, aes(label=treatment, color=Factor, alpha=ToHighlight)) +
  scale_alpha_continuous(range=c(0.5,1)) +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nNon branaplam skeletons", caption="Colored by structural groupings from Jingxins")
P

P +
  new_scale_colour() +
  geom_point(data = pca.results.to.plot_AdditionProjections,
             aes(color=color)) +
  scale_color_identity()

ggplot(pca.results.to.plot, aes(x=PC5, y=PC6)) +
  geom_text(size=3, aes(label=treatment, color=Factor, alpha=ToHighlight)) +
  scale_alpha_continuous(range=c(0.5,1)) +
  new_scale_colour() +
  geom_point(data = pca.results.to.plot_AdditionProjections,
             aes(color=color)) +
  scale_color_identity() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nNon branaplam skeletons", caption="Colored by structural groupings from Jingxins")

pca.results.to.plot %>%
  inner_join(
    counts.plot.dat %>%
      dplyr::select(SampleName, lib.size)
    ) %>%
ggplot(aes(x=PC5, y=PC6, color=lib.size)) +
  geom_text(size=3, aes(label=treatment)) +
  scale_alpha_continuous(range=c(0.5,1)) +
  scale_color_viridis_c() +
  theme_bw() +
  labs(title = "PCA using 632 upregulated GAGT introns\nNon branaplam skeletons", caption="Colored by structural groupings from Jingxins")

I’m re-questioning whether C05 and B12 are special in any worthwhile way


sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /software/openblas-0.3.13-el7-x86_64/lib/libopenblas_haswellp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggnewscale_0.4.8   knitr_1.39         ggrepel_0.9.1      gplots_3.1.3      
 [5] Heatplus_3.4.0     qvalue_2.28.0      edgeR_3.38.4       limma_3.52.4      
 [9] data.table_1.14.2  RColorBrewer_1.1-3 forcats_0.5.1      stringr_1.4.0     
[13] dplyr_1.0.9        purrr_0.3.4        readr_2.1.2        tidyr_1.2.0       
[17] tibble_3.1.7       ggplot2_3.3.6      tidyverse_1.3.1   

loaded via a namespace (and not attached):
 [1] bitops_1.0-7       fs_1.5.2           bit64_4.0.5        lubridate_1.8.0   
 [5] httr_1.4.3         rprojroot_2.0.3    tools_4.2.0        backports_1.4.1   
 [9] bslib_0.3.1        utf8_1.2.2         R6_2.5.1           KernSmooth_2.23-20
[13] DBI_1.1.2          colorspace_2.0-3   withr_2.5.0        tidyselect_1.1.2  
[17] bit_4.0.4          compiler_4.2.0     git2r_0.30.1       cli_3.3.0         
[21] rvest_1.0.2        xml2_1.3.3         labeling_0.4.2     sass_0.4.1        
[25] caTools_1.18.2     scales_1.2.0       digest_0.6.29      R.utils_2.11.0    
[29] rmarkdown_2.14     pkgconfig_2.0.3    htmltools_0.5.2    dbplyr_2.1.1      
[33] fastmap_1.1.0      highr_0.9          rlang_1.0.2        readxl_1.4.0      
[37] rstudioapi_0.13    jquerylib_0.1.4    generics_0.1.2     farver_2.1.0      
[41] jsonlite_1.8.0     vroom_1.5.7        gtools_3.9.2       R.oo_1.24.0       
[45] magrittr_2.0.3     Rcpp_1.0.8.3       munsell_0.5.0      fansi_1.0.3       
[49] R.methodsS3_1.8.1  lifecycle_1.0.1    stringi_1.7.6      whisker_0.4       
[53] yaml_2.3.5         plyr_1.8.7         grid_4.2.0         parallel_4.2.0    
[57] promises_1.2.0.1   crayon_1.5.1       lattice_0.20-45    haven_2.5.0       
[61] splines_4.2.0      hms_1.1.1          locfit_1.5-9.7     pillar_1.7.0      
[65] reshape2_1.4.4     reprex_2.0.1       glue_1.6.2         evaluate_0.15     
[69] modelr_0.1.8       vctrs_0.4.1        tzdb_0.3.0         httpuv_1.6.5      
[73] cellranger_1.1.0   gtable_0.3.0       assertthat_0.2.1   xfun_0.30         
[77] broom_0.8.0        later_1.3.0        viridisLite_0.4.0  workflowr_1.7.0   
[81] ellipsis_0.3.2