IdentifyTargetsForVerification

Intro

Last updated: 2022-09-08

Checks: 5 2

Knit directory: 20211209_JingxinRNAseq/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(19900924)

The command set.seed(19900924) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: absolute

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute	relative
/project2/yangili1/bjf79/20211209_JingxinRNAseq/code/bigwigs/unstranded/(.+?).bw	../code/bigwigs/unstranded/(.+?).bw

Repository version: db9a4f0

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version db9a4f0. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    ._.DS_Store
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/20220707_TitrationSeries_DE_testing.nb.html
    Ignored:    code/.DS_Store
    Ignored:    code/._.DS_Store
    Ignored:    code/._DOCK7.pdf
    Ignored:    code/._DOCK7_DMSO1.pdf
    Ignored:    code/._DOCK7_SM2_1.pdf
    Ignored:    code/._FKTN_DMSO_1.pdf
    Ignored:    code/._FKTN_SM2_1.pdf
    Ignored:    code/._MAPT.pdf
    Ignored:    code/._PKD1_DMSO_1.pdf
    Ignored:    code/._PKD1_SM2_1.pdf
    Ignored:    code/.snakemake/
    Ignored:    code/5ssSeqs.tab
    Ignored:    code/Alignments/
    Ignored:    code/ChemCLIP/
    Ignored:    code/ClinVar/
    Ignored:    code/DE_testing/
    Ignored:    code/DE_tests.mat.counts.gz
    Ignored:    code/DE_tests.txt.gz
    Ignored:    code/Fastq/
    Ignored:    code/FastqFastp/
    Ignored:    code/FragLenths/
    Ignored:    code/Meme/
    Ignored:    code/Multiqc/
    Ignored:    code/OMIM/
    Ignored:    code/OldBigWigs/
    Ignored:    code/QC/
    Ignored:    code/Session.vim
    Ignored:    code/SplicingAnalysis/
    Ignored:    code/TracksSession
    Ignored:    code/bigwigs/
    Ignored:    code/featureCounts/
    Ignored:    code/geena/
    Ignored:    code/igv_session.template.xml
    Ignored:    code/igv_session.xml
    Ignored:    code/log
    Ignored:    code/logs/
    Ignored:    code/scratch/
    Ignored:    code/scripts/.DS_Store
    Ignored:    code/test.txt.gz
    Ignored:    code/testPlottingWithMyScript.ForJingxin.sh
    Ignored:    code/testPlottingWithMyScript.ForJingxin2.sh
    Ignored:    code/testPlottingWithMyScript.sh
    Ignored:    data/._Hijikata_TableS1_41598_2017_8902_MOESM2_ESM.xls
    Ignored:    data/._Hijikata_TableS2_41598_2017_8902_MOESM3_ESM.xls
    Ignored:    output/._PioritizedIntronTargets.pdf

Untracked files:
    Untracked:  analysis/20220804_MakeBigwigListTsv.Rmd
    Untracked:  analysis/20220809_geena.Rmd
    Untracked:  analysis/20220822_Check_chRNA.Rmd
    Untracked:  analysis/20220907_IdentifyTargetsForVerification.Rmd
    Untracked:  code/scripts/Collapse_Juncfiles.R
    Untracked:  data/Hijikata_Table2_combined.csv
    Untracked:  data/Risdplam_Branaplam_specific_exons 5'ss.xlsx
    Untracked:  output/FibroblastOMIM_GOF_Downregulated_AndCrypticSpliced_GeneList.tsv

Unstaged changes:
    Modified:   analysis/20211216_DifferentialSplicing.Rmd
    Modified:   analysis/20220629_FirstGlanceTitrationSeries.Rmd
    Modified:   analysis/20220712_FitDoseResponseModels.Rmd
    Modified:   analysis/20220726_Check_chRNATestSeq.Rmd
    Modified:   code/Snakefile
    Modified:   code/config/config.yaml
    Modified:   code/config/samples.chRNAPilotSequencing.tsv
    Modified:   code/rules/Process_chRNA.smk
    Modified:   code/rules/RNASeqProcessing.smk
    Modified:   code/rules/common.smk
    Modified:   code/scripts/GenometracksByGenotype
    Modified:   output/QC/ReadCountsAndJunctionsPerSamples.tsv

Staged changes:
    Modified:   .gitmodules
    New:        code/scripts/GenometracksByGenotype

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Intro

Yang shared with me a message from Jingxin:

We are now designing lots of splicing assays. Previously, Ben provided 9 genes that are related to pathogenic gain-of-function mutations. I believe the analysis is based on our previous fibroblast result. We will need the coordinates of the AS events in these 9 genes. I hope that this can be done before next Tuesday (our next order for DNAs). 1 AKT3 MEGALENCEPHALY-POLYMICROGYRIA-POLYDACTYLY-HYDROCEPHALUS SYNDROME … 2 ACVR1 FIBRODYSPLASIA OSSIFICANS PROGRESSIVA; FOP
3 STAT1 IMMUNODEFICIENCY 31C; IMD31C
4 PDGFRB MYOFIBROMATOSIS, INFANTILE, 1; IMF1
5 RNF170 ATAXIA, SENSORY, 1, AUTOSOMAL DOMINANT; SNAX1
6 PTDSS1 LENZ-MAJEWSKI HYPEROSTOTIC DWARFISM; LMHD
7 KITLG HYPERPIGMENTATION WITH OR WITHOUT HYPOPIGMENTATION, FAMILIAL PROG… 8 PIEZO1 DEHYDRATED HEREDITARY STOMATOCYTOSIS 1 WITH OR WITHOUT PSEUDOHYPE… 9 SNTA1 LONG QT SYNDROME 12; LQT12

The most intriguing part of the dose-response (DR) study is the selectivity difference between risdiplam and branaplam. Due to the similarity between risdiplam and C2-C5-1, we can even leave C2-C5-1 out for simplicity. We need a list of more genes that are (a) risdiplam selective and (b) branaplam selective, with complete coordinates. The criteria I can think of are as follows:

Number of validated reads (leave the low-quality DR curves out).

EC50 difference. This can be defined by delta Psi at a certain concentration (or an average of a few concentrations).

Only consider exon inclusion in the presence of drugs.

Let’s start with point #1. Previously, based on the fibroblast RNA-seq data (cells treated with either A10, C2, CUOME, or SM2), I made a short list of genes that are OMIM genes annotated as gain-of-function or dominant-negative, which also in the RNA-seq are observed to be downregulated by treatment and also contain a detected significantly upregulated unannotated (cryptic 3’ss, or cryptic 5’ss) splice junction. It is not totally clear whether the significant detected upregulated junction is the causal mechanism for downregulation (eg like the poison exon in HTT), or whether the detected intron is a GA|GT intron, etc. To accomplish Jingxin’s point number one, I will manually inspect that list of 10 genes, looking for the hypothesized causal splice junction that creates a NMD-target transcript. Here I will show my work on how I go about this:

#load libraries
library(tidyverse)
library(edgeR)

#Read in short list of genes
genes.to.plot <- read_tsv("../output/FibroblastOMIM_GOF_Downregulated_AndCrypticSpliced_GeneList.tsv")
knitr::kable(genes.to.plot)

Gene	Preferred Title; symbol	UpstreamOfDonor2BaseSeq	psi_treatment	intron	logFC
AKT3	MEGALENCEPHALY-POLYMICROGYRIA-POLYDACTYLY-HYDROCEPHALUS SYNDROME 2; MPPH2	AG	0.0756435	chr1:243545028:243545510:clu_1364_-	-0.9947345
ACVR1	FIBRODYSPLASIA OSSIFICANS PROGRESSIVA; FOP	AT	0.1008161	chr2:157774187:157774275:clu_3398_-	-1.4052270
STAT1	IMMUNODEFICIENCY 31C; IMD31C	AG	0.1767480	chr2:190974932:190975310:clu_3511_-	-3.7853949
PDGFRB	MYOFIBROMATOSIS, INFANTILE, 1; IMF1	TG	0.0325590	chr5:150137053:150137400:clu_8376_-	-0.6589689
RNF170	ATAXIA, SENSORY, 1, AUTOSOMAL DOMINANT; SNAX1	GA	0.0265984	chr8:42874006:42887802:clu_11248_-	-0.7246948
PTDSS1	LENZ-MAJEWSKI HYPEROSTOTIC DWARFISM; LMHD	AG	0.0862211	chr8:96331095:96331377:clu_11937_+	-0.9047179
KITLG	HYPERPIGMENTATION WITH OR WITHOUT HYPOPIGMENTATION, FAMILIAL PROGRESSIVE; FPHH	AG	0.1110230	chr12:88497181:88503848:clu_16317_-	-1.0638706
PIEZO1	DEHYDRATED HEREDITARY STOMATOCYTOSIS 1 WITH OR WITHOUT PSEUDOHYPERKALEMIA AND/OR PERINATAL EDEMA; DHS1	GT	0.0130891	chr16:88716252:88716361:clu_21073_-	-1.0658404
SNTA1	LONG QT SYNDROME 12; LQT12	TG	0.0840857	chr20:33408888:33410103:clu_25106_-	-0.5575608

First let’s confirm a dose-dependent drug response for these genes.

dat <- read_tsv("../code/featureCounts/Counts.titration_series.txt", comment="#") %>%
  rename_at(vars(-(1:6)), ~str_replace(.x, "Alignments/STAR_Align/(.+?)/Aligned.sortedByCoord.out.bam", "\\1"))


ExpressedGenes.rpkm <- dat %>%
  dplyr::select(Geneid, everything(), -c(2:6)) %>%
  column_to_rownames("Geneid") %>%
  DGEList() %>%
  calcNormFactors() %>%
  rpkm(prior.count=0.1, gene.length = dat$Length)

sample.list <- read_tsv("../code/bigwigs/BigwigList.tsv",
                        col_names = c("SampleName", "bigwig", "group", "strand")) %>%
  filter(strand==".") %>%
  mutate(old.sample.name = str_replace(bigwig, "/project2/yangili1/bjf79/20211209_JingxinRNAseq/code/bigwigs/unstranded/(.+?).bw", "\\1")) %>%
  separate(SampleName, into=c("treatment", "dose.nM", "cell.type", "libType", "rep"), convert=T, remove=F) %>%
  left_join(
    read_tsv("../code/bigwigs/BigwigList.groups.tsv", col_names = c("group", "color", "bed", "supergroup")),
    by="group"
  )

gene_list <- read_tsv("../data/Genes.list.txt")

ExpressedGenes.rpkm %>%
  as.data.frame() %>%
  rownames_to_column("gene") %>%
  mutate(gene=str_remove(gene, "\\..+$")) %>%
  left_join(gene_list, by=c("gene"="ensembl_gene_id")) %>%
  filter(hgnc_symbol %in% genes.to.plot$Gene) %>%
  gather("sample", "rpkm", -gene, -hgnc_symbol) %>%
  inner_join(sample.list, by=c("sample"="old.sample.name")) %>%
  filter(!is.na(dose.nM)) %>%
  ggplot(aes(x=dose.nM, y=rpkm, color=treatment)) +
  geom_line() +
  scale_x_continuous(trans='log10') +
  facet_wrap(~hgnc_symbol, scales = "free_y") +
  theme_bw()

Note that some of these don’t even really have a dose-dependent response consistent with the fibroblast data from which this short list of genes was derived… For example, all of these genes were in part chosen because they went DOWN in the treatment samples in fibroblast data. In contrast, KITLG seems to go up. Other genes, like PDGFRB and RNF170 don’t perhaps are consistent, but they are expressed so low in the LCL data that the dose-response effect is a bit questionable. In any case, for all of these, let’s plot the raw coverage data and try to hone in on some causal introns, knowing that it may be impossible to identify for a lot of these genes.

# Read in list of introns detected across all experiments that pass leafcutter's default clustering processing thresholds
all.samples.5ss <- read_tsv("../code/SplicingAnalysis/FullSpliceSiteAnnotations/JuncfilesMerged.annotated.basic.bed.5ss.tab.gz", col_names = c("intron", "seq", "score")) %>%
  mutate(intron = str_replace(intron, "^(.+?)::.+$", "\\1")) %>%
  separate(intron, into=c("chrom", "start", "stop", "strand"), sep="_", convert=T, remove=F)

all.samples.intron.annotations <- read_tsv("../code/SplicingAnalysis/FullSpliceSiteAnnotations/JuncfilesMerged.annotated.basic.bed.gz")

all.samples.PSI <- read_tsv("../code/SplicingAnalysis/leafcutter_all_samples/PSI.table.bed.gz")

# Write out the splice junctions that are GA|GT introns. I will highlight these in shashimi plots for each of the 10 genes.  
SpliceJunctionToHighlight <- all.samples.PSI %>%
  select(1:6) %>%
  inner_join(all.samples.5ss, by=c("#Chrom"="chrom", "start", "end"="stop", "strand")) %>%
  inner_join(all.samples.intron.annotations, by=c("#Chrom"="chrom", "start", "end", "strand") ) %>%
  filter(str_detect(seq, "^\\w{2}GAGT")) %>%
  filter(gene_names %in% genes.to.plot$Gene)
SpliceJunctionToHighlight %>%
  write_tsv("../code/scratch/ToPlot.introns.bed")
SpliceJunctionToHighlight %>% select(1:4, gene_names) %>%
  knitr::kable()

#Chrom	start	end	junc	gene_names
chr12	88516490	88518195	chr12:88516490:88518195:clu_23943_-	KITLG
chr16	88720532	88720660	chr16:88720532:88720660:clu_30902_-	PIEZO1
chr2	190977025	190978528	chr2:190977025:190978528:clu_5141_-	STAT1
chr2	191010004	191014018	chr2:191010004:191014018:clu_5142_-	STAT1
chr2	191013678	191014018	chr2:191013678:191014018:clu_5142_-	STAT1
chr8	42874006	42887802	chr8:42874006:42887802:clu_16606_-	RNF170

# Write out GA|GT splice sites, I will highlight these with a vertical bar
SpliceJunctionToHighlight %>%
  mutate(SpliceDonor = case_when(
    strand == "+" ~ start,
    strand == "-" ~ end
  )) %>%
  mutate(SpliceDonorEnd = SpliceDonor + 1) %>%
  select(`#Chrom`, SpliceDonor, SpliceDonorEnd) %>%
  write_tsv("../code/scratch/ToPlot.SpliceDonors.bed")

# write out pygenometracks ini file for GA|GT splice sites track
fileConn<-file("../code/scratch/ToPlot.SpliceDonors.ini")
writeLines(c("[vlines]","type = vlines", "file = scratch/ToPlot.SpliceDonors.bed"), con=fileConn)
close(fileConn)

# Read in list of gene coordinates
genes <- read_tsv("../code/scripts/GenometracksByGenotype/PremadeTracks/gencode.v26.FromGTEx.genes.bed12.gz", col_names = c("chrom", "start", "stop", "name", "score", "strand", 1:7)) %>%
  select(1:6)

# Write out shell commands that use a helper script of mine for plotting the data nicely.
genes.to.plot %>%
  left_join(genes, by=c("Gene"="name")) %>%
  mutate(min=start-1000, max=stop+1000) %>%
  mutate(cmd = str_glue("python scripts/GenometracksByGenotype/AggregateBigwigsForPlotting.py --GroupSettingsFile <(awk '$1 ~ /_u$/' bigwigs/BigwigList.groups.tsv ) --BigwigList bigwigs/BigwigList.tsv --Normalization WholeGenome --Region {chrom}:{min}-{max} --BigwigListType KeyFile --OutputPrefix scratch/ -vv --TracksTemplate scripts/GenometracksByGenotype/tracks_templates/GeneralPurposeColoredByGenotypeWithSupergroups.ini --Bed12GenesToIni scripts/GenometracksByGenotype/PremadeTracks/gencode.v26.FromGTEx.genes.bed12.gz  --FilterJuncsByBed scratch/ToPlot.introns.bed\npyGenomeTracks --tracks <(cat scratch/tracks.ini scratch/ToPlot.SpliceDonors.ini) --out ../docs/assets/ForJingxin_{Gene}.png --region {chrom}:{min}-{max} --trackLabelFraction 0.15\n\n")) %>%
  select(cmd) %>%
  write.table("../code/testPlottingWithMyScript.ForJingxin.sh", quote=F, row.names = F, col.names = F)

Note that for only 4/10 of these genes could we even reasonably detect any GA|GT splice junctions. For the other 6 genes, the exact mechanism of down-regulation by treatment may not be identifiable from our existing data.

Let’s plot the dose-response curves for all these GA|GT splice junctions:

splicing.table <- read_tsv("../code/SplicingAnalysis/leafcutter_all_samples/PSI.table.bed.gz")

splicing.table %>%
  inner_join(all.samples.intron.annotations, by=c("#Chrom"="chrom", "start", "end", "strand")) %>%
  inner_join(all.samples.5ss, by=c("#Chrom"="chrom", "start", "end"="stop", "strand")) %>%
  filter(gene_names %in% genes.to.plot$Gene) %>%
  mutate(IsGAGT = str_detect(seq, "^\\w{2}GAGT")) %>%
  group_by(gid) %>%
  filter(any(IsGAGT)) %>%
  ungroup() %>%
  filter(IsGAGT) %>%
  mutate(facettitle = paste(junc, gene_names,IsGAGT, sep="\n")) %>%
  select(-c(1:6)) %>%
  gather("sample", "PSI", A10_NA_Fibroblast_polyA_1:Branaplam_31.6_LCL_chRNA_3) %>%
  inner_join(sample.list, by=c("sample"="SampleName")) %>%
  filter(!is.na(dose.nM)) %>%
  filter(libType=="polyA" & cell.type=="LCL") %>%
  ggplot(aes(x=dose.nM, y=PSI, color=treatment)) +
  geom_line() +
  scale_x_continuous(trans='log10') +
  facet_wrap(~facettitle, scales = "free_y") +
  theme_bw()

Perhaps the KITLG splice junction is not expressed enough to make an interpretable dose-response curves. But for RNF170, STAT1, and PIEZO1 at least there is one GA|GT junction that does indeed go up in dose-dependent manner…

Just out of curiosity, let’s look across all junctions (including the non-GA|GT) quantified and look for dose-dependent responses. Here the facet titles are the intron coordinates, the host gene, then whether the intron is in an intron cluster with any GA|GT. So basically I am curious if I can find any reliable dose-dependent splicing responses in these genes that aren’t due to any detectable GA|GT intron.

splicing.table %>%
  inner_join(all.samples.intron.annotations, by=c("#Chrom"="chrom", "start", "end", "strand")) %>%
  inner_join(all.samples.5ss, by=c("#Chrom"="chrom", "start", "end"="stop", "strand")) %>%
  filter(gene_names %in% genes.to.plot$Gene) %>%
  mutate(IsGAGT = str_detect(seq, "^\\w{2}GAGT")) %>%
  group_by(gid) %>%
  mutate(ClusterContainsGAGT = any(IsGAGT)) %>%
  ungroup() %>%
  mutate(facettitle = paste(junc, gene_names,ClusterContainsGAGT, sep="\n")) %>%
  select(-c(1:6)) %>%
  gather("sample", "PSI", A10_NA_Fibroblast_polyA_1:Branaplam_31.6_LCL_chRNA_3) %>%
  inner_join(sample.list, by=c("sample"="SampleName")) %>%
  filter(!is.na(dose.nM)) %>%
  filter(libType=="polyA" & cell.type=="LCL") %>%
  ggplot(aes(x=dose.nM, y=PSI, color=treatment)) +
  geom_line() +
  scale_x_continuous(trans='log10') +
  facet_wrap(~facettitle, scales = "free_y") +
  theme_bw()

Out of these 150 or so dose-response curves, the most reliable looking ones are the ones in STAT1 and RNF170 which are involved in a GA|GT-containing splice cluster. So not so much evidence (at least from this limited sample) for direct effects at non GA|GT introns.

In any case, for all 10 genes, let’s plot the normalized read coverage across all our experiments, and specifically, for the 4 genes of interest, let’s also zoom in on the GA|GT intron of interest…

conda activate GenometracksByGenotype
cd ../code
bash testPlottingWithMyScript.ForJingxin.sh

Results in some of the following gene-wise plots… I’m just going to highlight the 4/10 here with any GA|GT introns…

KITLG PIEZO1 RNF170 STAT1

Let’s zoom in a bit on the introns of interest…

SpliceJunctionToHighlight %>% select(1:4, gene_names) %>%
  distinct(gene_names, .keep_all=T) %>%
  mutate(min=start-2000, max=end+2000) %>%
  mutate(chrom=`#Chrom`) %>%
  mutate(cmd = str_glue("python scripts/GenometracksByGenotype/AggregateBigwigsForPlotting.py --GroupSettingsFile <(awk '$1 ~ /_u$/' bigwigs/BigwigList.groups.tsv ) --BigwigList bigwigs/BigwigList.tsv --Normalization WholeGenome --Region {chrom}:{min}-{max} --BigwigListType KeyFile --OutputPrefix scratch/ -vv --TracksTemplate scripts/GenometracksByGenotype/tracks_templates/GeneralPurposeColoredByGenotypeWithSupergroups.ini --Bed12GenesToIni scripts/GenometracksByGenotype/PremadeTracks/gencode.v26.FromGTEx.genes.bed12.gz\npyGenomeTracks --tracks <(cat scratch/tracks.ini scratch/ToPlot.SpliceDonors.ini) --out ../docs/assets/ForJingxin_{gene_names}_ZoomInOnIntron.png --region {chrom}:{min}-{max} --trackLabelFraction 0.15\n\n")) %>%
  select(cmd) %>%
  write.table("../code/testPlottingWithMyScript.ForJingxin2.sh", quote=F, row.names = F, col.names = F)

…After running that script to plot zoomed in sashimi plots, I get these four images:

Note that KITLG poison exon seems identifiable in the fibroblast data only, as it seems poorly covered/expressed in the LCL data. So though I can’t see a dose-dependent response based on the LCL experiment, it is really obvious from the fibroblast data that I have the poison exon identified
The STAT1 poison example is also pretty clear, and note how much more splicing there is of the poison exon in terms of observed PSI upon treatment in the chomatin associated RNA-seq data versus the polyA RNA-seq titration series data (both in LCLs)
The PIEZO1 example is a bit confusing. The splice junction of interest does show a dose-dependent response, and it is a GA|GT splice junction, and similar to the STAT1 example, the GA|GT junction is much more spliced in the chRNA compared to the polyA-RNA, consistent with it being an NMD target. Note that this intron is a alt 5’ss (instead of associated with a poison cassette exon) and based on the converage tracks, this junction seems quite rare relative to the transcript expression. -RNF170 example also contains only one detected GA|GT junction but it is not associated with a novel cassette exon but rather an alt 5’ss within an exon. However, it is not clear that it is used an any appreciable rate that is could point to it being the causal mechanism for the dose-dependent gene response. However, the splicing for one GA|GT does indeed follow a dose-dependent pattern, so maybe it is right.

Finally, let’s write out these relevant junctions coordinates for Jingxin. First to hone in on the correct junction coordinates, let me replot all of the dose response curves for clusters that contain a GA|GT intron in these 4 genes…

splicing.table %>%
  inner_join(all.samples.intron.annotations, by=c("#Chrom"="chrom", "start", "end", "strand")) %>%
  inner_join(all.samples.5ss, by=c("#Chrom"="chrom", "start", "end"="stop", "strand")) %>%
  filter(gene_names %in% c("KITLG", "STAT1", "PIEZO1", "RNF170")) %>%
  mutate(IsGAGT = str_detect(seq, "^\\w{2}GAGT")) %>%
  group_by(gid) %>%
  mutate(ClusterContainsGAGT = any(IsGAGT)) %>%
  ungroup() %>%
  filter(ClusterContainsGAGT) %>%
  mutate(facettitle = paste(junc, gene_names,IsGAGT, sep="\n")) %>%
  select(-c(1:6)) %>%
  gather("sample", "PSI", A10_NA_Fibroblast_polyA_1:Branaplam_31.6_LCL_chRNA_3) %>%
  inner_join(sample.list, by=c("sample"="SampleName")) %>%
  filter(!is.na(dose.nM)) %>%
  filter(libType=="polyA" & cell.type=="LCL") %>%
  ggplot(aes(x=dose.nM, y=PSI, color=treatment)) +
  geom_line() +
  scale_x_continuous(trans='log10') +
  facet_wrap(~facettitle, scales = "free_y") +
  theme_bw() +
  theme(strip.text.x = element_text(size = 7))

Based on manual inspection, let’s replot, highlighting just the introns involved that Jingxin probably would want coordinates for:

splicing.table %>%
  inner_join(all.samples.intron.annotations, by=c("#Chrom"="chrom", "start", "end", "strand")) %>%
  inner_join(all.samples.5ss, by=c("#Chrom"="chrom", "start", "end"="stop", "strand")) %>%
  filter(gene_names %in% c("KITLG", "STAT1", "PIEZO1", "RNF170")) %>%
  mutate(IsGAGT = str_detect(seq, "^\\w{2}GAGT")) %>%
  group_by(gid) %>%
  mutate(ClusterContainsGAGT = any(IsGAGT)) %>%
  ungroup() %>%
  filter(ClusterContainsGAGT) %>%
  mutate(IntronType = recode(anchor, A="Alt 5'ss", D="Alt 3'ss", DA="Annotated", N="New Intron", NDA="New Splice Site combination")) %>%
  select(-c("A10_NA_Fibroblast_polyA_1":"Branaplam_31.6_LCL_chRNA_3")) %>%
  select(junc, gene_names, IsGAGT, IntronType)

# A tibble: 22 × 4
   junc                                gene_names IsGAGT IntronType
   <chr>                               <chr>      <lgl>  <chr>     
 1 chr12:88516490:88518149:clu_23943_- KITLG      FALSE  Alt 5'ss  
 2 chr12:88516490:88518195:clu_23943_- KITLG      TRUE   Alt 5'ss  
 3 chr12:88516490:88518697:clu_23943_- KITLG      FALSE  Annotated 
 4 chr12:88518308:88518697:clu_23943_- KITLG      FALSE  Alt 3'ss  
 5 chr16:88720525:88720616:clu_30902_- PIEZO1     FALSE  Alt 3'ss  
 6 chr16:88720532:88720616:clu_30902_- PIEZO1     FALSE  Annotated 
 7 chr16:88720532:88720660:clu_30902_- PIEZO1     TRUE   Alt 5'ss  
 8 chr2:190977025:190978528:clu_5141_- STAT1      TRUE   Alt 5'ss  
 9 chr2:190977025:190978856:clu_5141_- STAT1      FALSE  Annotated 
10 chr2:190978606:190978856:clu_5141_- STAT1      FALSE  Alt 3'ss  
# … with 12 more rows

# Based on the dose response curves and which introns are GAGT and manual inspection, I will note these introns:
JuncstionsToNote <- c("chr16:88720532:88720616:clu_30902_-", "chr16:88720532:88720660:clu_30902_-",
                      "chr2:190977025:190978528:clu_5141_-", "chr2:190978606:190978856:clu_5141_-", "chr2:190977025:190978856:clu_5141_-",
                      "chr8:42874006:42887802:clu_16606_-", "chr8:42874006:42887728:clu_16606_-",
                      "chr12:88516490:88518195:clu_23943_-", "chr12:88518308:88518697:clu_23943_-", "chr12:88516490:88518697:clu_23943_-")

splicing.table %>%
  inner_join(all.samples.intron.annotations, by=c("#Chrom"="chrom", "start", "end", "strand")) %>%
  inner_join(all.samples.5ss, by=c("#Chrom"="chrom", "start", "end"="stop", "strand")) %>%
  filter(gene_names %in% c("KITLG", "STAT1", "PIEZO1", "RNF170")) %>%
  mutate(IsGAGT = str_detect(seq, "^\\w{2}GAGT")) %>%
  mutate(IntronType = recode(anchor, A="Alt 5'ss", D="Alt 3'ss", DA="Annotated", N="New Intron", NDA="New Splice Site combination")) %>%
  mutate(facettitle = paste(junc, gene_names,IsGAGT, IntronType, sep="\n")) %>%
  filter(junc %in% JuncstionsToNote) %>%
  select(-c(1:6)) %>%
  gather("sample", "PSI", A10_NA_Fibroblast_polyA_1:Branaplam_31.6_LCL_chRNA_3) %>%
  inner_join(sample.list, by=c("sample"="SampleName")) %>%
  filter(!is.na(dose.nM)) %>%
  filter(libType=="polyA" & cell.type=="LCL") %>%
  ggplot(aes(x=dose.nM, y=PSI, color=treatment)) +
  geom_line() +
  scale_x_continuous(trans='log10') +
  facet_wrap(~facettitle, scales = "free_y") +
  theme_bw() +
  theme(strip.text.x = element_text(size = 5))

splicing.table %>%
  inner_join(all.samples.intron.annotations, by=c("#Chrom"="chrom", "start", "end", "strand")) %>%
  inner_join(all.samples.5ss, by=c("#Chrom"="chrom", "start", "end"="stop", "strand")) %>%
  filter(junc %in% JuncstionsToNote) %>%
  mutate(IsGAGT = str_detect(seq, "^\\w{2}GAGT")) %>%
  mutate(IntronType = recode(anchor, A="Alt 5'ss", D="Alt 3'ss", DA="Annotated", N="New Intron", NDA="New Splice Site combination")) %>%
  select(-c("A10_NA_Fibroblast_polyA_1":"Branaplam_31.6_LCL_chRNA_3")) %>%
  select(junc, gene_names, IsGAGT, IntronType) %>%
  knitr::kable()

junc	gene_names	IsGAGT	IntronType
chr12:88516490:88518195:clu_23943_-	KITLG	TRUE	Alt 5’ss
chr12:88516490:88518697:clu_23943_-	KITLG	FALSE	Annotated
chr12:88518308:88518697:clu_23943_-	KITLG	FALSE	Alt 3’ss
chr16:88720532:88720616:clu_30902_-	PIEZO1	FALSE	Annotated
chr16:88720532:88720660:clu_30902_-	PIEZO1	TRUE	Alt 5’ss
chr2:190977025:190978528:clu_5141_-	STAT1	TRUE	Alt 5’ss
chr2:190977025:190978856:clu_5141_-	STAT1	FALSE	Annotated
chr2:190978606:190978856:clu_5141_-	STAT1	FALSE	Alt 3’ss
chr8:42874006:42887728:clu_16606_-	RNF170	FALSE	Annotated
chr8:42874006:42887802:clu_16606_-	RNF170	TRUE	Alt 5’ss

Ok, so I am done with point number1 in Jingxin’s request. I will address point number 2 in a new notebook…

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /software/openblas-0.2.19-el7-x86_64/lib/libopenblas_haswellp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] edgeR_3.26.5    limma_3.40.6    forcats_0.4.0   stringr_1.4.0  
 [5] dplyr_1.0.9     purrr_0.3.4     readr_1.3.1     tidyr_1.2.0    
 [9] tibble_3.1.7    ggplot2_3.3.6   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5       locfit_1.5-9.1   lubridate_1.7.4  lattice_0.20-38 
 [5] assertthat_0.2.1 rprojroot_2.0.2  digest_0.6.20    utf8_1.1.4      
 [9] R6_2.4.0         cellranger_1.1.0 backports_1.4.1  reprex_0.3.0    
[13] evaluate_0.15    httr_1.4.1       highr_0.9        pillar_1.7.0    
[17] rlang_1.0.3      readxl_1.3.1     rstudioapi_0.10  rmarkdown_1.13  
[21] labeling_0.3     munsell_0.5.0    broom_1.0.0      compiler_3.6.1  
[25] httpuv_1.5.1     modelr_0.1.8     xfun_0.31        pkgconfig_2.0.2 
[29] htmltools_0.3.6  tidyselect_1.1.2 workflowr_1.6.2  fansi_0.4.0     
[33] crayon_1.3.4     dbplyr_1.4.2     withr_2.4.1      later_0.8.0     
[37] grid_3.6.1       jsonlite_1.6     gtable_0.3.0     lifecycle_1.0.1 
[41] DBI_1.1.0        git2r_0.26.1     magrittr_1.5     scales_1.1.0    
[45] cli_3.3.0        stringi_1.4.3    farver_2.1.0     fs_1.3.1        
[49] promises_1.0.1   xml2_1.3.2       ellipsis_0.3.2   generics_0.1.3  
[53] vctrs_0.4.1      tools_3.6.1      glue_1.6.2       hms_0.5.3       
[57] yaml_2.2.0       colorspace_1.4-1 rvest_0.3.5      knitr_1.39      
[61] haven_2.3.1