Last updated: 2019-05-23

Checks: 6 0

Knit directory: apaQTL/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.3.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20190411) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  .Rprofile
    Untracked:  ._.DS_Store
    Untracked:  .gitignore
    Untracked:  _workflowr.yml
    Untracked:  analysis/._PASdescriptiveplots.Rmd
    Untracked:  analysis/._cuttoffPercUsage.Rmd
    Untracked:  analysis/cuttoffPercUsage.Rmd
    Untracked:  apaQTL.Rproj
    Untracked:  code/._BothFracDTPlotGeneRegions_normalized.sh
    Untracked:  code/._FC_UTR.sh
    Untracked:  code/._FC_newPeaks_olddata.sh
    Untracked:  code/._LC_samplegroups.py
    Untracked:  code/._SnakefilePAS
    Untracked:  code/._SnakefilefiltPAS
    Untracked:  code/._TESplots100bp.sh
    Untracked:  code/._TESplots150bp.sh
    Untracked:  code/._TESplots200bp.sh
    Untracked:  code/._aAPAqtl_nominal39ind.sh
    Untracked:  code/._apaQTLCorrectPvalMakeQQ.R
    Untracked:  code/._apaQTL_Nominal.sh
    Untracked:  code/._apaQTL_permuted.sh
    Untracked:  code/._assignNucIntonpeak2intronlocs.sh
    Untracked:  code/._assignTotIntronpeak2intronlocs.sh
    Untracked:  code/._bam2BW_5primemost.sh
    Untracked:  code/._bed2saf.py
    Untracked:  code/._bothFrac_FC.sh
    Untracked:  code/._callPeaksYL.py
    Untracked:  code/._chooseAnno2SAF.py
    Untracked:  code/._chooseSignalSite
    Untracked:  code/._chooseSignalSite.py
    Untracked:  code/._cluster.json
    Untracked:  code/._clusterPAS.json
    Untracked:  code/._clusterfiltPAS.json
    Untracked:  code/._codingdms2bed.py
    Untracked:  code/._config.yaml
    Untracked:  code/._config2.yaml
    Untracked:  code/._configOLD.yaml
    Untracked:  code/._convertNumeric.py
    Untracked:  code/._dag.pdf
    Untracked:  code/._encodeRNADTplots.sh
    Untracked:  code/._extractGenotypes.py
    Untracked:  code/._fc2leafphen.py
    Untracked:  code/._filter5perc.R
    Untracked:  code/._filter5percPheno.py
    Untracked:  code/._filterpeaks.py
    Untracked:  code/._finalPASbed2SAF.py
    Untracked:  code/._fix4su304corr.py
    Untracked:  code/._fix4su604corr.py
    Untracked:  code/._fix4sukalisto.py
    Untracked:  code/._fixFChead.py
    Untracked:  code/._fixFChead_bothfrac.py
    Untracked:  code/._fixH3k12ac.py
    Untracked:  code/._fixRNAhead4corr.py
    Untracked:  code/._fixRNAkalisto.py
    Untracked:  code/._fixgroupedtranscript.py
    Untracked:  code/._fixhead_netseqfc.py
    Untracked:  code/._grouptranscripts.py
    Untracked:  code/._make5percPeakbed.py
    Untracked:  code/._makeFileID.py
    Untracked:  code/._makePheno.py
    Untracked:  code/._makeSAFbothfrac5perc.py
    Untracked:  code/._makegencondeTSSfile.py
    Untracked:  code/._mergeAllBam.sh
    Untracked:  code/._mergeBW_norm.sh
    Untracked:  code/._mergeByFracBam.sh
    Untracked:  code/._mergePeaks.sh
    Untracked:  code/._namePeaks.py
    Untracked:  code/._netseqFC.sh
    Untracked:  code/._peak2PAS.py
    Untracked:  code/._peakFC.sh
    Untracked:  code/._pheno2countonly.R
    Untracked:  code/._qtlsPvalOppFrac.py
    Untracked:  code/._quantassign2parsedpeak.py
    Untracked:  code/._removeloc_pheno.py
    Untracked:  code/._run_leafcutterDiffIso.sh
    Untracked:  code/._selectNominalPvalues.py
    Untracked:  code/._snakemakePAS.batch
    Untracked:  code/._snakemakefiltPAS.batch
    Untracked:  code/._submit-snakemakePAS.sh
    Untracked:  code/._submit-snakemakefiltPAS.sh
    Untracked:  code/._subset_diffisopheno.py
    Untracked:  code/._subtractExons.sh
    Untracked:  code/._utrdms2saf.py
    Untracked:  code/.snakemake/
    Untracked:  code/APAqtl_nominal.err
    Untracked:  code/APAqtl_nominal.out
    Untracked:  code/APAqtl_nominal_39.err
    Untracked:  code/APAqtl_nominal_39.out
    Untracked:  code/APAqtl_permuted.err
    Untracked:  code/APAqtl_permuted.out
    Untracked:  code/BothFracDTPlotGeneRegions.err
    Untracked:  code/BothFracDTPlotGeneRegions.out
    Untracked:  code/BothFracDTPlotGeneRegions_norm.err
    Untracked:  code/BothFracDTPlotGeneRegions_norm.out
    Untracked:  code/BothFracDTPlotGeneRegions_normalized.sh
    Untracked:  code/DistPAS2Sig.py
    Untracked:  code/EncodeRNADTPlotGeneRegions.err
    Untracked:  code/EncodeRNADTPlotGeneRegions.out
    Untracked:  code/FC_UTR.err
    Untracked:  code/FC_UTR.out
    Untracked:  code/FC_UTR.sh
    Untracked:  code/FC_newPAS_olddata.err
    Untracked:  code/FC_newPAS_olddata.out
    Untracked:  code/FC_newPeaks_olddata.sh
    Untracked:  code/LC_samplegroups.py
    Untracked:  code/README.md
    Untracked:  code/Rplots.pdf
    Untracked:  code/TESplots100bp.err
    Untracked:  code/TESplots100bp.out
    Untracked:  code/TESplots100bp.sh
    Untracked:  code/TESplots150bp.err
    Untracked:  code/TESplots150bp.out
    Untracked:  code/TESplots150bp.sh
    Untracked:  code/TESplots200bp.err
    Untracked:  code/TESplots200bp.out
    Untracked:  code/TESplots200bp.sh
    Untracked:  code/Upstream100Bases_general.py
    Untracked:  code/aAPAqtl_nominal39ind.sh
    Untracked:  code/apaQTLCorrectPvalMakeQQ_4pc.R
    Untracked:  code/apaQTL_Nominal_4pc.sh
    Untracked:  code/apaQTL_permuted.4pc.sh
    Untracked:  code/assignNucIntonpeak2intronlocs.sh
    Untracked:  code/assignPeak2Intronicregion.err
    Untracked:  code/assignPeak2Intronicregion.out
    Untracked:  code/assignTotIntronpeak2intronlocs.sh
    Untracked:  code/assigntotPeak2Intronicregion.err
    Untracked:  code/assigntotPeak2Intronicregion.out
    Untracked:  code/bam2BW_5primemost.sh
    Untracked:  code/bam2bw.err
    Untracked:  code/bam2bw.out
    Untracked:  code/bam2bw_5primemost.err
    Untracked:  code/bam2bw_5primemost.out
    Untracked:  code/bothFrac_FC.err
    Untracked:  code/bothFrac_FC.out
    Untracked:  code/bothFrac_FC.sh
    Untracked:  code/codingdms2bed.py
    Untracked:  code/dag.pdf
    Untracked:  code/dagPAS.pdf
    Untracked:  code/dagfiltPAS.pdf
    Untracked:  code/encodeRNADTplots.sh
    Untracked:  code/extractGenotypes.py
    Untracked:  code/fc2leafphen.py
    Untracked:  code/finalPASbed2SAF.py
    Untracked:  code/findbuginpeaks.R
    Untracked:  code/fix4su304corr.py
    Untracked:  code/fix4su604corr.py
    Untracked:  code/fix4sukalisto.py
    Untracked:  code/fixFChead_bothfrac.py
    Untracked:  code/fixFChead_summary.py
    Untracked:  code/fixH3k12ac.py
    Untracked:  code/fixRNAhead4corr.py
    Untracked:  code/fixRNAkalisto.py
    Untracked:  code/fixgroupedtranscript.py
    Untracked:  code/fixhead_netseqfc.py
    Untracked:  code/get100upPAS.py
    Untracked:  code/getSeq100up.sh
    Untracked:  code/getseq100up.err
    Untracked:  code/getseq100up.out
    Untracked:  code/grouptranscripts.err
    Untracked:  code/grouptranscripts.out
    Untracked:  code/grouptranscripts.py
    Untracked:  code/log/
    Untracked:  code/makeSAFbothfrac5perc.py
    Untracked:  code/makegencondeTSSfile.py
    Untracked:  code/mergeBW_norm.sh
    Untracked:  code/mergeBWnorm.err
    Untracked:  code/mergeBWnorm.out
    Untracked:  code/netseqFC.err
    Untracked:  code/netseqFC.out
    Untracked:  code/netseqFC.sh
    Untracked:  code/qtlsPvalOppFrac.py
    Untracked:  code/removeloc_pheno.py
    Untracked:  code/run_DistPAS2Sig.err
    Untracked:  code/run_DistPAS2Sig.out
    Untracked:  code/run_distPAS2Sig.sh
    Untracked:  code/run_leafcutterDiffIso.sh
    Untracked:  code/run_leafcutter_ds.err
    Untracked:  code/run_leafcutter_ds.out
    Untracked:  code/selectNominalPvalues.py
    Untracked:  code/snakePASlog.out
    Untracked:  code/snakefiltPASlog.out
    Untracked:  code/subset_diffisopheno.py
    Untracked:  code/subtractExons.err
    Untracked:  code/subtractExons.out
    Untracked:  code/subtractExons.sh
    Untracked:  code/transcriptdm2bed.py
    Untracked:  code/utrdms2saf.py
    Untracked:  data/CompareOldandNew/
    Untracked:  data/DTmatrix/
    Untracked:  data/DiffIso/
    Untracked:  data/EncodeRNA/
    Untracked:  data/GeuvadisRNA/
    Untracked:  data/PAS/
    Untracked:  data/QTLGenotypes/
    Untracked:  data/QTLoverlap/
    Untracked:  data/README.md
    Untracked:  data/RNAseq/
    Untracked:  data/Reads2UTR/
    Untracked:  data/SignalSiteFiles/
    Untracked:  data/ThirtyNineIndQtl_nominal/
    Untracked:  data/apaQTLNominal/
    Untracked:  data/apaQTLNominal_4pc/
    Untracked:  data/apaQTLPermuted/
    Untracked:  data/apaQTLPermuted_4pc/
    Untracked:  data/apaQTLs/
    Untracked:  data/assignedPeaks/
    Untracked:  data/bam/
    Untracked:  data/bam_clean/
    Untracked:  data/bam_waspfilt/
    Untracked:  data/bed_10up/
    Untracked:  data/bed_clean/
    Untracked:  data/bed_clean_sort/
    Untracked:  data/bed_waspfilter/
    Untracked:  data/bedsort_waspfilter/
    Untracked:  data/bothFrac_FC/
    Untracked:  data/bw_norm/
    Untracked:  data/exampleQTLs/
    Untracked:  data/fastq/
    Untracked:  data/filterPeaks/
    Untracked:  data/fourSU/
    Untracked:  data/h3k27ac/
    Untracked:  data/highdiffsiggenes.txt
    Untracked:  data/inclusivePeaks/
    Untracked:  data/inclusivePeaks_FC/
    Untracked:  data/intron_analysis/
    Untracked:  data/mergedBG/
    Untracked:  data/mergedBW_byfrac/
    Untracked:  data/mergedBW_norm/
    Untracked:  data/mergedBam/
    Untracked:  data/mergedbyFracBam/
    Untracked:  data/netseq/
    Untracked:  data/nuc_10up/
    Untracked:  data/nuc_10upclean/
    Untracked:  data/peakCoverage/
    Untracked:  data/peaks_5perc/
    Untracked:  data/phenotype/
    Untracked:  data/phenotype_5perc/
    Untracked:  data/sigDiffGenes.txt
    Untracked:  data/sort/
    Untracked:  data/sort_clean/
    Untracked:  data/sort_waspfilter/
    Untracked:  nohup.out
    Untracked:  output/._.DS_Store
    Untracked:  output/._meanCorrelationPhenotypes.svg
    Untracked:  output/dtPlots/
    Untracked:  output/fastqc/
    Untracked:  output/meanCorrelationPhenotypes.svg

Unstaged changes:
    Modified:   analysis/PASusageQC.Rmd
    Modified:   analysis/Readdistagainstfeatures.Rmd
    Modified:   analysis/choosePCs.Rmd
    Modified:   analysis/corrbetweenind.Rmd
    Modified:   analysis/nascenttranscription.Rmd
    Modified:   analysis/rerunQTL_changePC.Rmd
    Modified:   analysis/rna_netseq_h3k12ac.Rmd
    Modified:   code/BothFracDTPlotGeneRegions.sh
    Modified:   code/Snakefile
    Deleted:    code/Upstream10Bases_general.py
    Modified:   code/apaQTLCorrectPvalMakeQQ.R
    Modified:   code/apaQTL_permuted.sh
    Modified:   code/apaQTLsnake.err
    Modified:   code/bed2saf.py
    Modified:   code/cluster.json
    Modified:   code/config.yaml
    Deleted:    code/test.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 2632f11 brimittleman 2019-05-23 add tot enriched
html b454fc6 brimittleman 2019-05-23 Build site.
Rmd fc5727e brimittleman 2019-05-23 add boxplots by pau
html ab26926 brimittleman 2019-05-22 Build site.
Rmd 6bfa078 brimittleman 2019-05-22 by delta pau
html ace11cc brimittleman 2019-05-22 Build site.
html 57d8a8c brimittleman 2019-05-22 Build site.
Rmd a259a17 brimittleman 2019-05-22 fix bug with utrs
html 365e817 brimittleman 2019-05-21 Build site.
html d859f02 brimittleman 2019-05-21 Build site.
Rmd 82fdc65 brimittleman 2019-05-21 add by length
html 801ca1b brimittleman 2019-05-20 Build site.
Rmd a455701 brimittleman 2019-05-20 analysis plot
html d89772d brimittleman 2019-05-15 Build site.
Rmd ee92964 brimittleman 2019-05-15 start ideas for inton analysis

I am interested in understanding where in the introns the nuclear peaks are. Are they closer to the three prime or five prime edge of the intron. This may help us understand if NMD is contributing to the loss of transcripts between the nuclear and total fraction.

I need to create an annotation with introns that do not overlap. For this I will use line up all of the exons for a gene then take the open spaces as introns.

library(tidyverse)
── Attaching packages ───────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.3.1  
✔ readr   1.3.1       ✔ forcats 0.3.0  
── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(workflowr)
This is workflowr version 1.3.0
Run ?workflowr for help getting started
library(cowplot)

Attaching package: 'cowplot'
The following object is masked from 'package:ggplot2':

    ggsave
nucIntronicPeaks=read.table("../data/peaks_5perc/APApeak_Peaks_GeneLocAnno.Nuclear.5perc.fc", stringsAsFactors = F, header = F,col.names = c("chr", "start", "end", "gene", "loc", "strand", "peak", "avgUsage")) %>% filter(loc=="intron")

nucIntronicPeaksBed=nucIntronicPeaks %>% mutate(ID=paste(peak,gene,loc, sep=":")) %>% dplyr::select(chr, start, end, ID, avgUsage, strand)

write.table(nucIntronicPeaksBed, "../data/intron_analysis/NuclearIntronicPeaks.bed", col.names = F, row.names = F, quote = F,sep="\t")

I will need to assign each of these to an intron in the new annotation.

The genome annotation file, Transcript2GeneName.dms has the information i need. I will need to parse this file. I need all exons for a gene (longest transcript) The file has the exon starts and ends for each transcript.

I will remove the exon locations for full transcripts using bedtools subtract.

Create transcript file.I will select all of the transcripts in the dms file and merge by gene name. Then I can subtract the exons

python transcriptdm2bed.py

Sort the output, group by transcript and fix order of columns.

sort -k1,1 -k2,2n /project2/gilad/briana/genome_anotation_data/RefSeq_annotations/AllTranscriptsbyName.bed > /project2/gilad/briana/genome_anotation_data/RefSeq_annotations/AllTranscriptsbyName.sort.bed

sbatch grouptranscripts.py

python fixgroupedtranscript.py

I want to subract any exon or UTR seqeunce. I have an annotation bed file I will use:

exonandUTRs=read.table("../../genome_anotation_data/RefSeq_annotations/ncbiRefSeq_FormatedallAnnotation.sort.bed", col.names = c("CHR", "start", "end", "ID", "score", "strand"),stringsAsFactors = F)%>% separate(ID, into=c("loc", "gene"),sep=":") %>% filter(loc!="intron") %>% dplyr::select(-loc) %>% mutate(CHR=paste("chr", CHR, sep=""))


write.table(exonandUTRs, file="../data/intron_analysis/ExonandUTRloc.bed", quote=F, col.names = F, row.names = F, sep="\t")
sort -k1,1 -k2,2n ../data/intron_analysis/ExonandUTRloc.bed > ../data/intron_analysis/ExonandUTRloc.sort.bed
sbatch subtractExons.sh

sort:

sort -k1,1 -k2,2n /project2/gilad/briana/apaQTL/data/intron_analysis/transcriptsMinusExons.bed > /project2/gilad/briana/apaQTL/data/intron_analysis/transcriptsMinusExons.sort.bed

Next I will map the intronic peaks on these positions.

sbatch assignNucIntonpeak2intronlocs.sh

Plot percentage of intron where PAS is.

pas2intron=read.table("../data/intron_analysis/IntronPeaksontoIntrons.bed",col.names = c("intronCHR", "intronStart", "intronEnd", "gene", "score", "strand", "peakCHR", "peakStart", "peakEnd", "PeakID", "meanUsage", "peakStrand")) %>% mutate(PASloc=ifelse(strand=="+", peakEnd, peakStart)) %>% dplyr::select(intronStart, intronEnd, gene, strand, PeakID, PASloc ,meanUsage) %>% mutate(intronLength=intronEnd-intronStart , distance2PAS= ifelse(strand=="+", PASloc-intronStart, intronEnd-PASloc), propIntron=distance2PAS/intronLength)
nuclearplot=ggplot(pas2intron, aes(x=propIntron)) + geom_histogram(bins=50, aes(y=..count../33345)) + labs(title="PAS position within intron \n for Nuclear PAS", y="Proportion of Intronic PAS", x="Proportion of Intronic Length")

Facet by usage 0-25, 25-50, 50-75, 75-1

pas2intron_usage=pas2intron %>% mutate(UsageCat=ifelse(meanUsage<=.5, "low", "high"))

ggplot(pas2intron_usage, aes(x=propIntron, fill=UsageCat)) + geom_histogram(bins=50, aes(y=..count../33345)) + labs(title="PAS position within intron", y="Proportion of Intronic PAS", x="Proportion of Intronic Length") + facet_grid(~UsageCat)

Version Author Date
57d8a8c brimittleman 2019-05-22
d859f02 brimittleman 2019-05-21

Look at different intron lengths:

First i want to look at the distribution of intorn lengths:

ggplot(pas2intron_usage, aes(x=log10(intronLength))) + geom_density()

Version Author Date
57d8a8c brimittleman 2019-05-22
d859f02 brimittleman 2019-05-21

I will look at above and below the mean intron length:

meanIntronlength=mean(pas2intron_usage$intronLength)

pas2intron_length=pas2intron %>% mutate(LengthCat=ifelse(intronLength<=meanIntronlength, "bottom", "top"))


ggplot(pas2intron_length, aes(x=propIntron, fill=LengthCat)) + geom_histogram(bins=50, aes(y=..count../33345)) + labs(title="PAS position within intron", y="Proportion of Intronic PAS", x="Proportion of Intronic Length") + facet_grid(~LengthCat)

Version Author Date
57d8a8c brimittleman 2019-05-22
ggplot(pas2intron_length, aes(x=distance2PAS, fill=LengthCat)) + geom_histogram(bins=50, aes(y=..count../33345)) + labs(title="PAS position within intron", y="Proportion of Intronic PAS", x="Proportion of Intronic Length") + facet_grid(~LengthCat)

Version Author Date
57d8a8c brimittleman 2019-05-22

Look at quartiles:

summary(pas2intron_usage$intronLength)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    106    3929    9220   22248   24094 1102540 
pas2intron_length2=pas2intron %>% mutate(LengthCat=ifelse(intronLength<=3929, "first", ifelse(intronLength>3929 &intronLength<=9220, "second", ifelse(intronLength>9220 &intronLength<=24094, "third", "fourth"))))

pas2intron_length2$LengthCat <- factor(pas2intron_length2$LengthCat, levels=c("first", "second", "third", "fourth"))
                                         
ggplot(pas2intron_length2, aes(x=propIntron, fill=LengthCat)) + geom_histogram(bins=50, aes(y=..count../33345)) + labs(title="PAS position within intron \n Nuclear intronic PAS", y="Proportion of Intronic PAS", x="Proportion of Intronic Length") + facet_grid(~LengthCat)+theme(axis.text.x = element_text(angle = 90, hjust = 1))

Version Author Date
ace11cc brimittleman 2019-05-22
57d8a8c brimittleman 2019-05-22
ggplot(pas2intron_length2, aes(x=log(distance2PAS)), by=LengthCat, col=LengthCat) + stat_ecdf(aes(col=LengthCat)) 

Version Author Date
ace11cc brimittleman 2019-05-22
57d8a8c brimittleman 2019-05-22

Look at distribution in total fraction:

totIntronicPeaks=read.table("../data/peaks_5perc/APApeak_Peaks_GeneLocAnno.Total.5perc.fc", stringsAsFactors = F, header = F,col.names = c("chr", "start", "end", "gene", "loc", "strand", "peak", "avgUsage")) %>% filter(loc=="intron")

totIntronicPeaksBed=totIntronicPeaks %>% mutate(ID=paste(peak,gene,loc, sep=":")) %>% dplyr::select(chr, start, end, ID, avgUsage, strand)

write.table(totIntronicPeaksBed, "../data/intron_analysis/TotalIntronicPeaks.bed", col.names = F, row.names = F, quote = F,sep="\t")

map these to the intron file

sbatch assignTotIntronpeak2intronlocs.sh
pas2intronTot=read.table("../data/intron_analysis/TotalIntronPeaksontoIntrons.bed",col.names = c("intronCHR", "intronStart", "intronEnd", "gene", "score", "strand", "peakCHR", "peakStart", "peakEnd", "PeakID", "meanUsage", "peakStrand")) %>% mutate(PASloc=ifelse(strand=="+", peakEnd, peakStart)) %>% dplyr::select(intronStart, intronEnd, gene, strand, PeakID, PASloc ,meanUsage) %>% mutate(intronLength=intronEnd-intronStart , distance2PAS= ifelse(strand=="+", PASloc-intronStart, intronEnd-PASloc), propIntron=distance2PAS/intronLength)

nrow(pas2intronTot)
[1] 31954
totalplot=ggplot(pas2intronTot, aes(x=propIntron)) + geom_histogram(bins=50, aes(y=..count../31954)) + labs(title="PAS position within intron \nfor total PAS", y="Proportion of Intronic PAS", x="Proportion of Intronic Length")
plot_grid(totalplot, nuclearplot)

Version Author Date
57d8a8c brimittleman 2019-05-22
summary(pas2intronTot$intronLength)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    106    3785    8872   21032   22928 1102540 
pas2intron_totlength2=pas2intronTot %>% mutate(LengthCat=ifelse(intronLength<=3785, "first", ifelse(intronLength>3785 &intronLength<=8872, "second", ifelse(intronLength>8872 &intronLength<=22928, "third", "fourth"))))

pas2intron_totlength2$LengthCat <- factor(pas2intron_totlength2$LengthCat, levels=c("first", "second", "third", "fourth"))

                                         
ggplot(pas2intron_totlength2, aes(x=propIntron, fill=LengthCat)) + geom_histogram(bins=50, aes(y=..count../31954)) + labs(title="PAS position within intron \n Total intronic PAS", y="Proportion of Intronic PAS", x="Proportion of Intronic Length") + facet_grid(~LengthCat) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Version Author Date
ace11cc brimittleman 2019-05-22
57d8a8c brimittleman 2019-05-22

Look by differences in \(\Delta\) PAU.

effectsize=read.table("../data/DiffIso/TN_diff_isoform_AllChrom_effect_sizes.txt", stringsAsFactors = F, col.names=c('PAS',  'logef' ,'Nuclear', 'Total','deltaPAU')) %>% filter(PAS != "intron") %>% dplyr::select(PAS, deltaPAU)
pas2intronNuc=read.table("../data/intron_analysis/IntronPeaksontoIntrons.bed",col.names = c("intronCHR", "intronStart", "intronEnd", "gene", "score", "strand", "peakCHR", "peakStart", "peakEnd", "PeakID", "meanUsage", "peakStrand")) %>% mutate(PAS=paste(peakCHR, peakStart, peakEnd, gene, sep=":")) %>% inner_join(effectsize, by="PAS") %>% mutate(PASloc=ifelse(strand=="+", peakEnd, peakStart)) %>% dplyr::select(intronStart, intronEnd, gene, strand, PeakID, PASloc ,meanUsage, deltaPAU) %>% mutate(intronLength=intronEnd-intronStart , distance2PAS= ifelse(strand=="+", PASloc-intronStart, intronEnd-PASloc), propIntron=distance2PAS/intronLength)

pas2intronNuc$deltaPAU= as.numeric(pas2intronNuc$deltaPAU)

Plot relationship between delta PAU and distance to proporiton of intron

ggplot(pas2intronNuc, aes(x=propIntron, y=deltaPAU)) + geom_point() + geom_density2d(col="red")

Version Author Date
ab26926 brimittleman 2019-05-22
pas2intronNuc_cat=pas2intronNuc %>% mutate(Fraction=ifelse(deltaPAU <= -0.2, "NuclearSpecific", ifelse(deltaPAU>-.2 &deltaPAU < .2, "NotDiff", "TotalEnriched"))) %>% mutate(LengthCat=ifelse(intronLength<=3929, "first", ifelse(intronLength>3929 &intronLength<=9220, "second", ifelse(intronLength>9220 &intronLength<=24094, "third", "fourth"))))

pas2intronNuc_cat$LengthCat <- factor(pas2intronNuc_cat$LengthCat, levels=c("first", "second", "third", "fourth"))

Plot this:

ggplot(pas2intronNuc_cat, aes(y=propIntron, x=Fraction, fill=Fraction)) + geom_boxplot(alpha=.5) + facet_grid(~LengthCat)  + theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

Version Author Date
b454fc6 brimittleman 2019-05-23
ab26926 brimittleman 2019-05-22
ggplot(pas2intronNuc_cat, aes(x=propIntron, by=Fraction, fill=Fraction)) + geom_density(alpha=.5) + facet_grid(~LengthCat)  + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Version Author Date
b454fc6 brimittleman 2019-05-23
ggplot(pas2intronNuc_cat, aes(y=propIntron, x=Fraction, fill=Fraction)) + geom_boxplot(alpha=.5)  + theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

pas2intronTotal=read.table("../data/intron_analysis/TotalIntronPeaksontoIntrons.bed",col.names = c("intronCHR", "intronStart", "intronEnd", "gene", "score", "strand", "peakCHR", "peakStart", "peakEnd", "PeakID", "meanUsage", "peakStrand")) %>% mutate(PAS=paste(peakCHR, peakStart, peakEnd, gene, sep=":")) %>% inner_join(effectsize, by="PAS") %>% mutate(PASloc=ifelse(strand=="+", peakEnd, peakStart)) %>% dplyr::select(intronStart, intronEnd, gene, strand, PeakID, PASloc ,meanUsage, deltaPAU) %>% mutate(intronLength=intronEnd-intronStart , distance2PAS= ifelse(strand=="+", PASloc-intronStart, intronEnd-PASloc), propIntron=distance2PAS/intronLength)

pas2intronTotal$deltaPAU= as.numeric(pas2intronTotal$deltaPAU)


pas2intronTotal_cat=pas2intronTotal %>% mutate(Fraction=ifelse(deltaPAU <= -0.2, "NuclearSpecific", ifelse(deltaPAU>-.2 &deltaPAU < .2, "NotDiff", "TotalEnriched"))) %>% mutate(LengthCat=ifelse(intronLength<=3785, "first", ifelse(intronLength>3785 &intronLength<=8872, "second", ifelse(intronLength>8872 &intronLength<=22928, "third", "fourth"))))



pas2intronTotal_cat$LengthCat <- factor(pas2intronTotal_cat$LengthCat, levels=c("first", "second", "third", "fourth"))

ggplot(pas2intronTotal_cat, aes(y=propIntron, x=Fraction, fill=Fraction)) + geom_boxplot(alpha=.5) + facet_grid(~LengthCat)  + theme(axis.text.x = element_text(angle = 90, hjust = 1)) 


sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS/LAPACK: /software/openblas-0.2.19-el7-x86_64/lib/libopenblas_haswellp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cowplot_0.9.4   workflowr_1.3.0 forcats_0.3.0   stringr_1.3.1  
 [5] dplyr_0.8.0.1   purrr_0.3.2     readr_1.3.1     tidyr_0.8.3    
 [9] tibble_2.1.1    ggplot2_3.1.1   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       cellranger_1.1.0 pillar_1.3.1     compiler_3.5.1  
 [5] git2r_0.23.0     plyr_1.8.4       tools_3.5.1      digest_0.6.18   
 [9] lubridate_1.7.4  jsonlite_1.6     evaluate_0.12    nlme_3.1-137    
[13] gtable_0.2.0     lattice_0.20-38  pkgconfig_2.0.2  rlang_0.3.1     
[17] cli_1.0.1        rstudioapi_0.10  yaml_2.2.0       haven_1.1.2     
[21] withr_2.1.2      xml2_1.2.0       httr_1.3.1       knitr_1.20      
[25] hms_0.4.2        generics_0.0.2   fs_1.2.6         rprojroot_1.3-2 
[29] grid_3.5.1       tidyselect_0.2.5 glue_1.3.0       R6_2.3.0        
[33] readxl_1.1.0     rmarkdown_1.10   reshape2_1.4.3   modelr_0.1.2    
[37] magrittr_1.5     whisker_0.3-2    MASS_7.3-51.1    backports_1.1.2 
[41] scales_1.0.0     htmltools_0.3.6  rvest_0.3.2      assertthat_0.2.0
[45] colorspace_1.3-2 labeling_0.3     stringi_1.2.4    lazyeval_0.2.1  
[49] munsell_0.5.0    broom_0.5.1      crayon_1.3.4