Last updated: 2019-03-06
Checks: 6 0
Knit directory: threeprimeseq/analysis/
This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(12345)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/.DS_Store
Ignored: data/perm_QTL_trans_noMP_5percov/
Ignored: output/.DS_Store
Untracked files:
Untracked: KalistoAbundance18486.txt
Untracked: analysis/4suDataIGV.Rmd
Untracked: analysis/DirectionapaQTL.Rmd
Untracked: analysis/EmpDistforOverlaps.Rmd
Untracked: analysis/EvaleQTLs.Rmd
Untracked: analysis/YL_QTL_test.Rmd
Untracked: analysis/groSeqAnalysis.Rmd
Untracked: analysis/ncbiRefSeq_sm.sort.mRNA.bed
Untracked: analysis/snake.config.notes.Rmd
Untracked: analysis/verifyBAM.Rmd
Untracked: analysis/verifybam_dubs.Rmd
Untracked: code/PeaksToCoverPerReads.py
Untracked: code/strober_pc_pve_heatmap_func.R
Untracked: data/18486.genecov.txt
Untracked: data/APApeaksYL.total.inbrain.bed
Untracked: data/AllPeak_counts/
Untracked: data/ApaQTLs/
Untracked: data/ApaQTLs_otherPhen/
Untracked: data/ChromHmmOverlap/
Untracked: data/DistTXN2Peak_genelocAnno/
Untracked: data/FeatureoverlapPeaks/
Untracked: data/GM12878.chromHMM.bed
Untracked: data/GM12878.chromHMM.txt
Untracked: data/LianoglouLCL/
Untracked: data/LocusZoom/
Untracked: data/LocusZoom_Unexp/
Untracked: data/LocusZoom_proc/
Untracked: data/MatchedSnps/
Untracked: data/NuclearApaQTLs.txt
Untracked: data/PeakCounts/
Untracked: data/PeakCounts_noMP_5perc/
Untracked: data/PeakCounts_noMP_genelocanno/
Untracked: data/PeakUsage/
Untracked: data/PeakUsage_noMP/
Untracked: data/PeakUsage_noMP_GeneLocAnno/
Untracked: data/PeaksUsed/
Untracked: data/PeaksUsed_noMP_5percCov/
Untracked: data/PolyA_DB/
Untracked: data/QTL_overlap/
Untracked: data/RNAkalisto/
Untracked: data/RefSeq_annotations/
Untracked: data/Replicates_usage/
Untracked: data/TotalApaQTLs.txt
Untracked: data/Totalpeaks_filtered_clean.bed
Untracked: data/UnderstandPeaksQC/
Untracked: data/WASP_STAT/
Untracked: data/YL-SP-18486-T-combined-genecov.txt
Untracked: data/YL-SP-18486-T_S9_R1_001-genecov.txt
Untracked: data/YL_QTL_test/
Untracked: data/apaExamp/
Untracked: data/apaExamp_proc/
Untracked: data/apaQTL_examp_noMP/
Untracked: data/bedgraph_peaks/
Untracked: data/bin200.5.T.nuccov.bed
Untracked: data/bin200.Anuccov.bed
Untracked: data/bin200.nuccov.bed
Untracked: data/clean_peaks/
Untracked: data/comb_map_stats.csv
Untracked: data/comb_map_stats.xlsx
Untracked: data/comb_map_stats_39ind.csv
Untracked: data/combined_reads_mapped_three_prime_seq.csv
Untracked: data/diff_iso_GeneLocAnno/
Untracked: data/diff_iso_proc/
Untracked: data/diff_iso_trans/
Untracked: data/eQTLs_Lietal/
Untracked: data/ensemble_to_genename.txt
Untracked: data/example_gene_peakQuant/
Untracked: data/explainProtVar/
Untracked: data/filtPeakOppstrand_cov_noMP_GeneLocAnno_5perc/
Untracked: data/filtered_APApeaks_merged_allchrom_refseqTrans.closest2End.bed
Untracked: data/filtered_APApeaks_merged_allchrom_refseqTrans.closest2End.noties.bed
Untracked: data/first50lines_closest.txt
Untracked: data/gencov.test.csv
Untracked: data/gencov.test.txt
Untracked: data/gencov_zero.test.csv
Untracked: data/gencov_zero.test.txt
Untracked: data/gene_cov/
Untracked: data/joined
Untracked: data/leafcutter/
Untracked: data/merged_combined_YL-SP-threeprimeseq.bg
Untracked: data/molPheno_noMP/
Untracked: data/mol_overlap/
Untracked: data/mol_pheno/
Untracked: data/nom_QTL/
Untracked: data/nom_QTL_opp/
Untracked: data/nom_QTL_trans/
Untracked: data/nuc6up/
Untracked: data/nuc_10up/
Untracked: data/other_qtls/
Untracked: data/pQTL_otherphen/
Untracked: data/pacbio_cov/
Untracked: data/peakPerRefSeqGene/
Untracked: data/peaks4DT/
Untracked: data/perm_QTL/
Untracked: data/perm_QTL_GeneLocAnno_noMP_5percov/
Untracked: data/perm_QTL_GeneLocAnno_noMP_5percov_3UTR/
Untracked: data/perm_QTL_diffWindow/
Untracked: data/perm_QTL_opp/
Untracked: data/perm_QTL_trans/
Untracked: data/perm_QTL_trans_filt/
Untracked: data/protAndAPAAndExplmRes.Rda
Untracked: data/protAndAPAlmRes.Rda
Untracked: data/protAndExpressionlmRes.Rda
Untracked: data/reads_mapped_three_prime_seq.csv
Untracked: data/smash.cov.results.bed
Untracked: data/smash.cov.results.csv
Untracked: data/smash.cov.results.txt
Untracked: data/smash_testregion/
Untracked: data/ssFC200.cov.bed
Untracked: data/temp.file1
Untracked: data/temp.file2
Untracked: data/temp.gencov.test.txt
Untracked: data/temp.gencov_zero.test.txt
Untracked: data/threePrimeSeqMetaData.csv
Untracked: data/threePrimeSeqMetaData55Ind.txt
Untracked: data/threePrimeSeqMetaData55Ind.xlsx
Untracked: data/threePrimeSeqMetaData55Ind_noDup.txt
Untracked: data/threePrimeSeqMetaData55Ind_noDup.xlsx
Untracked: data/threePrimeSeqMetaData55Ind_noDup_WASPMAP.txt
Untracked: data/threePrimeSeqMetaData55Ind_noDup_WASPMAP.xlsx
Untracked: output/LZ/
Untracked: output/deeptools_plots/
Untracked: output/picard/
Untracked: output/plots/
Untracked: output/qual.fig2.pdf
Unstaged changes:
Modified: analysis/28ind.peak.explore.Rmd
Modified: analysis/CompareLianoglouData.Rmd
Modified: analysis/NewPeakPostMP.Rmd
Modified: analysis/apaQTLoverlapGWAS.Rmd
Modified: analysis/cleanupdtseq.internalpriming.Rmd
Modified: analysis/coloc_apaQTLs_protQTLs.Rmd
Modified: analysis/dif.iso.usage.leafcutter.Rmd
Modified: analysis/diff_iso_pipeline.Rmd
Modified: analysis/explainpQTLs.Rmd
Modified: analysis/explore.filters.Rmd
Modified: analysis/fixBWChromNames.Rmd
Modified: analysis/flash2mash.Rmd
Modified: analysis/mispriming_approach.Rmd
Modified: analysis/overlapMolQTL.Rmd
Modified: analysis/overlapMolQTL.opposite.Rmd
Modified: analysis/overlap_qtls.Rmd
Modified: analysis/peakOverlap_oppstrand.Rmd
Modified: analysis/peakQCPPlots.Rmd
Modified: analysis/pheno.leaf.comb.Rmd
Modified: analysis/pipeline_55Ind.Rmd
Modified: analysis/swarmPlots_QTLs.Rmd
Modified: analysis/test.max2.Rmd
Modified: analysis/test.smash.Rmd
Modified: analysis/understandPeaks.Rmd
Modified: analysis/unexplainedeQTL_analysis.Rmd
Modified: code/Snakefile
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | e1c4295 | Briana Mittleman | 2019-03-06 | add messing with meme- start own script |
html | 8a73f91 | Briana Mittleman | 2019-03-02 | Build site. |
Rmd | 49f21df | Briana Mittleman | 2019-03-02 | modify analysis for len 6 add qtl match |
html | 901e191 | Briana Mittleman | 2019-03-02 | Build site. |
Rmd | a96297c | Briana Mittleman | 2019-03-02 | add signal site enrich analysis |
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC310884/ (50bp up)
I will use homer to look for enriched signals in my signal sites.
I need to get the 50 base pairs upstream of my peaks and the shuffled peaks. This should encompase the signal site. This is similar to the 10 bp upstream script I used to identify evidence of mispriming. I want it to take any peak file and conver the bed to the 50bp uptream.
Upstream50Bases.py
#python
def main(Fin, Fout):
outBed=open(Fout, "w")
chrom_lengths=open("/project2/gilad/briana/genome_anotation_data/chrom_lengths2.sort.bed","r")
#make a dictionary with chrom lengths
length_dic={}
for i in chrom_lengths:
chrom, start, end = i.split()
length_dic[str(chrom)]=int(end)
#write file
for ln in open(Fin):
chrom, start, end, name, score, strand = ln.split()
chrom=str(chrom)
if strand=="+":
start_new=int(start)-50
if start_new <= 1:
start_new = 0
end_new= int(start)
if end_new == 0:
end_new=1
outBed.write("%s\t%d\t%d\t%s\t%s\t%s\n"%(chrom, start_new, end_new, name, score, strand))
if strand == "-":
start_new=int(end)
end_new=int(end) + 50
outBed.write("%s\t%d\t%d\t%s\t%s\t%s\n"%(chrom, start_new, end_new, name, score, strand))
outBed.close()
if __name__ == "__main__":
import sys
inFile = sys.argv[1]
outFile=sy.argv[2]
main(inFile, outFile)
RUn this for both the peak and shuff file with the fixed peak strands:
/project2/gilad/briana/threeprimeseq/data/FeatureoverlapPeaks/shuffled_FilterPeaks.sort.bed
/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand.bed
run_get50up.sh
#!/bin/bash
#SBATCH --job-name=run_get50up
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=run_get50upt.out
#SBATCH --error=run_get50up.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
python Upstream50Bases.py /project2/gilad/briana/threeprimeseq/data/FeatureoverlapPeaks/shuffled_FilterPeaks.sort.bed /project2/gilad/briana/threeprimeseq/data/SignalEnrich/fiftyup_shuffled_FilterPeaks.sort.bed
python Upstream50Bases.py /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand.bed /project2/gilad/briana/threeprimeseq/data/SignalEnrich/fiftyup_APAPeaks_5percCov_fixedStrand.bed
pabp (signal site) is top known
Run homer enrichement
-h hypergeometic enrichment
run it in downloaded version for now- will fix when environment solves and this is in anaconda env.
signalSiteEnrich.sh
#!/bin/bash
#SBATCH --job-name=signalSiteEnrich
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=signalSiteEnrich.out
#SBATCH --error=signalSiteEnrich.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END
source deactivate three-prime-env
module unload Anaconda3
findMotifsGenome.pl /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand.bed /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa /project2/gilad/briana/threeprimeseq/data/SignalEnrich/ -size -300,100 -h -bg /project2/gilad/briana/threeprimeseq/data/FeatureoverlapPeaks/shuffled_FilterPeaks.sort.bed -len 6
I see the correct enrichment but I want to understand where the enrichment is compared to the peaks. Iam worried the peaks are really close to eachother and maybe the enrichment was from a signal site of another peak.
Homer gives some results:
Tpos: average position of motif in target sequences (0 = start of sequences)
This result is only there for the new res. THe known res arnt there. I can run homer specfically looking for the motifs in the https://www.ncbi.nlm.nih.gov/pmc/articles/PMC310884/ paper.
I need to decide on a: Log odds detection threshold, used to determine bound vs. unbound sites (mandatory) example: 8.059752
I cau use the seq2profile.pl for this (here I can say allow 0 mismatch )
Motif file:
/project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/
makeMotifFiles.sh
#!/bin/bash
#SBATCH --job-name=makeMotifFiles
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=makeMotifFiles.out
#SBATCH --error=makeMotifFiles.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END
source ~/.bash_profile
~/homer/bin/seq2profile.pl AATAAA AATAAA-Sig1 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATAAA-Sig1.motif
~/homer/bin/seq2profile.pl ATTAAA ATTAAA-Sig2 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/ATTAAA-Sig2.motif
~/homer/bin/seq2profile.pl AGTAAA AGTAAA-Sig3 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AGTAAA-Sig3.motif
~/homer/bin/seq2profile.pl TATAAA TATAAA-Sig4 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/TATAAA-Sig4.motif
~/homer/bin/seq2profile.pl CATAAA CATAAA-Sig5 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/CATAAA-Sig5.motif
~/homer/bin/seq2profile.pl GATAAA GATAAA-Sig6 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/GATAAA-Sig6.motif
~/homer/bin/seq2profile.pl AATATA AATATA-Sig7 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATATA-Sig7.motif
~/homer/bin/seq2profile.pl AATACA AATACA-Sig8 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATACA-Sig8.motif
~/homer/bin/seq2profile.pl AATAGA AATAGA-Sig9 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATAGA-Sig9.motif
~/homer/bin/seq2profile.pl AAAAAG AAAAAG-Sig10 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AAAAAG-Sig10.motif
~/homer/bin/seq2profile.pl ACTAAA ACTAAA-Sig11 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/ACTAAA-Sig11.motif
I can then run the motif finding on these
I want to find inidivdual matches for these with http://meme-suite.org/doc/fimo.html
I need to convert these with motif2meme
#R
source("/project2/gilad/briana/Motif2Meme/motif2meme.R")
motif2meme("/project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATAAA-Sig1.motif")
Read 7 items Error in index.start:index.end : NA/NaN argument
Try to make my own: /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATAAA_meme.motif
MEME version 4
ALPHABET= ACGT
strands: +
Background letter frequencies (from
A 0.303 C 0.183 G 0.209 T 0.306
MOTIF AATAAA
letter-probability matrix: alength= 4 w= 6 nsites= 6 E= 0
0.997 0.001 0.001 0.001
0.001 0.997 0.001 0.001
0.001 0.001 0.001 0.997
0.997 0.001 0.001 0.001
0.997 0.001 0.001 0.001
0.997 0.001 0.001 0.001
I also need to make a fasta formated file for the regions. (start with the 50 upstream)
takes the bedtools nuc res and makes a fasta
interactively
#example
# >peakid
#seq
inFile=open("/project2/gilad/briana/threeprimeseq/data/SignalEnrich/Seq_fiftyup_APAPeaks_5percCov_fixedStrand.bed", "r")
outFile=open("/project2/gilad/briana/threeprimeseq/data/SignalEnrich/Seq_fiftyup_APAPeaks_5percCov_fixedStrand.fasta", "w")
for i, ln in enumerate(inFile):
if i >0:
id=ln.split()[3]
seq=ln.split()[15]
outFile.write(">%s\n%s\n"%(id, seq))
outFile.close()
add this to conda environment “meme” Test this one the one motif:
fimo --o /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/ /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATAAA_meme.motif /project2/gilad/briana/threeprimeseq/data/SignalEnrich/Seq_fiftyup_APAPeaks_5percCov_fixedStrand.fasta
If this doesnt work I will write my own python script to look for the motifs in the sequences. I can then look through the sequences and see if any of those are there and if so which. I can look for each in an heiarchical way. I can then find the location of the signal. I can use the .find command for this in python. It will give me the first index if findes the string.
I can make a second dictionary with the resuts. Anytime a signal is found, Ican add the location of the signal to a list in the value for the dictionary.
change region I am looking at before I do this - move to new analysis and try a new method for this analysis
sigsite=['AATAAA', 'ATTAAA' ,'AGTAAA' ,'TATAAA' ,'CATAAA' ,'GATAAA' ,'AATATA' ,'AATACA' ,'AATAGA' ,'AAAAAG','ACTAAA']
#start results dic, key is site, value is empty list
Res_dic={}
for i in sigsite:
Res_dic[i]=[]
upstreamSeq=open("/project2/gilad/briana/threeprimeseq/data/SignalEnrich/Seq_fiftyup_APAPeaks_5percCov_fixedStrand.fasta", "r")
for i,ln in enumerate(upstreamSeq):
if i % 2 ==0:
if sigsite[0] in ln:
if sigsite[1] in ln:
if sigsite[2] in ln:
if sigsite[3] in ln:
if sigsite[4] in ln:
if sigsite[5] in ln:
if sigsite[6] in ln:
if sigsite[7] in ln:
if sigsite[8] in ln:
if sigsite[9] in ln:
if sigsite[10] in ln:
#not finished code
/project2/gilad/briana/threeprimeseq/data/ApaQTLs/Nuclear.apaQTLs.sort.bed /project2/gilad/briana/threeprimeseq/data/ApaQTLs/Total.apaQTLs.sort.bed
cat /project2/gilad/briana/threeprimeseq/data/ApaQTLs/Nuclear.apaQTLs.sort.bed /project2/gilad/briana/threeprimeseq/data/ApaQTLs/Total.apaQTLs.sort.bed | sort -k1,1 -k2,2n > /project2/gilad/briana/threeprimeseq/data/ApaQTLs/All.apaQTLs.sort.bed
cat /project2/gilad/briana/threeprimeseq/data/MatchedSnp/Nuclear_matched_snps_sort.bed /project2/gilad/briana/threeprimeseq/data/MatchedSnp/Total_matched_snps_sort.bed | sort -k1,1 -k2,2n > /project2/gilad/briana/threeprimeseq/data/MatchedSnp/All_matched_snps_sort.bed
signalSiteEnrich_QTLs.sh
#!/bin/bash
#SBATCH --job-name=signalSiteEnrich_QTL
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=signalSiteEnrich_QTL.out
#SBATCH --error=signalSiteEnrich_QTL.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END
#source deactivate three-prime-env
#module unload Anaconda3
~/homer/bin/findMotifsGenome.pl /project2/gilad/briana/threeprimeseq/data/ApaQTLs/All.apaQTLs.sort.bed /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa /project2/gilad/briana/threeprimeseq/data/SignalEnrich_QTL/ -size 100 -h -bg /project2/gilad/briana/threeprimeseq/data/MatchedSnp/All_matched_snps_sort.bed -len 6
results dont mean much at the moment
/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed
Make a shuffled file of this for background:
Do this interactively
import pybedtools
peaks= pybedtools.BedTool("/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed")
peaks_shuf=peaks.shuffle(genome='hg19')
peaks_shuf.saveas("/project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpec/shuffled_MoreSigUsageNuclear.bed")
sort -k1,1 -k2,2n /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpec/shuffled_MoreSigUsageNuclear.bed | sed 's/^chr//' > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpec/shuffled_MoreSigUsageNuclear.sort.bed
signalSiteEnrich_nucSpec.sh
#!/bin/bash
#SBATCH --job-name=signalSiteEnrich_nucSpec
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=signalSiteEnrich_nucSpec.out
#SBATCH --error=signalSiteEnrich_nucSpec.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END
source deactivate three-prime-env
module unload Anaconda3
findMotifsGenome.pl /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpec/ -size -300,100 -h -bg /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpec/shuffled_MoreSigUsageNuclear.sort.bed -len 6
Subset these by those in introns:
/project2/gilad/briana/threeprimeseq/data/mergedPeaks_noMP_GeneLoc/Filtered_APApeaks_merged_allchrom_noMP.sort.named.noCHR_geneLocParsed_withAnno.SAF
I want to subset only those in the /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed file that have an intron annoations:
SigUsageNuc_subsetInron.py
anno=open("/project2/gilad/briana/threeprimeseq/data/mergedPeaks_noMP_GeneLoc/Filtered_APApeaks_merged_allchrom_noMP.sort.named.noCHR_geneLocParsed_withAnno.SAF","r")
inbed="/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed"
outBed=open("/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron.bed","w")
#make dic with intron peaks
intron={}
for i,ln in enumerate(anno):
if i >0:
peak=ln.split()[0].split(":")[0]
loc=ln.split()[0].split(":")[-1]
if loc == "intron":
intron[peak]=""
for ln in open(inbed, "r"):
peak=ln.split()[3].split(":")[1]
if peak in intron.keys():
outBed.write(ln)
outBed.close()
Of these 519 are intron. I now need to do the shuff on these:
Do this interactively
import pybedtools
peaks= pybedtools.BedTool("/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron.bed")
peaks_shuf=peaks.shuffle(genome='hg19')
peaks_shuf.saveas("/project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpecIntron/shuffled_MoreSigUsageNuclear_intron.bed")
sort -k1,1 -k2,2n /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpecIntron/shuffled_MoreSigUsageNuclear_intron.bed | sed 's/^chr//' > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpecIntron/shuffled_MoreSigUsageNuclear_intron.sort.bed
signalSiteEnrich_nucSpec_intron.sh
#!/bin/bash
#SBATCH --job-name=signalSiteEnrich_nucSpec_intron
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=signalSiteEnrich_nucSpec_intron.out
#SBATCH --error=signalSiteEnrich_nucSpec_intron.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END
source ~/.bash_profile
~/homer/bin/findMotifsGenome.pl /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron.bed /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpecIntron/ -size -300,100 -h -bg /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpecIntron/shuffled_MoreSigUsageNuclear_intron.sort.bed -len 6
I saw enrichment but now I need to work on location of the enrichment. I have the 50bp upstream of the PAS from earlier in this analysis. I can use bedtools nuc to get the actual sequences for these regions.
nuc50upPeaks.sh
#!/bin/bash
#SBATCH --job-name=nuc50upPeaks
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=nuc50upPeaks.out
#SBATCH --error=nuc50upPeaks.err
#SBATCH --partition=broadwl
#SBATCH --mem=36G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
bedtools nuc -s -seq -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed /project2/gilad/briana/threeprimeseq/data/SignalEnrich/fiftyup_APAPeaks_5percCov_fixedStrand.bed > /project2/gilad/briana/threeprimeseq/data/SignalEnrich/Seq_fiftyup_APAPeaks_5percCov_fixedStrand.bed
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.2.0 Rcpp_0.12.19 lattice_0.20-35 digest_0.6.17
[5] rprojroot_1.3-2 grid_3.5.1 jsonlite_1.6 backports_1.1.2
[9] git2r_0.24.0 magrittr_1.5 evaluate_0.13 stringi_1.2.4
[13] fs_1.2.6 whisker_0.3-2 Matrix_1.2-14 reticulate_1.10
[17] rmarkdown_1.11 tools_3.5.1 stringr_1.4.0 glue_1.3.0
[21] yaml_2.2.0 compiler_3.5.1 htmltools_0.3.6 knitr_1.20