Last updated: 2018-09-24
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date 
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
 ✔ Environment: empty 
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
 ✔ Seed: 
set.seed(12345) 
The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
 ✔ Session information: recorded 
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
 ✔ Repository version: ed0a1eb 
wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    output/.DS_Store
Untracked files:
    Untracked:  analysis/ncbiRefSeq_sm.sort.mRNA.bed
    Untracked:  analysis/snake.config.notes.Rmd
    Untracked:  analysis/verifyBAM.Rmd
    Untracked:  data/18486.genecov.txt
    Untracked:  data/APApeaksYL.total.inbrain.bed
    Untracked:  data/NuclearApaQTLs.txt
    Untracked:  data/RNAkalisto/
    Untracked:  data/TotalApaQTLs.txt
    Untracked:  data/Totalpeaks_filtered_clean.bed
    Untracked:  data/YL-SP-18486-T-combined-genecov.txt
    Untracked:  data/YL-SP-18486-T_S9_R1_001-genecov.txt
    Untracked:  data/bedgraph_peaks/
    Untracked:  data/bin200.5.T.nuccov.bed
    Untracked:  data/bin200.Anuccov.bed
    Untracked:  data/bin200.nuccov.bed
    Untracked:  data/clean_peaks/
    Untracked:  data/comb_map_stats.csv
    Untracked:  data/comb_map_stats.xlsx
    Untracked:  data/combined_reads_mapped_three_prime_seq.csv
    Untracked:  data/gencov.test.csv
    Untracked:  data/gencov.test.txt
    Untracked:  data/gencov_zero.test.csv
    Untracked:  data/gencov_zero.test.txt
    Untracked:  data/gene_cov/
    Untracked:  data/joined
    Untracked:  data/leafcutter/
    Untracked:  data/merged_combined_YL-SP-threeprimeseq.bg
    Untracked:  data/nom_QTL/
    Untracked:  data/nom_QTL_opp/
    Untracked:  data/nuc6up/
    Untracked:  data/other_qtls/
    Untracked:  data/peakPerRefSeqGene/
    Untracked:  data/perm_QTL/
    Untracked:  data/perm_QTL_opp/
    Untracked:  data/reads_mapped_three_prime_seq.csv
    Untracked:  data/smash.cov.results.bed
    Untracked:  data/smash.cov.results.csv
    Untracked:  data/smash.cov.results.txt
    Untracked:  data/smash_testregion/
    Untracked:  data/ssFC200.cov.bed
    Untracked:  data/temp.file1
    Untracked:  data/temp.file2
    Untracked:  data/temp.gencov.test.txt
    Untracked:  data/temp.gencov_zero.test.txt
    Untracked:  output/picard/
    Untracked:  output/plots/
    Untracked:  output/qual.fig2.pdf
Unstaged changes:
    Modified:   analysis/28ind.peak.explore.Rmd
    Modified:   analysis/cleanupdtseq.internalpriming.Rmd
    Modified:   analysis/dif.iso.usage.leafcutter.Rmd
    Modified:   analysis/diff_iso_pipeline.Rmd
    Modified:   analysis/explore.filters.Rmd
    Modified:   analysis/overlap_qtls.Rmd
    Modified:   analysis/peakOverlap_oppstrand.Rmd
    Modified:   analysis/pheno.leaf.comb.Rmd
    Modified:   analysis/test.max2.Rmd
    Modified:   code/Snakefile
| File | Version | Author | Date | Message | 
|---|---|---|---|---|
| Rmd | ed0a1eb | Briana Mittleman | 2018-09-24 | add fix command to full pipeline | 
| html | a0a73b5 | Briana Mittleman | 2018-09-24 | Build site. | 
| Rmd | 6242ff6 | Briana Mittleman | 2018-09-24 | add filter command to full pipeline | 
| html | 617d032 | brimittleman | 2018-08-08 | Build site. | 
| Rmd | 6be219c | brimittleman | 2018-08-08 | add final pipeline | 
| html | 5566fd6 | brimittleman | 2018-08-02 | Build site. | 
| Rmd | 0f79304 | brimittleman | 2018-08-02 | fix cov to peak file problem | 
| html | efad657 | Briana Mittleman | 2018-07-31 | Build site. | 
| Rmd | 7c203e4 | Briana Mittleman | 2018-07-31 | format files for yangs peak script | 
| html | 7fc2ce7 | Briana Mittleman | 2018-07-30 | Build site. | 
| Rmd | 782320d | Briana Mittleman | 2018-07-30 | look at coverage in merged bw | 
| html | e5a8da6 | Briana Mittleman | 2018-07-30 | Build site. | 
| Rmd | 422a428 | Briana Mittleman | 2018-07-30 | add peak cove pipeline and combined lane qc | 
I need to create a processing pipeline that I can run each time I get more individuals that will do the following:
combine all total and nuclear libraries (as a bigwig/genome coverage)
call peaks with Yang’s script
filter peaks with Yang’s script
clean peaks
run feature counts on these peaks for all fo the individuals
I can do this step in my snakefile. First, I added the following to my environemnt.
I want to create bedgraph for each file. I will add a rule to my snakefile that does this and puts them in the bedgraph directory.
I want to add more memory for this rule in the cluster.json
"bedgraph" :
    {
            "mem": 16000
    },
"bedgraph_5" :
    {
            "mem": 16000
    }I will use the bedgraphtobigwig tool.
#add to directory
dir_bedgraph= dir_data + "bedgraph/"
dir_bigwig= dir_data + "bigwig/"
dir_sortbg= dir_data + "bedgraph_sort/"
dir_bedgraph_5= dir_data + "bedgraph_5prime/"
#add to rule_all  
expand(dir_bedgraph + "{samples}.split.bg", samples=samples)
expand(dir_sortbg + "{samples}.sort.bg", samples=samples)
expand(dir_bigwig + "{samples}.bw", samples=samples)
expand(dir_bedgraph_5 + "{samples}.5.bg", samples=samples)
#rule
rule bedgraph_5: 
  input:
    bam = dir_sort + "{samples}-sort.bam"
  output: dir_bedgraph_5 + "{samples}.5.bg"
  shell: "bedtools genomecov -ibam {input.bam} -bg -5 > {output}"
  
rule bedgraph: 
  input:
    bam = dir_sort + "{samples}-sort.bam"
  output: dir_bedgraph + "{samples}.split.bg"
  shell: "bedtools genomecov -ibam {input.bam} -bg -split > {output}"
rule sort_bg:
    input: dir_bedgraph + "{samples}.split.bg"
    output: dir_sortbg + "{samples}.sort.bg"
    shell: "sort -k1,1 -k2,2n {input} > {output}"
rule bg_to_bw:
    input: 
        bg=dir_sortbg + "{samples}.sort.bg"
        len= chrom_length 
    output: dir_bigwig + "{samples}.bw"
    shell: "bedGraphToBigWig {input.bg} {input.len} {output}"This next step will take all of the files in the bigwig directory and merge them. To do this I will create a script that creates a list of all of the files then uses this list in the merge script.
mergeBW.sh
#!/bin/bash
#SBATCH --job-name=mergeBW
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=mergeBW.out
#SBATCH --error=mergeBW.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
ls -d -1 /project2/gilad/briana/threeprimeseq/data/bigwig/* | tail -n +2 > /project2/gilad/briana/threeprimeseq/data/list_bw/list_of_bigwig.txt
bigWigMerge -inList /project2/gilad/briana/threeprimeseq/data/list_bw/list_of_bigwig.txt /project2/gilad/briana/threeprimeseq/data/mergedBW/merged_combined_YL-SP-threeprimeseq.bg
The result of this script will be a merged bedgraph of all of the files.
library(workflowr)This is workflowr version 1.1.1
Run ?workflowr for help getting startedlibrary(ggplot2)
library(dplyr)
Attaching package: 'dplyr'The following objects are masked from 'package:stats':
    filter, lagThe following objects are masked from 'package:base':
    intersect, setdiff, setequal, union#!/usr/bin/env python
main(inFile, outFile):
    fout = open(outFile,'w')
    for ind,ln in enumerate(open(inFile)):
      print(ind)
      chrom, start, end, count = ln.split()
      i2=int(start)
      while i2 < int(end):
        fout.write("%s\t%d\t%s\n"%(chrom, i2 + 1, count))
        fout.flush()
        i2 += 1
    fout.close()    
    
if __name__ == "__main__":
    import numpy as np
    from misc_helper import *
    import sys
    inFile = sys.argv[1]
    outFile = sys.argv[2]
    main(inFile, outFile)Create a bash script to run this. I want the input and output files to be arguments in the python script.
#!/bin/bash
#SBATCH --job-name=run_bgtocov
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=run_bgtocov.out
#SBATCH --error=run_bgtocov.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env 
python bg_to_cov.py "/project2/gilad/briana/threeprimeseq/data/mergedBW/merged_combined_YL-SP-threeprimeseq.bg" "/project2/gilad/briana/threeprimeseq/data/mergedBW/merged_combined_YL-SP-threeprimeseq.coverage.txt"Sort result with:
sort -k1,1 -k2,2n merged_combined_YL-SP-threeprimeseq.coverage.txt > merged_combined_YL-SP-threeprimeseq.coverage.sort.txt 
def main(inFile, outFile, ctarget):
    fout = open(outFile,'w')
    mincount = 10
    ov = 20
    current_peak = []
    
    currentChrom = None
    prevPos = 0
    for ln in open(inFile):
        chrom, pos, count = ln.split()
        if chrom != ctarget: continue
        count = float(count)
        if currentChrom == None:
            currentChrom = chrom
            
        if count == 0 or currentChrom != chrom or int(pos) > prevPos + 1:
            if len(current_peak) > 0:
                print (current_peak)
                M = max([x[1] for x in current_peak])
                if M > mincount:
                    all_peaks = refine_peak(current_peak, M, M*0.1,M*0.05)
                    #refined_peaks = [(x[0][0],x[-1][0], np.mean([y[1] for y in x])) for x in all_peaks]  
                    rpeaks = [(int(x[0][0])-ov,int(x[-1][0])+ov, np.mean([y[1] for y in x])) for x in all_peaks]
                    if len(rpeaks) > 1:
                        for clu in cluster_intervals(rpeaks)[0]:
                            M = max([x[2] for x in clu])
                            merging = []
                            for x in clu:
                                if x[2] > M *0.5:
                                    #print x, M
                                    merging.append(x)
                            c, s,e,mean =  chrom, min([x[0] for x in merging])+ov, max([x[1] for x in merging])-ov, np.mean([x[2] for x in merging])
                            #print c,s,e,mean
                            fout.write("chr%s\t%d\t%d\t%d\t+\t.\n"%(c,s,e,mean))
                            fout.flush()
                    elif len(rpeaks) == 1:
                        s,e,mean = rpeaks[0]
                        fout.write("chr%s\t%d\t%d\t%f\t+\t.\n"%(chrom,s+ov,e-ov,mean))
                        print("chr%s"%chrom+"\t%d\t%d\t%f\t+\t.\n"%rpeaks[0])
                    #print refined_peaks
            current_peak = [(pos,count)]
        else:
            current_peak.append((pos,count))
        currentChrom = chrom
        prevPos = int(pos)
def refine_peak(current_peak, M, thresh, noise, minpeaksize=30):
    
    cpeak = []
    opeak = []
    allcpeaks = []
    allopeaks = []
    for pos, count in current_peak:
        if count > thresh:
            cpeak.append((pos,count))
            opeak = []
            continue
        elif count > noise: 
            opeak.append((pos,count))
        else:
            if len(opeak) > minpeaksize:
                allopeaks.append(opeak) 
            opeak = []
        if len(cpeak) > minpeaksize:
            allcpeaks.append(cpeak)
            cpeak = []
        
    if len(cpeak) > minpeaksize:
        allcpeaks.append(cpeak)
    if len(opeak) > minpeaksize:
        allopeaks.append(opeak)
    allpeaks = allcpeaks
    for opeak in allopeaks:
        M = max([x[1] for x in opeak])
        allpeaks += refine_peak(opeak, M, M*0.3, noise)
    #print [(x[0],x[-1]) for x in allcpeaks], [(x[0],x[-1]) for x in allopeaks], [(x[0],x[-1]) for x in allpeaks]
    #print '---\n'
    return(allpeaks)
if __name__ == "__main__":
    import numpy as np
    from misc_helper import *
    import sys
    chrom = sys.argv[1]
    inFile = "/project2/gilad/briana/threeprimeseq/data/mergedBW/merged_combined_YL-SP-threeprimeseq.coverage.sort.txt" # "/project2/yangili1/threeprimeseq/gencov/TotalBamFiles.split.genomecov.bed"
    outFile = "/project2/gilad/briana/threeprimeseq/data/mergedPeaks/APApeaks_chr%s.bed"%chrom
    main(inFile, outFile, chrom)#!/bin/bash
#SBATCH --job-name=w_getpeakYLgen
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=w_getpeakYLgen.out
#SBATCH --error=w_getpeakYLgen.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
for i in $(seq 1 22); do 
  python callPeaksYL_GEN.py $i
doneRun the file with : sbatch w_getpeakYLGEN.sh
After I have the peaks I will need to use Yangs filter peak function.
Update each of the following scripts:
cat /project2/gilad/briana/threeprimeseq/data/mergedPeaks/*.bed > /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/APApeaks_merged_allchrom.bedbed2saf.py
fout = file("/project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/APApeaks_merged_allchrom.SAF",'w')
fout.write("GeneID\tChr\tStart\tEnd\tStrand\n")
for ln in open("/project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/APApeaks_merged_allchrom.bed"):
    chrom, start, end, score, strand, score2 = ln.split()
    ID = "peak_%s_%s_%s"%(chrom,start, end)
    fout.write("%s\t%s\t%s\t%s\t+\n"%(ID+"_+", chrom.replace("chr",""), start, end))
    fout.write("%s\t%s\t%s\t%s\t-\n"%(ID+"_-", chrom.replace("chr",""), start, end))
fout.close()Run this with run_bed2saf.sh. I did this because I need to load python2 rather than using the environment,
#!/bin/bash
#SBATCH --job-name=peak_fc
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=peak_fc.out
#SBATCH --error=peak_fc.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
featureCounts -a /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/APApeaks_merged_allchrom.SAF -F SAF -o /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/APAquant.fc /project2/gilad/briana/threeprimeseq/data/sort/*-sort.bam -s 1This script is peak_fc.sh
filter_peaks.py
I should run this in a bash script with python 2 as well.
#!/bin/bash
#SBATCH --job-name=filter_peak
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=filet_peak.out
#SBATCH --error=filter_peak.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load python  
python filter_peaks.pyName the peaks for the cleanup:
x = wc -l /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom.bed 
seq 1 x > peak.num.txt
paste /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom.bed peak.num.txt | column -s $'\t' -t > temp
awk '{print $1 "\t" $2 "\t" $3 "\t" $7  "\t"  $4 "\t"  $5 "\t" $6}' temp >   /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom.named.bed#!/bin/rscripts
# usage: ./cleanupdtseq.R in_bedfile, outfile, cuttoff
#this script takes a putative peak file, and output file name and a cuttoff for classification and outputs the file with all of the seqs classified. 
#use optparse for management of input arguments I want to be able to imput the 6up nuc file and write out a filter file  
#script needs to run outside of conda env. should module load R in bash script when I submit it 
library(optparse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(cleanUpdTSeq)
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
option_list = list(
  make_option(c("-f", "--file"), action="store", default=NA, type='character',
              help="input file"),
  make_option(c("-o", "--output"), action="store", default=NA, type='character',
              help="output file"),
  make_option(c("-c", "--cutoff"), action="store", default=NA, type='double',
              help="assignment cuttoff")
)
  
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
#interrupt execution if no file is  supplied
if (is.null(opt$file)){
  print_help(opt_parser)
  stop("Need input file", call.=FALSE)
}
#imput file for test data 
testSet <- read.table(file = opt$file, sep="\t", col.names =c("chr", "start", "end", "PeakName", "Cov", "Strand", "score"))
peaks <- BED2GRangesSeq(testSet, withSeq=FALSE)
#build vector with human genome  
testSet.NaiveBayes <- buildFeatureVector(peaks, BSgenomeName=Hsapiens,
                                         upstream=40, downstream=30, 
                                         wordSize=6, alphabet=c("ACGT"),
                                         sampleType="unknown", 
                                         replaceNAdistance=30, 
                                         method="NaiveBayes",
                                         ZeroBasedIndex=1, fetchSeq=TRUE)
#classfy sites with built in classsifer
data(classifier)
testResults <- predictTestSet(testSet.NaiveBayes=testSet.NaiveBayes,
                              classifier=classifier,
                              outputFile=NULL, 
                              assignmentCutoff=opt$cutoff)
true_peaks=testResults %>% filter(pred.class==1) 
#write results  
write.table(true_peaks, file=opt$output, quote = F, row.names = F, col.names = T)  I will create a bash script to run the cleanupdtseq.R code.
#!/bin/bash
#SBATCH --job-name=cleanup_comb
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=cleanup_comb.out
#SBATCH --error=cleanup_comb.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load R
Rscript cleanupdtseq.R  -f /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom.named.bed -o /project2/gilad/briana/threeprimeseq/data/clean.peaks_comb/truePeaks_clean.bed -c .5
Do this after. filter_peaksClean.R, run with run_filter_peaksClean.sh
library(dplyr)
clean=read.table("/project2/gilad/briana/threeprimeseq/data/clean.peaks_comb/truePeaks_clean.bed", header=F, col.names=c("PeakName", "probFalse", "probTrue", "predClass", "UP", "Down"), skip=1)
peaks=read.table("/project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom.named.bed", header=F, col.names=c("Chr", "Start", "End", "PeakName", "Cov", "Strand", "Score"))
  
true_peaks=clean %>% filter(predClass==1) 
true_peak_bed=semi_join(peaks, clean, by="PeakName")
write.table(true_peak_bed, file="/project2/gilad/briana/threeprimeseq/data/clean.peaks_comb/APApeaks_combined_clean.bed", row.names = F, col.names = F, quote = F)#!/bin/bash
#SBATCH --job-name=filter_clean
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=filter_clean.out
#SBATCH --error=filter_clean.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load R
Rscript filter_peaksClean.Rmay have to run bed to SAF again. bed2saf.peaks.py
from misc_helper import *
fout = file("/project2/gilad/briana/threeprimeseq/data/clean.peaks_comb/APApeaks_combined_clean.saf",'w')
fout.write("GeneID\tChr\tStart\tEnd\tStrand\n")
for ln in open("/project2/gilad/briana/threeprimeseq/data/clean.peaks_comb/APApeaks_combined_clean.bed"):
    chrom, start, end, name, score, strand, score2 = ln.split()
    ID = "peak_%s_%s_%s"%(chrom,start, end)
    fout.write("%s\t%s\t%s\t%s\t+\n"%(ID+"_+", chrom.replace("chr",""), start, end))
    fout.write("%s\t%s\t%s\t%s\t-\n"%(ID+"_-", chrom.replace("chr",""), start, end))
fout.close()#!/bin/bash
#SBATCH --job-name=bed2saf_peaks
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=bed2saf_peak.out
#SBATCH --error=bed2saf_peak.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load python
python bed2saf.peaks.py#!/bin/bash
#SBATCH --job-name=clean_peak_fc
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=clean_peak_fc.out
#SBATCH --error=clean_peak_fc.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
featureCounts -a /project2/gilad/briana/threeprimeseq/data/clean.peaks_comb/APApeaks_combined_clean.saf -F SAF -o /project2/gilad/briana/threeprimeseq/data/clean_peaks_comb_quant/APAquant.fc.cleanpeaks.fc /project2/gilad/briana/threeprimeseq/data/sort/*-sort.bam -s 1Should make this a snake file!!!
mergeBW.sh
run_bgtocov.sh
sort -k1,1 -k2,2n /project2/gilad/briana/threeprimeseq/data/mergedBW/merged_combined_YL-SP-threeprimeseq.coverage.txt > /project2/gilad/briana/threeprimeseq/data/mergedBW/merged_combined_YL-SP-threeprimeseq.coverage.sort.txt
w_getpeakYLGEN.sh
cat /project2/gilad/briana/threeprimeseq/data/mergedPeaks/*.bed > /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/APApeaks_merged_allchrom.bed
run_bed2saf.sh
peak_fc.sh
run_filterPeak.sh
x = wc -l /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom.bed 
seq 1 x > /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/peak.num.txt
paste /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom.bed /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/peak.num.txt | column -s $'\t' -t > /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/temp
awk '{print $1 "\t" $2 "\t" $3 "\t" $7  "\t"  $4 "\t"  $5 "\t" $6}' /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/temp >   /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom.named.bed
if no cleanning:
cleanup_comb.sh
run_filter_peaksClean.sh
run_bed2saf_peaks.sh
clean_peak_fc.sh
#!/bin/bash
#SBATCH --job-name=comb_gencov
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=comb_gencov.out
#SBATCH --error=comb_gencov.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env 
samtools merge /project2/gilad/briana/threeprimeseq/data/comb_bam/all_total.nuc_comb.bam  /project2/gilad/briana/threeprimeseq/data/sort/*.bam
bedtools genomecov -ibam /project2/gilad/briana/threeprimeseq/data/comb_bam/all_total.nuc_comb.bam -d -split > /project2/gilad/briana/threeprimeseq/data/comb_bam/all_total.nuc_comb.split.genomecov.bedWill need to run mergeBW.sh and run_bgtocov.sh then sort with
sort -k1,1 -k2,2n merged_combined_YL-SP-threeprimeseq.coverage.txt > merged_combined_YL-SP-threeprimeseq.coverage.sort.txt then call peaks with the updated callpeaks script from yang (get_APA_peaks.py) I run this with w_getpeakYLGEN.sh.
sessionInfo()R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] dplyr_0.7.6     ggplot2_3.0.0   workflowr_1.1.1
loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18      compiler_3.5.1    pillar_1.3.0     
 [4] git2r_0.23.0      plyr_1.8.4        bindr_0.1.1      
 [7] R.methodsS3_1.7.1 R.utils_2.7.0     tools_3.5.1      
[10] digest_0.6.16     evaluate_0.11     tibble_1.4.2     
[13] gtable_0.2.0      pkgconfig_2.0.2   rlang_0.2.2      
[16] yaml_2.2.0        bindrcpp_0.2.2    withr_2.1.2      
[19] stringr_1.3.1     knitr_1.20        rprojroot_1.3-2  
[22] grid_3.5.1        tidyselect_0.2.4  glue_1.3.0       
[25] R6_2.2.2          rmarkdown_1.10    purrr_0.2.5      
[28] magrittr_1.5      whisker_0.3-2     backports_1.1.2  
[31] scales_1.0.0      htmltools_0.3.6   assertthat_0.2.0 
[34] colorspace_1.3-2  stringi_1.2.4     lazyeval_0.2.1   
[37] munsell_0.5.0     crayon_1.3.4      R.oo_1.22.0      
This reproducible R Markdown analysis was created with workflowr 1.1.1