Last updated: 2018-07-25
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(12345)
The command set.seed(12345)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 2d41f11
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: output/.DS_Store
Untracked files:
Untracked: data/18486.genecov.txt
Untracked: data/APApeaksYL.total.inbrain.bed
Untracked: data/YL-SP-18486-T_S9_R1_001-genecov.txt
Untracked: data/bedgraph_peaks/
Untracked: data/bin200.5.T.nuccov.bed
Untracked: data/bin200.Anuccov.bed
Untracked: data/bin200.nuccov.bed
Untracked: data/gene_cov/
Untracked: data/leafcutter/
Untracked: data/nuc6up/
Untracked: data/reads_mapped_three_prime_seq.csv
Untracked: data/smash.cov.results.bed
Untracked: data/smash.cov.results.csv
Untracked: data/smash.cov.results.txt
Untracked: data/smash_testregion/
Untracked: data/ssFC200.cov.bed
Untracked: output/picard/
Untracked: output/plots/
Untracked: output/qual.fig2.pdf
Unstaged changes:
Modified: analysis/dif.iso.usage.leafcutter.Rmd
Modified: analysis/explore.filters.Rmd
Modified: analysis/test.max2.Rmd
Modified: code/Snakefile
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 2d41f11 | Briana Mittleman | 2018-07-25 | add expand smash to index |
I want to run smash on a whole chromosome to see what regions I can do it on to get the whole genome. First I am going to try chromosome 22.
I want to create a bedgraph for the combined nuclear and total files.
/project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.sort.bam
/project2/gilad/briana/threeprimeseq/data/macs2/NuclearBamFiles.bam
#!/bin/bash
#SBATCH --job-name=5gencov_comb
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=5gencov_comb.out
#SBATCH --error=5gencov_com.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
bedtools genomecov -d -5 -ibam /project2/gilad/briana/threeprimeseq/data/macs2/TotalBamFiles.sort.bam > /project2/gilad/briana/threeprimeseq/data/genomecov/gencov5prime.combinedTotal.bed
bedtools genomecov -d -5 -ibam /project2/gilad/briana/threeprimeseq/data/macs2/NuclearBamFiles.sort.bam > /project2/gilad/briana/threeprimeseq/data/genomecov/gencov5prime.combinedNuclear.bed
I now need to subset the bed files to chr22.
#!/bin/bash
#SBATCH --job-name=subset22
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=subset22.out
#SBATCH --error=subset22.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
awk '$1==22 {print}' /project2/gilad/briana/threeprimeseq/data/genomecov/gencov5prime.combinedNuclear.bed > /project2/gilad/briana/threeprimeseq/data/genomecov_chr22/gencov5prime.combinedNuclear.chr22.bed
awk '$1==22 {print}' /project2/gilad/briana/threeprimeseq/data/genomecov/gencov5prime.combinedTotal.bed > /project2/gilad/briana/threeprimeseq/data/genomecov_chr22/gencov5prime.combinedTotal.chr22.bed
Chromosome 22 is 51304566 bases. I need this to satisfy the \(2^{x}\) criteria. I can use the log rule, \(log_{2}length=x\)
log2(51304566)
[1] 25.61258
2^26
[1] 67108864
zeros_to_add=67108864 -51304566
I will use 2^26, 67108864. This means I need to add 1.580429810^{7} 0s to the matrix. I can do this by making a matrix with the correct number of zeros and row binding it.
zero_matrix=matrix(0.1, zeros_to_add)
I will write and R script that I can run on the cluster. The file will take in the genomecoverage file and will output the graph and the smash results in a bedgraph format.
#!/bin/rscripts
# usage: ./run_smash.R gencoverage, outfile_plot, outfile_bedgraph
#this script takes the genomecov file and a name for an output plot (.png) and an output bedgraph (.bg)
#use optparse for management of input arguments I want to be able to imput the 6up nuc file and write out a filter file
#script needs to run outside of conda env. should module load R in bash script when I submit it
library(optparse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(smashr)
option_list = list(
make_option(c("-f", "--file"), action="store", default=NA, type='character',
help="input file"),
make_option(c("-p", "--plot"), action="store", default=NA, type='character',
help="output plot file"),
make_option(c("-o", "--output"), action="store", default=NA, type='character',
help="output file")
)
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
#interrupt execution if no file is supplied
if (is.null(opt$file)){
print_help(opt_parser)
stop("Need input file", call.=FALSE)
}
#Functions for this script
criteria=function(x){
#function takes a number and makes the matrix with 0s
exp=log2(x)
exp_round=ceiling(exp)
zerosadd= 2^exp_round - x
seq_0=rep(0, zerosadd)
return(seq_0)
}
#import bedgraph
names=c("Chr", "Pos", "Count")
cov=read.table(file = opt$file, col.names = names)
chromosome=cov[1,1]
#prepare data by adding 0s
zero_seq=criteria(nrow(cov))
data_vec=as.vector(cov$Count)
data_zero_vec=c(data_vec, zero_seq)
data_zero_matrix=matrix(data_zero_vec, 1, length(data_zero_vec))
#run smash
smash_res=smash.poiss(data_zero_matrix[1,],post.var=TRUE)
#create and save plot
pos=1:length(data_vec)
png(opt$plot)
finalplot=plot(pos,smash_res$est[1:length(data_vec)],type='l',xlab="position",ylab="intensity", main="SMASH results")
dev.off()
#create bedgraph and write it out
bedgraph=cbind(lapply(cov$Chr, function(x) paste("chr", x, sep="")), cov$Pos, cov$Pos + 1, smash_res$est[1:length(data_vec)])
write.table(bedgraph, file=opt$output, quote = F, row.names = F, col.names = F)
To run this I will have to create a batch script similar to the following.
#!/bin/bash
#SBATCH --job-name=run.run_smash
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=run_runsmash.out
#SBATCH --error=run_runsmash.err
#SBATCH --partition=broadwl
#SBATCH --mem=20G
#SBATCH --mail-type=END
module load R
Rscript run_smash.R -f /project2/gilad/briana/threeprimeseq/data/genomecov_chr22/gencov5prime.combinedTotal.chr22.bed -p /project2/gilad/briana/threeprimeseq/data/smash.chr22/smooth.combinedTotal.chr22.png -o /project2/gilad/briana/threeprimeseq/data/smash.chr22/smooth.combinedTotal.chr22.bg
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.1.1 Rcpp_0.12.18 digest_0.6.15
[4] rprojroot_1.3-2 R.methodsS3_1.7.1 backports_1.1.2
[7] git2r_0.23.0 magrittr_1.5 evaluate_0.11
[10] stringi_1.2.4 whisker_0.3-2 R.oo_1.22.0
[13] R.utils_2.6.0 rmarkdown_1.10 tools_3.5.1
[16] stringr_1.3.1 yaml_2.1.19 compiler_3.5.1
[19] htmltools_0.3.6 knitr_1.20
This reproducible R Markdown analysis was created with workflowr 1.1.1