Last updated: 2018-07-25
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(12345)
The command set.seed(12345)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: d8394a3
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: output/.DS_Store
Untracked files:
Untracked: analysis/expand.smash.Rmd
Untracked: data/18486.genecov.txt
Untracked: data/APApeaksYL.total.inbrain.bed
Untracked: data/YL-SP-18486-T_S9_R1_001-genecov.txt
Untracked: data/bedgraph_peaks/
Untracked: data/bin200.5.T.nuccov.bed
Untracked: data/bin200.Anuccov.bed
Untracked: data/bin200.nuccov.bed
Untracked: data/gene_cov/
Untracked: data/leafcutter/
Untracked: data/nuc6up/
Untracked: data/reads_mapped_three_prime_seq.csv
Untracked: data/smash.cov.results.bed
Untracked: data/smash.cov.results.csv
Untracked: data/smash.cov.results.txt
Untracked: data/smash_testregion/
Untracked: data/ssFC200.cov.bed
Untracked: output/picard/
Untracked: output/plots/
Untracked: output/qual.fig2.pdf
Unstaged changes:
Modified: analysis/dif.iso.usage.leafcutter.Rmd
Modified: analysis/explore.filters.Rmd
Modified: analysis/test.max2.Rmd
Modified: code/Snakefile
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | d8394a3 | Briana Mittleman | 2018-07-25 | start clean up code analysis |
Install new packages:
source("https://bioconductor.org/biocLite.R")
biocLite("BSgenome.Hsapiens.UCSC.hg19")
Load Packages:
library(workflowr)
This is workflowr version 1.1.1
Run ?workflowr for help getting started
library(cleanUpdTSeq)
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, append, as.data.frame, basename, cbind,
colMeans, colnames, colSums, dirname, do.call, duplicated,
eval, evalq, Filter, Find, get, grep, grepl, intersect,
is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which, which.max,
which.min
Loading required package: BSgenome
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following object is masked from 'package:base':
expand.grid
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
Attaching package: 'Biostrings'
The following object is masked from 'package:base':
strsplit
Loading required package: rtracklayer
Loading required package: BSgenome.Drerio.UCSC.danRer7
Loading required package: seqinr
Attaching package: 'seqinr'
The following object is masked from 'package:Biostrings':
translate
Loading required package: e1071
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
I am also going to install cleanUpdTSeq on my midway account because I will want to write scripts using this package that can take in any bedfile and will write out the file with the classification results. I can also have the cutoff option be a parameter that will change.
The test set should have chr, start, end, name, score, strand.
#!/bin/rscripts
# usage: ./cleanupdtseq.R in_bedfile, outfile, cuttoff
#this script takes a putative peak file, and output file name and a cuttoff for classification and outputs the file with all of the seqs classified.
#use optparse for management of input arguments I want to be able to imput the 6up nuc file and write out a filter file
#script needs to run outside of conda env. should module load R in bash script when I submit it
library(optparse)
library(dplyr)
library(tidyr)
library(ggplot2)
library(cleanUpdTSeq)
library(GenomicRanges)
library(BSgenome.Hsapiens.UCSC.hg19)
option_list = list(
make_option(c("-f", "--file"), action="store", default=NA, type='character',
help="input file"),
make_option(c("-o", "--output"), action="store", default=NA, type='character',
help="output file"),
make_option(c("-c", "--cutoff"), action="store", default=NA, type='double',
help="assignment cuttoff")
)
opt_parser <- OptionParser(option_list=option_list)
opt <- parse_args(opt_parser)
#interrupt execution if no file is supplied
if (is.null(opt$file)){
print_help(opt_parser)
stop("Need input file", call.=FALSE)
}
#imput file for test data
testSet <- read.table(file = opt$file, sep="\t", header=TRUE)
peaks <- BED2GRangesSeq(testSet, withSeq=FALSE)
#build vector with human genome
testSet.NaiveBayes <- buildFeatureVector(peaks, BSgenomeName=Hsapiens,
upstream=40, downstream=30,
wordSize=6, alphabet=c("ACGT"),
sampleType="unknown",
replaceNAdistance=30,
method="NaiveBayes",
ZeroBasedIndex=1, fetchSeq=TRUE)
#classfy sites with built in classsifer
data(classifier)
testResults <- predictTestSet(testSet.NaiveBayes=testSet.NaiveBayes,
classifier=classifier,
outputFile=NULL,
assignmentCutoff=opt$cutoff)
#write results
write.table(testResults, file=opt$output, quote = F, row.names = F, col.names = T)
I will need to module load R in the bash script that writes this.
#!/bin/bash
#SBATCH --job-name=clean_filteredpeakstotal
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=clean_filteredpeakstotal.out
#SBATCH --error=clean_filteredpeakstotal.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END
module load R
Rscript cleanupdtseq.R -f /project2/gilad/briana/threeprimeseq/data/clean.peaks/APAfiltered_named.bed -o /project2/gilad/briana/threeprimeseq/data/clean.peaks/clean_APAfilteredTotal.txt -c .5
#add names to bed file with peaks
#awk '{print $1 "\t" $2 "\t" $3 "\t" $1 ":" $2 ":" $3 "\t" $4 "\t" $5 "\t" $6}' APAfiltered.bed > APAfiltered_named.bed
seq 1 199880 > peak.num.txt
paste APAfiltered.bed peak.num.txt | column -s $'\t' -t > temp
awk '{print $1 "\t" $2 "\t" $3 "\t" $7 "\t" $4 "\t" $5 "\t" $6}' temp > APAfiltered_named.bed
!sq
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 cleanUpdTSeq_1.18.0
[3] e1071_1.6-8 seqinr_3.4-5
[5] BSgenome.Drerio.UCSC.danRer7_1.4.0 BSgenome_1.48.0
[7] rtracklayer_1.40.3 Biostrings_2.48.0
[9] XVector_0.20.0 GenomicRanges_1.32.6
[11] GenomeInfoDb_1.16.0 IRanges_2.14.10
[13] S4Vectors_0.18.3 BiocGenerics_0.26.0
[15] workflowr_1.1.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 compiler_3.5.1
[3] git2r_0.23.0 class_7.3-14
[5] R.methodsS3_1.7.1 bitops_1.0-6
[7] R.utils_2.6.0 tools_3.5.1
[9] zlibbioc_1.26.0 digest_0.6.15
[11] lattice_0.20-35 evaluate_0.11
[13] Matrix_1.2-14 DelayedArray_0.6.2
[15] yaml_2.1.19 GenomeInfoDbData_1.1.0
[17] stringr_1.3.1 knitr_1.20
[19] ade4_1.7-11 rprojroot_1.3-2
[21] grid_3.5.1 Biobase_2.40.0
[23] XML_3.98-1.12 BiocParallel_1.14.2
[25] rmarkdown_1.10 magrittr_1.5
[27] whisker_0.3-2 MASS_7.3-50
[29] backports_1.1.2 Rsamtools_1.32.2
[31] htmltools_0.3.6 matrixStats_0.54.0
[33] GenomicAlignments_1.16.0 SummarizedExperiment_1.10.1
[35] stringi_1.2.4 RCurl_1.95-4.11
[37] R.oo_1.22.0
This reproducible R Markdown analysis was created with workflowr 1.1.1