Last updated: 2018-06-07
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(12345)
The command set.seed(12345)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: bb0305b
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Untracked files:
Untracked: data/18486.genecov.txt
Untracked: data/YL-SP-18486-T_S9_R1_001-genecov.txt
Untracked: data/gene_cov/
Untracked: data/leafcutter/
Untracked: data/reads_mapped_three_prime_seq.csv
Untracked: data/ssFC200.cov.bed
Untracked: output/plots/
Unstaged changes:
Modified: analysis/dif.iso.usage.leafcutter.Rmd
Modified: code/Snakefile
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | bb0305b | Briana Mittleman | 2018-06-07 | coverage questions |
I am going to change the filters and see if we get more genes with polA than we did in the first differential isoform usage anaylsis.
First, I will change it to 1 read in 5 individuals rather than 5 reads in 5 individuals.
library(dplyr)
Warning: package 'dplyr' was built under R version 3.4.4
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)
Warning: package 'reshape2' was built under R version 3.4.3
library(workflowr)
Loading required package: rmarkdown
This is workflowr version 1.0.1
Run ?workflowr for help getting started
library(tidyr)
Attaching package: 'tidyr'
The following object is masked from 'package:reshape2':
smiths
#import data and correct row names
cov_all=read.table("../data/ssFC200.cov.bed", header = T, stringsAsFactors = FALSE)
names=c("Geneid","Chr", "Start", "End", "Strand", "Length", "N_18486","T_18486","N_18497","T_18497","N_18500","T_18500","N_18505",'T_18505',"N_18508","T_18508","N_18853","T_18853","N_18870","T_18870","N_19128","T_19128","N_19141","T_19141","N_19193","T_19193","N_19209","T_19209","N_19223","N_19225","T_19225","T_19223","N_19238","T_19238","N_19239","T_19239","N_19257","T_19257")
colnames(cov_all)= names
#convert to leaf format
cov_all_anno=cov_all %>% separate(col=Geneid, into=c("bin","gene"), sep=".E")
cov_all_anno$gene= paste( "E", cov_all_anno$gene, sep="" )
bin_loc=paste(cov_all_anno$Start, cov_all_anno$End, cov_all_anno$Strand,sep=".")
leaf_all_anno=paste(cov_all_anno$Chr,bin_loc, cov_all_anno$gene, sep=":")
leaf_all=cbind(leaf_all_anno,cov_all_anno[,8:39])
Create a function where I can filter the number of reads and individuals for the filtering.
apa_genes=function(reads, ind) {
leaf_all_nuc= leaf_all %>% select(contains("N_"))
keep.nuc.leaf=rowSums(leaf_all_nuc>=reads) >= ind
leaf_nuc_filt=leaf_all[keep.nuc.leaf,]
leaf_all_tot= leaf_all %>% select(contains("T_"))
keep.tot.leaf=rowSums(leaf_all_tot>=reads) >= ind
leaf_tot_filt=leaf_all[keep.tot.leaf,]
leaf_all_filt=union(leaf_nuc_filt,leaf_tot_filt)
genes.anno=data.frame(x=leaf_all_filt$leaf_all_anno) %>% separate(col=x, into=c("chr","bin","gene"), sep=":")
n_genes= n_distinct(genes.anno$gene)
num_gene=genes.anno %>% group_by(gene) %>% select(gene) %>% tally() %>% filter(n>1)
return(nrow(num_gene))
}
current_filter=apa_genes(5,5)
Warning: package 'bindrcpp' was built under R version 3.4.4
one_read=apa_genes(1,5)
one_read_oneind=apa_genes(1,1)
I need to compare gene counts for RNA seq and 3’ seq. I can use the protein coding coverage files that were created using snakemake. (/project2/gilad/briana/threeprimeseq/data/gene_cov)
For RNA seq I need to run the snakemake rule for this file:
/project2/gilad/yangili/LCLs/bams/RNAseqGeuvadis_STAR_18486.final.sort.bam
I will need to run, bamtobed, sortbed, and bedtools coverage. This script is rnaseq_cov.sh
#!/bin/bash
#SBATCH --job-name=rna_cov
#SBATCH --time=8:00:00
#SBATCH --output=rna_cov.out
#SBATCH --error=rna_cov.err
#SBATCH --partition=broadwl
#SBATCH --mem=20G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
#input is a bam
sample=$1
describer=$(echo ${sample} | sed -e 's/.*\RNAseqGeuvadis_STAR_//' | sed -e "s/.final.sort.bam$//")
bedtools bamtobed -i $1 > /project2/gilad/briana/threeprimeseq/data/rnaseq_bed/${describer}.bed
sort -k1,1 -k2,2n /project2/gilad/briana/threeprimeseq/data/rnaseq_bed/${describer}.bed > /project2/gilad/briana/threeprimeseq/data/rnaseq_sorted_bed/${describer}.sort.bed
bedtools coverage -counts -sorted -a /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.proteincodinggene.sort.chr.bed -b /project2/gilad/briana/threeprimeseq/data/rnaseq_sorted_bed/${describer}.sort.bed > /project2/gilad/briana/threeprimeseq/data/rnaseq_cov/${describer}.genecov.txt
Import the data:
rnaseq=read.table("../data/18486.genecov.txt")
names(rnaseq)=c("Chr", "start", "end", "gene", "score", "strand", "count")
threeprime=read.table("../data/YL-SP-18486-T_S9_R1_001-genecov.txt")
names(threeprime)=c("Chr", "start", "end", "gene", "score", "strand", "count")
Join the data on the gene names.
rnaseq_sm=rnaseq %>% select("gene", "count")
threeprime_sm=threeprime %>% select("gene", "count")
gene_cov=rnaseq_sm %>% left_join(threeprime_sm, by= "gene")
names(gene_cov)= c("gene", "rnaseq", "threeprime")
lm(gene_cov$rnaseq ~gene_cov$threeprime)
Call:
lm(formula = gene_cov$rnaseq ~ gene_cov$threeprime)
Coefficients:
(Intercept) gene_cov$threeprime
2952.907 2.888
ggplot(gene_cov,aes(x=log10(rnaseq), y=log10(threeprime)))+ geom_point(na.rm=TRUE, size=.1) + geom_density2d(na.rm = TRUE, size = 1, colour = 'red') + labs(y='Log(three prime seq gene Count)', x='Log(RNA seq gene Count)', title="Correlation between three prime seq and rna seq read counts") + xlab('Log(RNA seq gene Count)') + geom_smooth(method="lm")
Warning: Removed 7062 rows containing non-finite values (stat_smooth).
summary(gene_cov$rnaseq)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 6 611 3619 3658 1075054
summary(gene_cov$threeprime)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 31.0 230.5 185.0 139213.0
Look at home many genes have values in RNA seq and not in three prime seq.
rnaseq.great.0= gene_cov %>%filter(threeprime==0) %>% filter(rnaseq>0) %>% select(gene)
rnaseq_det=rnaseq %>% mutate(det_in_three=ifelse(gene %in% rnaseq.great.0$gene, "Not", "Det" )) %>% mutate(count_cor=count/(end-start))
ggplot(rnaseq_det, aes(y=log10(count_cor), x=det_in_three)) + geom_violin() + labs(y="log10 standardized RNA seq count", x="Detected in Three prime seq", title="Comparing RNA seq counts for genes where detected and not detected in three prime seq")
Warning: Removed 3420 rows containing non-finite values (stat_ydensity).
sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 tidyr_0.7.2 workflowr_1.0.1 rmarkdown_1.8.5
[5] reshape2_1.4.3 ggplot2_2.2.1 dplyr_0.7.5
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 compiler_3.4.2 pillar_1.1.0
[4] git2r_0.21.0 plyr_1.8.4 bindr_0.1.1
[7] R.methodsS3_1.7.1 R.utils_2.6.0 tools_3.4.2
[10] digest_0.6.15 evaluate_0.10.1 tibble_1.4.2
[13] gtable_0.2.0 pkgconfig_2.0.1 rlang_0.2.1
[16] yaml_2.1.19 stringr_1.3.1 knitr_1.18
[19] rprojroot_1.3-2 grid_3.4.2 tidyselect_0.2.4
[22] glue_1.2.0 R6_2.2.2 purrr_0.2.5
[25] magrittr_1.5 whisker_0.3-2 backports_1.1.2
[28] scales_0.5.0 htmltools_0.3.6 MASS_7.3-48
[31] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3
[34] stringi_1.2.2 lazyeval_0.2.1 munsell_0.4.3
[37] R.oo_1.22.0
This reproducible R Markdown analysis was created with workflowr 1.0.1