Last updated: 2018-06-07
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date 
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
 ✔ Environment: empty 
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
 ✔ Seed: 
set.seed(12345) 
The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
 ✔ Session information: recorded 
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
 ✔ Repository version: bb0305b 
wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
Untracked files:
    Untracked:  data/18486.genecov.txt
    Untracked:  data/YL-SP-18486-T_S9_R1_001-genecov.txt
    Untracked:  data/gene_cov/
    Untracked:  data/leafcutter/
    Untracked:  data/reads_mapped_three_prime_seq.csv
    Untracked:  data/ssFC200.cov.bed
    Untracked:  output/plots/
Unstaged changes:
    Modified:   analysis/dif.iso.usage.leafcutter.Rmd
    Modified:   code/Snakefile
| File | Version | Author | Date | Message | 
|---|---|---|---|---|
| Rmd | bb0305b | Briana Mittleman | 2018-06-07 | coverage questions | 
I am going to change the filters and see if we get more genes with polA than we did in the first differential isoform usage anaylsis.
First, I will change it to 1 read in 5 individuals rather than 5 reads in 5 individuals.
library(dplyr)Warning: package 'dplyr' was built under R version 3.4.4
Attaching package: 'dplyr'The following objects are masked from 'package:stats':
    filter, lagThe following objects are masked from 'package:base':
    intersect, setdiff, setequal, unionlibrary(ggplot2)
library(reshape2)Warning: package 'reshape2' was built under R version 3.4.3library(workflowr)Loading required package: rmarkdownThis is workflowr version 1.0.1
Run ?workflowr for help getting startedlibrary(tidyr)
Attaching package: 'tidyr'The following object is masked from 'package:reshape2':
    smiths#import data and correct row names 
cov_all=read.table("../data/ssFC200.cov.bed", header = T, stringsAsFactors = FALSE)
names=c("Geneid","Chr", "Start", "End", "Strand", "Length", "N_18486","T_18486","N_18497","T_18497","N_18500","T_18500","N_18505",'T_18505',"N_18508","T_18508","N_18853","T_18853","N_18870","T_18870","N_19128","T_19128","N_19141","T_19141","N_19193","T_19193","N_19209","T_19209","N_19223","N_19225","T_19225","T_19223","N_19238","T_19238","N_19239","T_19239","N_19257","T_19257")
colnames(cov_all)= names#convert to leaf format
cov_all_anno=cov_all %>% separate(col=Geneid, into=c("bin","gene"), sep=".E") 
cov_all_anno$gene=  paste( "E",  cov_all_anno$gene, sep="" )
bin_loc=paste(cov_all_anno$Start, cov_all_anno$End, cov_all_anno$Strand,sep=".")
leaf_all_anno=paste(cov_all_anno$Chr,bin_loc, cov_all_anno$gene, sep=":")
leaf_all=cbind(leaf_all_anno,cov_all_anno[,8:39])Create a function where I can filter the number of reads and individuals for the filtering.
apa_genes=function(reads, ind) {
  leaf_all_nuc= leaf_all %>% select(contains("N_"))
  keep.nuc.leaf=rowSums(leaf_all_nuc>=reads) >= ind
  leaf_nuc_filt=leaf_all[keep.nuc.leaf,]
  leaf_all_tot= leaf_all %>% select(contains("T_"))
  keep.tot.leaf=rowSums(leaf_all_tot>=reads) >= ind
  leaf_tot_filt=leaf_all[keep.tot.leaf,]
  leaf_all_filt=union(leaf_nuc_filt,leaf_tot_filt)
  
  genes.anno=data.frame(x=leaf_all_filt$leaf_all_anno) %>%  separate(col=x, into=c("chr","bin","gene"), sep=":")
  n_genes= n_distinct(genes.anno$gene) 
  num_gene=genes.anno %>% group_by(gene) %>% select(gene) %>% tally() %>% filter(n>1) 
  return(nrow(num_gene))
}current_filter=apa_genes(5,5)Warning: package 'bindrcpp' was built under R version 3.4.4one_read=apa_genes(1,5)
one_read_oneind=apa_genes(1,1)I need to compare gene counts for RNA seq and 3’ seq. I can use the protein coding coverage files that were created using snakemake. (/project2/gilad/briana/threeprimeseq/data/gene_cov)
For RNA seq I need to run the snakemake rule for this file:
/project2/gilad/yangili/LCLs/bams/RNAseqGeuvadis_STAR_18486.final.sort.bam
I will need to run, bamtobed, sortbed, and bedtools coverage. This script is rnaseq_cov.sh
#!/bin/bash
#SBATCH --job-name=rna_cov
#SBATCH --time=8:00:00
#SBATCH --output=rna_cov.out
#SBATCH --error=rna_cov.err
#SBATCH --partition=broadwl
#SBATCH --mem=20G
#SBATCH --mail-type=END
module load Anaconda3  
source activate three-prime-env
#input is a bam 
sample=$1
describer=$(echo ${sample} | sed -e 's/.*\RNAseqGeuvadis_STAR_//' | sed -e "s/.final.sort.bam$//")
bedtools bamtobed -i $1 > /project2/gilad/briana/threeprimeseq/data/rnaseq_bed/${describer}.bed
sort -k1,1 -k2,2n /project2/gilad/briana/threeprimeseq/data/rnaseq_bed/${describer}.bed > /project2/gilad/briana/threeprimeseq/data/rnaseq_sorted_bed/${describer}.sort.bed
bedtools coverage -counts -sorted -a /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.proteincodinggene.sort.chr.bed -b /project2/gilad/briana/threeprimeseq/data/rnaseq_sorted_bed/${describer}.sort.bed > /project2/gilad/briana/threeprimeseq/data/rnaseq_cov/${describer}.genecov.txt
Import the data:
rnaseq=read.table("../data/18486.genecov.txt")
names(rnaseq)=c("Chr", "start", "end", "gene", "score", "strand", "count")
threeprime=read.table("../data/YL-SP-18486-T_S9_R1_001-genecov.txt")
names(threeprime)=c("Chr", "start", "end", "gene", "score", "strand", "count")Join the data on the gene names.
rnaseq_sm=rnaseq %>% select("gene", "count")
threeprime_sm=threeprime %>% select("gene", "count")
gene_cov=rnaseq_sm %>% left_join(threeprime_sm, by= "gene") 
names(gene_cov)= c("gene", "rnaseq", "threeprime")
lm(gene_cov$rnaseq ~gene_cov$threeprime)
Call:
lm(formula = gene_cov$rnaseq ~ gene_cov$threeprime)
Coefficients:
        (Intercept)  gene_cov$threeprime  
           2952.907                2.888  ggplot(gene_cov,aes(x=log10(rnaseq), y=log10(threeprime)))+ geom_point(na.rm=TRUE, size=.1) + geom_density2d(na.rm = TRUE, size = 1, colour = 'red') + labs(y='Log(three prime seq gene Count)', x='Log(RNA seq gene Count)', title="Correlation between three prime seq and rna seq read counts") + xlab('Log(RNA seq gene Count)') + geom_smooth(method="lm")Warning: Removed 7062 rows containing non-finite values (stat_smooth).
summary(gene_cov$rnaseq)   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       6     611    3619    3658 1075054 summary(gene_cov$threeprime)    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
     0.0      0.0     31.0    230.5    185.0 139213.0 Look at home many genes have values in RNA seq and not in three prime seq.
rnaseq.great.0= gene_cov %>%filter(threeprime==0) %>% filter(rnaseq>0) %>% select(gene)
rnaseq_det=rnaseq %>% mutate(det_in_three=ifelse(gene %in% rnaseq.great.0$gene, "Not", "Det" )) %>% mutate(count_cor=count/(end-start))
ggplot(rnaseq_det, aes(y=log10(count_cor), x=det_in_three)) + geom_violin() + labs(y="log10 standardized RNA seq count", x="Detected in Three prime seq", title="Comparing RNA seq counts for genes where detected and not detected in three prime seq")Warning: Removed 3420 rows containing non-finite values (stat_ydensity).
sessionInfo()R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] bindrcpp_0.2.2  tidyr_0.7.2     workflowr_1.0.1 rmarkdown_1.8.5
[5] reshape2_1.4.3  ggplot2_2.2.1   dplyr_0.7.5    
loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17      compiler_3.4.2    pillar_1.1.0     
 [4] git2r_0.21.0      plyr_1.8.4        bindr_0.1.1      
 [7] R.methodsS3_1.7.1 R.utils_2.6.0     tools_3.4.2      
[10] digest_0.6.15     evaluate_0.10.1   tibble_1.4.2     
[13] gtable_0.2.0      pkgconfig_2.0.1   rlang_0.2.1      
[16] yaml_2.1.19       stringr_1.3.1     knitr_1.18       
[19] rprojroot_1.3-2   grid_3.4.2        tidyselect_0.2.4 
[22] glue_1.2.0        R6_2.2.2          purrr_0.2.5      
[25] magrittr_1.5      whisker_0.3-2     backports_1.1.2  
[28] scales_0.5.0      htmltools_0.3.6   MASS_7.3-48      
[31] assertthat_0.2.0  colorspace_1.3-2  labeling_0.3     
[34] stringi_1.2.2     lazyeval_0.2.1    munsell_0.4.3    
[37] R.oo_1.22.0      
This reproducible R Markdown analysis was created with workflowr 1.0.1