Last updated: 2018-09-04
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(12345)
The command set.seed(12345)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: deaa5b0
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/figure/
Ignored: output/.DS_Store
Untracked files:
Untracked: analysis/ncbiRefSeq_sm.sort.mRNA.bed
Untracked: analysis/snake.config.notes.Rmd
Untracked: data/18486.genecov.txt
Untracked: data/APApeaksYL.total.inbrain.bed
Untracked: data/RNAkalisto/
Untracked: data/Totalpeaks_filtered_clean.bed
Untracked: data/YL-SP-18486-T-combined-genecov.txt
Untracked: data/YL-SP-18486-T_S9_R1_001-genecov.txt
Untracked: data/bedgraph_peaks/
Untracked: data/bin200.5.T.nuccov.bed
Untracked: data/bin200.Anuccov.bed
Untracked: data/bin200.nuccov.bed
Untracked: data/clean_peaks/
Untracked: data/comb_map_stats.csv
Untracked: data/comb_map_stats.xlsx
Untracked: data/combined_reads_mapped_three_prime_seq.csv
Untracked: data/gencov.test.csv
Untracked: data/gencov.test.txt
Untracked: data/gencov_zero.test.csv
Untracked: data/gencov_zero.test.txt
Untracked: data/gene_cov/
Untracked: data/joined
Untracked: data/leafcutter/
Untracked: data/merged_combined_YL-SP-threeprimeseq.bg
Untracked: data/nom_QTL/
Untracked: data/nom_QTL_opp/
Untracked: data/nuc6up/
Untracked: data/peakPerRefSeqGene/
Untracked: data/perm_QTL/
Untracked: data/perm_QTL_opp/
Untracked: data/reads_mapped_three_prime_seq.csv
Untracked: data/smash.cov.results.bed
Untracked: data/smash.cov.results.csv
Untracked: data/smash.cov.results.txt
Untracked: data/smash_testregion/
Untracked: data/ssFC200.cov.bed
Untracked: data/temp.file1
Untracked: data/temp.file2
Untracked: data/temp.gencov.test.txt
Untracked: data/temp.gencov_zero.test.txt
Untracked: output/picard/
Untracked: output/plots/
Untracked: output/qual.fig2.pdf
Unstaged changes:
Modified: analysis/28ind.peak.explore.Rmd
Modified: analysis/cleanupdtseq.internalpriming.Rmd
Modified: analysis/dif.iso.usage.leafcutter.Rmd
Modified: analysis/diff_iso_pipeline.Rmd
Modified: analysis/explore.filters.Rmd
Modified: analysis/peak.cov.pipeline.Rmd
Modified: analysis/peakOverlap_oppstrand.Rmd
Modified: analysis/pheno.leaf.comb.Rmd
Modified: analysis/test.max2.Rmd
Modified: code/Snakefile
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | deaa5b0 | Briana Mittleman | 2018-09-04 | compare TPM for genes with no peaks |
html | 2e39f7a | Briana Mittleman | 2018-08-30 | Build site. |
Rmd | a2a7cd9 | Briana Mittleman | 2018-08-30 | add kalisto code |
html | cbec2f6 | Briana Mittleman | 2018-08-29 | Build site. |
Rmd | 6b818cb | Briana Mittleman | 2018-08-29 | try gencode anno |
html | c6dc97b | brimittleman | 2018-08-28 | Build site. |
Rmd | fa818a1 | brimittleman | 2018-08-28 | first processing figure |
I will use this analysis to work on vizualising some of the processing steps of this analysis.
I want to create a figure similar to the one I created in https://brimittleman.github.io/comparative_threeprime/characterize.ortho.peaks.html. I will use the count distinct function from bedtools map. For this I am using the RefSeq mRNA annotations.
#!/bin/bash
#SBATCH --job-name=refseq_countdistinct
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=refseq_countdistinct.out
#SBATCH --error=refseq_countdistinct.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
bedtools map -c 4 -s -o count_distinct -a /project2/gilad/briana/genome_anotation_data/refseq.ProteinCoding.noCHR.bed -b /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom.named.fixed.bed > /project2/gilad/briana/threeprimeseq/data/peakPerRefseqGene/filtered_APApeaks_perRefseqGene.txt
#try opp strand
bedtools map -c 4 -S -o count_distinct -a /project2/gilad/briana/genome_anotation_data/refseq.ProteinCoding.noCHR.bed -b /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom.named.fixed.bed > /project2/gilad/briana/threeprimeseq/data/peakPerRefseqGene/filtered_APApeaks_perRefseqGene_oppStrand.txt
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0 ✔ purrr 0.2.5
✔ tibble 1.4.2 ✔ dplyr 0.7.6
✔ tidyr 0.8.1 ✔ stringr 1.3.1
✔ readr 1.1.1 ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(workflowr)
This is workflowr version 1.1.1
Run ?workflowr for help getting started
library(reshape2)
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
library(cowplot)
Attaching package: 'cowplot'
The following object is masked from 'package:ggplot2':
ggsave
names=c("Chr", "Start", "End", "Name", "Score", "Strand", "numPeaks")
peakpergene=read.table("../data/peakPerRefSeqGene/filtered_APApeaks_perRefseqGene.txt", stringsAsFactors = F, header = F, col.names = names) %>% mutate(onePeak=ifelse(numPeaks==1, 1, 0 )) %>% mutate(multPeaks=ifelse(numPeaks > 1, 1, 0 ))
genes1peak=sum(peakpergene$onePeak)/nrow(peakpergene)
genesMultpeak=sum(peakpergene$multPeaks)/nrow(peakpergene)
genes0peak= 1- genes1peak - genesMultpeak
perPeak= c(round(genes0peak,digits = 3), round(genes1peak,digits = 3),round(genesMultpeak, digits = 3))
Category=c("Zero", "One", "Multiple")
perPeakdf=as.data.frame(cbind(Category,as.numeric(perPeak)))
Plot these proportions:
lab1=paste("Genes =", genes0peak*nrow(peakpergene), sep=" ")
lab2=paste("Genes =", sum(peakpergene$onePeak), sep=" ")
lab3=paste("Genes =", sum(peakpergene$multPeaks), sep=" ")
genepeakplot=ggplot(perPeakdf, aes(x="", y=perPeak, fill=Category)) + geom_bar(stat="identity")+ labs(title="Characterize genes by number of PAS", y="Proportion of Protein Coding gene", x="")+ scale_fill_brewer(palette="Paired") + coord_cartesian(ylim=c(0,1)) + annotate("text", x="", y= .35, label=lab1) + annotate("text", x="", y= .78, label=lab2) + annotate("text", x="", y= .92, label=lab3)
genepeakplot
Version | Author | Date |
---|---|---|
cbec2f6 | Briana Mittleman | 2018-08-29 |
c6dc97b | brimittleman | 2018-08-28 |
This includes for than 1 isoform for different genes. I am going to go back to the original refseq file and resegment it. Column 13 is the gene name. Column 2 needs to start with NM because that is mRNA.
grep "NM" ncbiRefSeq.txt | awk '{print $3 "\t" $5 "\t" $6 "\t" $2 "\t" $13 "\t" $4}' > ncbiRefSeq.mRNA.named.bed
I can write a script that writes only the longest isoform for each gene.
outfile=open("refseq.ProteinCoding.bed", "w")
infile=open("ncbiRefSeq.mRNA.named.bed", "r")
lines=infile.readlines()
lot_lines=len(lines)
for n,ln in enumerate(lines):
chrom, start, end, mRNA, gene, strand = ln.split()
#if first line
if n == 0:
#first line condition
SE_list=[]
cur_gene=gene
SE_list.append(int(start))
SE_list.append(int(end))
elif n == lot_lines-1:
#last line condition
if gene == cur_gene:
SE_list.append(int(start))
SE_list.append(int(end))
SE_list.sort()
outfile.write("%s\t%d\t%d\t%s\t.\t%s\n"%(chrom, SE_list[0], SE_list[-1], gene, strand))
else:
outfile.write("%s\t%d\t%d\t%s\t.\t%s\n"%(chrom, int(start), int(end), gene, strand))
elif gene == cur_gene:
SE_list.append(int(start))
SE_list.append(int(end))
elif gene != cur_gene:
#write out the last line but with the start end from the SE list
prevline=lines[n-1]
chrom2, start2, end2, mRNA2, gene2, strand2 = prevline.split()
outfile.write("%s\t%d\t%d\t%s\t.\t%s\n"%(chrom2, SE_list[0], SE_list[-1], gene2, strand2))
cur_gene=gene
SE_list=[int(start), int(end)]
outfile.close()
I can check this by maknig sure there is 1 line for all of the unique names in the in file.
awk '{print $5}' ncbiRefSeq.mRNA.named.bed | sort | uniq | wc -l
#19243
wc -l refseq.ProteinCoding.bed
#20298
sed 's/^chr//' refseq.ProteinCoding.bed > refseq.ProteinCoding.noCHR.bed
There is still a problem with the script. The problem is when the same gene name is on extra haplotypes. I could remove all of the lines in the file that have _ in the first column. These are on contigs or specfic haplotypes. They will not map to our peaks anyway. I also need to remove the chr.
This still seems lower than previos APA estimates. I had used gencode estimates before. I am gonig to run this analysis again with those gene.
#!/bin/bash
#SBATCH --job-name=gencode_countdistinct
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=gencode_countdistinct.out
#SBATCH --error=gencode_countdistinct.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
bedtools map -c 4 -s -o count_distinct -a /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.proteincodinggene.sort.bed -b /project2/gilad/briana/threeprimeseq/data/mergedPeaks_comb/filtered_APApeaks_merged_allchrom.named.fixed.bed >
Gpeakpergene=read.table("../data/peakPerRefSeqGene/filtered_APApeaks_perGencodeGene.txt", stringsAsFactors = F, header = F, col.names = names) %>% mutate(onePeak=ifelse(numPeaks==1, 1, 0 )) %>% mutate(multPeaks=ifelse(numPeaks > 1, 1, 0 ))
Ggenes1peak=sum(Gpeakpergene$onePeak)/nrow(Gpeakpergene)
GgenesMultpeak=sum(Gpeakpergene$multPeaks)/nrow(Gpeakpergene)
Ggenes0peak= 1- Ggenes1peak - GgenesMultpeak
GperPeak= c(round(Ggenes0peak,digits = 3), round(Ggenes1peak,digits = 3),round(GgenesMultpeak, digits = 3))
GperPeakdf=as.data.frame(cbind(Category,as.numeric(GperPeak)))
Plot these proportions:
Glab1=paste("Genes =", Ggenes0peak*nrow(Gpeakpergene), sep=" ")
Glab2=paste("Genes =", sum(Gpeakpergene$onePeak), sep=" ")
Glab3=paste("Genes =", sum(Gpeakpergene$multPeaks), sep=" ")
Ggenepeakplot=ggplot(GperPeakdf, aes(x="", y=perPeak, fill=Category)) + geom_bar(stat="identity")+ labs(title="Characterize Gencode genes by number of PAS", y="Proportion of Protein Coding gene", x="")+ scale_fill_brewer(palette="Paired") + coord_cartesian(ylim=c(0,1)) + annotate("text", x="", y= .35, label=Glab1) + annotate("text", x="", y= .78, label=Glab2) + annotate("text", x="", y= .92, label=Glab3)
Ggenepeakplot
Version | Author | Date |
---|---|---|
cbec2f6 | Briana Mittleman | 2018-08-29 |
These results are still lower than expected. This is because all of my previous analysis mapped the genes to the peaks as which were closest in the upstream direction. Here I am saying the peak must overlap the gene.
I should again look at some of the genes with the top counts in RNA seq and the 0 peaks.
I am going to run feaureCounts on 18486 guevardis with the refseq annotation with the named genes. I need to make this a SAF file.
from misc_helper import *
fout = file("/project2/gilad/briana/genome_anotation_data/refseq.ProteinCoding.noCHR.SAF",'w')
fout.write("GeneID\tChr\tStart\tEnd\tStrand\n")
for ln in open("/project2/gilad/briana/genome_anotation_data/refseq.ProteinCoding.noCHR.bed"):
chrom, start, end, gene, score, strand = ln.split()
start_i=int(start)
end_i=int(end)
fout.write("%s\t%s\t%d\t%d\t%s\n"%(gene, chrom, start_i, end_i, strand))
fout.close()
#!/bin/bash
#SBATCH --job-name=fc_RNAseq_refseq
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=fc_RNAseq_refseq.out
#SBATCH --error=fc_RNAseq_refseq.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
# outdir: /project2/gilad/briana/comparitive_threeprime/data/PeakwGene_quant
featureCounts -a /project2/gilad/briana/genome_anotation_data/refseq.ProteinCoding.noCHR.SAF -F SAF -o /project2/gilad/briana/threeprimeseq/data/peakPerRefseqGene/refseq18486exp.quant /project2/yangili1/LCL/RNAseqGeuvadisBams/RNAseqGeuvadis_STAR_18486.final.bam -s 1
Now I can upload the results and compare them to the peak counts in these genes.
namesRNA=c("Name", "Chr", "Start", "End", "Strand", "Length", "RNAseq")
RNAseqrefseq=read.table("../data/peakPerRefSeqGene/refseq18486exp.quant", header=T, stringsAsFactors = F, col.names = namesRNA)
RNAseqrefseq$Start=as.integer(RNAseqrefseq$Start)
Warning: NAs introduced by coercion
RNAseqrefseq$End=as.integer(RNAseqrefseq$End)
Warning: NAs introduced by coercion
Join the peakpergene dataframe with this dataframe.
refPeakandRNA=peakpergene %>% inner_join(RNAseqrefseq, by=c("Name", "Chr", "Start", "End", "Strand"))
refPeakandRNA_noPeak=refPeakandRNA %>% filter(RNAseq!=0) %>% filter(numPeaks==0) %>% arrange(desc(RNAseq)) %>% select(Name, Start, End, Chr, RNAseq, numPeaks)
This doesnt make much sense. Seems like the peaks are on the opposite strand for the top genes. I am gonig to force opposite strandedness and see what happens.
Opeakpergene=read.table("../data/peakPerRefSeqGene/filtered_APApeaks_perRefseqGene_oppStrand.txt", stringsAsFactors = F, header = F, col.names = names) %>% mutate(onePeak=ifelse(numPeaks==1, 1, 0 )) %>% mutate(multPeaks=ifelse(numPeaks > 1, 1, 0 ))
Ogenes1peak=sum(Opeakpergene$onePeak)/nrow(Opeakpergene)
OgenesMultpeak=sum(Opeakpergene$multPeaks)/nrow(Opeakpergene)
Ogenes0peak= 1- Ogenes1peak - OgenesMultpeak
OperPeak= c(round(Ogenes0peak,digits = 3), round(Ogenes1peak,digits = 3),round(OgenesMultpeak, digits = 3))
OperPeakdf=as.data.frame(cbind(Category,OperPeak))
OperPeakdf$OperPeak=as.numeric(as.character(OperPeakdf$OperPeak))
Plot these proportions:
Olab1=paste("Genes =", Ogenes0peak*nrow(Opeakpergene), sep=" ")
Olab2=paste("Genes =", sum(Opeakpergene$onePeak), sep=" ")
Olab3=paste("Genes =", sum(Opeakpergene$multPeaks), sep=" ")
Ogenepeakplot=ggplot(OperPeakdf, aes(x="", y=OperPeak, by=Category, fill=Category)) + geom_bar(stat="identity")+ labs(title="Characterize Refseq genes by number of PAS- Oppstrand", y="Proportion of Protein Coding gene", x="")+ scale_fill_brewer(palette="Paired") + coord_cartesian(ylim=c(0,1)) + annotate("text", x="", y= .2, label=Olab1) + annotate("text", x="", y= .4, label=Olab2) + annotate("text", x="", y= .9, label=Olab3)
Ogenepeakplot
Version | Author | Date |
---|---|---|
2e39f7a | Briana Mittleman | 2018-08-30 |
This makes more sense now.
refPeakandRNA_withO=Opeakpergene %>% inner_join(RNAseqrefseq, by=c("Name", "Chr", "Start", "End", "Strand"))
refPeakandRNA_noPeakw_withO=refPeakandRNA_withO %>% filter(RNAseq!=0) %>% filter(numPeaks==0) %>% arrange(desc(RNAseq)) %>% select(Name, Start, End, Chr, RNAseq, numPeaks)
plot(sort(log10(refPeakandRNA_withO$RNAseq), decreasing = T), main="Distribution of RNA expression counts 18486", ylab="log10 Gene count", xlab="Refseq Gene")
points(sort(log10(refPeakandRNA_noPeakw_withO$RNAseq), decreasing = T), col="Red")
legend("topright", legend=c("All Gene", "Gene without Peak"), col=c("black", "red"),pch=19)
Version | Author | Date |
---|---|---|
2e39f7a | Briana Mittleman | 2018-08-30 |
Run Kalisto on the this RNA seq line and look at this plot with the kalisto output expression TPM. I added Kallisto to the three-prime-env.
Kallisto step:
This needs to be based on a transcriptome. I will use the protein coding transcripts from https://www.gencodegenes.org/releases/28lift37.html.
#!/bin/bash
#SBATCH --job-name=kallisto_index18486
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=kallisto_index18486.out
#SBATCH --error=kallisto_index18486.err
#SBATCH --partition=broadwl
#SBATCH --mem=50G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
kallisto index --make-unique -i /project2/gilad/briana/threeprimeseq/data/RNAseqKallisto/RNA18486_index /project2/gilad/briana/genome_anotation_data/gencode.v28lift37.pc_transcripts.fa
#!/bin/bash
#SBATCH --job-name=kallisto_quant18486
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=kallisto_quant18486.out
#SBATCH --error=kallisto_quant18486.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
kallisto quant -i /project2/gilad/briana/threeprimeseq/data/RNAseqKallisto/RNA18486_index -o /project2/gilad/briana/threeprimeseq/data/RNAseqKallisto/ /project2/yangili1/LCL/RNAseq/RNA.18486_1.fastq.gz /project2/yangili1/LCL/RNAseq/RNA.18486_2.fastq.gz
Convert to readable with TPM:
kallisto h5dump abundance.h5 -o /project2/gilad/briana/threeprimeseq/data/RNAseqKallisto
This is the gencode annotation. I want to do this with the refseq transcriptome. https://www.ncbi.nlm.nih.gov/projects/genome/guide/human/
kallisto_refseqindex18486.sh
#!/bin/bash
#SBATCH --job-name=kallisto_refseqindex18486
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=kallisto_refseqindex18486.out
#SBATCH --error=kallisto_refseqindex18486.err
#SBATCH --partition=broadwl
#SBATCH --mem=50G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
kallisto index --make-unique -i /project2/gilad/briana/threeprimeseq/data/RNAseqKallisto/RNA18486_refseq_index /project2/gilad/briana/genome_anotation_data/GRCh37_latest_rna.fna
#!/bin/bash
#SBATCH --job-name=kallisto_refseq_quant18486
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=kallisto_refseq_quant18486.out
#SBATCH --error=kallisto_refseq_quant18486.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load Anaconda3
source activate three-prime-env
kallisto quant -i /project2/gilad/briana/threeprimeseq/data/RNAseqKallisto/RNA18486_refseq_index -o /project2/gilad/briana/threeprimeseq/data/RNAseqKallisto/refseq/project2/yangili1/LCL/RNAseq/RNA.18486_1.fastq.gz /project2/yangili1/LCL/RNAseq/RNA.18486_2.fastq.gz
I will use tximport to convert from the transcripts that are quantified in kalisto.
#source("https://bioconductor.org/biocLite.R")
#biocLite("tximport")
#biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")
library(tximport)
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
Loading required package: GenomicFeatures
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:dplyr':
combine, intersect, setdiff, union
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, append, as.data.frame, basename, cbind,
colMeans, colnames, colSums, dirname, do.call, duplicated,
eval, evalq, Filter, Find, get, grep, grepl, intersect,
is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which, which.max,
which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:dplyr':
first, rename
The following object is masked from 'package:tidyr':
expand
The following object is masked from 'package:base':
expand.grid
Loading required package: IRanges
Attaching package: 'IRanges'
The following objects are masked from 'package:dplyr':
collapse, desc, slice
The following object is masked from 'package:purrr':
reduce
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'AnnotationDbi'
The following object is masked from 'package:dplyr':
select
Import Kalisto resutls:
#I need to make a gene to transcript ID with the transcript id and gene id columns
tx2gene=read.table("../data/RNAkalisto/ncbiRefSeq.txn2gene.txt" ,header= F, sep="\t", stringsAsFactors = F)
txi.kallisto.tsv <- tximport("../data/RNAkalisto/abundance.tsv", type = "kallisto", tx2gene = tx2gene)
Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read_tsv
1
removing duplicated transcript rows from tx2gene
transcripts missing from tx2gene: 99
summarizing abundance
summarizing counts
summarizing length
txi.kallisto.tsv$abundance= as.data.frame(txi.kallisto.tsv$abundance) %>% rownames_to_column(var="Name")
colnames(txi.kallisto.tsv$abundance)= c("Name", "TPM")
Now I want to join this with the RNA seq data so I am looking at the expression tpm rather than counts.
refPeakandRNA_withO_TPM=refPeakandRNA_withO %>% inner_join(txi.kallisto.tsv$abundance, by="Name") %>% filter(TPM>0)
refPeakandRNA_noPeakw_withO_TPM=refPeakandRNA_noPeakw_withO %>% inner_join(txi.kallisto.tsv$abundance, by="Name") %>% filter(TPM >0)
I can now replot the genes without peaks by TPM for the RNA seq rather than count.
plot(sort(log10(refPeakandRNA_withO_TPM$TPM), decreasing = T), main="Distribution of RNA expression 18486", ylab="log10 TPM", xlab="Refseq Gene")
points(sort(log10(refPeakandRNA_noPeakw_withO_TPM$TPM), decreasing = T), col="Red")
legend("topright", legend=c("All Genes", "Genes without Peak"), col=c("black", "red"),pch=19)
I can use this to look at some of the highest expressed genes that we do not have peaks for.
HIST2H2AA4: no coverage at location
HIST1H2AC: no coverage at location
BOP1: Not in the protein coding gene file. Are 2 peaks.
GSTM1: no coverage at location
NPIPA1: no coverage at location
SLX1A: difficult to interpret due to overlapping genes in the region
HIST1H2BJ: no coverage at location
MTX1: peak in the original filtered peaks, not in the refseq gene - lost due to direction, the peak goes the same was as the gene. probably noise
GALE - looks like there is a peak but we are not detecting it. May be too close to the next peak at the 3’ end of LYPLA2 gene.
HGH1: no coverage at location
MSMP: difficult to interpret due to overlapping genes in the region
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[2] GenomicFeatures_1.32.2
[3] AnnotationDbi_1.42.1
[4] Biobase_2.40.0
[5] GenomicRanges_1.32.6
[6] GenomeInfoDb_1.16.0
[7] IRanges_2.14.11
[8] S4Vectors_0.18.3
[9] BiocGenerics_0.26.0
[10] tximport_1.8.0
[11] bindrcpp_0.2.2
[12] cowplot_0.9.3
[13] reshape2_1.4.3
[14] workflowr_1.1.1
[15] forcats_0.3.0
[16] stringr_1.3.1
[17] dplyr_0.7.6
[18] purrr_0.2.5
[19] readr_1.1.1
[20] tidyr_0.8.1
[21] tibble_1.4.2
[22] ggplot2_3.0.0
[23] tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] nlme_3.1-137 matrixStats_0.54.0
[3] bitops_1.0-6 lubridate_1.7.4
[5] bit64_0.9-7 RColorBrewer_1.1-2
[7] progress_1.2.0 httr_1.3.1
[9] rprojroot_1.3-2 tools_3.5.1
[11] backports_1.1.2 R6_2.2.2
[13] DBI_1.0.0 lazyeval_0.2.1
[15] colorspace_1.3-2 withr_2.1.2
[17] tidyselect_0.2.4 prettyunits_1.0.2
[19] bit_1.1-14 compiler_3.5.1
[21] git2r_0.23.0 cli_1.0.0
[23] rvest_0.3.2 xml2_1.2.0
[25] DelayedArray_0.6.5 rtracklayer_1.40.6
[27] labeling_0.3 scales_1.0.0
[29] digest_0.6.16 Rsamtools_1.32.3
[31] rmarkdown_1.10 R.utils_2.7.0
[33] XVector_0.20.0 pkgconfig_2.0.2
[35] htmltools_0.3.6 rlang_0.2.2
[37] readxl_1.1.0 rstudioapi_0.7
[39] RSQLite_2.1.1 bindr_0.1.1
[41] jsonlite_1.5 BiocParallel_1.14.2
[43] R.oo_1.22.0 RCurl_1.95-4.11
[45] magrittr_1.5 GenomeInfoDbData_1.1.0
[47] Matrix_1.2-14 Rcpp_0.12.18
[49] munsell_0.5.0 R.methodsS3_1.7.1
[51] stringi_1.2.4 whisker_0.3-2
[53] yaml_2.2.0 SummarizedExperiment_1.10.1
[55] zlibbioc_1.26.0 plyr_1.8.4
[57] grid_3.5.1 blob_1.1.1
[59] crayon_1.3.4 lattice_0.20-35
[61] Biostrings_2.48.0 haven_1.1.2
[63] hms_0.4.2 knitr_1.20
[65] pillar_1.3.0 biomaRt_2.36.1
[67] XML_3.98-1.16 glue_1.3.0
[69] evaluate_0.11 modelr_0.1.2
[71] cellranger_1.1.0 gtable_0.2.0
[73] assertthat_0.2.0 broom_0.5.0
[75] GenomicAlignments_1.16.0 memoise_1.1.0
This reproducible R Markdown analysis was created with workflowr 1.1.1