Peak coverage along Gene examples

Last updated: 2018-11-14

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(12345)

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

✔ Repository version: f3efe0f

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  KalistoAbundance18486.txt
    Untracked:  analysis/ncbiRefSeq_sm.sort.mRNA.bed
    Untracked:  analysis/snake.config.notes.Rmd
    Untracked:  analysis/verifyBAM.Rmd
    Untracked:  data/18486.genecov.txt
    Untracked:  data/APApeaksYL.total.inbrain.bed
    Untracked:  data/ChromHmmOverlap/
    Untracked:  data/GM12878.chromHMM.bed
    Untracked:  data/GM12878.chromHMM.txt
    Untracked:  data/NuclearApaQTLs.txt
    Untracked:  data/PeaksUsed/
    Untracked:  data/RNAkalisto/
    Untracked:  data/TotalApaQTLs.txt
    Untracked:  data/Totalpeaks_filtered_clean.bed
    Untracked:  data/YL-SP-18486-T-combined-genecov.txt
    Untracked:  data/YL-SP-18486-T_S9_R1_001-genecov.txt
    Untracked:  data/apaExamp/
    Untracked:  data/bedgraph_peaks/
    Untracked:  data/bin200.5.T.nuccov.bed
    Untracked:  data/bin200.Anuccov.bed
    Untracked:  data/bin200.nuccov.bed
    Untracked:  data/clean_peaks/
    Untracked:  data/comb_map_stats.csv
    Untracked:  data/comb_map_stats.xlsx
    Untracked:  data/comb_map_stats_39ind.csv
    Untracked:  data/combined_reads_mapped_three_prime_seq.csv
    Untracked:  data/diff_iso_trans/
    Untracked:  data/ensemble_to_genename.txt
    Untracked:  data/example_gene_peakQuant/
    Untracked:  data/filtered_APApeaks_merged_allchrom_refseqTrans.closest2End.bed
    Untracked:  data/filtered_APApeaks_merged_allchrom_refseqTrans.closest2End.noties.bed
    Untracked:  data/first50lines_closest.txt
    Untracked:  data/gencov.test.csv
    Untracked:  data/gencov.test.txt
    Untracked:  data/gencov_zero.test.csv
    Untracked:  data/gencov_zero.test.txt
    Untracked:  data/gene_cov/
    Untracked:  data/joined
    Untracked:  data/leafcutter/
    Untracked:  data/merged_combined_YL-SP-threeprimeseq.bg
    Untracked:  data/mol_overlap/
    Untracked:  data/mol_pheno/
    Untracked:  data/nom_QTL/
    Untracked:  data/nom_QTL_opp/
    Untracked:  data/nom_QTL_trans/
    Untracked:  data/nuc6up/
    Untracked:  data/other_qtls/
    Untracked:  data/peakPerRefSeqGene/
    Untracked:  data/perm_QTL/
    Untracked:  data/perm_QTL_opp/
    Untracked:  data/perm_QTL_trans/
    Untracked:  data/reads_mapped_three_prime_seq.csv
    Untracked:  data/smash.cov.results.bed
    Untracked:  data/smash.cov.results.csv
    Untracked:  data/smash.cov.results.txt
    Untracked:  data/smash_testregion/
    Untracked:  data/ssFC200.cov.bed
    Untracked:  data/temp.file1
    Untracked:  data/temp.file2
    Untracked:  data/temp.gencov.test.txt
    Untracked:  data/temp.gencov_zero.test.txt
    Untracked:  output/picard/
    Untracked:  output/plots/
    Untracked:  output/qual.fig2.pdf

Unstaged changes:
    Modified:   analysis/28ind.peak.explore.Rmd
    Modified:   analysis/39indQC.Rmd
    Modified:   analysis/chromHmm_enrichment.Rmd
    Modified:   analysis/cleanupdtseq.internalpriming.Rmd
    Modified:   analysis/coloc_apaQTLs_protQTLs.Rmd
    Modified:   analysis/dif.iso.usage.leafcutter.Rmd
    Modified:   analysis/diff_iso_pipeline.Rmd
    Modified:   analysis/explore.filters.Rmd
    Modified:   analysis/flash2mash.Rmd
    Modified:   analysis/overlapMolQTL.Rmd
    Modified:   analysis/overlap_qtls.Rmd
    Modified:   analysis/peakOverlap_oppstrand.Rmd
    Modified:   analysis/pheno.leaf.comb.Rmd
    Modified:   analysis/swarmPlots_QTLs.Rmd
    Modified:   analysis/test.max2.Rmd
    Modified:   code/Snakefile

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	f3efe0f	Briana Mittleman	2018-11-14	add prop cov plots
html	d4ff15b	Briana Mittleman	2018-11-14	Build site.
Rmd	acc74e7	Briana Mittleman	2018-11-14	add example by genotype plots
html	d254fd2	Briana Mittleman	2018-11-13	Build site.
Rmd	4299126	Briana Mittleman	2018-11-13	investigate which peak is sig
html	e2da5c4	Briana Mittleman	2018-11-13	Build site.
Rmd	f8d8c94	Briana Mittleman	2018-11-13	add plots for peak coverage
html	24821f2	Briana Mittleman	2018-11-12	Build site.
Rmd	7d1bd9a	Briana Mittleman	2018-11-12	add code for looking at sig gene peaks

The quantified peak files are:

/project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Nuclear_fixed.fc
/project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Total_fixed.fc

I want to grep specific genes and look at the read distribution for peaks along a gene. In these files the peakIDs stil have the peak locations. Before I ran the QTL analysis I changed the final coverage (ex /project2/gilad/briana/threeprimeseq/data/phenotypes_filtPeakTranscript/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Nuclear.pheno_fixed.txt.gz.qqnorm_chr$i.gz) to have the TSS as the ID.

Librarys

library(workflowr)

This is workflowr version 1.1.1
Run ?workflowr for help getting started

library(reshape2)
library(tidyverse)

── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

✔ ggplot2 3.0.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.6
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0

── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(VennDiagram)

Loading required package: grid

Loading required package: futile.logger

library(data.table)


Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose

The following objects are masked from 'package:reshape2':

    dcast, melt

library(ggpubr)

Loading required package: magrittr


Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract


Attaching package: 'ggpubr'

The following object is masked from 'package:VennDiagram':

    rotate

library(cowplot)


Attaching package: 'cowplot'

The following object is masked from 'package:ggpubr':

    get_legend

The following object is masked from 'package:ggplot2':

    ggsave

nuc_names=c('Geneid',   'Chr',  'Start',    'End',  'Strand',   'Length',   'NA18486'   ,'NA18497', 'NA18500'   ,'NA18505', 'NA18508'   ,'NA18511', 'NA18519',  'NA18520',  'NA18853','NA18858',    'NA18861',  'NA18870'   ,'NA18909'  ,'NA18912'  ,'NA18916', 'NA19092'   ,'NA19093', 'NA19119',  'NA19128'   ,'NA19130', 'NA19131'   ,'NA19137', 'NA19140',  'NA19141'   ,'NA19144', 'NA19152'   ,'NA19153', 'NA19160'   ,'NA19171', 'NA19193'   ,'NA19200', 'NA19207',  'NA19209',  'NA19210',  'NA19223'   ,'NA19225', 'NA19238'   ,'NA19239', 'NA19257')


tot_names=c('Geneid',   'Chr',  'Start',    'End',  'Strand',   'Length',   'NA18486'   ,'NA18497', 'NA18500'   ,'NA18505', 'NA18508'   ,'NA18511', 'NA18519',  'NA18520',  'NA18853','NA18858',    'NA18861',  'NA18870'   ,'NA18909'  ,'NA18912'  ,'NA18916', 'NA19092'   ,'NA19093', 'NA19119',  'NA19128'   ,'NA19130', 'NA19131'   ,'NA19137', 'NA19140',  'NA19141'   ,'NA19144', 'NA19152'   ,'NA19153', 'NA19160'   ,'NA19171', 'NA19193'   ,'NA19200', 'NA19207',  'NA19209',  'NA19210',  'NA19223'   ,'NA19225', 'NA19238'   ,'NA19239', 'NA19257')

NuclearAPA=read.table("../data/perm_QTL_trans/filtered_APApeaks_merged_allchrom_refseqGenes_pheno_Nuclear_transcript_permResBH.txt", stringsAsFactors = F, header=T)  %>% mutate(sig=ifelse(-log10(bh)>=1, 1,0 )) %>%  separate(pid, sep = ":", into=c("chr", "start", "end", "id")) %>% separate(id, sep = "_", into=c("gene", "strand", "peak")) %>% filter(sig==1)
totalAPA=read.table("../data/perm_QTL_trans/filtered_APApeaks_merged_allchrom_refseqGenes_pheno_Total_transcript_permResBH.txt", stringsAsFactors = F, header=T)  %>% mutate(sig=ifelse(-log10(bh)>=1, 1,0 )) %>%  separate(pid, sep = ":", into=c("chr", "start", "end", "id")) %>% separate(id, sep = "_", into=c("gene", "strand", "peak")) %>% filter(sig==1)

examples to look at Nuclear: IRF5, HSF1, NOL9,DCAF16,

Total: NBEAL2, SACM1L, COX7A2L


#nuclear
grep IRF5 /project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Nuclear_fixed.fc > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/IRF5_NuclearCov_peaks.txt

grep HSF1 /project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Nuclear_fixed.fc > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/HSF1_NuclearCov_peaks.txt


grep NOL9 /project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Nuclear_fixed.fc > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/NOL9_NuclearCov_peaks.txt

grep DCAF16 /project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Nuclear_fixed.fc > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/DCAF16_NuclearCov_peaks.txt

grep PPP4C /project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Nuclear_fixed.fc > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/PPP4C_NuclearCov_peaks.txt



#total
grep NBEAL2 /project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Total_fixed.fc > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/NBEAL2_TotalCov_peaks.txt

grep SACM1L /project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Total_fixed.fc > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/SACM1L_TotalCov_peaks.txt



grep TESK1 /project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Total_fixed.fc > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/TESK1_TotalCov_peaks.txt  


grep DGCR14 /project2/gilad/briana/threeprimeseq/data/filtPeakOppstrand_cov/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Total_fixed.fc > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/DGCR14_TotalCov_peaks.txt

Copy these to my computer so I can work with them here. I am going to want to make a function that makes the histogram reproducibly for anyfile. I will need to know how many bins to include in the histogram. First I will make the graph for one example then I will make it more general.

Files are in /Users/bmittleman1/Documents/Gilad_lab/threeprimeseq/data/example_gene_peakQuant

Start wit a small file.

pos=c(3,4,7:39)
PPP4c=read.table("../data/example_gene_peakQuant/PPP4C_NuclearCov_peaks.txt", stringsAsFactors = F, col.names = nuc_names) %>% select(pos) 

PPP4c$peaks=seq(0, (nrow(PPP4c)-1))

PPP4c_melt=melt(PPP4c, id.vars=c('peaks','Start','End'))

Plot:

ggplot(PPP4c_melt, aes(x=peaks, y=value, by=variable, fill=variable)) + geom_histogram(stat="identity")

Warning: Ignoring unknown parameters: binwidth, bins, pad

Expand here to see past versions of unnamed-chunk-6-1.png:

Version	Author	Date
d4ff15b	Briana Mittleman	2018-11-14
e2da5c4	Briana Mittleman	2018-11-13

Try with actual location as the center of the peak.

pos=c(3,4,7:39)
PPP4c_2=read.table("../data/example_gene_peakQuant/PPP4C_NuclearCov_peaks.txt", stringsAsFactors = F, col.names = tot_names) %>% select(pos)
PPP4c_2$peaks=seq(0, (nrow(PPP4c_2)-1))
PPP4c_2= PPP4c_2 %>% mutate(PeakCenter=(Start+ (End-Start)/2))
PPP4c2_melt=melt(PPP4c_2, id.vars=c('peaks','PeakCenter', "Start", "End"))
colnames(PPP4c2_melt)= c('peaks','PeakCenter', "Start", "End", "Individual", "ReadCount")

Plot:

ggplot(PPP4c2_melt, aes(x=PeakCenter, y=ReadCount, by=Individual, fill=Individual)) + geom_histogram(stat="identity") + labs(title="Peak Coverage and Location PP4c")

Warning: Ignoring unknown parameters: binwidth, bins, pad

Expand here to see past versions of unnamed-chunk-8-1.png:

Version	Author	Date
d4ff15b	Briana Mittleman	2018-11-14
e2da5c4	Briana Mittleman	2018-11-13

Generalize this for more genes:

makePeakLocplot=function(file, geneName,fraction){
  pos=c(3,4,7:39)
  if (fraction=="Total"){
  gene=read.table(file, stringsAsFactors = F, col.names = tot_names) %>% select(pos)
  }
  else{
    gene=read.table(file, stringsAsFactors = F, col.names = nuc_names) %>% select(pos)
  }
  
  gene$peaks=seq(0, (nrow(gene)-1))
  gene= gene %>% mutate(PeakCenter=(Start+ (End-Start)/2))
  gene_melt=melt(gene, id.vars=c('peaks','PeakCenter', "Start", "End"))
  colnames(gene_melt)= c('peaks','PeakCenter', "Start", "End", "Individual", "ReadCount")
  finalplot=ggplot(gene_melt, aes(x=PeakCenter, y=ReadCount, by=Individual, fill=Individual)) + geom_histogram(stat="identity", show.legend = FALSE) + labs(title=paste("Peak Coverage and Location", geneName, sep = " ")) 
  return(finalplot)
}

Try for another gene:

makePeakLocplot("../data/example_gene_peakQuant/PPP4C_NuclearCov_peaks.txt",'PPP4c',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

Expand here to see past versions of unnamed-chunk-10-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

makePeakLocplot("../data/example_gene_peakQuant/DCAF16_NuclearCov_peaks.txt",'DCAF16',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

Expand here to see past versions of unnamed-chunk-11-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

makePeakLocplot("../data/example_gene_peakQuant/DGCR14_TotalCov_peaks.txt",'DGCR14',"Total")

Warning: Ignoring unknown parameters: binwidth, bins, pad

Expand here to see past versions of unnamed-chunk-12-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

makePeakLocplot("../data/example_gene_peakQuant/IRF5_NuclearCov_peaks.txt",'IRF5',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

Expand here to see past versions of unnamed-chunk-13-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

makePeakLocplot("../data/example_gene_peakQuant/HSF1_NuclearCov_peaks.txt",'HSF1',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

Expand here to see past versions of unnamed-chunk-14-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

makePeakLocplot("../data/example_gene_peakQuant/NOL9_NuclearCov_peaks.txt",'NOL9',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

Expand here to see past versions of unnamed-chunk-15-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

makePeakLocplot("../data/example_gene_peakQuant/SACM1L_TotalCov_peaks.txt",'SACM1L',"Total")

Warning: Ignoring unknown parameters: binwidth, bins, pad

Expand here to see past versions of unnamed-chunk-16-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

Make a function to do this by peak number (ignoring direction)

makePeakNumplot=function(file, geneName,fraction){
  pos=c(7:39)
  if (fraction=="Total"){
  gene=read.table(file, stringsAsFactors = F, col.names = tot_names) %>% select(pos)
  }
  else{
    gene=read.table(file, stringsAsFactors = F, col.names = nuc_names) %>% select(pos)
  }
  
  gene$peaks=seq(0, (nrow(gene)-1))
  gene_melt=melt(gene, id.vars=c('peaks'))
  colnames(gene_melt)= c('peaks',"Individual", "ReadCount")
  finalplot=ggplot(gene_melt, aes(x=peaks, y=ReadCount, by=Individual, fill=Individual)) + geom_histogram(stat="identity", show.legend = FALSE) + labs(title=paste("Peak Coverage", geneName, sep = " ")) 
  return(finalplot)
}

I can plot them next to eachother using cowplot

ppp4c_loc=makePeakLocplot("../data/example_gene_peakQuant/PPP4C_NuclearCov_peaks.txt",'PPP4c',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

ppp4c_num=makePeakNumplot("../data/example_gene_peakQuant/PPP4C_NuclearCov_peaks.txt",'PPP4c',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

plot_grid(ppp4c_loc,ppp4c_num)

Expand here to see past versions of unnamed-chunk-18-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

dcaf16_loc=makePeakLocplot("../data/example_gene_peakQuant/DCAF16_NuclearCov_peaks.txt",'DCAF16',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

dcaf16_num=makePeakNumplot("../data/example_gene_peakQuant/DCAF16_NuclearCov_peaks.txt",'DCAF16',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

plot_grid(dcaf16_loc,dcaf16_num)

Expand here to see past versions of unnamed-chunk-19-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

dgcr14_loc=makePeakLocplot("../data/example_gene_peakQuant/DGCR14_TotalCov_peaks.txt",'DGCR14',"Total")

Warning: Ignoring unknown parameters: binwidth, bins, pad

dgcr14_num=makePeakNumplot("../data/example_gene_peakQuant/DGCR14_TotalCov_peaks.txt",'DGCR14',"Total")

Warning: Ignoring unknown parameters: binwidth, bins, pad

plot_grid(dgcr14_loc,dgcr14_num)

Expand here to see past versions of unnamed-chunk-20-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

irf5_loc=makePeakLocplot("../data/example_gene_peakQuant/IRF5_NuclearCov_peaks.txt",'IRF5',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

irf5_num=makePeakNumplot("../data/example_gene_peakQuant/IRF5_NuclearCov_peaks.txt",'IRF5',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

plot_grid(irf5_loc,irf5_num)

Expand here to see past versions of unnamed-chunk-21-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

HSF1_loc=makePeakLocplot("../data/example_gene_peakQuant/HSF1_NuclearCov_peaks.txt",'HSF1',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

HSF1_num=makePeakNumplot("../data/example_gene_peakQuant/HSF1_NuclearCov_peaks.txt",'HSF1',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

plot_grid(HSF1_loc,HSF1_num)

Expand here to see past versions of unnamed-chunk-22-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

NOL9_loc=makePeakLocplot("../data/example_gene_peakQuant/NOL9_NuclearCov_peaks.txt",'NOL9',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

NOL9_num=makePeakNumplot("../data/example_gene_peakQuant/NOL9_NuclearCov_peaks.txt",'NOL9',"Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

plot_grid(NOL9_loc,NOL9_num)

Expand here to see past versions of unnamed-chunk-23-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

SACM1L_loc=makePeakLocplot("../data/example_gene_peakQuant/SACM1L_TotalCov_peaks.txt",'SACM1L',"Total")

Warning: Ignoring unknown parameters: binwidth, bins, pad

SACM1L_num=makePeakNumplot("../data/example_gene_peakQuant/SACM1L_TotalCov_peaks.txt",'SACM1L',"Total")

Warning: Ignoring unknown parameters: binwidth, bins, pad

plot_grid(SACM1L_loc,SACM1L_num)

Expand here to see past versions of unnamed-chunk-24-1.png:

Version	Author	Date
e2da5c4	Briana Mittleman	2018-11-13

NBEAL2_loc=makePeakLocplot("../data/example_gene_peakQuant/NBEAL2_TotalCov_peaks.txt",'NBEAL2',"Total")

Warning: Ignoring unknown parameters: binwidth, bins, pad

NBEAL2_num=makePeakNumplot("../data/example_gene_peakQuant/NBEAL2_TotalCov_peaks.txt",'NBEAL2',"Total")

Warning: Ignoring unknown parameters: binwidth, bins, pad

plot_grid(NBEAL2_loc,NBEAL2_num)

Expand here to see past versions of unnamed-chunk-25-1.png:

Version	Author	Date
d254fd2	Briana Mittleman	2018-11-13

Which Peak is Sig

It would be interesting to know which peak in these gene plots is associated with the QTL.

Nuclear: * IRF5 : peak305794-7:128635754, peak305795,128681297, peak305798-7:128661132

IRF5_all=read.table("../data/example_gene_peakQuant/IRF5_NuclearCov_peaks.txt", col.names = nuc_names)

peak305794-peak 4 peak305795-peak 5 peak305798-peak 6

HSF1: peak323832- 8:145516593

HSF1_all=read.table("../data/example_gene_peakQuant/HSF1_NuclearCov_peaks.txt", col.names = nuc_names)

The QTL is the first peak. (peak 0)

NOL9: peak702- 1:6604621

NOL9_all=read.table("../data/example_gene_peakQuant/NOL9_NuclearCov_peaks.txt", col.names = nuc_names)

QTL is peak 7 in the graph

DCAF16: peak236311- 4:17797455

DCAF16_all=read.table("../data/example_gene_peakQuant/DCAF16_NuclearCov_peaks.txt", col.names = nuc_names)

QTL is peak 3 in graph

PPP4C: peak122195-16:30482494

pprc_all=read.table("../data/example_gene_peakQuant/PPP4C_NuclearCov_peaks.txt", col.names = nuc_names)

The QTL peak is the lower expressed peak (peak1 in graph)

Total: * NBEAL2: peak216374- 3:47080127

NBEAL2_all=read.table("../data/example_gene_peakQuant/NBEAL2_TotalCov_peaks.txt", col.names = tot_names)

peak 15 in graph

SACM1L: peak216084-3:45780980, peak216086-3:45780980, peak216087-3:45790569

SACM1L_all=read.table("../data/example_gene_peakQuant/SACM1L_TotalCov_peaks.txt", col.names = tot_names)

peak216084-12

peak216086 - 14 (major peak)

peak216087 -15

DGCR14: peak204736-22:18647341

DGCR14_all=read.table("../data/example_gene_peakQuant/DGCR14_TotalCov_peaks.txt", col.names = tot_names)

peak204736- peak 7

This has shown me that most of the QTL peaks are not the major/most used peak. This leads me to beleive I would get different QTLs if I made one metric per gene because I may ont be able to capture these effects.

Seperate by genotype

It would be good to look at these seperated by genotype.

IRF5 : peak305794-7:128635754, peak305795,7:128681297, peak305798-7:128661132

geno_names=c('CHROM', 'POS', 'snpID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'NA18486', 'NA18487', 'NA18488', 'NA18489', 'NA18498', 'NA18499', 'NA18501', 'NA18502', 'NA18504', 'NA18505', 'NA18507', 'NA18508', 'NA18510', 'NA18511', 'NA18516', 'NA18517', 'NA18519', 'NA18520', 'NA18522', 'NA18523', 'NA18852', 'NA18853', 'NA18855', 'NA18856', 'NA18858', 'NA18859', 'NA18861', 'NA18862', 'NA18867', 'NA18868', 'NA18870', 'NA18871', 'NA18873', 'NA18874', 'NA18907', 'NA18909', 'NA18910', 'NA18912', 'NA18913', 'NA18916', 'NA18917', 'NA18923', 'NA18924', 'NA18933', 'NA18934', 'NA19093', 'NA19095', 'NA19096', 'NA19098', 'NA19099', 'NA19101', 'NA19102', 'NA19107', 'NA19108', 'NA19113', 'NA19114', 'NA19116', 'NA19117', 'NA19118', 'NA19119', 'NA19121', 'NA19122', 'NA19127', 'NA19128', 'NA19129', 'NA19130', 'NA19131', 'NA19137', 'NA19138', 'NA19140', 'NA19141', 'NA19143', 'NA19144', 'NA19146', 'NA19147', 'NA19149', 'NA19150', 'NA19152', 'NA19153', 'NA19159', 'NA19160', 'NA19171', 'NA19172', 'NA19175', 'NA19176', 'NA19184', 'NA19185', 'NA19189', 'NA19190', 'NA19197', 'NA19198', 'NA19200', 'NA19201', 'NA19203', 'NA19204', 'NA19206', 'NA19207', 'NA19209', 'NA19210', 'NA19213', 'NA19214', 'NA19222', 'NA19223', 'NA19225', 'NA19226', 'NA19235', 'NA19236', 'NA19238', 'NA19239', 'NA19247', 'NA19248', 'NA19256', 'NA19257')
#the samples I ran the QTLs for
samples=c('NA18486','NA18505','NA18508','NA18511','NA18519','NA18520','NA18853','NA18858','NA18861','NA18870','NA18909','NA18912','NA18916','NA19093','NA19119','NA19128','NA19130','NA19131','NA19137','NA19140','NA19141','NA19144','NA19152','NA19153','NA19160','NA19171','NA19200','NA19207','NA19209','NA19210','NA19223','NA19225','NA19238','NA19239','NA19257')

#grep the genotpe file results to /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/
grep 7:128635754 chr7.dose.filt.vcf > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/Genotypes_7:128635754.txt
grep 7:128681297 chr7.dose.filt.vcf > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/Genotypes_7:128681297.txt
grep 7:128661132 chr7.dose.filt.vcf > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/Genotypes_7:128661132.txt

Transfer to computer:

Make a function to take a file and format it the way I can use it.

#this gives me 35x1 data frame with the genotpes for each ind at this snp.  
prepare_genotpes=function(file, genName=geno_names, samp=samples){
  geno=read.table(file, col.names=genName, stringsAsFactors = F) %>% select(one_of(samp))
  geno_dose=apply(geno, 2, function(y)sapply(y, function(x)as.integer(strsplit(x,":")[[1]][[2]])))
  geno_dose=as.data.frame(geno_dose) %>% rownames_to_column(var="individual")
  return(geno_dose)
}

chr7_128681297= prepare_genotpes("../data/example_gene_peakQuant/Genotypes_7:128681297.txt")

I want a dataframe that has individual, genotype, then all of the peaks. I also need to remove individuals not in that genotype file.

IRF5_pheno=IRF5_all%>% select(one_of(samples))
row.names(IRF5_pheno)=paste("IRF5_peak", seq(1,nrow(IRF5_all)),sep="_")

IRF5_pheno= IRF5_pheno %>% t
IRF5_pheno= as.data.frame(IRF5_pheno) %>% rownames_to_column(var="individual")
IRF5_pheno_geno=IRF5_pheno %>% inner_join(chr7_128681297, by="individual")
IRF5_pheno_geno_melt= melt(IRF5_pheno_geno, id.vars=c("geno_dose", "individual")) %>% group_by(variable,geno_dose) %>% summarise(mean=mean(value),sd=sd(value))

IRF5_pheno_geno_melt$geno_dose=as.factor(IRF5_pheno_geno_melt$geno_dose)

irf5_readplot=ggplot(IRF5_pheno_geno_melt,aes(x=variable, y=mean, by=geno_dose, fill=geno_dose)) + geom_bar(stat="identity") +theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(y="Mean read count", x="Peak", title="IRF5 peaks by chr7:128681297 genotype") +  annotate("pointrange", x = 6, y = 750, ymin = 750, ymax = 750,
  colour = "black", size = 1.5)

Try this with a different gene.

#grep the genotpe file results to /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/
grep 16:30482494 chr16.dose.filt.vcf > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/Genotypes_16:30482494.txt

chr16_30482494= prepare_genotpes("../data/example_gene_peakQuant/Genotypes_16:30482494.txt")


pprc_pheno=pprc_all%>% select(one_of(samples))
row.names(pprc_pheno)=paste("PPRC_peak", seq(1,nrow(pprc_all)),sep="_")

pprc_pheno= pprc_pheno %>% t
pprc_pheno= as.data.frame(pprc_pheno) %>% rownames_to_column(var="individual")
pprc_pheno_geno=pprc_pheno %>% inner_join(chr16_30482494, by="individual")
pprc_pheno_geno_melt= melt(pprc_pheno_geno, id.vars=c("geno_dose", "individual")) %>% group_by(variable,geno_dose) %>% summarise(mean=mean(value),sd=sd(value))

pprc_pheno_geno_melt$geno_dose=as.factor(pprc_pheno_geno_melt$geno_dose)

Plot

pprc_countplot=ggplot(pprc_pheno_geno_melt,aes(x=variable, y=mean, by=geno_dose, fill=geno_dose)) + geom_bar(stat="identity") +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ labs(y="Mean read count", x="Peak", title="PPRC peaks by 16:30482494 genotype") +  annotate("pointrange", x = 2, y = 20, ymin = 20, ymax = 20,colour = "black", size = .5)

I want to see if this looks similar when I use the normalized usage from leafcutter (what the QTLs actually ran on)

I am going to grab both the perc. and normalized.



#nuclear
less /project2/gilad/briana/threeprimeseq/data/phenotypes_filtPeakTranscript/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Nuclear.pheno_fixed.txt.gz.qqnorm_chr7.gz | grep IRF5 > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/IRF5_NuclearPropCovNorm_peaks.txt

grep IRF5 /project2/gilad/briana/threeprimeseq/data/phenotypes_filtPeakTranscript/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Nuclear.pheno_fixed.txt.gz.phen_chr7 > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/IRF5_NuclearPropCov_peaks.txt


less /project2/gilad/briana/threeprimeseq/data/phenotypes_filtPeakTranscript/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Nuclear.pheno_fixed.txt.gz.qqnorm_chr16.gz | grep PPP4C > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/PPP4C_NuclearPropCovNorm_peaks.txt

grep PPP4C /project2/gilad/briana/threeprimeseq/data/phenotypes_filtPeakTranscript/filtered_APApeaks_merged_allchrom_refseqGenes.Transcript_sm_quant.Nuclear.pheno_fixed.txt.gz.phen_chr16 > /project2/gilad/briana/threeprimeseq/data/example_gene_peakQuant/PPP4C_NuclearPropCov_peaks.txt

cov_names=c('chr',  'start',    'end',  'PeakID','NA18486'  ,'NA18497', 'NA18500'   ,'NA18505', 'NA18508'   ,'NA18511', 'NA18519',  'NA18520',  'NA18853','NA18858',    'NA18861',  'NA18870'   ,'NA18909'  ,'NA18912'  ,'NA18916', 'NA19092'   ,'NA19093', 'NA19119',  'NA19128'   ,'NA19130', 'NA19131'   ,'NA19137', 'NA19140',  'NA19141'   ,'NA19144', 'NA19152'   ,'NA19153', 'NA19160'   ,'NA19171', 'NA19193'   ,'NA19200', 'NA19207',  'NA19209',  'NA19210',  'NA19223'   ,'NA19225', 'NA19238'   ,'NA19239', 'NA19257')

IRF5cov_all=read.table("../data/example_gene_peakQuant/IRF5_NuclearPropCov_peaks.txt", col.names = cov_names)
IRF5covNorm_all=read.table("../data/example_gene_peakQuant/IRF5_NuclearPropCovNorm_peaks.txt", col.names = cov_names)


IRF5cov_pheno=IRF5cov_all%>% select(one_of(samples))
row.names(IRF5cov_pheno)=paste("IRF5_peak", seq(1,nrow(IRF5cov_all)),sep="_")

IRF5cov_pheno= IRF5cov_pheno %>% t
IRF5cov_pheno= as.data.frame(IRF5cov_pheno) %>% rownames_to_column(var="individual")
IRF5cov_pheno_geno=IRF5cov_pheno %>% inner_join(chr7_128681297, by="individual")
IRF5cov_pheno_geno_melt= melt(IRF5cov_pheno_geno, id.vars=c("geno_dose", "individual")) %>% group_by(variable,geno_dose) %>% summarise(mean=mean(value),sd=sd(value))

IRF5cov_pheno_geno_melt$geno_dose=as.factor(IRF5cov_pheno_geno_melt$geno_dose)


irf5_covplot=ggplot(IRF5cov_pheno_geno_melt,aes(x=variable, y=mean, by=geno_dose, fill=geno_dose)) + geom_bar(stat="identity",position = "dodge") +theme(axis.text.x = element_text(angle = 90, hjust = 1)) +labs(y="Mean Coverage Prop", x="Peak", title="IRF5 peaks by chr7:128681297 genotype") + annotate("pointrange", x = 6, y = .75, ymin = .750, ymax = .750,colour = "black", size = .5)

IRF5covNorm_pheno=IRF5covNorm_all%>% select(one_of(samples))
row.names(IRF5covNorm_pheno)=paste("IRF5_peak", seq(1,nrow(IRF5covNorm_all)),sep="_")

IRF5covNorm_pheno= IRF5covNorm_pheno %>% t
IRF5covNorm_pheno= as.data.frame(IRF5covNorm_pheno) %>% rownames_to_column(var="individual")
IRF5covNorm_pheno_geno=IRF5covNorm_pheno %>% inner_join(chr7_128681297, by="individual")
IRF5covNorm_pheno_geno_melt= melt(IRF5covNorm_pheno_geno, id.vars=c("geno_dose", "individual")) %>% group_by(variable,geno_dose) %>% summarise(mean=mean(value),sd=sd(value))

IRF5covNorm_pheno_geno_melt$geno_dose=as.factor(IRF5covNorm_pheno_geno_melt$geno_dose)


irf5_normplot=ggplot(IRF5covNorm_pheno_geno_melt,aes(x=variable, y=mean, by=geno_dose, fill=geno_dose)) + geom_bar(stat="identity",position = "dodge") +theme(axis.text.x = element_text(angle = 90, hjust = 1)) +labs(y="Mean Normalized Coverage Prop", x="Peak", title="IRF5 peaks by chr7:128681297 genotype") + annotate("pointrange", x = 6, y = .75, ymin = .750, ymax = .750,colour = "black", size = .5)

pprccov_all=read.table("../data/example_gene_peakQuant/PPP4C_NuclearPropCov_peaks.txt", col.names = cov_names)
pprccovNorm_all=read.table("../data/example_gene_peakQuant/PPP4C_NuclearPropCovNorm_peaks.txt", col.names = cov_names)


pprcCov_pheno=pprccov_all%>% select(one_of(samples))
row.names(pprcCov_pheno)=paste("PPRC_peak", seq(1,nrow(pprccov_all)),sep="_")

pprcCov_pheno= pprcCov_pheno %>% t
pprcCov_pheno= as.data.frame(pprcCov_pheno) %>% rownames_to_column(var="individual")
pprcCov_pheno_geno=pprcCov_pheno %>% inner_join(chr16_30482494, by="individual")
pprcCov_pheno_geno_melt= melt(pprcCov_pheno_geno, id.vars=c("geno_dose", "individual")) %>% group_by(variable,geno_dose) %>% summarise(mean=mean(value),sd=sd(value))

pprcCov_pheno_geno_melt$geno_dose=as.factor(pprcCov_pheno_geno_melt$geno_dose)

pprc_covplot=ggplot(pprcCov_pheno_geno_melt,aes(x=variable, y=mean, by=geno_dose, fill=geno_dose)) + geom_bar(stat="identity",position = 'dodge') +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ labs(y="Mean coverage proportion", x="Peak", title="PPRC peaks by 16:30482494 genotype") + annotate("pointrange", x = 2, y = .4, ymin = .4, ymax = .4,colour = "black", size = .5)

pprcCovNorm_pheno=pprccovNorm_all%>% select(one_of(samples))
row.names(pprcCovNorm_pheno)=paste("PPRC_peak", seq(1,nrow(pprccovNorm_all)),sep="_")

pprcCovNorm_pheno= pprcCovNorm_pheno %>% t
pprcCovNorm_pheno= as.data.frame(pprcCovNorm_pheno) %>% rownames_to_column(var="individual")
pprcCovNorm_pheno_geno=pprcCovNorm_pheno %>% inner_join(chr16_30482494, by="individual")
pprcCovNorm_pheno_geno_melt= melt(pprcCovNorm_pheno_geno, id.vars=c("geno_dose", "individual")) %>% group_by(variable,geno_dose) %>% summarise(mean=mean(value),sd=sd(value))

pprcCovNorm_pheno_geno_melt$geno_dose=as.factor(pprcCovNorm_pheno_geno_melt$geno_dose)

pprc_normplot=ggplot(pprcCovNorm_pheno_geno_melt,aes(x=variable, y=mean, by=geno_dose, fill=geno_dose)) + geom_bar(stat="identity",position = 'dodge') +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ labs(y="Mean coverage proportion", x="Peak", title="PPRC peaks by 16:30482494 genotype") + annotate("pointrange", x = 2, y = 1, ymin = 1, ymax = 1,colour = "black", size = .5)

plot_grid(irf5_readplot,irf5_covplot,irf5_normplot)

plot_grid(pprc_countplot,pprc_covplot,pprc_normplot)

Session information

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] bindrcpp_0.2.2      cowplot_0.9.3       ggpubr_0.1.8       
 [4] magrittr_1.5        data.table_1.11.8   VennDiagram_1.6.20 
 [7] futile.logger_1.4.3 forcats_0.3.0       stringr_1.3.1      
[10] dplyr_0.7.6         purrr_0.2.5         readr_1.1.1        
[13] tidyr_0.8.1         tibble_1.4.2        ggplot2_3.0.0      
[16] tidyverse_1.2.1     reshape2_1.4.3      workflowr_1.1.1    

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.4     haven_1.1.2          lattice_0.20-35     
 [4] colorspace_1.3-2     htmltools_0.3.6      yaml_2.2.0          
 [7] rlang_0.2.2          R.oo_1.22.0          pillar_1.3.0        
[10] glue_1.3.0           withr_2.1.2          R.utils_2.7.0       
[13] lambda.r_1.2.3       modelr_0.1.2         readxl_1.1.0        
[16] bindr_0.1.1          plyr_1.8.4           munsell_0.5.0       
[19] gtable_0.2.0         cellranger_1.1.0     rvest_0.3.2         
[22] R.methodsS3_1.7.1    evaluate_0.11        labeling_0.3        
[25] knitr_1.20           broom_0.5.0          Rcpp_0.12.19        
[28] formatR_1.5          backports_1.1.2      scales_1.0.0        
[31] jsonlite_1.5         hms_0.4.2            digest_0.6.17       
[34] stringi_1.2.4        rprojroot_1.3-2      cli_1.0.1           
[37] tools_3.5.1          lazyeval_0.2.1       futile.options_1.0.1
[40] crayon_1.3.4         whisker_0.3-2        pkgconfig_2.0.2     
[43] xml2_1.2.0           lubridate_1.7.4      assertthat_0.2.0    
[46] rmarkdown_1.10       httr_1.3.1           rstudioapi_0.8      
[49] R6_2.3.0             nlme_3.1-137         git2r_0.23.0        
[52] compiler_3.5.1

This reproducible R Markdown analysis was created with workflowr 1.1.1

Peak coverage along Gene examples

Briana Mittleman

11/12/2018

Which Peak is Sig

Seperate by genotype

Session information