Last updated: 2022-03-07

Checks: 7 0

Knit directory: chipseq-cross-species/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220209)

The command set.seed(20220209) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 3031d6c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 3031d6c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/figure/
    Ignored:    data/genomic_data/
    Ignored:    output/annotations/
    Ignored:    output/bam_files/
    Ignored:    output/filtered_peaks/
    Ignored:    output/logs/
    Ignored:    output/peaks/
    Ignored:    output/plots/
    Ignored:    output/qc/A-1_input.SeqDepthNorm.bw
    Ignored:    output/qc/A-1_input.SeqDepthNorm_dunnart_downSampled.bw
    Ignored:    output/qc/A-1_input.ccurve.txt
    Ignored:    output/qc/A-1_input.extrap.txt
    Ignored:    output/qc/A-1_input_R1_trimmed.fastq.gz
    Ignored:    output/qc/A-1_input_est_lib_complex_metrics.txt
    Ignored:    output/qc/A-2_H3K4me3.SeqDepthNorm.bw
    Ignored:    output/qc/A-2_H3K4me3.SeqDepthNorm_dunnart_downSampled.bw
    Ignored:    output/qc/A-2_H3K4me3.ccurve.txt
    Ignored:    output/qc/A-2_H3K4me3.extrap.txt
    Ignored:    output/qc/A-2_H3K4me3_R1_trimmed.fastq.gz
    Ignored:    output/qc/A-2_H3K4me3_est_lib_complex_metrics.txt
    Ignored:    output/qc/A-2_H3K4me3_vs_A-1_input.frip_default.txt
    Ignored:    output/qc/A-2_H3K4me3_vs_B-1_input.frip_default.txt
    Ignored:    output/qc/A-3_H3K27ac.SeqDepthNorm.bw
    Ignored:    output/qc/A-3_H3K27ac.SeqDepthNorm_dunnart_downSampled.bw
    Ignored:    output/qc/A-3_H3K27ac.ccurve.txt
    Ignored:    output/qc/A-3_H3K27ac.extrap.txt
    Ignored:    output/qc/A-3_H3K27ac_R1_trimmed.fastq.gz
    Ignored:    output/qc/A-3_H3K27ac_est_lib_complex_metrics.txt
    Ignored:    output/qc/A-3_H3K27ac_vs_A-1_input.frip_default.txt
    Ignored:    output/qc/A-3_H3K27ac_vs_B-1_input.frip_default.txt
    Ignored:    output/qc/B-1_input.SeqDepthNorm.bw
    Ignored:    output/qc/B-1_input.SeqDepthNorm_dunnart_downSampled.bw
    Ignored:    output/qc/B-1_input.ccurve.txt
    Ignored:    output/qc/B-1_input.extrap.txt
    Ignored:    output/qc/B-1_input_R1_trimmed.fastq.gz
    Ignored:    output/qc/B-1_input_est_lib_complex_metrics.txt
    Ignored:    output/qc/B-2_H3K4me3.SeqDepthNorm.bw
    Ignored:    output/qc/B-2_H3K4me3.SeqDepthNorm_dunnart_downSampled.bw
    Ignored:    output/qc/B-2_H3K4me3.ccurve.txt
    Ignored:    output/qc/B-2_H3K4me3.extrap.txt
    Ignored:    output/qc/B-2_H3K4me3_R1_trimmed.fastq.gz
    Ignored:    output/qc/B-2_H3K4me3_est_lib_complex_metrics.txt
    Ignored:    output/qc/B-2_H3K4me3_vs_A-1_input.frip_default.txt
    Ignored:    output/qc/B-2_H3K4me3_vs_B-1_input.frip_default.txt
    Ignored:    output/qc/B-3_H3K27ac.SeqDepthNorm.bw
    Ignored:    output/qc/B-3_H3K27ac.SeqDepthNorm_dunnart_downSampled.bw
    Ignored:    output/qc/B-3_H3K27ac.ccurve.txt
    Ignored:    output/qc/B-3_H3K27ac.extrap.txt
    Ignored:    output/qc/B-3_H3K27ac_R1_trimmed.fastq.gz
    Ignored:    output/qc/B-3_H3K27ac_est_lib_complex_metrics.txt
    Ignored:    output/qc/B-3_H3K27ac_vs_A-1_input.frip_default.txt
    Ignored:    output/qc/B-3_H3K27ac_vs_B-1_input.frip_default.txt
    Ignored:    output/qc/bamPEFragmentSize_rawcounts.tab
    Ignored:    output/qc/multiBAM_fingerprint_metrics.txt
    Ignored:    output/qc/multiBAM_fingerprint_rawcounts.txt
    Ignored:    output/qc/multibamsum.npz
    Ignored:    output/qc/multibamsum.tab
    Ignored:    output/qc/multibamsum_dunnart_downSampled.npz
    Ignored:    output/qc/multibamsum_dunnart_downSampled.tab
    Ignored:    output/qc/pearsoncor_multibamsum_matrix.txt
    Ignored:    results/

Untracked files:
    Untracked:  .snakemake/
    Untracked:  Rplots.pdf
    Untracked:  code/basic_wrapper.slurm
    Untracked:  code/configs/metadata.tsv
    Untracked:  code/slurmjob.stderr
    Untracked:  code/slurmjob.stdout
    Untracked:  code/trimfastq.py
    Untracked:  data/raw_reads/
    Untracked:  geneLengthvsPeakCount.pdf
    Untracked:  output/qc/H3K27ac_overlap_default.frip
    Untracked:  output/rnaseq/
    Untracked:  slurm-33664196.out
    Untracked:  slurm-33664197.out
    Untracked:  slurm-33664198.out
    Untracked:  slurm-33664199.out
    Untracked:  slurm-33664200.out
    Untracked:  slurm-33664201.out
    Untracked:  slurm-33664202.out
    Untracked:  slurm-33664203.out
    Untracked:  slurm-33664204.out
    Untracked:  slurm-33664205.out
    Untracked:  slurm-33664206.out
    Untracked:  slurm-33664207.out
    Untracked:  slurm-33665413.out
    Untracked:  slurm-33665414.out
    Untracked:  slurm-33665415.out
    Untracked:  slurm-33665416.out
    Untracked:  slurm-33665417.out
    Untracked:  slurm-33665418.out
    Untracked:  slurm-33665419.out
    Untracked:  slurm-33665420.out
    Untracked:  slurm-33665421.out
    Untracked:  slurm-33665422.out
    Untracked:  slurm-33665527.out
    Untracked:  slurm-33665528.out
    Untracked:  slurm-33665529.out
    Untracked:  slurm-33665530.out
    Untracked:  slurm-33665531.out
    Untracked:  slurm-33665532.out
    Untracked:  slurm-33665533.out
    Untracked:  slurm-33665534.out
    Untracked:  slurm-33665535.out
    Untracked:  slurm-33665536.out
    Untracked:  test.pdf

Unstaged changes:
    Deleted:    code/dunnart_peak_calling/trimfastq.py
    Modified:   output/qc/H3K4me3_overlap_default.frip
    Deleted:    output/qc/bamPEFragmentSize_hist.png
    Deleted:    output/qc/multiBAM_fingerprint.png
    Deleted:    output/qc/pearsoncor_multibamsum.png

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/dunnart_peak_characterisation.Rmd) and HTML (docs/dunnart_peak_characterisation.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	3031d6c	lecook	2022-03-07	wflow_publish(“analysis/dunnart_peak_characterisation.Rmd”)
html	95a6294	lecook	2022-03-03	Build site.
Rmd	faabd95	lecook	2022-03-03	updated
Rmd	732d337	lecook	2022-03-03	update
Rmd	c591390	lecook	2022-03-03	update
Rmd	0be985e	lecook	2022-03-03	update
Rmd	4b8f21e	lecook	2022-03-03	update
Rmd	2103ae7	lecook	2022-03-03	updates
Rmd	1c82c33	lecook	2022-03-03	update
Rmd	ad31704	lecook	2022-03-03	updated
Rmd	9c5a61a	lecook	2022-02-27	Merge branch ‘master’ of https://github.com/lecook/chipseq-cross-species into master
Rmd	1ffd9c3	lecook	2022-02-27	update
Rmd	84a98f9	lecook	2022-02-27	change directories
Rmd	8d4d8a5	lecook	2022-02-27	convert code to workflowr
Rmd	38f2f83	lecook	2022-02-25	add code to workflowr
Rmd	1fa92d9	lecook	2022-02-23	add analysis files

Introduction

This analysis looks through various features of the dunnart peaks including peak intensity, lengths, distance to TSS and nearest gene calls.

Load libraries

library(GenomicRanges)
library(GenomicFeatures)
library(rtracklayer)
library(AnnotationForge)
library(clusterProfiler)
library(ChIPseeker)
library(ggpubr)
library(AnnotationDbi)
library(AnnotationHub)
library(org.Mm.eg.db)
library(plyr)
library(tidyverse)
library(RColorBrewer)
library(ggplot2)
library(data.table)
library(ggridges)
library(reshape2)
library(RColorBrewer)
library(VennDiagram)
library(viridis)
library(hrbrthemes)
library(gghalves)
library(multcompView)
library(factoextra)
library(NbClust)
library(simplifyEnrichment)
library(formatR)

Set up

Set fonts, plot and output directories etc.

plot_dir = "output/plots/"
fullPeak_dir = "output/peaks/"
annot_dir = "output/annotations/"
filterPeaks_dir = "output/filtered_peaks/"

## Set the fonts up so that each plot is the saved the same way.
font = theme(axis.text.x = element_text(size = 25),
        axis.text.y = element_text(size = 25),  
        axis.title.x = element_text(size = 25),
        axis.title.y = element_text(size = 25), 
        legend.title=element_text(size=25), legend.text=element_text(size=25))

Peak features

Peak counts

fileList = fullPeak_dir
combined.tbl.stack = "dunnart_combined_stacked.pdf"
peak.length.plot = "dunnart_fullPeaks_length.pdf"
peak.intensity.plot = "dunnart_fullPeaks_intensity.pdf"

files <- list.files(fileList, pattern = "*r_peaks.narrowPeak|*_only.narrowPeak|*overlap_default.narrowPeak|*and_H3K27ac.narrowPeak", full.names = T) # create list of files in directory
files <- as.list(files)
data <- lapply(files, function(x) data.table::fread(x, header = FALSE, sep = "\t", quote = "", na.strings = c("", "NA")))
names(data) <- c("enhancers","promoters","H3K27ac_only","H3K27ac","H3K4me3_H3K27ac", "H3K4me3_only","H3K4me3")

df1 <- Map(dplyr::mutate, data, group = names(data))
df1$H3K27ac_only$cre <- "enhancers"
df1$H3K4me3_only$cre <- "promoters"
df1$H3K4me3_H3K27ac$cre <- "promoters"

fullPeaks <- rbind(df1$enhancers, df1$promoters, df1$H3K27ac, df1$H3K4me3)
combinedPeaks <- rbind(df1$H3K27ac_only, df1$H3K4me3_H3K27ac, df1$H3K4me3_only)
        
## Plot stacked bar graph with number of peaks for each histone mark
combined.tbl <- with(combinedPeaks, table(group, cre)) # nolint # nolint
combined.tbl <- as.data.frame(combined.tbl)

p <- ggplot(combined.tbl, aes(factor(cre), Freq, fill = group)) + 
      geom_bar(position = position_stack(reverse = TRUE), stat = "identity") +
      scale_fill_manual(values = c("#FF9D40",  "#37B6CE", "#04859D"))  + 
      theme_minimal() + ylab("Number of peaks") + xlab("")

pdf(file = paste0(plot_dir, combined.tbl.stack), width = 10, height = 7)
print(p)
dev.off()

svg 
  2

Average peak lengths

# Peak lengths for H3K4me3 and H3K27ac
fullPeaks$length <- (fullPeaks$V3 - fullPeaks$V2)
level_order <- c('promoter', 'enhancer', 'H3K4me3', 'H3K27ac')

## Plot peak lengths
p <- ggplot(fullPeaks, aes(factor(group), y = log10(length))) + 
  geom_violin(aes(fill = factor(group), 
                  color = factor(group))) + 
  geom_boxplot(aes(color = factor(group)), 
               width = .15, outlier.shape = NA, 
               position = position_dodge(width = .1)) + 
  theme_bw() + xlab("") + ylab("Log10 Peak Length") + 
  scale_color_manual(values = c("#E9C46A","#E9C46A","#264653","#2A9D8F")) +
  scale_fill_manual(values = c("#F1DAA2","#F1DAA2","#AEBABF","#7AC2B9"))
  
p

## Save as a pdf
pdf(file = paste0(plot_dir, peak.length.plot), width = 10, height = 7)
    print(p +  scale_color_manual(values = c("#E9C46A","#E9C46A","#264653","#2A9D8F")) +
    scale_fill_manual(values = c("#F1DAA2","#F1DAA2","#AEBABF","#7AC2B9")) + coord_flip())
dev.off()

svg 
  2

Average peak intensity

## Plot peak intensity
p <- ggplot(fullPeaks, aes(factor(group), y = log10(V7))) + 
  geom_half_violin(aes(fill = factor(group), color = factor(group)),
                     side = "r", 
                     position = "dodge") +
  geom_boxplot(aes(color = factor(group)),
               width = .15, 
               outlier.shape = NA,
               fill = c("#FCF8EC","#FCF8EC","#F5F6F7","#E4F3F1"),
               position = position_dodge(width = .1)) +
  coord_cartesian(xlim = c(1.2, NA), clip = "off") + theme_bw() + 
  xlab("") + ylab("Log10 Peak Intensity") +  
  scale_color_manual(values  = c("#E9C46A","#E9C46A","#264653","#2A9D8F")) +
  scale_fill_manual(values = c("#F1DAA2","#F1DAA2","#AEBABF","#7AC2B9")) 

p

pdf(file = paste0(plot_dir, peak.intensity.plot), width = 10, height = 7)
    print(p +  scale_color_manual(values = c("#E9C46A","#E9C46A","#264653","#2A9D8F")) +
    scale_fill_manual(values = c("#F1DAA2","#F1DAA2","#AEBABF","#7AC2B9")) + coord_flip())
dev.off()

svg 
  2

    print(p +  scale_color_manual(values = c("#E9C46A","#E9C46A","#264653","#2A9D8F")) +
    scale_fill_manual(values = c("#F1DAA2","#F1DAA2","#AEBABF","#7AC2B9")) + coord_flip())

Annotate dunnart peaks

The easiest way to call the nearest genes for the peaks in the dunnart is to use the ChIPseeker package (Guangchuang Yu 2021) as it allows easy integration of non-model organism genomes and has well documented instructions on incorporating BYO genomes with the package. To use the ChIPseeker to annotate peaks, firstly a txdb is needed for the dunnart annotation file. A TxDb class is a container for storing transcript annotations. The dunnart genome doesn’t have a de novo annotation so instead the Tasmanian devil annotation (RefSeq) has been lifted over to the dunnart genome using LiftOff (https://github.com/agshumate/Liftoff).

Build the annotation files for the dunnart

smiCra_txdb <- makeTxDbFromGFF("data/genomic_data/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr2.gff")

Gene ID conversion tables

For downstream analyses, conversion tables between gene databases and between species is needed. This is because the ENSEMBL/ENTREZ IDs for the Tasmanian Devil have fewer links to databases such as GO terms etc. For this I have two conversion tables: 1. Converts Tasmanian devil RefSeq to Tasmanian Devil ENSEMBL IDs 2. Convert Tasmanian Devil ENSEMBL IDs to mouse ENSEMBL IDs

Additionally I have a list of background genes for calculating GO enrichment. This background list includes all devil genes that have an orthologous mouse gene ID in the ensembl database.

## geneID conversion tables
df2 <- read.table("output/annotations/devil_to_mouse_ensembl.txt", header=TRUE, sep="\t") ## conversion table for devil ENSEMBL to mouse ENSEMBL
df3 <- read.table("output/annotations/refseq_to_ensembl.txt", header=TRUE, sep="\t") ## convertsion table for devil refseq to devil ENSEMBL
bg = fread("output/annotations/go_background_orthologousENSEMBL_biomart.txt", header = FALSE)
bg = unlist(bg, use.names = FALSE)

Annotate peak files with ChIPseeker

annotatePeaks <- function(peak, outFile, outFile1, outFile2, GOenrich, kegg, backg){
  
  # Annotate peak file based on dunnart GFF
  peakAnno <- annotatePeak(peak = peak, tssRegion = c(-3000, 3000), TxDb = smiCra_txdb)

  # Write annotation to file
  write.table(peakAnno, file=paste0(annot_dir, outFile), sep="\t", quote=F, row.names=F)
  
  peakAnnoDF <- as.data.frame(peakAnno, row.names = NULL)
  # # Convert refseq IDs and geneIDs to devil ensembl IDs
  df2 <- read.table("output/annotations/devil_to_mouse_ensembl.txt", header=TRUE, sep="\t") ## conversion table for devil ENSEMBL to mouse ENSEMBL
  df3 <- read.table("output/annotations/refseq_to_ensembl.txt", header=TRUE, sep="\t") ## convertsion table for devil refseq to devil ENSEMBL

  peakAnnoDF$ensemblgeneID <- df2$Gene.stable.ID[match(unlist(peakAnnoDF$geneId), df2$Gene.name)]
  peakAnnoDF$ensemblgeneID = replace(peakAnnoDF$ensemblgeneID,is.na(peakAnnoDF$ensemblgeneID),"-")
  peakAnnoDF$transcriptIdAltered <- gsub("\\..*","", peakAnnoDF$transcriptId)
  peakAnnoDF$refseqID <- df3$Ensembl.Gene.ID[match(unlist(peakAnnoDF$transcriptIdAltered), df3$RefSeq.mRNA.Accession)]
  peakAnnoDF$refseqID = replace(peakAnnoDF$refseqID,is.na(peakAnnoDF$refseqID),"-")
  peakAnnoDF$combined <- ifelse(peakAnnoDF$refseqID == "-", peakAnnoDF$ensemblgeneID, peakAnnoDF$refseqID)
  peakAnnoDF$combined[peakAnnoDF$combined == as.character("-")] <- NA

  peakAnnoDF <- peakAnnoDF[!is.na(peakAnnoDF$combined),]
  write.table(peakAnnoDF, paste0(annot_dir, outFile2), sep="\t", quote=F, row.names=F)
  # # Convert devil ensembl to mouse ensembl
  peakAnnoDF$mouseensembl <- df2$Mouse.gene.stable.ID[match(unlist(peakAnnoDF$combined), df2$Gene.stable.ID)]
  # # Write annotation with converted IDs 
  peakAnnoDF$mouseensembl[peakAnnoDF$mouseensembl == as.character("")] <- NA
  peakAnnoDF <- peakAnnoDF[!is.na(peakAnnoDF$mouseensembl),]
  write.table(peakAnnoDF, paste0(annot_dir, outFile1), sep="\t", quote=F, row.names=F)

  # # GO enrichment analysis
  GO <- enrichGO(gene = peakAnnoDF$mouseensembl,
                  keyType = "ENSEMBL",
                  OrgDb = org.Mm.eg.db,
                  ont = "BP",
                  universe = backg,
                  pAdjustMethod = "fdr",
                  qvalueCutoff = 0.01,
                  readable = TRUE)
  write.table(GO, paste0(annot_dir, GOenrich), 
              sep="\t", quote=F, row.names=F)
  
  # convert mouse ensembl to entrezID
  suppressWarnings(gene.df <- bitr(geneID=peakAnnoDF$mouseensembl, fromType="ENSEMBL", toType="ENTREZID", OrgDb="org.Mm.eg.db" ))
  
  bg.entrez <- bitr(geneID=backg, fromType="ENSEMBL", toType="ENTREZID", OrgDb="org.Mm.eg.db" )
  
  bg.entrez = bg.entrez$ENTREZID
  
  # ## kegg analysis
  kk <- enrichKEGG(gene = gene.df$ENTREZID, 
                   organism = "mmu", pAdjustMethod = "fdr",
                   qvalueCutoff = 0.01, universe = bg.entrez)
  write.table(kk, paste0(annot_dir, kegg), 
              sep="\t", 
              quote=F,
              row.names=F)
}


### Enhancer-associated peaks annotation
annotatePeaks(peak = paste0(fullPeak_dir, "dunnart_enhancer_peaks.narrowPeak"), outFile = "dunnart_enhancer_annotation.txt", backg = unlist(fread("output/annotations/go_background_orthologousENSEMBL_biomart.txt", header = FALSE), use.names = FALSE),
  outFile1 = "dunnart_enhancer_annotationConvertedIDs.txt", GOenrich = "enhancer_mm10GOenrich.txt", 
  kegg="enhancer_KEGG.txt", outFile2 = "dunnart_enhancer_annotationConvertedIDs_t.devil.txt")

>> loading peak file...              2022-03-07 15:52:54 
>> preparing features information...         2022-03-07 15:52:55 
>> identifying nearest features...       2022-03-07 15:52:55 
>> calculating distance from peak to TSS...  2022-03-07 15:52:56 
>> assigning genomic annotation...       2022-03-07 15:52:56 
>> assigning chromosome lengths          2022-03-07 15:53:05 
>> done...                   2022-03-07 15:53:05

### Promoter-associated peaks annotation
annotatePeaks(peak = paste0(fullPeak_dir, "dunnart_promoter_peaks.narrowPeak"), outFile = "dunnart_promoter_annotation.txt", backg = unlist(fread("output/annotations/go_background_orthologousENSEMBL_biomart.txt", header = FALSE), use.names = FALSE),
  outFile1 = "dunnart_promoter_annotationConvertedIDs.txt" , GOenrich = "promoter_mm10GOenrich.txt", 
  kegg="promoter_KEGG.txt", outFile2 = "dunnart_promoter_annotationConvertedIDs_t.devil.txt")

>> loading peak file...              2022-03-07 15:53:55 
>> preparing features information...         2022-03-07 15:53:55 
>> identifying nearest features...       2022-03-07 15:53:55 
>> calculating distance from peak to TSS...  2022-03-07 15:53:56 
>> assigning genomic annotation...       2022-03-07 15:53:56 
>> assigning chromosome lengths          2022-03-07 15:53:58 
>> done...                   2022-03-07 15:53:58

### H3K4me3-associated peaks annotation
annotatePeaks(peak = paste0(fullPeak_dir, "H3K4me3_overlap_default.narrowPeak"), outFile = "dunnart_H3K4me3_annotation.txt", backg = unlist(fread("output/annotations/go_background_orthologousENSEMBL_biomart.txt", header = FALSE), use.names = FALSE),
  outFile1 = "dunnart_H3K4me3_annotationConvertedIDs.txt" , GOenrich = "H3K4me3_mm10GOenrich.txt", 
  kegg="H3K4me3_KEGG.txt", outFile2 = "dunnart_H3K4me3_annotationConvertedIDs_t.devil.txt")

>> loading peak file...              2022-03-07 15:54:22 
>> preparing features information...         2022-03-07 15:54:23 
>> identifying nearest features...       2022-03-07 15:54:23 
>> calculating distance from peak to TSS...  2022-03-07 15:54:23 
>> assigning genomic annotation...       2022-03-07 15:54:23 
>> assigning chromosome lengths          2022-03-07 15:54:25 
>> done...                   2022-03-07 15:54:25

### H3K27ac-associated peaks annotation
annotatePeaks(peak = paste0(fullPeak_dir, "H3K27ac_overlap_default.narrowPeak"), outFile = "dunnart_H3K27ac_annotation.txt", backg = unlist(fread("output/annotations/go_background_orthologousENSEMBL_biomart.txt", header = FALSE), use.names = FALSE),
  outFile1 = "dunnart_H3K27ac_annotationConvertedIDs.txt" , GOenrich = "H3K27ac_mm10GOenrich.txt", 
  kegg="H3K27ac_KEGG.txt", outFile2 = "dunnart_H3K27ac_annotationConvertedIDs_t.devil.txt")

>> loading peak file...              2022-03-07 15:54:49 
>> preparing features information...         2022-03-07 15:54:50 
>> identifying nearest features...       2022-03-07 15:54:50 
>> calculating distance from peak to TSS...  2022-03-07 15:54:51 
>> assigning genomic annotation...       2022-03-07 15:54:51 
>> assigning chromosome lengths          2022-03-07 15:54:53 
>> done...                   2022-03-07 15:54:53

Distance to the nearest TSS

Now see where the peaks are located in relation to the TSS. Promoters should be reasonably close to the TSS and enhancers more distal to the TSS. Plot distance to TSS for all unfiltered peaks

fileList = annot_dir
enhancerTSSplot="dunnart_enhancerTSS.pdf"
promoterTSSplot="dunnart_promoterTSS.pdf"
TSSplot = "dunnart_TSS.pdf"

files =list.files(fileList, pattern= "annotation.txt", full.names=T) # create list of files in directory
files = as.list(files)
data = lapply(files, function(x) fread(x, header=TRUE, sep="\t", quote = "", na.strings=c("", "NA"))) # read in all files
names(data) = c("enhancers","H3K27ac","H3K4me3","promoter")
df1 = Map(mutate, data, mark = names(data))
enhProm = list(as.data.table(df1$promoter), as.data.table(df1$enhancers))
enhProm = suppressWarnings(lapply(enhProm, 
                                  function(x)
  x[,log10_abs_dist:=log10(abs(distanceToTSS)+1)][,log10_abs_dist:=ifelse(distanceToTSS<0,-log10_abs_dist,log10_abs_dist)]))

names(enhProm) = c("promoters","enhancers")

## Enhancers
p <- ggplot(enhProm$enhancers, aes(x=log10_abs_dist)) + 
    geom_density(alpha=0.5, color="#FF9D40", fill="#FF9D40") + theme_bw() + 
  ggtitle("Enhancer-associated peaks")

## Save as PDF
pdf(file = paste0(plot_dir, enhancerTSSplot), width=10, height = 7)
    print(p)
dev.off()

svg 
  2

## Promoters
p <- ggplot(enhProm$promoters, aes(x=log10_abs_dist)) + 
    geom_density(alpha=0.5, color="#2A9D8F", fill="#2A9D8F") + theme_bw() + 
  ggtitle("Promoter-associated peaks")

p

Version	Author	Date
95a6294	lecook	2022-03-03

## Save as PDF
pdf(file = paste0(plot_dir, promoterTSSplot), width=10, height = 7)
    print(p)
dev.off()

svg 
  2

## Overlay both on the same plots
enhProm = rbindlist(enhProm,)
p <- ggplot(enhProm, aes(x=log10_abs_dist, fill=factor(mark), color=factor(mark))) + 
  geom_density(alpha=0.5) +scale_color_manual(values=c("#E9C46A", "#2A9D8F")) +
  scale_fill_manual(values=c("#E9C46A", "#2A9D8F")) + 
  scale_x_continuous(limits=c(-7, 7)) + theme_bw()
p

Version	Author	Date
95a6294	lecook	2022-03-03

## Save as PDF
pdf(file = paste0(plot_dir, TSSplot), width=10, height = 7)
    print(p + font)
dev.off()

svg 
  2

From this we can see that the promoter peaks have a large amount of peaks a long way from the TSS - suggests that these are either actually enhancers or they represent unannotated transcripts. Probably a mixture of both based on comparison with mouse peaks (where the annotation is better) there are not as many peaks in this distal regions. To look further into this I assessed the relationship between various peak features and the distances to the nearest genes as potential explanation for the distally-located promoter-associated peaks.

Assessing distal promoter-associated peaks

Gene length vs number of peaks per gene

peak.counts <- enhProm %>% as.data.frame() %>% group_by(geneId, mark, geneLength) %>%
    summarise(count = n(), .groups = 'drop') %>% as.data.table()

p <- ggscatter(
  peak.counts, x = "count", y = "geneLength",
  color = "mark", palette = "jco", add="reg.line",
  conf.int = TRUE, # Add confidence interval
  cor.coef = TRUE, # Add correlation coefficient. see ?stat_cor
  cor.coeff.args = list(method = "pearson", label.x = 3, label.sep = "\n"), xlab = "Gene Length",
  ylab = "Number of peaks per gene"
  ) +
  facet_wrap(~mark) + theme(panel.spacing = unit(3, "lines"), plot.margin = margin(2, 2, 2, 2, "cm"))
p

pdf(paste0(plot_dir, "geneLengthvsPeakCount.pdf"), width=10, height = 15)
print(p)
dev.off()

svg 
  2

#FIX
#p <- ggplot(peak, aes(abs(distanceToTSS), y=log10_length, color=cre)) + 
#  geom_point(alpha=0.3) +  theme_bw() + scale_color_manual(values=c("#F1DAA2", "#7AC2B9"))

#pdf(paste0(plot_dir, "peakLengthVsDist.pdf"), width=10, height = 7)

#dev.off()

Peak intensity for distal enhancers marked by H3K4me3

# subset log10_abs_dist between 3.5 and 7 (3000bp and 10 million bps)
#H3K4me3 = fread("H3K4me3_annotation.txt", sep="\t", quote="", header=TRUE)
#H3K27ac = fread("H3K27ac_annotation.txt", sep="\t", quote="", header=TRUE)

#H3K4me3.subset.greater = subset(H3K4me3, abs(distanceToTSS) >=3000)
#H3K4me3.subset.greater$mark = "H3K4me3 > 3kb"
#H3K4me3.subset.less = subset(H3K4me3, abs(distanceToTSS) < 3000)
#H3K4me3.subset.less$mark = "H3K4me3 < 3kb"
#H3K27ac.subset = subset(H3K27ac, abs(distanceToTSS) >=3000)
#H3K27ac.subset$mark = "H3K27ac"
#data = rbind(H3K4me3.subset.less, H3K4me3.subset.greater, H3K27ac.subset)
#stat <- compare_means(V7 ~ mark, data = data)
#mean(H3K4me3.subset.greater$V7)
#mean(H3K4me3.subset.less$V7)

#enhancer <- fread("dunnart_enhancer_annotation.txt", sep="\t", quote="", header=TRUE)
#promoter <- fread("dunnart_promoter_annotation.txt", sep="\t", quote="", header=TRUE)
#enhancer$cre <- "enhancer"
#promoter$cre <- "promoter"
#enhancer = enhancer[,log10_abs_dist:=log10(abs(distanceToTSS)+1)]
#promoter = promoter[,log10_abs_dist:=log10(abs(distanceToTSS)+1)]

#promoter.subset=subset(promoter, log10_abs_dist >= 3.5)
#promoter = subset(promoter, log10_abs_dist <3.5)
#promoter.subset$cre = "subset"
#data = rbind(promoter, promoter.subset)
#data = data[,log10_intensity:=log10(V7)]

#p <- ggplot(data, aes(factor(cre), y = log10_intensity)) + 
  #geom_violin(aes(fill=factor(cre), color=factor(cre)),
   # position = "dodge"
   # )+
 # geom_boxplot(aes(color=factor(cre)),
  #  outlier.shape = NA,
  #  width = .15, 
  #  notch = TRUE,
  #  fill=c("#FCF8EC","#E4F3F1"),
  #  position = position_dodge(width=.1)
  #) + theme_bw() + xlab("") + ylab("Peak intensity") 
#pdf(paste0(plot_dir, "distalH3K4me3PeakIntensity.pdf"), width=9, height = 8)
#p +  scale_color_manual(values = c("#E9C46A","#2A9D8F")) +
#scale_fill_manual(values = c("#F1DAA2","#7AC2B9")) + stat_compare_means(method = "wilcox.test" )
#dev.off()

Number of peaks per gene

#enhancer <- fread("annotations/dunnart_enhancer_annotation.txt", sep="\t", quote="", header=TRUE)
#promoter <- fread("annotations/dunnart_promoter_annotation.txt", sep="\t", quote="", header=TRUE)
#enhancer$cre <- "enhancer"
#promoter$cre <- "promoter"
##peak <- rbind(enhancer, promoter)
#peak.counts <- enhancer %>% as.data.frame() %>% group_by(geneId, cre) %>%
#    summarise(count = n(),.groups = 'drop') %>% as.data.table()
#peak.counts = peak.counts[,log10_abs_dist:=log10(count)][,log10_abs_dist:=ifelse(count<0,-log10_abs_dist,log10_abs_dist)]
#pick <- peak.counts %>% filter(count > 50)

#p <- ggplot(peak.counts, aes(factor(cre), y = log10_abs_dist)) + 
 # geom_violin(aes(fill=factor(cre), color=factor(cre))) +
  #geom_boxplot(aes(color=factor(cre)),
   # width = .15, 
    #outlier.shape = NA,
    #fill=c("#FCF8EC","#E4F3F1"),
    #position = position_dodge(width=.1)
  #)  +
  #geom_text(data = pick, aes(factor(cre), log10_abs_dist), label = pick$geneId, check_overlap = TRUE, size=3, position = position_jitter(w = 0.3)) +
 # theme_bw() + xlab("") + ylab("Log 10 number of peaks") 
#pdf(paste0(plot_dir, "numberPeaksPerGene.pdf"), width=9, height = 8)
#p +  scale_color_manual(values = c("#E9C46A","#2A9D8F")) +
#scale_fill_manual(values = c("#F1DAA2","#7AC2B9"))
#dev.off()

Gene length and number of peaks per gene

#p <- ggplot(peak.counts, aes(x=log10_abs_length, y = log10(n))) + 
#  geom_point()+ theme_bw() + xlab("Gene Length") + ylab("Log 10 number of peaks") 
#pdf(paste0(plot_dir, "numberPeaksvsGeneLength.pdf"), width=9, height = 8)
#p
#dev.off()

k-means clustering

Use k-means clustering to group the peaks to decide on a cut off for “promoter” peaks. This will be more conservative for what we identify as promoters. We can use k-means clustering with the MacQueen algorithm (morissette_k-means_2013) to perform unsupervised clustering on promoter-associated peaks into 3 clusters based on distance to the closest TSS.

set.seed(3)

file = paste0(annot_dir, "dunnart_promoter_annotation.txt")
plot1 = paste0(plot_dir, "promoter_kmeans_bar.pdf")
plot2 = paste0(plot_dir, "promoter_kmeans_histogram.pdf")
output = "promoter_kmeans_peaks.txt"

data = fread(file, header=TRUE, sep="\t", quote = "") # read in all files
data = data[,log10_dist:=log10(abs(distanceToTSS)+1.1)][,log10_dist:=ifelse(distanceToTSS<0,-log10_dist,log10_dist)]
data = data[,abs_dist:=log10(abs(distanceToTSS)+1.1)]

data = data %>% dplyr::select("V4", "width", "V7", "distanceToTSS", "log10_dist", "abs_dist", "annotation")
  
## plot the number of peaks in each cluster
## Using the MacQueen algorithm instead of the default Lloyd 
## The algorithm is more efficient as it updates centroids more often and usually needs to
## perform one complete pass through the cases to converge on a solution.
df = data[,5]
cre.kmeans = kmeans(df, 3, iter.max=100, nstart=25, algorithm="MacQueen")  
cre.kmeans_table = data.frame(cre.kmeans$size, cre.kmeans$centers)
cre.kmeans_df = data.frame(Cluster = cre.kmeans$cluster, data)

p <- ggplot(data = cre.kmeans_df, aes(y = Cluster, 
                                      fill=as.factor(Cluster), 
                                      color=as.factor(Cluster))) +
  geom_bar()  + theme(plot.title = element_text(hjust = 0.5)) + 
  theme_bw() + scale_color_manual(values = c('#9EBCDA','#8C6BB1', "#4D004B")) + 
  scale_fill_manual(values = c('#9EBCDA','#8C6BB1', "#4D004B")) + theme_bw() 
p + ggtitle("Number of peaks in clusters")

Version	Author	Date
95a6294	lecook	2022-03-03

pdf(plot1, width=10, height = 7)
print(p + font)
dev.off()

svg 
  2

p <- ggplot(cre.kmeans_df, aes(x=log10_dist, 
                               fill=as.factor(Cluster), 
                               color=as.factor(Cluster))) +
  geom_histogram(binwidth=0.1, position = 'identity') +
  theme_bw() + scale_color_manual(values = c('#9EBCDA','#8C6BB1', "#4D004B")) + 
  scale_fill_manual(values = c('#9EBCDA','#8C6BB1', "#4D004B")) 
p + ggtitle("Histogram of clustered peaks")

Version	Author	Date
95a6294	lecook	2022-03-03

pdf(plot2, width = 10, height = 7)
  print(p + font)
dev.off()

svg 
  2

write.table(cre.kmeans_df, paste0(filterPeaks_dir, output), sep="\t", quote=F, row.names=F)

Filter for cluster 1 and make final promoter/enhancer plots

peakWidthPlot = paste0(plot_dir, "filteredPeaks_peakWidthPlot.pdf")
plotTSS = paste0(plot_dir,  "filteredPeaks_peakDistanceToTSS.pdf")
peakIntensityPlot = paste0(plot_dir, "filteredPeaks_peakIntensityPlot.pdf")
annotation = paste0(plot_dir, "filteredPeaks_peakAnnotation.pdf")
x = paste0(filterPeaks_dir, "promoter_kmeans_peaks.txt")
enhancer = paste0(annot_dir, "dunnart_enhancer_annotation.txt")
promoter = paste0(annot_dir, "dunnart_promoter_annotation.txt")

enhancer.annot= fread(enhancer, header=TRUE, sep="\t", quote = "") 
enhancer.annot = enhancer.annot[,log10_dist:=log10(abs(distanceToTSS)+1)][,log10_dist:=ifelse(distanceToTSS<0,-log10_dist,log10_dist)]
enhancer.annot = enhancer.annot[,abs_dist:=abs(distanceToTSS)]
promoter.annot = fread(promoter, header=TRUE, sep="\t", quote = "")
promoter.annot = promoter.annot[,log10_dist:=log10(abs(distanceToTSS)+1)][,log10_dist:=ifelse(distanceToTSS<0,-log10_dist,log10_dist)]
promoter.annot = promoter.annot[,abs_dist:=abs(distanceToTSS)]

promoter.annot$label = "promoter"
enhancer.annot$label = "enhancer"
enhancer.annot = enhancer.annot %>% dplyr::select("V4", "width", "V7", "distanceToTSS", 
                                                  "log10_dist", "abs_dist", "annotation", "label")
promoter.annot = promoter.annot %>% dplyr::select("V4", "width", "V7", "distanceToTSS", 
                                                  "log10_dist", "abs_dist", "annotation", "label")

promoter.kmeans = fread(x, header=TRUE, sep="\t", quote = "") 
promoter.kmeans$cre = "promoter"

for (i in 1:nrow(promoter.kmeans)) {
   if(promoter.kmeans$Cluster[i]=='1' & promoter.kmeans$cre[i]=='promoter') {
           promoter.kmeans[i, 'label'] <- "promoter cluster 1"
          }
          else if(promoter.kmeans$Cluster[i]=='2' & promoter.kmeans$cre[i]=='promoter') {
            promoter.kmeans[i, 'label'] <- "promoter cluster 2"
            }
            else if(promoter.kmeans$Cluster[i]=='3' & promoter.kmeans$cre[i]=='promoter') {
              promoter.kmeans[i, 'label'] <- "promoter cluster 3"
              }
}

data2 = promoter.kmeans %>% dplyr::select("V4", "width", "V7", "distanceToTSS", "log10_dist", "abs_dist", "annotation", "label")
all_peaks = rbind(data2, enhancer.annot, promoter.annot)
all_peaks = as.data.table(all_peaks)
all_peaks = all_peaks[,log10_intensity:=log10(V7)]
all_peaks = all_peaks[,log10_width:=log10(width)]

Plot distance to TSS

colors = c("#E9C46A","#e34a33", "#fdbb84","#fee8c8", "#2A9D8F")

p <- ggplot(all_peaks, aes(x=log10_dist, colour = factor(label), fill=factor(label))) + 
  geom_density(aes(x = log10_dist, y = ..density..), position="stack", alpha=0.7) + 
  theme_bw() + scale_fill_manual(values = colors) + scale_color_manual(values = colors)

pdf(plotTSS, width=10, height = 7)
  print(p + font)
dev.off()

svg 
  2

my_comparisons = list( c("promoter cluster 1", "promoter cluster 2"), c("promoter cluster 1", "promoter cluster 3"), c("promoter cluster 1", "promoter"), c("promoter cluster 1", "enhancer") )
  
## Plot peak intensity
p <- ggplot(all_peaks, aes(factor(label), y = log10_intensity)) + 
  geom_violin(aes(fill = factor(label), 
                  color = factor(label)), 
              position = "dodge") +
  geom_boxplot(aes(color = factor(label)),
               outlier.shape = NA,
               width = .15, 
               notch = TRUE,
               position = position_dodge(width=.1)) + theme_bw() + 
  xlab("") + ylab("Log 10 peak intensity") + scale_color_manual(values = colors) +
  scale_fill_manual(values = colors) + stat_compare_means(comparisons = my_comparisons)


p +  ggtitle("Peak intensity for clustered promoter-associated peaks compared to unfiltered peaks")

pdf(paste0(plot_dir, "test.pdf"),width=10, height = 8)
  print(p + font)
dev.off()

svg 
  2

## Plot peak width
p <- ggplot(all_peaks, aes(factor(label), y = log10_width)) + 
  geom_violin(aes(fill = factor(label), 
                  color = factor(label)), 
              position = "dodge") +
  geom_boxplot(aes(color=factor(label)),
               outlier.shape = NA,
               width = .15, 
               notch = TRUE,
               position = position_dodge(width=.1)) + 
  theme_bw() + xlab("") + ylab("Log 10 peak width") + scale_color_manual(values =colors) +
  scale_fill_manual(values = colors) + stat_compare_means(comparisons = my_comparisons) 

p + ggtitle("Peak widths for clustered promoter-associated peaks compared to unfiltered peaks")

pdf(peakWidthPlot, width=9, height = 8)
  print(p + font)
dev.off()

svg 
  2

## Plot genomic regions
df2 = rbind(promoter.kmeans)
df2 = map_df(df2, ~ gsub(" ().*", "", .x))# remove everything after the first space
tbl = with(df2, table(annotation, Cluster)) %>% as.data.frame()

p <- ggplot(tbl, aes(factor(annotation), Freq, fill=factor(Cluster))) + 
  geom_bar(position=position_stack(reverse = TRUE), stat="identity") +
  #geom_text(data=subset(data,Freq != 0),aes(label=Freq, y=pos), size=3)+
  scale_fill_manual(values = c('#9EBCDA','#8C6BB1', "#4D004B"))  + 
  theme_minimal() + ylab("Number of peaks") + xlab("") 

p + ggtitle("Genomic features of clusters")

pdf(annotation)
print(p + font)
dev.off()

svg 
  2

Extract cluster groups from narrowPeak files and save separately

promoter.kmeans = fread(paste0(filterPeaks_dir, "promoter_kmeans_peaks.txt"))
cluster1 <- promoter.kmeans$V4[promoter.kmeans$Cluster==1]
cluster2 <- promoter.kmeans$V4[promoter.kmeans$Cluster==2]
cluster3 <- promoter.kmeans$V4[promoter.kmeans$Cluster==3]

promoter = paste0(fullPeak_dir, "dunnart_promoter_peaks.narrowPeak")
out1 = paste0(filterPeaks_dir, "cluster1_promoters.narrowPeak")
out2 = paste0(filterPeaks_dir, "cluster2_promoters.narrowPeak")
out3 = paste0(filterPeaks_dir, "cluster3_promoters.narrowPeak")
cluster1 = cluster1
cluster2 = cluster2
cluster3 = cluster3

file= fread(promoter, header=FALSE, sep="\t", quote = "") 
write.table(subset(file, V4 %in% cluster1), out1, quote=FALSE, col.names=FALSE, row.names=FALSE, sep="\t")
write.table(subset(file, V4 %in% cluster2), out2, quote=FALSE, col.names=FALSE, row.names=FALSE, sep="\t")
write.table(subset(file, V4 %in% cluster3), out3, quote=FALSE, col.names=FALSE, row.names=FALSE, sep="\t")

Anotate promoter clusters

annotatePeaks(peak = paste0(filterPeaks_dir, "cluster1_promoters.narrowPeak"), 
  outFile = "cluster1_promoter_annotation.txt", outFile1 = "cluster1_promoter_annotationConvertedIDs.txt" , backg = unlist(fread("output/annotations/go_background_orthologousENSEMBL_biomart.txt", header = FALSE), use.names = FALSE), 
  GOenrich = "cluster1_mm10GOenrich.txt", 
  kegg="cluster1_KEGG.txt", outFile2 = "cluster1_annotationConvertedIDs_t.devil.txt")

>> loading peak file...              2022-03-07 15:57:55 
>> preparing features information...         2022-03-07 15:57:55 
>> identifying nearest features...       2022-03-07 15:57:55 
>> calculating distance from peak to TSS...  2022-03-07 15:57:55 
>> assigning genomic annotation...       2022-03-07 15:57:55 
>> assigning chromosome lengths          2022-03-07 15:57:56 
>> done...                   2022-03-07 15:57:56

annotatePeaks(peak = paste0(filterPeaks_dir, "cluster2_promoters.narrowPeak"), 
  outFile = "cluster2_promoter_annotation.txt", outFile1 = "cluster2_promoter_annotationConvertedIDs.txt" , backg = unlist(fread("output/annotations/go_background_orthologousENSEMBL_biomart.txt", header = FALSE), use.names = FALSE), 
  GOenrich = "cluster2_mm10GOenrich.txt", 
  kegg="cluster2_KEGG.txt", outFile2 = "cluster2_annotationConvertedIDs_t.devil.txt")

>> loading peak file...              2022-03-07 15:58:17 
>> preparing features information...         2022-03-07 15:58:17 
>> identifying nearest features...       2022-03-07 15:58:17 
>> calculating distance from peak to TSS...  2022-03-07 15:58:18 
>> assigning genomic annotation...       2022-03-07 15:58:18 
>> assigning chromosome lengths          2022-03-07 15:58:19 
>> done...                   2022-03-07 15:58:19

annotatePeaks(peak = paste0(filterPeaks_dir, "cluster3_promoters.narrowPeak"), 
  outFile = "cluster3_promoter_annotation.txt", outFile1 = "cluster3_promoter_annotationConvertedIDs.txt" , backg = unlist(fread("output/annotations/go_background_orthologousENSEMBL_biomart.txt", header = FALSE), use.names = FALSE), 
  GOenrich = "cluster3_mm10GOenrich.txt", 
  kegg="cluster3_KEGG.txt", outFile2 = "cluster3_annotationConvertedIDs_t.devil.txt")

>> loading peak file...              2022-03-07 15:58:40 
>> preparing features information...         2022-03-07 15:58:40 
>> identifying nearest features...       2022-03-07 15:58:40 
>> calculating distance from peak to TSS...  2022-03-07 15:58:40 
>> assigning genomic annotation...       2022-03-07 15:58:40 
>> assigning chromosome lengths          2022-03-07 15:58:41 
>> done...                   2022-03-07 15:58:41

Assess distance to TSS for clusters

fileList = annot_dir
enhancerTSSplot="dunnart_enhancerTSS.pdf"
promoterTSSplot="dunnart_filtered_promoterTSS.pdf"
TSSplot = "dunnart_all_TSS.pdf"


files =list.files(fileList, pattern= "_annotation.txt", full.names=T) # create list of files in directory
files = as.list(files)
data = lapply(files, function(x) fread(x, header=TRUE, sep="\t", quote = "", na.strings=c("", "NA"))) # read in all files
names(data) = c("cluster1", "cluster2", "cluster3", "enhancers","H3K27ac","H3K4me3","promoters")
enhProm = Map(mutate, data, mark = names(data))
#promoter_5kb = x
#promoter_5kb$mark = "promoter_5kb"
#enhProm = list(df1$enhancers, df1$promoters)
enhProm = suppressWarnings(lapply(enhProm, function(x) x[,log10_abs_dist:=log10(abs(distanceToTSS)+1)][,log10_abs_dist:=ifelse(distanceToTSS<0,-log10_abs_dist,log10_abs_dist)]))

enhProm = suppressWarnings(lapply(enhProm, function(x) x[,abs_dist:=abs(distanceToTSS)]))
#names(enhProm) = c("enhancers","promoters", "promoter_5kb")
level_order = c("cluster1", "cluster2", "cluster3", "promoters", "enhancers", "H3K4me3", "H3K27ac")

p <- ggplot(enhProm$enhancers, aes(x=log10_abs_dist, 
                                    color=factor(mark), 
                                    fill=factor(mark))) + 
        geom_density(alpha=0.5) + theme_bw() + scale_color_manual(values = c('#E9C46A')) + 
        scale_fill_manual(values = c('#E9C46A'))

p + ggtitle("Distance to the nearest TSS for enhancer-associated peaks")

pdf(file = paste0(plot_dir, enhancerTSSplot), width=10, height = 7)
print(p + font)
dev.off()

svg 
  2

enhProm = rbind(enhProm$promoters, enhProm$cluster1, enhProm$cluster2, enhProm$cluster3 ,use.names=TRUE)

p <- ggplot(enhProm, aes(x=log10_abs_dist, 
                         color = factor(mark, level=level_order), 
                         fill = factor(mark, level=level_order))) + 
  geom_density(position = "stack", alpha=0.8) + theme_bw() 

p + ggtitle("Distance to the nearest TSS for promoter-associated peaks")

pdf(file = paste0(plot_dir, TSSplot), width=15, height = 7)
print(p + font)
dev.off()

svg 
  2

Now look at CpG% and GC% for these groups

Run Homer on the clusters

Load modules and create genome db for use within homer
Annotate peaks to extract GC and CpG content information using annotatePeaks.pl

module load perl
module load gcc
module load web_proxy
module load wget

Load genome for use with HOMER tools This will allow the genome to be called within the tools

loadGenome.pl -name smiCra1 -org null -fasta data/genomic_data/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr.fasta -gff data/genomic_data/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr2.gff

Annotate peaks to extract GC and CpG content information

annotatePeaks.pl output/filtered_peaks/cluster1_promoters.narrowPeak smiCra1 -gff data/genomic_data/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr2.gff -CpG -cons > output/annotations/cluster1_promoters_homerAnnot.txt
annotatePeaks.pl output/filtered_peaks/cluster2_promoters.narrowPeak smiCra1 -gff Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr2.gff -CpG -cons > output/annotations/cluster2_promoters_homerAnnot.txt
annotatePeaks.pl output/filtered_peaks/cluster3_promoters.narrowPeak smiCra1 -gff Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr2.gff -CpG -cons > output/annotations/cluster3_promoters_homerAnnot.txt

annotatePeaks.pl output/peaks/dunnart_promoter_peaks.narrowPeak smiCra1 -gff data/genomic_data/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr2.gff -CpG -cons > output/annotations/dunnart_promoters_homerAnnot.txt
annotatePeaks.pl output/peaks/dunnart_enhancer_peaks.narrowPeak smiCra1 -gff data/genomic_data/Scras_dunnart_assem1.0_pb-ont-illsr_flyeassem_red-rd-scfitr2_pil2xwgs2_60chr2.gff -CpG -cons > output/annotations/dunnart_enhancers_homerAnnot.txt

fileList = annot_dir
gcPlot = paste0(plot_dir, "dunnart_GC.pdf")
cpgPlot = paste0(plot_dir, "dunnart_cpg.pdf")

files =list.files(fileList, pattern= "homerAnnot.txt", full.names=T) # create list of files in directory
files = as.list(files)
data = lapply(files, function(x) fread(x, header=TRUE, sep="\t", quote = "", na.strings=c("", "NA"))) # read in all files
names(data) = c("cluster1","cluster2","cluster3","enhancers", "promoters")
df1 = Map(mutate, data, group = names(data))
df1 = lapply(df1, function(x) x %>% dplyr::select("Chr", "Start", "End", 
                                                  "Peak Score", "Distance to TSS", 
                                                  "CpG%", "GC%", "group") %>% as.data.table())
df1 = rbind(df1$cluster1, df1$cluster2, df1$cluster3, df1$promoters, df1$enhancers)
colnames(df1)[5:7] <- c("distanceToTSS","CpG", "GC")
df1[is.na(df1)] <- 0
df1 = df1[,log10_abs_dist:=log10(abs(distanceToTSS)+1)][,log10_abs_dist:=ifelse(distanceToTSS<0,-log10_abs_dist,log10_abs_dist)]

my_comparisons = list( c("cluster1", "cluster2"), c("cluster1", "cluster3"), 
                       c("cluster1", "promoters"), c("cluster1", "enhancers"), 
                       c("promoters", "enhancers"), c("promoters", "cluster2"), 
                       c("promoters", "cluster3"), c("enhancers", "cluster2"), 
                       c("enhancers", "cluster3"), c("cluster2", "cluster3") )

p <- ggplot(df1, aes(factor(group), y = GC)) + 
  geom_violin(aes(fill=factor(group), 
                  color=factor(group)), 
              position = "dodge") +
  geom_boxplot(aes(color=factor(group)),
               outlier.shape = NA,
               width = .15, 
               notch = TRUE,
               fill=c("#E7EEF6","#D4C7E2","#D3BFD2", "#FCF8EC","#E4F3F1"),
               position = position_dodge(width=.1)) + 
  theme_bw() + xlab("") + ylab("GC content") + 
  scale_color_manual(values = c('#9EBCDA','#8C6BB1','#4D004B', '#E9C46A','#2A9D8F')) +
  scale_fill_manual(values = c("#B6CDE3","#A990C4","#7A4078" ,"#F1DAA2","#7AC2B9"))

p + ggtitle("GC content")

pdf(gcPlot, width=9, height = 8)
    print(p + font + stat_compare_means(p.adjust.method = "fdr", comparisons = my_comparisons))
dev.off()

svg 
  2

## Plot CpG
p <- ggplot(df1, aes(factor(group), y = CpG)) + 
  geom_violin(aes(fill=factor(group), 
                  color=factor(group)),
              position = "dodge") +
  geom_boxplot(aes(color=factor(group)),
               outlier.shape = NA,
               width = .15, 
               notch = TRUE,
               fill=c("#E7EEF6","#D4C7E2","#D3BFD2", "#FCF8EC","#E4F3F1"),
               position = position_dodge(width=.1)) + 
  theme_bw() + xlab("") + ylab("CpG") + 
  scale_color_manual(values = c('#9EBCDA','#8C6BB1','#4D004B', '#E9C46A','#2A9D8F')) +
  scale_fill_manual(values = c("#B6CDE3","#A990C4","#7A4078" ,"#F1DAA2","#7AC2B9"))

p + ggtitle("CpG content")

pdf(cpgPlot, width = 9, height = 8)
  print(p + font + stat_compare_means(p.adjust.method = "fdr", comparisons = my_comparisons))
  dev.off()

svg 
  2

GO enrichment for high-confidence promoter-associated peaks and enhancer-associated peaks

promoter = fread(paste0(annot_dir, "cluster1_promoter_annotationConvertedIDs.txt"))
enhancer = fread(paste0(annot_dir, "dunnart_enhancer_annotationConvertedIDs.txt"))
data = list(promoter, enhancer)
genes = lapply(data, function(i) as.data.frame(i)$mouseensembl)
genes = lapply(genes, function(i) unique(i))
names(genes) = c("promoter-associated peaks", "enhancer-associated peaks")
go_cluster <- setReadable(
      compareCluster(
        geneCluster = genes, 
        fun = enrichGO, 
        ont="BP",
        universe = bg, 
        keyType="ENSEMBL", 
        pvalueCutoff = 0.001, 
        OrgDb = org.Mm.eg.db),
      OrgDb = org.Mm.eg.db, 
      keyType="ENSEMBL")

Plot GO enrichment

p <- dotplot(go_cluster, showCategory = 10) + scale_color_viridis() +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

pdf(paste0(plot_dir, "dunnart_go_cluster.pdf"), width = 11, height = 9)
print(p)
dev.off()

svg 
  2

write.table(go_cluster@compareClusterResult, file = paste0(annot_dir, "dunnart_peak_mm10_enrichGO_compareCluster_results.txt"), sep="\t", quote=F, row.names=F)

suppressWarnings(genes_ent <- lapply(genes, function(x) bitr(geneID=x, fromType="ENSEMBL", toType="ENTREZID", OrgDb="org.Mm.eg.db" )))

genes_ent = lapply(genes_ent, function(x) x$ENTREZID)

suppressWarnings(bg_ent <- bitr(geneID=bg, fromType="ENSEMBL", toType="ENTREZID", OrgDb="org.Mm.eg.db" ))
bg_entrez = bg_ent$ENTREZID

kegg_cluster <- 
  compareCluster(
  geneCluster = genes_ent, 
  fun = 'enrichKEGG', 
  universe = bg_entrez, 
  pvalueCutoff = 0.001, 
  organism = 'mmu')

p <- dotplot(kegg_cluster, showCategory = 10) + scale_color_viridis() + 
  theme(axis.text.x = element_text(angle = 45, hjust=1))

pdf(paste0(plot_dir, "dunnart_kegg_cluster.pdf"), width = 8, height = 6)
print(p)
dev.off()

svg 
  2

split <- split(as.data.frame(go_cluster@compareClusterResult), 
               f = factor(go_cluster@compareClusterResult$Cluster))
go_terms = lapply(split, function(x) x$ID)

pdf(paste0(plot_dir, "dunnart_simplifyGO.pdf"), width = 10, height = 14)
simplifyGOFromMultipleLists(go_terms, 
                                 db = org.Mm.eg.db,  
                                 measure = "Wang", 
                                 method = "binary_cut")

334/334 GO IDs left for clustering.
Cluster 334 terms by 'binary_cut'... 27 clusters, used 0.1066508 secs.

dev.off()

svg 
  2

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/local/easybuild-2019/easybuild/software/compiler/gcc/10.2.0/openblas/0.3.12/lib/libopenblas_haswellp-r0.3.12.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.14.0      formatR_1.11             simplifyEnrichment_1.4.0
 [4] NbClust_3.0              factoextra_1.0.7         multcompView_0.1-8      
 [7] gghalves_0.1.1           hrbrthemes_0.8.0         viridis_0.6.2           
[10] viridisLite_0.4.0        VennDiagram_1.7.1        futile.logger_1.4.3     
[13] reshape2_1.4.4           ggridges_0.5.3           data.table_1.14.0       
[16] RColorBrewer_1.1-2       forcats_0.5.1            stringr_1.4.0           
[19] dplyr_1.0.8              purrr_0.3.4              readr_1.4.0             
[22] tidyr_1.1.3              tibble_3.1.2             tidyverse_1.3.1         
[25] plyr_1.8.6               org.Mm.eg.db_3.14.0      AnnotationHub_3.2.2     
[28] BiocFileCache_2.2.1      dbplyr_2.1.1             ggpubr_0.4.0            
[31] ggplot2_3.3.3            ChIPseeker_1.30.3        clusterProfiler_4.2.2   
[34] AnnotationForge_1.36.0   rtracklayer_1.54.0       GenomicFeatures_1.46.5  
[37] AnnotationDbi_1.56.2     Biobase_2.54.0           GenomicRanges_1.46.1    
[40] GenomeInfoDb_1.30.1      IRanges_2.28.0           S4Vectors_0.32.3        
[43] BiocGenerics_0.40.0      workflowr_1.7.0         

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3                         
  [2] bit64_4.0.5                            
  [3] knitr_1.33                             
  [4] DelayedArray_0.20.0                    
  [5] doParallel_1.0.16                      
  [6] KEGGREST_1.34.0                        
  [7] RCurl_1.98-1.3                         
  [8] generics_0.1.0                         
  [9] callr_3.7.0                            
 [10] lambda.r_1.2.4                         
 [11] RSQLite_2.2.7                          
 [12] shadowtext_0.1.1                       
 [13] bit_4.0.4                              
 [14] enrichplot_1.14.2                      
 [15] xml2_1.3.2                             
 [16] lubridate_1.7.10                       
 [17] httpuv_1.6.1                           
 [18] ggsci_2.9                              
 [19] SummarizedExperiment_1.24.0            
 [20] assertthat_0.2.1                       
 [21] xfun_0.23                              
 [22] hms_1.1.0                              
 [23] jquerylib_0.1.4                        
 [24] evaluate_0.14                          
 [25] promises_1.2.0.1                       
 [26] fansi_0.5.0                            
 [27] restfulr_0.0.13                        
 [28] progress_1.2.2                         
 [29] caTools_1.18.2                         
 [30] readxl_1.3.1                           
 [31] igraph_1.2.6                           
 [32] DBI_1.1.1                              
 [33] ellipsis_0.3.2                         
 [34] backports_1.2.1                        
 [35] RcppParallel_5.1.4                     
 [36] biomaRt_2.50.3                         
 [37] MatrixGenerics_1.6.0                   
 [38] vctrs_0.3.8                            
 [39] abind_1.4-5                            
 [40] cachem_1.0.5                           
 [41] withr_2.4.2                            
 [42] ggforce_0.3.3                          
 [43] GenomicAlignments_1.30.0               
 [44] treeio_1.18.1                          
 [45] prettyunits_1.1.1                      
 [46] cluster_2.1.2                          
 [47] DOSE_3.20.1                            
 [48] ape_5.5                                
 [49] lazyeval_0.2.2                         
 [50] crayon_1.4.1                           
 [51] labeling_0.4.2                         
 [52] slam_0.1-48                            
 [53] pkgconfig_2.0.3                        
 [54] tweenr_1.0.2                           
 [55] nlme_3.1-152                           
 [56] rlang_1.0.2                            
 [57] lifecycle_1.0.1                        
 [58] downloader_0.4                         
 [59] filelock_1.0.2                         
 [60] extrafontdb_1.0                        
 [61] modelr_0.1.8                           
 [62] cellranger_1.1.0                       
 [63] rprojroot_2.0.2                        
 [64] polyclip_1.10-0                        
 [65] matrixStats_0.61.0                     
 [66] Matrix_1.3-4                           
 [67] aplot_0.1.2                            
 [68] carData_3.0-4                          
 [69] boot_1.3-28                            
 [70] reprex_2.0.0                           
 [71] GlobalOptions_0.1.2                    
 [72] whisker_0.4                            
 [73] processx_3.5.2                         
 [74] png_0.1-7                              
 [75] rjson_0.2.20                           
 [76] bitops_1.0-7                           
 [77] getPass_0.2-2                          
 [78] KernSmooth_2.23-20                     
 [79] Biostrings_2.62.0                      
 [80] blob_1.2.1                             
 [81] shape_1.4.6                            
 [82] qvalue_2.26.0                          
 [83] rstatix_0.7.0                          
 [84] gridGraphics_0.5-1                     
 [85] ggsignif_0.6.1                         
 [86] scales_1.1.1                           
 [87] memoise_2.0.0                          
 [88] magrittr_2.0.1                         
 [89] gplots_3.1.1                           
 [90] zlibbioc_1.40.0                        
 [91] compiler_4.1.0                         
 [92] scatterpie_0.1.7                       
 [93] BiocIO_1.4.0                           
 [94] clue_0.3-59                            
 [95] plotrix_3.8-1                          
 [96] Rsamtools_2.10.0                       
 [97] cli_2.5.0                              
 [98] XVector_0.34.0                         
 [99] patchwork_1.1.1                        
[100] ps_1.6.0                               
[101] mgcv_1.8-36                            
[102] MASS_7.3-54                            
[103] tidyselect_1.1.1                       
[104] stringi_1.6.2                          
[105] highr_0.9                              
[106] yaml_2.2.1                             
[107] GOSemSim_2.20.0                        
[108] ggrepel_0.9.1                          
[109] sass_0.4.0                             
[110] fastmatch_1.1-0                        
[111] tools_4.1.0                            
[112] parallel_4.1.0                         
[113] rio_0.5.26                             
[114] circlize_0.4.12                        
[115] rstudioapi_0.13                        
[116] foreach_1.5.1                          
[117] foreign_0.8-81                         
[118] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[119] git2r_0.28.0                           
[120] gridExtra_2.3                          
[121] farver_2.1.0                           
[122] ggraph_2.0.5                           
[123] proxyC_0.2.4                           
[124] digest_0.6.27                          
[125] BiocManager_1.30.16                    
[126] shiny_1.6.0                            
[127] Rcpp_1.0.6                             
[128] car_3.0-10                             
[129] broom_0.7.6                            
[130] BiocVersion_3.14.0                     
[131] later_1.2.0                            
[132] httr_1.4.2                             
[133] gdtools_0.2.4                          
[134] ComplexHeatmap_2.10.0                  
[135] colorspace_2.0-1                       
[136] rvest_1.0.0                            
[137] XML_3.99-0.6                           
[138] fs_1.5.0                               
[139] splines_4.1.0                          
[140] yulab.utils_0.0.4                      
[141] tidytree_0.3.9                         
[142] graphlayouts_0.7.1                     
[143] ggplotify_0.1.0                        
[144] systemfonts_1.0.4                      
[145] xtable_1.8-4                           
[146] jsonlite_1.7.2                         
[147] ggtree_3.2.1                           
[148] futile.options_1.0.1                   
[149] tidygraph_1.2.0                        
[150] NLP_0.2-1                              
[151] ggfun_0.0.5                            
[152] R6_2.5.0                               
[153] tm_0.7-8                               
[154] pillar_1.6.1                           
[155] htmltools_0.5.1.1                      
[156] mime_0.10                              
[157] glue_1.4.2                             
[158] fastmap_1.1.0                          
[159] BiocParallel_1.28.3                    
[160] codetools_0.2-18                       
[161] interactiveDisplayBase_1.32.0          
[162] fgsea_1.20.0                           
[163] utf8_1.2.1                             
[164] lattice_0.20-44                        
[165] bslib_0.2.5.1                          
[166] curl_4.3.1                             
[167] gtools_3.8.2                           
[168] magick_2.7.2                           
[169] zip_2.2.0                              
[170] GO.db_3.14.0                           
[171] openxlsx_4.2.3                         
[172] Rttf2pt1_1.3.8                         
[173] rmarkdown_2.8                          
[174] munsell_0.5.0                          
[175] GetoptLong_1.0.5                       
[176] DO.db_2.9                              
[177] GenomeInfoDbData_1.2.7                 
[178] iterators_1.0.13                       
[179] haven_2.4.1                            
[180] gtable_0.3.0                           
[181] extrafont_0.17

Dunnart peak characterisation

lecook

2022-02-23