Last updated: 2022-05-26
Checks: 6 1
Knit directory:
apocrine_signature_mdamb453/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220427) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 8836090. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/bigwig/
Untracked files:
Untracked: output/all4_h3k27ac.fa
Untracked: output/all4_no_h3k27ac.fa
Untracked: output/ar_foxa1_gata3_no_tfap2b.fa
Untracked: output/ar_foxa1_gata3_tfap2b.fa
Untracked: output/meme/
Untracked: output/streme/
Unstaged changes:
Modified: analysis/motif_analysis.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/motif_analysis.Rmd) and
HTML (docs/motif_analysis.html) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote), click on
the hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | 8836090 | Steve Pederson | 2022-05-26 | Added motif centrality |
| html | 8836090 | Steve Pederson | 2022-05-26 | Added motif centrality |
This step takes the peaks defined previously and
As AR is only found bound in the presence of DHT, all peaks are taken using DHT-treatment as the reference condition.
library(memes)
library(universalmotif)
library(tidyverse)
library(plyranges)
library(extraChIPs)
library(rtracklayer)
library(magrittr)
library(BSgenome.Hsapiens.UCSC.hg19)
library(scales)
library(pander)
library(glue)
library(cowplot)
library(parallel)
library(gt)
options(meme_bin = "/opt/meme/bin/")
check_meme_install()
theme_set(theme_bw())
panderOptions("big.mark", ",")
stringSetToViews <- function(x) {
stopifnot(is(x, "XStringSet"))
w <- unique(width(x))
stopifnot(length(w) == 1)
n <- length(x)
starts <- (seq_along(x) - 1) * w + 1
seq <- unlist(x)
Views(subject = seq, start = starts, width = w)
}
getBestMatch <- function(x, subject, min.score = "80%", rc = TRUE, ...) {
pwm <- NULL
if (is.matrix(x)) pwm <- x
if (is(x, "universalmotif")) pwm <- x@motif
stopifnot(!is.null(pwm))
stopifnot(is(subject, "XStringViews"))
w <- unique(width(subject))
stopifnot(length(w) == 1)
matches <- matchPWM(
pwm, subject, min.score = min.score, with.score = TRUE, ...
)
df <- as.data.frame(matches)
df$score <- mcols(matches)$score
df$id <- ceiling(df$start/ w)
df$pos <- df$start - (df$id - 1) * w
if (rc) {
rc_matches <- matchPWM(
pwm, subject = reverseComplement(subject),
min.score = min.score, with.score = TRUE,
...
)
rc_df <- as.data.frame(rc_matches)
rc_df$score <- mcols(rc_matches)$score
rc_df$start <- max(end(subject)) - end(rc_matches)
rc_df$end <- max(end(subject)) - start(rc_matches)
rc_df$id <- ceiling(rc_df$start/ w)
rc_df$pos <- rc_df$start - (rc_df$id - 1) * w + 1
df <- bind_rows(df, rc_df)
}
df <- arrange(df, id, desc(score))
df <- distinct(df, id, .keep_all = TRUE)
df <- dplyr::select(df, id, pos, seq)
as_tibble(df)
}
hg19 <- BSgenome.Hsapiens.UCSC.hg19
sq19 <- seqinfo(hg19)
all_gr <- here::here("data", "annotations", "all_gr.rds") %>%
read_rds() %>%
lapply(
function(x) {
seqinfo(x) <- sq19
x
}
)
counts <- here::here("data", "rnaseq", "counts.out.gz") %>%
read_tsv(skip = 1) %>%
dplyr::select(Geneid, ends_with("bam")) %>%
dplyr::rename(gene_id = Geneid)
detected <- all_gr$gene %>%
as_tibble() %>%
distinct(gene_id, gene_name) %>%
left_join(counts) %>%
pivot_longer(
cols = ends_with("bam"), names_to = "sample", values_to = "counts"
) %>%
mutate(detected = counts > 0) %>%
group_by(gene_id, gene_name) %>%
summarise(detected = mean(detected) > 0.25, .groups = "drop") %>%
dplyr::filter(detected) %>%
dplyr::select(starts_with("gene"))
Pre-existing RNA-seq data was again loaded and the set of 21,173 detected genes was defined as those with \(\geq 1\) aligned read in at least 25% of samples. This dataset represented the DHT response in MDA-MB-453 cells, however, given the existence of both Vehicle and DHT conditions, represents a good representative measure of the polyadenylated components of the transcriptome.
features <- here::here("data", "h3k27ac") %>%
list.files(full.names = TRUE, pattern = "bed$") %>%
sapply(import.bed, seqinfo = sq19) %>%
lapply(granges) %>%
setNames(basename(names(.))) %>%
setNames(str_remove_all(names(.), "s.bed")) %>%
GRangesList() %>%
unlist() %>%
names_to_column("feature") %>%
sort()
The set of H3K27ac-defined features was again used as a marker of regulatory activity, with 16,544 and 15,286 regions defined as enhancers or promoters respectively.
meme_db <- here::here("data", "external", "JASPAR2022_CORE_Hsapiens.meme") %>%
read_meme() %>%
to_df() %>%
as_tibble() %>%
mutate(
across(everything(), vctrs::vec_proxy),
altname = str_remove(altname, "MA[0-9]+\\.[0-9]*\\."),
organism = "Homo sapiens"
)
meme_db_detected <- meme_db %>%
mutate(
gene_name = str_split(altname, pattern = "::", n = 2)
) %>%
dplyr::filter(
vapply(
gene_name,
function(x) all(x %in% detected$gene_name),
logical(1)
)
)
The motif database was initially defined as all 727 CORE motifs found in Homo sapiens, as contained in JASPAR2022. In order to minimise irrelevant results, only the 465 motifs associated with a detected gene were then retained.
dht_peaks <- here::here("data", "peaks") %>%
list.files(recursive = TRUE, pattern = "oracle", full.names = TRUE) %>%
sapply(read_rds, simplify = FALSE) %>%
lapply(function(x) x[["DHT"]]) %>%
lapply(setNames, nm = c()) %>%
setNames(str_extract_all(names(.), "AR|FOXA1|GATA3|TFAP2B")) %>%
lapply(
function(x) {
seqinfo(x) <- sq19
x
}
)
targets <- names(dht_peaks)
dht_consensus <- dht_peaks %>%
lapply(granges) %>%
GRangesList() %>%
unlist() %>%
reduce() %>%
mutate(
AR = overlapsAny(., dht_peaks$AR),
FOXA1 = overlapsAny(., dht_peaks$FOXA1),
GATA3 = overlapsAny(., dht_peaks$GATA3),
TFAP2B = overlapsAny(., dht_peaks$TFAP2B),
promoter = bestOverlap(., features, var = "feature", missing = "None") == "promoter",
enhancer = bestOverlap(., features, var = "feature", missing = "None") == "enhancer",
H3K27ac = promoter | enhancer
)
a <- dht_peaks %>%
lapply(mutate, w = width) %>%
lapply(as_tibble) %>%
bind_rows(.id = "target") %>%
ggplot(aes(w, stat(density), colour = target)) +
geom_density() +
scale_x_log10(breaks = c(100, 300, 600, 1e3, 2e3, 3e3), labels = comma) +
labs(
x = "Peak Width (bp)", y = "Density", colour = "Target"
)
b <- dht_peaks %>%
lapply(mutate, w = width) %>%
lapply(as_tibble) %>%
bind_rows(.id = "target") %>%
mutate(skew = 0.5 - peak / w) %>%
ggplot(aes(skew, stat(density), colour = target)) +
geom_density() +
scale_x_continuous(labels = percent) +
labs(
x = "% Width Between Peak Summit and Peak Centre",
y = "Density",
colour = "ChIP Target"
)
plot_grid(
a + theme(legend.position = "none"),
b + theme(legend.position = "none"),
get_legend(b),
rel_widths = c(1, 1, 0.2), nrow = 1,
labels = c("A", "B")
)
Distributions of A) peak width and B) summit location with peaks.
| Version | Author | Date |
|---|---|---|
| 8836090 | Steve Pederson | 2022-05-26 |
grl_by_targets <- dht_consensus %>%
as_tibble() %>%
pivot_longer(
all_of(targets), names_to = "target", values_to = "detected"
) %>%
dplyr::filter(detected) %>%
group_by(range, H3K27ac) %>%
summarise(
targets = paste(sort(target), collapse = " "),
.groups = "drop"
) %>%
colToRanges(var = "range", seqinfo = sq19) %>%
sort() %>%
splitAsList(.$targets)
Union ranges were grouped by the combination of detected AR, FOXA1, GATA3 and TFAP2B binding, then sequences obtained corresponding to each union range.
grl_by_targets %>%
lapply(
function(x) {
tibble(
`N Sites` = length(x),
`% H3K27ac` = mean(x$H3K27ac),
`Median Width` = floor(median(width(x)))
)
}
) %>%
bind_rows(.id = "Targets") %>%
mutate(
`N Targets` = str_count(Targets, " ") + 1,
`% H3K27ac` = percent(`% H3K27ac`, 0.1)
) %>%
dplyr::select(contains("Targets"), everything()) %>%
arrange(`N Targets`, Targets) %>%
pander(
justify = "lrrrr",
caption = "Summary of union ranges and which combinations of the targets were detected."
)
| Targets | N Targets | N Sites | % H3K27ac | Median Width |
|---|---|---|---|---|
| AR | 1 | 395 | 13.9% | 241 |
| FOXA1 | 1 | 59,404 | 12.8% | 319 |
| GATA3 | 1 | 4,284 | 19.6% | 234 |
| TFAP2B | 1 | 6,655 | 80.1% | 301 |
| AR FOXA1 | 2 | 3,543 | 17.8% | 505 |
| AR GATA3 | 2 | 158 | 16.5% | 310 |
| AR TFAP2B | 2 | 135 | 55.6% | 414 |
| FOXA1 GATA3 | 2 | 9,760 | 18.1% | 479 |
| FOXA1 TFAP2B | 2 | 3,849 | 64.6% | 513 |
| GATA3 TFAP2B | 2 | 396 | 60.6% | 368 |
| AR FOXA1 GATA3 | 3 | 6,518 | 25.2% | 683 |
| AR FOXA1 TFAP2B | 3 | 1,786 | 62.9% | 639 |
| AR GATA3 TFAP2B | 3 | 61 | 55.7% | 419 |
| FOXA1 GATA3 TFAP2B | 3 | 1,441 | 52.0% | 575 |
| AR FOXA1 GATA3 TFAP2B | 4 | 6,336 | 58.2% | 814 |
meme_db_targets <- meme_db_detected %>%
dplyr::filter(altname %in% targets) %>%
to_list()
seq_by_targets <- grl_by_targets %>%
get_sequence(hg19)
ame_by_targets <- seq_by_targets %>%
runAme(database = meme_db_targets)
AME from the meme-suite was then run on each sequence to detect the presence of motifs associated with each target within each set of sequences.
ame_by_targets %>%
bind_rows(.id = "targets") %>%
mutate(
motif_id = as.factor(motif_id),
targets = as.factor(targets),
TFAP2B = ifelse(
str_detect(targets, "TFAP2B"), "TFAP2B", "No TFAP2B"
)
) %>%
dplyr::filter(adj.pvalue < 0.05) %>%
plot_ame_heatmap(
id = motif_id,
value = tp_percent,
group = targets
) +
geom_label(
aes(label = percent(tp_percent/100, 0.1)),
size = 3,
show.legend = FALSE
) +
facet_grid(TFAP2B ~ motif_alt_id, scales = "free", space = "free") +
scale_x_discrete(expand = expansion(c(0, 0))) +
scale_y_discrete(expand = expansion(c(0, 0))) +
labs(
x = "Detected Motif",
y = "Detected Targets",
fill = expression(widehat("TP%"))
) +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
axis.text.y = element_text(angle = 0, hjust = 1, size = 10),
panel.grid = element_blank()
)
Estimated percentage of true positive matches to all binding motifs associated with each ChIP target. Only values with an enrichment adjusted p-value < 0.05 are shown, such that missing values are able to be confidently assumed as not enriched for the motif. With the possible exception of GATA3 binding in the presence of AR and FOXA1 only, all provide supporting evidence for direct binding of each target to it’s motif
| Version | Author | Date |
|---|---|---|
| 8836090 | Steve Pederson | 2022-05-26 |
seq_width <- 200
min_score <- percent(0.75, accuracy = 1)
In order to assess any association in binding patterns, the binding peak (i.e.) summit of each ChIP target was taken as a viewpoint and the centrality of the best matching sequence motif for that target was assessed. Sequences were extended 200bp from each summit and the best sequence match to each binding motif were retained. Sequence matches with a score > 75% of the highest possible score were initially included before reducing matches to those with the highest score.
The motifs obtained form the larger database were the 6 motifs defined as candidate binding sites for each of the 4 ChIP targets.
meme_db_targets %>%
lapply(
function(x) {
tibble(
motif_id = x@name,
target = x@altname,
IC = x@icscore,
length = ncol(x@motif)
)
}
) %>%
bind_rows() %>%
arrange(target, desc(IC)) %>%
dplyr::rename(
`Motif ID` = motif_id, Target = target, `Length (bp)` = length
) %>%
pander(
justify = "llrr",
caption = paste(
"Motifs use for checking defined target binding sites. Motifs are ranked",
"within each ChIP target by information content (IC)."
)
)
| Motif ID | Target | IC | Length (bp) |
|---|---|---|---|
| MA0007.2 | AR | 13.27 | 15 |
| MA0148.4 | FOXA1 | 11.41 | 12 |
| MA0037.3 | GATA3 | 9.749 | 8 |
| MA0813.1 | TFAP2B | 14.31 | 13 |
| MA0812.1 | TFAP2B | 12.56 | 11 |
| MA0811.1 | TFAP2B | 10.86 | 12 |
plot_grid(
plot_grid(
plotlist = lapply(
meme_db_targets %>%
to_df() %>%
arrange(altname) %>%
dplyr::filter(altname != "TFAP2B") %>%
to_list() %>%
lapply(
function(x){
x@name <- paste0(
x@name, ": ", x@altname, " (IC = ",
round(x@icscore, 1), ")"
)
x
}
),
function(x) {
view_motifs(x) +
ggtitle(x@name) +
theme(plot.title = element_text(hjust = 0.5, size = 11))
}
),
labels = c("A", "B", "C"),
ncol = 1
),
plot_grid(
meme_db_targets %>%
to_df() %>%
dplyr::filter(altname == "TFAP2B") %>%
to_list() %>%
lapply(
function(x){
x@name <- paste0(
x@name, ": ", x@altname, " (IC = ",
round(x@icscore, 1), ")"
)
x
}
) %>%
view_motifs() +
theme(plot.title = element_text(hjust = 0.5, size = 9))
),
ncol = 2,
labels = c("", "D")
)
All motifs matching the set of ChIP targets. Motifc in A-C are not aligned with each other, whilst all TFAP2B motifs are shown using the best alignment.
ar_summits <- dht_peaks$AR %>%
resize(width = 1, fix = 'start') %>%
shift(shift = .$peak) %>%
resize(width = seq_width, fix = 'center') %>%
granges() %>%
mutate(
AR = overlapsAny(., subset(dht_consensus, AR)),
FOXA1 = overlapsAny(., subset(dht_consensus, FOXA1)),
GATA3 = overlapsAny(., subset(dht_consensus, GATA3)),
TFAP2B = overlapsAny(., subset(dht_consensus, TFAP2B)),
H3K27ac = overlapsAny(., features)
)
seq_ar_summits <- ar_summits %>%
get_sequence(hg19)
views_ar_summits <- stringSetToViews(seq_ar_summits)
matches_ar_summits <- meme_db_targets %>%
lapply(
getBestMatch,
subject = views_ar_summits,
min.score = min_score
) %>%
setNames(
vapply(meme_db_targets, function(x) x@name, character(1))
) %>%
bind_rows(.id = "name") %>%
left_join(
dplyr::select(meme_db_detected, name, altname),
by = "name"
) %>%
left_join(
ar_summits %>%
as_tibble() %>%
mutate(id =seq_along(range)),
by = "id"
) %>%
arrange(id) %>%
dplyr::select(range, any_of(targets), H3K27ac, name, altname, seq, pos)
matches_ar_summits %>%
mutate(name = glue("{name}\n({altname})")) %>%
ggplot(aes(pos, stat(density), colour = altname)) +
geom_density(aes(linetype = H3K27ac)) +
geom_text(
aes(x = seq_width/2, y = 0.001, label = lab),
data = . %>%
group_by(name) %>%
summarise(
n = dplyr::n(), .groups = "drop") %>%
mutate(lab = glue("n = {comma(n)}\n({percent(n/length(ar_summits), 0.1)})")),
colour = "black",
inherit.aes = FALSE
) +
facet_wrap(~name) +
scale_linetype_manual(values = c(2, 1)) +
labs(
x = "Best Match Position In Sequence",
y = "Density",
colour = "Motif Target"
)
Position of the best match for each motif using sequences centred at the summits of AR binding sites. Sites where H3K27ac signal was not found are shown as dashed lines. No visual difference in AR motif centrality was detected due to H3K27ac signal. A slight tendency for the best FOXA1 to also appear centrally was noted.
Across all motifs which were considered to represent possible matches to the binding site of AR, matches to one or more of the motifs were found in 13,432 (69.2%) of the 19,406 queried sequences.
foxa1_summits <- dht_peaks$FOXA1 %>%
resize(width = 1, fix = 'start') %>%
shift(shift = .$peak) %>%
resize(width = seq_width, fix = 'center') %>%
granges() %>%
mutate(
AR = overlapsAny(., subset(dht_consensus, AR)),
FOXA1 = overlapsAny(., subset(dht_consensus, FOXA1)),
GATA3 = overlapsAny(., subset(dht_consensus, GATA3)),
TFAP2B = overlapsAny(., subset(dht_consensus, TFAP2B)),
H3K27ac = overlapsAny(., features)
)
seq_foxa1_summits <- foxa1_summits %>%
get_sequence(hg19)
views_foxa1_summits <- stringSetToViews(seq_foxa1_summits)
matches_foxa1_summits <- meme_db_targets %>%
lapply(
getBestMatch,
subject = views_foxa1_summits,
min.score = min_score
) %>%
setNames(
vapply(meme_db_targets, function(x) x@name, character(1))
) %>%
bind_rows(.id = "name") %>%
left_join(
dplyr::select(meme_db_detected, name, altname),
by = "name"
) %>%
left_join(
foxa1_summits %>%
as_tibble() %>%
mutate(id =seq_along(range)),
by = "id"
) %>%
arrange(id) %>%
dplyr::select(range, any_of(targets), H3K27ac, name, altname, seq, pos)
matches_foxa1_summits %>%
mutate(name = glue("{name}\n({altname})")) %>%
ggplot(aes(pos, stat(density), colour = altname)) +
geom_density(aes(linetype = H3K27ac)) +
geom_text(
aes(x = seq_width/2, y = 0.001, label = lab),
data = . %>%
group_by(name) %>%
summarise(
n = dplyr::n(), .groups = "drop") %>%
mutate(lab = glue("n = {comma(n)}\n({percent(n/length(foxa1_summits), 0.1)})")),
colour = "black",
inherit.aes = FALSE
) +
facet_wrap(~name) +
scale_linetype_manual(values = c(2, 1)) +
labs(
x = "Best Match Position In Sequence",
y = "Density",
colour = "Motif Target"
)
Position of the best match for each motif using sequences centred at the summits of FOXA1 binding sites. Sites where H3K27ac signal was not found are shown as dashed lines. No visual difference in FOXA1 motif centrality was detected due to H3K27ac signal. A slight tendency for the best FOXA1 to also appear centrally was noted.
Across all motifs which were considered to represent possible matches to the binding site of FOXA1, matches to one or more of the motifs were found in 87,963 (94.8%) of the 92,751 queried sequences.
gata3_summits <- dht_peaks$GATA3 %>%
resize(width = 1, fix = 'start') %>%
shift(shift = .$peak) %>%
resize(width = seq_width, fix = 'center') %>%
granges() %>%
mutate(
AR = overlapsAny(., subset(dht_consensus, AR)),
FOXA1 = overlapsAny(., subset(dht_consensus, FOXA1)),
GATA3 = overlapsAny(., subset(dht_consensus, GATA3)),
TFAP2B = overlapsAny(., subset(dht_consensus, TFAP2B)),
H3K27ac = overlapsAny(., features)
)
seq_gata3_summits <- gata3_summits %>%
get_sequence(hg19)
views_gata3_summits <- stringSetToViews(seq_gata3_summits)
matches_gata3_summits <- meme_db_targets %>%
lapply(
getBestMatch,
subject = views_gata3_summits,
min.score = min_score
) %>%
setNames(
vapply(meme_db_targets, function(x) x@name, character(1))
) %>%
bind_rows(.id = "name") %>%
left_join(
dplyr::select(meme_db_detected, name, altname),
by = "name"
) %>%
left_join(
gata3_summits %>%
as_tibble() %>%
mutate(id =seq_along(range)),
by = "id"
) %>%
arrange(id) %>%
dplyr::select(range, any_of(targets), H3K27ac, name, altname, seq, pos)
matches_gata3_summits %>%
mutate(name = glue("{name}\n({altname})")) %>%
ggplot(aes(pos, stat(density), colour = altname)) +
geom_density(aes(linetype = H3K27ac)) +
geom_text(
aes(x = seq_width/2, y = 0.001, label = lab),
data = . %>%
group_by(name) %>%
summarise(
n = dplyr::n(), .groups = "drop") %>%
mutate(lab = glue("n = {comma(n)}\n({percent(n/length(gata3_summits), 0.1)})")),
colour = "black",
inherit.aes = FALSE
) +
facet_wrap(~name) +
scale_linetype_manual(values = c(2, 1)) +
labs(
x = "Best Match Position In Sequence",
y = "Density",
colour = "Motif Target"
)
Position of the best match for each motif using sequences centred at the summits of GATA3 binding sites. Sites where H3K27ac signal was not found are shown as dashed lines. No visual difference in GATA3 motif centrality was detected due to H3K27ac signal. A slight tendency for the best FOXA1 to also appear centrally was noted.
Across all motifs which were considered to represent possible matches to the binding site of GATA3, matches to one or more of the motifs were found in 29,441 (99.2%) of the 29,666 queried sequences.
tfap2b_summits <- dht_peaks$TFAP2B %>%
resize(width = 1, fix = 'start') %>%
shift(shift = .$peak) %>%
resize(width = seq_width, fix = 'center') %>%
granges() %>%
mutate(
AR = overlapsAny(., subset(dht_consensus, AR)),
FOXA1 = overlapsAny(., subset(dht_consensus, FOXA1)),
GATA3 = overlapsAny(., subset(dht_consensus, GATA3)),
H3K27ac = overlapsAny(., features)
)
seq_tfap2b_summits <- tfap2b_summits %>%
get_sequence(hg19)
views_tfap2b_summits <- stringSetToViews(seq_tfap2b_summits)
matches_tfap2b_summits <- meme_db_targets %>%
lapply(
getBestMatch,
subject = views_tfap2b_summits,
min.score = min_score
) %>%
setNames(
vapply(meme_db_targets, function(x) x@name, character(1))
) %>%
bind_rows(.id = "name") %>%
left_join(
dplyr::select(meme_db_detected, name, altname),
by = "name"
) %>%
left_join(
tfap2b_summits %>%
as_tibble() %>%
mutate(id =seq_along(range)),
by = "id"
) %>%
arrange(id) %>%
dplyr::select(range, any_of(targets), H3K27ac, name, altname, seq, pos)
matches_tfap2b_summits %>%
mutate(name = glue("{name}\n({altname})")) %>%
ggplot(aes(pos, stat(density), colour = altname)) +
geom_density(aes(linetype = H3K27ac)) +
geom_text(
aes(x = seq_width/2, y = 0.001, label = lab),
data = . %>%
group_by(name) %>%
summarise(
n = dplyr::n(), .groups = "drop") %>%
mutate(lab = glue("n = {comma(n)}\n({percent(n/length(tfap2b_summits), 0.1)})")),
colour = "black",
inherit.aes = FALSE
) +
facet_wrap(~name) +
scale_linetype_manual(values = c(2, 1)) +
labs(
x = "Best Match Position In Sequence",
y = "Density",
colour = "Motif Target"
)
Position of the best match for each motif using sequences centred at the summits of TFAP2B binding sites. Sites where H3K27ac signal was not found are shown as dashed lines. No visual difference in TFAP2B motif centrality was detected. All other motifs appeared to be uniformly distributed amongst the sequences.
| Version | Author | Date |
|---|---|---|
| 8836090 | Steve Pederson | 2022-05-26 |
Across all motifs which were considered to represent possible matches to the binding site of TFAP2\(\beta\), matches to one or more of the motifs were found in 20,382 (97.0%) of the 21,008 queried sequences.
The above steps essentially perform checks on the data and verify that the expected binding motifs are generally driving binding of each target. A further step would be to determine any enriched binding motifs within the union peaks which are able to be confidently assigned to putative co-factors. For this section, key comparisons were chosen as this enables detection of cofactors which may be driving key binding patterns. Novel sequence motifs were first defined using STREME, which uses a suffix tree based approach to determining enrichment of sequence motifs.
A comparison between binding sites where all 4 targets were found was performed based on this showing regulatory activity (overlap with H3K27ac-derived features) and no regulatory activity (i.e. no H3K27ac overlap).
seq_all4_by_h3k27ac <- dht_consensus %>%
filter(!!!syms(targets)) %>%
# resize(width = 800, fix = 'center') %>%
split(.$H3K27ac) %>%
setNames(c("no_h3k27ac", "h3k27ac")) %>%
get_sequence(hg19) %>%
as.list()
streme_all4_by_h3k27ac_rds <- here::here(
"output", "streme", "all4_by_h3k27ac.rds"
)
if (!dir.exists(dirname(streme_all4_by_h3k27ac_rds))) dir.create(
dirname(streme_all4_by_h3k27ac_rds)
)
if (!file.exists(streme_all4_by_h3k27ac_rds)) {
streme_all4_by_h3k27ac <- runStreme(
input = seq_all4_by_h3k27ac$h3k27ac,
control = seq_all4_by_h3k27ac$no_h3k27ac,
thresh = 0.01
) %>%
as_tibble() %>%
mutate(
eval = case_when(
is.na(eval) ~ pval * nrow(.),
!is.na(eval) ~ eval
)
)
write_rds(
x = streme_all4_by_h3k27ac,
file = streme_all4_by_h3k27ac_rds
)
} else {
streme_all4_by_h3k27ac <- read_rds(streme_all4_by_h3k27ac_rds)
}
tomtom_all4_by_h3k27ac <- streme_all4_by_h3k27ac %>%
to_list() %>%
runTomTom(
database = to_list(meme_db_detected),
thresh = 0.1
)
streme_to_keep_all4_by_h3k27ac <- tomtom_all4_by_h3k27ac %>%
lapply(function(x) x$best_match_name) %>%
vapply(length, integer(1)) %>%
is_greater_than(0)
Using the default setting of STREME, 40 candidate sequence motifs were detected. TomTom was subsequently used to match candidate motifs with those in the database of detected sequences. To ensure only high-quality candidates were retained, matches with a p-value > 0.1 for similarity were discarded. This strategy gave 7 candidate motifs able to be matched to the database and retained as high-quality motifs.
pl <- tomtom_all4_by_h3k27ac[streme_to_keep_all4_by_h3k27ac] %>%
bind_rows() %>%
arrange(desc(nsites)) %>%
to_list() %>%
lapply(
function(x) {
nm <- paste0(
x@altname,
" (IC: ", round(x@icscore, 1), "; n = ", x@nsites, ")\n",
x@extrainfo[["best_match_altname"]]
)
view_motifs(x) +
ggtitle(nm) +
theme(
plot.title = element_text(hjust = 0.5, size = 8),
axis.text = element_text(size = 8),
axis.title = element_text(size = 8)
)
}
)
plot_grid(plotlist = pl)
All motifs detected by STREME using a threshold of \(p \leq 0.01\) and with high-quality matches to the database of motifs. The name of the best matching motif is given as the second line of the title, although this may not be the true TF which binds to the motif. Motifs are ordered by the number of times they appear in the sequences, noting the motifs with high information content tend to lead to fewer hits.
tomtom_all4_by_h3k27ac[streme_to_keep_all4_by_h3k27ac] %>%
lapply(
function(x) {
tt <- x$tomtom[[1]]
mt <- lapply(
tt$match_motif,
function(um) {
nm <- paste0(um@name, " (", um@altname, ")")
um@name <- nm
um
}
)
x$tomtom[[1]]$match_motif <- mt
x
}
) %>%
lapply(
function(x) {
mutate(x, tomtom_logos = view_tomtom_hits(x))
}
) %>%
bind_rows() %>%
# dplyr::slice(1:2) %>%
dplyr::select(
motif, id = altname, IC = icscore, nsites,
best_match_name, best_match_altname, best_match_motif,
tomtom_logos, tomtom, pval , eval
) %>%
mutate(
tomtom_logos = tomtom_logos %>%
lapply(
function(x) {
n <- length(levels(x$data$motif.id))
ggplot_image(x, height = px(60 + n*60), aspect_ratio = 180 / (100 + n*30))
}
),
`All Matches` = vapply(
tomtom,
function(x) {
paste(x$match_altname, collapse = ", ")
},
character(1)
)
) %>%
dplyr::select(
`STREME ID` = id, `Best Match` = best_match_altname, `All Matches`,
IC, `Found In` = nsites, `P-value` = pval, `E-value` = eval,
`TOMTOM Matches` = tomtom_logos
) %>%
gt(
caption = paste(
"All STREME-detected motifs with putative matches in the motif database.",
"The motif database was restricted to motifs associated with detected genes.",
"The information content (IC) is given along with the number of sequences",
"the motif was found in. The E-value is equivalent to an adjusted p-value",
"for the likelihood of the detected motif being non-artefactual."
),
id = "tomtom"
) %>%
fmt_markdown(contains("TOMTOM")) %>%
fmt_number(contains("val"), decimals = 4) %>%
fmt_number("Found In", decimals = 0, use_seps = TRUE) %>%
fmt_number("IC", decimals = 2) %>%
tab_spanner(
label = "Database Matches",
columns = all_of(c("Best Match", "All Matches"))
) %>%
tab_spanner(
label = "Statistics",
columns = c("IC", "Found In", "P-value", "E-value")
) %>%
text_transform(
locations = cells_body(columns = "All Matches"),
fn = function(x) str_replace_all(x, ", ", "<br>")
) %>%
tab_options(
table.width = pct(100),
table.font.size = px(12)
) %>%
opt_css(
css = "
#tomtom .gt_row{
vertical-align: top
}
"
)
| STREME ID | Database Matches | Statistics | TOMTOM Matches | ||||
|---|---|---|---|---|---|---|---|
| Best Match | All Matches | IC | Found In | P-value | E-value | ||
| STREME-1 | ZNF281 | ZNF281 ZNF148 |
11.84 | 921 | 0.0001 | 0.0024 | |
| STREME-10 | ZNF331 | ZNF331 | 13.75 | 282 | 0.0035 | 0.0840 | |
| STREME-13 | SP1 | SP1 SP2 KLF10 KLF12 |
11.79 | 1,260 | 0.0044 | 0.1056 | |
| STREME-15 | ZNF740 | ZNF740 KLF4 |
11.50 | 1,093 | 0.0071 | 0.1704 | |
| STREME-18 | ZNF148 | ZNF148 | 10.48 | 848 | 0.0088 | 0.2112 | |
| STREME-22 | GABPA | GABPA ELF3 ELF1 |
10.43 | 852 | 0.0680 | 1.6320 | |
| STREME-23 | ZNF449 | ZNF449 | 13.75 | 184 | 0.0870 | 2.0880 | |
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] gt_0.6.0 cowplot_1.1.1
[3] glue_1.6.2 pander_0.6.5
[5] scales_1.2.0 BSgenome.Hsapiens.UCSC.hg19_1.4.3
[7] BSgenome_1.64.0 Biostrings_2.64.0
[9] XVector_0.36.0 magrittr_2.0.3
[11] rtracklayer_1.56.0 extraChIPs_1.0.0
[13] SummarizedExperiment_1.26.1 Biobase_2.56.0
[15] MatrixGenerics_1.8.0 matrixStats_0.62.0
[17] BiocParallel_1.30.2 plyranges_1.16.0
[19] GenomicRanges_1.48.0 GenomeInfoDb_1.32.2
[21] IRanges_2.30.0 S4Vectors_0.34.0
[23] BiocGenerics_0.42.0 forcats_0.5.1
[25] stringr_1.4.0 dplyr_1.0.9
[27] purrr_0.3.4 readr_2.1.2
[29] tidyr_1.2.0 tibble_3.1.7
[31] ggplot2_3.3.6 tidyverse_1.3.1
[33] universalmotif_1.14.0 memes_1.4.1
[35] workflowr_1.7.0
loaded via a namespace (and not attached):
[1] rappdirs_0.3.3 R.methodsS3_1.8.1
[3] csaw_1.30.1 bit64_4.0.5
[5] knitr_1.39 DelayedArray_0.22.0
[7] R.utils_2.11.0 data.table_1.14.2
[9] rpart_4.1.16 KEGGREST_1.36.0
[11] RCurl_1.98-1.6 AnnotationFilter_1.20.0
[13] doParallel_1.0.17 generics_0.1.2
[15] GenomicFeatures_1.48.1 callr_3.7.0
[17] EnrichedHeatmap_1.26.0 usethis_2.1.6
[19] RSQLite_2.2.14 commonmark_1.8.0
[21] bit_4.0.4 tzdb_0.3.0
[23] xml2_1.3.3 lubridate_1.8.0
[25] httpuv_1.6.5 assertthat_0.2.1
[27] xfun_0.31 hms_1.1.1
[29] jquerylib_0.1.4 evaluate_0.15
[31] promises_1.2.0.1 fansi_1.0.3
[33] restfulr_0.0.13 progress_1.2.2
[35] dbplyr_2.1.1 readxl_1.4.0
[37] igraph_1.3.1 DBI_1.1.2
[39] htmlwidgets_1.5.4 ellipsis_0.3.2
[41] backports_1.4.1 biomaRt_2.52.0
[43] vctrs_0.4.1 here_1.0.1
[45] ensembldb_2.20.1 cachem_1.0.6
[47] withr_2.5.0 ggforce_0.3.3
[49] Gviz_1.40.1 checkmate_2.1.0
[51] vroom_1.5.7 GenomicAlignments_1.32.0
[53] prettyunits_1.1.1 cluster_2.1.3
[55] lazyeval_0.2.2 crayon_1.5.1
[57] edgeR_3.38.1 pkgconfig_2.0.3
[59] labeling_0.4.2 tweenr_1.0.2
[61] pkgload_1.2.4 ProtGenerics_1.28.0
[63] nnet_7.3-17 rlang_1.0.2
[65] lifecycle_1.0.1 filelock_1.0.2
[67] BiocFileCache_2.4.0 modelr_0.1.8
[69] dichromat_2.0-0.1 cellranger_1.1.0
[71] rprojroot_2.0.3 polyclip_1.10-0
[73] Matrix_1.4-1 ggseqlogo_0.1
[75] reprex_2.0.1 base64enc_0.1-3
[77] whisker_0.4 GlobalOptions_0.1.2
[79] processx_3.5.3 png_0.1-7
[81] rjson_0.2.21 GenomicInteractions_1.30.0
[83] bitops_1.0-7 cmdfun_1.0.2
[85] getPass_0.2-2 R.oo_1.24.0
[87] blob_1.2.3 shape_1.4.6
[89] jpeg_0.1-9 memoise_2.0.1
[91] zlibbioc_1.42.0 compiler_4.2.0
[93] scatterpie_0.1.7 BiocIO_1.6.0
[95] RColorBrewer_1.1-3 clue_0.3-60
[97] Rsamtools_2.12.0 cli_3.3.0
[99] ps_1.7.0 htmlTable_2.4.0
[101] Formula_1.2-4 MASS_7.3-56
[103] ggside_0.2.0.9990 tidyselect_1.1.2
[105] stringi_1.7.6 highr_0.9
[107] yaml_2.3.5 locfit_1.5-9.5
[109] latticeExtra_0.6-29 ggrepel_0.9.1
[111] grid_4.2.0 sass_0.4.1
[113] VariantAnnotation_1.42.1 tools_4.2.0
[115] circlize_0.4.15 rstudioapi_0.13
[117] foreach_1.5.2 foreign_0.8-82
[119] git2r_0.30.1 metapod_1.4.0
[121] gridExtra_2.3 farver_2.1.0
[123] digest_0.6.29 Rcpp_1.0.8.3
[125] broom_0.8.0 later_1.3.0
[127] httr_1.4.3 AnnotationDbi_1.58.0
[129] biovizBase_1.44.0 ComplexHeatmap_2.12.0
[131] colorspace_2.0-3 rvest_1.0.2
[133] brio_1.1.3 XML_3.99-0.9
[135] fs_1.5.2 splines_4.2.0
[137] jsonlite_1.8.0 ggfun_0.0.6
[139] testthat_3.1.4 R6_2.5.1
[141] Hmisc_4.7-0 pillar_1.7.0
[143] htmltools_0.5.2 fastmap_1.1.0
[145] codetools_0.2-18 utf8_1.2.2
[147] lattice_0.20-45 bslib_0.3.1
[149] curl_4.3.2 survival_3.2-13
[151] limma_3.52.1 rmarkdown_2.14
[153] InteractionSet_1.24.0 desc_1.4.1
[155] munsell_0.5.0 GetoptLong_1.0.5
[157] GenomeInfoDbData_1.2.8 iterators_1.0.14
[159] haven_2.5.0 gtable_0.3.0