Last updated: 2022-05-26

Checks: 6 1

Knit directory: apocrine_signature_mdamb453/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220427)

The command set.seed(20220427) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 8836090

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 8836090. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/bigwig/

Untracked files:
    Untracked:  output/all4_h3k27ac.fa
    Untracked:  output/all4_no_h3k27ac.fa
    Untracked:  output/ar_foxa1_gata3_no_tfap2b.fa
    Untracked:  output/ar_foxa1_gata3_tfap2b.fa
    Untracked:  output/meme/
    Untracked:  output/streme/

Unstaged changes:
    Modified:   analysis/motif_analysis.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/motif_analysis.Rmd) and HTML (docs/motif_analysis.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	8836090	Steve Pederson	2022-05-26	Added motif centrality
html	8836090	Steve Pederson	2022-05-26	Added motif centrality

Introduction

This step takes the peaks defined previously and

Assesses motif enrichment for all ChIP targets within the peaks
Detect additional binding motifs

As AR is only found bound in the presence of DHT, all peaks are taken using DHT-treatment as the reference condition.

library(memes)
library(universalmotif)
library(tidyverse)
library(plyranges)
library(extraChIPs)
library(rtracklayer)
library(magrittr)
library(BSgenome.Hsapiens.UCSC.hg19)
library(scales)
library(pander)
library(glue)
library(cowplot)
library(parallel)
library(gt)

options(meme_bin = "/opt/meme/bin/")
check_meme_install()
theme_set(theme_bw())
panderOptions("big.mark", ",")

stringSetToViews <- function(x) {
  stopifnot(is(x, "XStringSet"))
  w <- unique(width(x))
  stopifnot(length(w) == 1)
  n <- length(x)
  starts <- (seq_along(x) - 1) * w + 1
  seq <- unlist(x)
  Views(subject = seq, start = starts, width = w)
}
getBestMatch <- function(x, subject, min.score = "80%", rc = TRUE, ...) {
  pwm <- NULL
  if (is.matrix(x)) pwm <- x
  if (is(x, "universalmotif")) pwm <- x@motif
  stopifnot(!is.null(pwm))
  
  stopifnot(is(subject, "XStringViews"))
  w <- unique(width(subject))
  stopifnot(length(w) == 1)
  
  matches <- matchPWM(
    pwm, subject, min.score = min.score, with.score = TRUE, ...
  )
  df <- as.data.frame(matches)
  df$score <- mcols(matches)$score
  df$id <- ceiling(df$start/ w)
  df$pos <- df$start - (df$id - 1) * w 
  
  if (rc) {
    rc_matches <-  matchPWM(
      pwm, subject = reverseComplement(subject),
      min.score = min.score, with.score = TRUE, 
      ...
    )
    rc_df <- as.data.frame(rc_matches)
    rc_df$score <- mcols(rc_matches)$score
    rc_df$start <- max(end(subject)) - end(rc_matches)
    rc_df$end <- max(end(subject)) - start(rc_matches)
    rc_df$id <- ceiling(rc_df$start/ w)
    rc_df$pos <- rc_df$start - (rc_df$id - 1) * w + 1
    df <- bind_rows(df, rc_df)
  }
  
  df <- arrange(df, id, desc(score))
  df <- distinct(df, id, .keep_all = TRUE)
  df <- dplyr::select(df, id, pos, seq)
  as_tibble(df)
  
}

hg19 <- BSgenome.Hsapiens.UCSC.hg19
sq19 <- seqinfo(hg19)

all_gr <- here::here("data", "annotations", "all_gr.rds") %>%
  read_rds() %>% 
  lapply(
    function(x) {
      seqinfo(x) <- sq19
      x
    }
  )
counts <- here::here("data", "rnaseq", "counts.out.gz") %>%
  read_tsv(skip = 1) %>%
  dplyr::select(Geneid, ends_with("bam")) %>%
  dplyr::rename(gene_id = Geneid)
detected <- all_gr$gene %>%
  as_tibble() %>%
  distinct(gene_id, gene_name) %>%
  left_join(counts) %>%
  pivot_longer(
    cols = ends_with("bam"), names_to = "sample", values_to = "counts"
  ) %>%
  mutate(detected = counts > 0) %>%
  group_by(gene_id, gene_name) %>%
  summarise(detected = mean(detected) > 0.25, .groups = "drop") %>%
  dplyr::filter(detected) %>%
  dplyr::select(starts_with("gene"))

Pre-existing RNA-seq data was again loaded and the set of 21,173 detected genes was defined as those with $\geq 1$ aligned read in at least 25% of samples. This dataset represented the DHT response in MDA-MB-453 cells, however, given the existence of both Vehicle and DHT conditions, represents a good representative measure of the polyadenylated components of the transcriptome.

features <- here::here("data", "h3k27ac") %>%
  list.files(full.names = TRUE, pattern = "bed$") %>%
  sapply(import.bed, seqinfo = sq19) %>%
  lapply(granges) %>%
  setNames(basename(names(.))) %>%
  setNames(str_remove_all(names(.), "s.bed")) %>%
  GRangesList() %>%
  unlist() %>%
  names_to_column("feature") %>%
  sort()

The set of H3K27ac-defined features was again used as a marker of regulatory activity, with 16,544 and 15,286 regions defined as enhancers or promoters respectively.

meme_db <- here::here("data", "external", "JASPAR2022_CORE_Hsapiens.meme") %>%
  read_meme() %>%
  to_df() %>%
  as_tibble() %>%
  mutate(
    across(everything(), vctrs::vec_proxy),
    altname = str_remove(altname, "MA[0-9]+\\.[0-9]*\\."),
    organism = "Homo sapiens"
  )
meme_db_detected <- meme_db %>%
  mutate(
    gene_name = str_split(altname, pattern = "::", n = 2)
  ) %>%
  dplyr::filter(
    vapply(
      gene_name,
      function(x) all(x %in% detected$gene_name),
      logical(1)
    )
  )

The motif database was initially defined as all 727 CORE motifs found in Homo sapiens, as contained in JASPAR2022. In order to minimise irrelevant results, only the 465 motifs associated with a detected gene were then retained.

dht_peaks <- here::here("data", "peaks") %>%
  list.files(recursive = TRUE, pattern = "oracle", full.names = TRUE) %>%
  sapply(read_rds, simplify = FALSE) %>%
  lapply(function(x) x[["DHT"]]) %>%
  lapply(setNames, nm = c()) %>%
  setNames(str_extract_all(names(.), "AR|FOXA1|GATA3|TFAP2B")) %>% 
  lapply(
    function(x) {
      seqinfo(x) <- sq19
      x
    }
  )
targets <- names(dht_peaks)
dht_consensus <- dht_peaks %>%
  lapply(granges) %>%
  GRangesList() %>%
  unlist() %>%
  reduce() %>%
  mutate(
    AR = overlapsAny(., dht_peaks$AR),
    FOXA1 = overlapsAny(., dht_peaks$FOXA1),
    GATA3 = overlapsAny(., dht_peaks$GATA3),
    TFAP2B = overlapsAny(., dht_peaks$TFAP2B),
    promoter = bestOverlap(., features, var = "feature", missing = "None") == "promoter",
    enhancer = bestOverlap(., features, var = "feature", missing = "None") == "enhancer",
    H3K27ac = promoter | enhancer
  )

a <- dht_peaks %>% 
  lapply(mutate, w = width) %>% 
  lapply(as_tibble) %>% 
  bind_rows(.id = "target") %>% 
  ggplot(aes(w, stat(density), colour = target)) +
  geom_density() +
  scale_x_log10(breaks = c(100, 300, 600, 1e3, 2e3, 3e3), labels = comma) +
  labs(
    x = "Peak Width (bp)", y = "Density", colour = "Target"
  )
b <- dht_peaks %>% 
  lapply(mutate, w = width) %>% 
  lapply(as_tibble) %>% 
  bind_rows(.id = "target") %>% 
  mutate(skew = 0.5 - peak / w) %>% 
  ggplot(aes(skew, stat(density), colour = target)) +
  geom_density() +
  scale_x_continuous(labels = percent) +
  labs(
    x = "% Width Between Peak Summit and Peak Centre",
    y = "Density",
    colour = "ChIP Target"
  )
plot_grid(
  a + theme(legend.position = "none"), 
  b + theme(legend.position = "none"),
  get_legend(b),
  rel_widths = c(1, 1, 0.2), nrow = 1,
  labels = c("A", "B")
)

Distributions of A) peak width and B) summit location with peaks.

Version	Author	Date
8836090	Steve Pederson	2022-05-26

Enrichment of Expected Motifs

Analysis of Motif Enrichment (AME)

grl_by_targets <- dht_consensus %>% 
  as_tibble() %>% 
  pivot_longer(
    all_of(targets), names_to = "target", values_to = "detected"
  ) %>% 
  dplyr::filter(detected) %>% 
  group_by(range, H3K27ac) %>% 
  summarise(
    targets = paste(sort(target), collapse = " "),
    .groups = "drop"
  ) %>% 
  colToRanges(var = "range", seqinfo = sq19) %>% 
  sort() %>% 
  splitAsList(.$targets)

Union ranges were grouped by the combination of detected AR, FOXA1, GATA3 and TFAP2B binding, then sequences obtained corresponding to each union range.

grl_by_targets %>% 
  lapply(
    function(x) {
      tibble(
        `N Sites` = length(x), 
        `% H3K27ac` = mean(x$H3K27ac),
        `Median Width` = floor(median(width(x)))
      )
    }
  ) %>% 
  bind_rows(.id = "Targets") %>% 
  mutate(
    `N Targets` = str_count(Targets, " ") + 1,
    `% H3K27ac` = percent(`% H3K27ac`, 0.1)
  ) %>% 
  dplyr::select(contains("Targets"), everything()) %>% 
  arrange(`N Targets`, Targets) %>% 
  pander(
    justify = "lrrrr",
    caption = "Summary of union ranges and which combinations of the targets were detected."
  )

Summary of union ranges and which combinations of the targets were detected.
Targets	N Targets	N Sites	% H3K27ac	Median Width
AR	1	395	13.9%	241
FOXA1	1	59,404	12.8%	319
GATA3	1	4,284	19.6%	234
TFAP2B	1	6,655	80.1%	301
AR FOXA1	2	3,543	17.8%	505
AR GATA3	2	158	16.5%	310
AR TFAP2B	2	135	55.6%	414
FOXA1 GATA3	2	9,760	18.1%	479
FOXA1 TFAP2B	2	3,849	64.6%	513
GATA3 TFAP2B	2	396	60.6%	368
AR FOXA1 GATA3	3	6,518	25.2%	683
AR FOXA1 TFAP2B	3	1,786	62.9%	639
AR GATA3 TFAP2B	3	61	55.7%	419
FOXA1 GATA3 TFAP2B	3	1,441	52.0%	575
AR FOXA1 GATA3 TFAP2B	4	6,336	58.2%	814

meme_db_targets <- meme_db_detected %>% 
  dplyr::filter(altname %in% targets) %>% 
  to_list()
seq_by_targets <- grl_by_targets %>% 
  get_sequence(hg19)
ame_by_targets <- seq_by_targets %>% 
  runAme(database = meme_db_targets)

AME from the meme-suite was then run on each sequence to detect the presence of motifs associated with each target within each set of sequences.

ame_by_targets %>% 
  bind_rows(.id = "targets") %>% 
  mutate(
    motif_id = as.factor(motif_id),
    targets = as.factor(targets),
    TFAP2B = ifelse(
      str_detect(targets, "TFAP2B"), "TFAP2B", "No TFAP2B"
    )
  ) %>% 
  dplyr::filter(adj.pvalue < 0.05) %>% 
  plot_ame_heatmap(
    id = motif_id,
    value = tp_percent,
    group = targets
  ) +
  geom_label(
    aes(label = percent(tp_percent/100, 0.1)),
    size = 3,
    show.legend = FALSE
  ) +
  facet_grid(TFAP2B ~ motif_alt_id, scales = "free", space = "free") +
  scale_x_discrete(expand = expansion(c(0, 0))) +
  scale_y_discrete(expand = expansion(c(0, 0))) +
  labs(
    x = "Detected Motif",
    y = "Detected Targets",
    fill = expression(widehat("TP%"))
  ) +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.5, size = 10),
    axis.text.y = element_text(angle = 0, hjust = 1, size = 10),
    panel.grid = element_blank()
  )

Estimated percentage of true positive matches to all binding motifs associated with each ChIP target. Only values with an enrichment adjusted p-value < 0.05 are shown, such that missing values are able to be confidently assumed as not enriched for the motif. With the possible exception of GATA3 binding in the presence of AR and FOXA1 only, all provide supporting evidence for direct binding of each target to it’s motif

Version	Author	Date
8836090	Steve Pederson	2022-05-26

Motif Centrality Relative to Each Target

seq_width <- 200
min_score <- percent(0.75, accuracy = 1)

In order to assess any association in binding patterns, the binding peak (i.e.) summit of each ChIP target was taken as a viewpoint and the centrality of the best matching sequence motif for that target was assessed. Sequences were extended 200bp from each summit and the best sequence match to each binding motif were retained. Sequence matches with a score > 75% of the highest possible score were initially included before reducing matches to those with the highest score.

The motifs obtained form the larger database were the 6 motifs defined as candidate binding sites for each of the 4 ChIP targets.

meme_db_targets %>% 
  lapply(
    function(x) {
      tibble(
        motif_id = x@name,
        target = x@altname,
        IC = x@icscore,
        length = ncol(x@motif)
        )
    }
  ) %>% 
  bind_rows() %>% 
  arrange(target, desc(IC)) %>% 
  dplyr::rename(
    `Motif ID` = motif_id, Target = target, `Length (bp)` = length
  ) %>% 
  pander(
    justify = "llrr",
    caption = paste(
      "Motifs use for checking defined target binding sites. Motifs are ranked",
      "within each ChIP target by information content (IC)."
    )
  )

Motifs use for checking defined target binding sites. Motifs are ranked within each ChIP target by information content (IC).
Motif ID	Target	IC	Length (bp)
MA0007.2	AR	13.27	15
MA0148.4	FOXA1	11.41	12
MA0037.3	GATA3	9.749	8
MA0813.1	TFAP2B	14.31	13
MA0812.1	TFAP2B	12.56	11
MA0811.1	TFAP2B	10.86	12

plot_grid(
  plot_grid(
    plotlist = lapply(
      meme_db_targets %>% 
        to_df() %>% 
        arrange(altname) %>% 
        dplyr::filter(altname != "TFAP2B") %>% 
        to_list() %>% 
        lapply(
          function(x){
            x@name <- paste0(
              x@name, ": ", x@altname, " (IC = ",
              round(x@icscore, 1), ")"
            )
            x
          }
        ),
      function(x) {
        view_motifs(x) +
          ggtitle(x@name) +
          theme(plot.title = element_text(hjust = 0.5, size = 11))
      }
    ),
    labels = c("A", "B", "C"),
    ncol = 1
  ),
  plot_grid(
    meme_db_targets %>% 
      to_df() %>% 
      dplyr::filter(altname == "TFAP2B") %>% 
      to_list() %>% 
      lapply(
          function(x){
            x@name <- paste0(
              x@name, ": ", x@altname, " (IC = ",
              round(x@icscore, 1), ")"
            )
            x
          }
      ) %>% 
      view_motifs() +
      theme(plot.title = element_text(hjust = 0.5, size = 9))
  ),
  ncol = 2,
  labels = c("", "D")
)

All motifs matching the set of ChIP targets. Motifc in A-C are not aligned with each other, whilst all TFAP2B motifs are shown using the best alignment.

AR

ar_summits <- dht_peaks$AR %>% 
  resize(width = 1, fix = 'start') %>% 
  shift(shift = .$peak) %>% 
  resize(width = seq_width, fix = 'center') %>% 
  granges() %>% 
  mutate(
    AR = overlapsAny(., subset(dht_consensus, AR)),
    FOXA1 = overlapsAny(., subset(dht_consensus, FOXA1)),
    GATA3 = overlapsAny(., subset(dht_consensus, GATA3)),
    TFAP2B = overlapsAny(., subset(dht_consensus, TFAP2B)),
    H3K27ac = overlapsAny(., features)
  ) 
seq_ar_summits <- ar_summits %>% 
  get_sequence(hg19)
views_ar_summits <- stringSetToViews(seq_ar_summits)
matches_ar_summits <- meme_db_targets %>% 
  lapply(
    getBestMatch, 
    subject = views_ar_summits,
    min.score = min_score
  ) %>% 
  setNames(
    vapply(meme_db_targets, function(x) x@name, character(1))
  ) %>% 
  bind_rows(.id = "name") %>% 
  left_join(
    dplyr::select(meme_db_detected, name, altname),
    by = "name"
  ) %>% 
  left_join(
    ar_summits %>% 
      as_tibble() %>% 
      mutate(id =seq_along(range)),
    by = "id"
  ) %>% 
  arrange(id) %>% 
  dplyr::select(range, any_of(targets), H3K27ac, name, altname, seq, pos)

matches_ar_summits  %>% 
  mutate(name = glue("{name}\n({altname})")) %>% 
  ggplot(aes(pos, stat(density), colour = altname)) +
  geom_density(aes(linetype = H3K27ac)) +
  geom_text(
    aes(x = seq_width/2, y = 0.001, label = lab),
    data = . %>% 
      group_by(name) %>% 
      summarise(
        n = dplyr::n(), .groups = "drop") %>% 
      mutate(lab = glue("n = {comma(n)}\n({percent(n/length(ar_summits), 0.1)})")),
    colour = "black",
    inherit.aes = FALSE
  ) +
  facet_wrap(~name) +
  scale_linetype_manual(values = c(2, 1)) +
  labs(
    x = "Best Match Position In Sequence",
    y = "Density",
    colour = "Motif Target"
  )

Position of the best match for each motif using sequences centred at the summits of AR binding sites. Sites where H3K27ac signal was not found are shown as dashed lines. No visual difference in AR motif centrality was detected due to H3K27ac signal. A slight tendency for the best FOXA1 to also appear centrally was noted.

Across all motifs which were considered to represent possible matches to the binding site of AR, matches to one or more of the motifs were found in 13,432 (69.2%) of the 19,406 queried sequences.

FOXA1

foxa1_summits <- dht_peaks$FOXA1 %>% 
  resize(width = 1, fix = 'start') %>% 
  shift(shift = .$peak) %>% 
  resize(width = seq_width, fix = 'center') %>% 
  granges() %>% 
  mutate(
    AR = overlapsAny(., subset(dht_consensus, AR)),
    FOXA1 = overlapsAny(., subset(dht_consensus, FOXA1)),
    GATA3 = overlapsAny(., subset(dht_consensus, GATA3)),
    TFAP2B = overlapsAny(., subset(dht_consensus, TFAP2B)),
    H3K27ac = overlapsAny(., features)
  ) 
seq_foxa1_summits <- foxa1_summits %>% 
  get_sequence(hg19)
views_foxa1_summits <- stringSetToViews(seq_foxa1_summits)
matches_foxa1_summits <- meme_db_targets %>% 
  lapply(
    getBestMatch, 
    subject = views_foxa1_summits,
    min.score = min_score
  ) %>% 
  setNames(
    vapply(meme_db_targets, function(x) x@name, character(1))
  ) %>% 
  bind_rows(.id = "name") %>% 
  left_join(
    dplyr::select(meme_db_detected, name, altname),
    by = "name"
  ) %>% 
  left_join(
    foxa1_summits %>% 
      as_tibble() %>% 
      mutate(id =seq_along(range)),
    by = "id"
  ) %>% 
  arrange(id) %>% 
  dplyr::select(range, any_of(targets), H3K27ac, name, altname, seq, pos)

matches_foxa1_summits  %>% 
  mutate(name = glue("{name}\n({altname})")) %>% 
  ggplot(aes(pos, stat(density), colour = altname)) +
  geom_density(aes(linetype = H3K27ac)) +
  geom_text(
    aes(x = seq_width/2, y = 0.001, label = lab),
    data = . %>% 
      group_by(name) %>% 
      summarise(
        n = dplyr::n(), .groups = "drop") %>% 
      mutate(lab = glue("n = {comma(n)}\n({percent(n/length(foxa1_summits), 0.1)})")),
    colour = "black",
    inherit.aes = FALSE
  ) +
  facet_wrap(~name) +
  scale_linetype_manual(values = c(2, 1)) +
  labs(
    x = "Best Match Position In Sequence",
    y = "Density",
    colour = "Motif Target"
  )

Position of the best match for each motif using sequences centred at the summits of FOXA1 binding sites. Sites where H3K27ac signal was not found are shown as dashed lines. No visual difference in FOXA1 motif centrality was detected due to H3K27ac signal. A slight tendency for the best FOXA1 to also appear centrally was noted.

Across all motifs which were considered to represent possible matches to the binding site of FOXA1, matches to one or more of the motifs were found in 87,963 (94.8%) of the 92,751 queried sequences.

GATA3

gata3_summits <- dht_peaks$GATA3 %>% 
  resize(width = 1, fix = 'start') %>% 
  shift(shift = .$peak) %>% 
  resize(width = seq_width, fix = 'center') %>% 
  granges() %>% 
  mutate(
    AR = overlapsAny(., subset(dht_consensus, AR)),
    FOXA1 = overlapsAny(., subset(dht_consensus, FOXA1)),
    GATA3 = overlapsAny(., subset(dht_consensus, GATA3)),
    TFAP2B = overlapsAny(., subset(dht_consensus, TFAP2B)),
    H3K27ac = overlapsAny(., features)
  ) 
seq_gata3_summits <- gata3_summits %>% 
  get_sequence(hg19)
views_gata3_summits <- stringSetToViews(seq_gata3_summits)
matches_gata3_summits <- meme_db_targets %>% 
  lapply(
    getBestMatch, 
    subject = views_gata3_summits,
    min.score = min_score
  ) %>% 
  setNames(
    vapply(meme_db_targets, function(x) x@name, character(1))
  ) %>% 
  bind_rows(.id = "name") %>% 
  left_join(
    dplyr::select(meme_db_detected, name, altname),
    by = "name"
  ) %>% 
  left_join(
    gata3_summits %>% 
      as_tibble() %>% 
      mutate(id =seq_along(range)),
    by = "id"
  ) %>% 
  arrange(id) %>% 
  dplyr::select(range, any_of(targets), H3K27ac, name, altname, seq, pos)

matches_gata3_summits  %>% 
  mutate(name = glue("{name}\n({altname})")) %>% 
  ggplot(aes(pos, stat(density), colour = altname)) +
  geom_density(aes(linetype = H3K27ac)) +
  geom_text(
    aes(x = seq_width/2, y = 0.001, label = lab),
    data = . %>% 
      group_by(name) %>% 
      summarise(
        n = dplyr::n(), .groups = "drop") %>% 
      mutate(lab = glue("n = {comma(n)}\n({percent(n/length(gata3_summits), 0.1)})")),
    colour = "black",
    inherit.aes = FALSE
  ) +
  facet_wrap(~name) +
  scale_linetype_manual(values = c(2, 1)) +
  labs(
    x = "Best Match Position In Sequence",
    y = "Density",
    colour = "Motif Target"
  )

Position of the best match for each motif using sequences centred at the summits of GATA3 binding sites. Sites where H3K27ac signal was not found are shown as dashed lines. No visual difference in GATA3 motif centrality was detected due to H3K27ac signal. A slight tendency for the best FOXA1 to also appear centrally was noted.

Across all motifs which were considered to represent possible matches to the binding site of GATA3, matches to one or more of the motifs were found in 29,441 (99.2%) of the 29,666 queried sequences.

TFAP2B

tfap2b_summits <- dht_peaks$TFAP2B %>% 
  resize(width = 1, fix = 'start') %>% 
  shift(shift = .$peak) %>% 
  resize(width = seq_width, fix = 'center') %>% 
  granges() %>% 
  mutate(
    AR = overlapsAny(., subset(dht_consensus, AR)),
    FOXA1 = overlapsAny(., subset(dht_consensus, FOXA1)),
    GATA3 = overlapsAny(., subset(dht_consensus, GATA3)),
    H3K27ac = overlapsAny(., features)
  ) 
seq_tfap2b_summits <- tfap2b_summits %>% 
  get_sequence(hg19)
views_tfap2b_summits <- stringSetToViews(seq_tfap2b_summits)
matches_tfap2b_summits <- meme_db_targets %>% 
  lapply(
    getBestMatch, 
    subject = views_tfap2b_summits,
    min.score = min_score
  ) %>% 
  setNames(
    vapply(meme_db_targets, function(x) x@name, character(1))
  ) %>% 
  bind_rows(.id = "name") %>% 
  left_join(
    dplyr::select(meme_db_detected, name, altname),
    by = "name"
  ) %>% 
  left_join(
    tfap2b_summits %>% 
      as_tibble() %>% 
      mutate(id =seq_along(range)),
    by = "id"
  ) %>% 
  arrange(id) %>% 
  dplyr::select(range, any_of(targets), H3K27ac, name, altname, seq, pos)

matches_tfap2b_summits  %>% 
  mutate(name = glue("{name}\n({altname})")) %>% 
  ggplot(aes(pos, stat(density), colour = altname)) +
  geom_density(aes(linetype = H3K27ac)) +
  geom_text(
    aes(x = seq_width/2, y = 0.001, label = lab),
    data = . %>% 
      group_by(name) %>% 
      summarise(
        n = dplyr::n(), .groups = "drop") %>% 
      mutate(lab = glue("n = {comma(n)}\n({percent(n/length(tfap2b_summits), 0.1)})")),
    colour = "black",
    inherit.aes = FALSE
  ) +
  facet_wrap(~name) +
  scale_linetype_manual(values = c(2, 1)) +
  labs(
    x = "Best Match Position In Sequence",
    y = "Density",
    colour = "Motif Target"
  )

Position of the best match for each motif using sequences centred at the summits of TFAP2B binding sites. Sites where H3K27ac signal was not found are shown as dashed lines. No visual difference in TFAP2B motif centrality was detected. All other motifs appeared to be uniformly distributed amongst the sequences.

Version	Author	Date
8836090	Steve Pederson	2022-05-26

Across all motifs which were considered to represent possible matches to the binding site of TFAP2$\beta$, matches to one or more of the motifs were found in 20,382 (97.0%) of the 21,008 queried sequences.

Motif Discovery

The above steps essentially perform checks on the data and verify that the expected binding motifs are generally driving binding of each target. A further step would be to determine any enriched binding motifs within the union peaks which are able to be confidently assigned to putative co-factors. For this section, key comparisons were chosen as this enables detection of cofactors which may be driving key binding patterns. Novel sequence motifs were first defined using STREME, which uses a suffix tree based approach to determining enrichment of sequence motifs.

All 4 Targets Bound: H3K27ac Vs No H3K27ac

A comparison between binding sites where all 4 targets were found was performed based on this showing regulatory activity (overlap with H3K27ac-derived features) and no regulatory activity (i.e. no H3K27ac overlap).

seq_all4_by_h3k27ac <- dht_consensus %>%
  filter(!!!syms(targets)) %>%
  # resize(width = 800, fix = 'center') %>%
  split(.$H3K27ac) %>%
  setNames(c("no_h3k27ac", "h3k27ac")) %>%
  get_sequence(hg19) %>%
  as.list()
streme_all4_by_h3k27ac_rds <- here::here(
  "output", "streme", "all4_by_h3k27ac.rds"
)
if (!dir.exists(dirname(streme_all4_by_h3k27ac_rds))) dir.create(
  dirname(streme_all4_by_h3k27ac_rds)
)
if (!file.exists(streme_all4_by_h3k27ac_rds)) {
  streme_all4_by_h3k27ac <- runStreme(
    input = seq_all4_by_h3k27ac$h3k27ac,
    control = seq_all4_by_h3k27ac$no_h3k27ac,
    thresh = 0.01
  ) %>% 
    as_tibble() %>% 
    mutate(
      eval = case_when(
        is.na(eval) ~ pval * nrow(.),
        !is.na(eval) ~ eval
      )
    )
  write_rds(
    x = streme_all4_by_h3k27ac,
    file = streme_all4_by_h3k27ac_rds
  )
} else {
  streme_all4_by_h3k27ac <- read_rds(streme_all4_by_h3k27ac_rds)
}
tomtom_all4_by_h3k27ac <- streme_all4_by_h3k27ac %>%
  to_list() %>%
  runTomTom(
    database = to_list(meme_db_detected),
    thresh = 0.1
  )
streme_to_keep_all4_by_h3k27ac <- tomtom_all4_by_h3k27ac %>% 
  lapply(function(x) x$best_match_name) %>% 
  vapply(length, integer(1)) %>% 
  is_greater_than(0)

Using the default setting of STREME, 40 candidate sequence motifs were detected. TomTom was subsequently used to match candidate motifs with those in the database of detected sequences. To ensure only high-quality candidates were retained, matches with a p-value > 0.1 for similarity were discarded. This strategy gave 7 candidate motifs able to be matched to the database and retained as high-quality motifs.

pl <- tomtom_all4_by_h3k27ac[streme_to_keep_all4_by_h3k27ac] %>%
  bind_rows() %>% 
  arrange(desc(nsites)) %>% 
  to_list() %>% 
  lapply(
    function(x) {
      nm <- paste0(
        x@altname, 
        " (IC: ", round(x@icscore, 1), "; n = ", x@nsites, ")\n",
        x@extrainfo[["best_match_altname"]]
      )
      view_motifs(x) +
        ggtitle(nm) +
        theme(
          plot.title = element_text(hjust = 0.5, size = 8),
          axis.text = element_text(size = 8),
          axis.title = element_text(size = 8)
        )
    }
  )
plot_grid(plotlist = pl)

$All motifs detected by STREME using a threshold of $p \leq 0.01$ and with high-quality matches to the database of motifs. The name of the best matching motif is given as the second line of the title, although this may not be the true TF which binds to the motif. Motifs are ordered by the number of times they appear in the sequences, noting the motifs with high information content tend to lead to fewer hits.$

All motifs detected by STREME using a threshold of $p \leq 0.01$ and with high-quality matches to the database of motifs. The name of the best matching motif is given as the second line of the title, although this may not be the true TF which binds to the motif. Motifs are ordered by the number of times they appear in the sequences, noting the motifs with high information content tend to lead to fewer hits.

tomtom_all4_by_h3k27ac[streme_to_keep_all4_by_h3k27ac] %>% 
  lapply(
    function(x) {
      tt <- x$tomtom[[1]]
      mt <- lapply(
        tt$match_motif,
        function(um) {
          nm <- paste0(um@name, " (", um@altname, ")")
          um@name <- nm
          um
        }
      )
      x$tomtom[[1]]$match_motif <- mt
      x
    }
  ) %>% 
  lapply(
    function(x) {
      mutate(x, tomtom_logos = view_tomtom_hits(x))
    }
  ) %>% 
  bind_rows() %>% 
  # dplyr::slice(1:2) %>%
  dplyr::select(
    motif, id = altname, IC = icscore, nsites,
    best_match_name, best_match_altname, best_match_motif,
    tomtom_logos, tomtom, pval , eval
  ) %>% 
  mutate(
    tomtom_logos = tomtom_logos %>% 
      lapply(
        function(x) {
          n <- length(levels(x$data$motif.id))
          ggplot_image(x, height = px(60 + n*60), aspect_ratio = 180 / (100 + n*30))
        }
      ),
    `All Matches` = vapply(
      tomtom,
      function(x) {
        paste(x$match_altname, collapse = ", ")
      },
      character(1)
    )
  ) %>% 
  dplyr::select(
    `STREME ID` = id, `Best Match` = best_match_altname, `All Matches`,
    IC, `Found In` = nsites, `P-value` = pval, `E-value` = eval,
    `TOMTOM Matches` = tomtom_logos
  ) %>% 
  gt(
    caption = paste(
      "All STREME-detected motifs with putative matches in the motif database.",
      "The motif database was restricted to motifs associated with detected genes.",
      "The information content (IC) is given along with the number of sequences",
      "the motif was found in. The E-value is equivalent to an adjusted p-value",
      "for the likelihood of the detected motif being non-artefactual."
    ),
    id = "tomtom"
  ) %>% 
  fmt_markdown(contains("TOMTOM")) %>% 
  fmt_number(contains("val"), decimals = 4) %>% 
  fmt_number("Found In", decimals = 0, use_seps = TRUE) %>% 
  fmt_number("IC", decimals = 2) %>% 
  tab_spanner(
    label = "Database Matches",
    columns = all_of(c("Best Match", "All Matches"))
  ) %>% 
  tab_spanner(
    label = "Statistics",
    columns = c("IC", "Found In", "P-value", "E-value")
  ) %>% 
  text_transform(
    locations = cells_body(columns = "All Matches"),
    fn = function(x) str_replace_all(x, ", ", "<br>")
  ) %>% 
  tab_options(
    table.width = pct(100),
    table.font.size = px(12)
  ) %>% 
  opt_css(
    css = "
    #tomtom .gt_row{
      vertical-align: top
    }
    "
  )

All STREME-detected motifs with putative matches in the motif database. The motif database was restricted to motifs associated with detected genes. The information content (IC) is given along with the number of sequences the motif was found in. The E-value is equivalent to an adjusted p-value for the likelihood of the detected motif being non-artefactual.
STREME ID	Database Matches		Statistics				TOMTOM Matches
STREME ID	Best Match	All Matches	IC	Found In	P-value	E-value	TOMTOM Matches
STREME-1	ZNF281	ZNF281 ZNF148	11.84	921	0.0001	0.0024
STREME-10	ZNF331	ZNF331	13.75	282	0.0035	0.0840
STREME-13	SP1	SP1 SP2 KLF10 KLF12	11.79	1,260	0.0044	0.1056
STREME-15	ZNF740	ZNF740 KLF4	11.50	1,093	0.0071	0.1704
STREME-18	ZNF148	ZNF148	10.48	848	0.0088	0.2112
STREME-22	GABPA	GABPA ELF3 ELF1	10.43	852	0.0680	1.6320
STREME-23	ZNF449	ZNF449	13.75	184	0.0870	2.0880

sessionInfo()

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] gt_0.6.0                          cowplot_1.1.1                    
 [3] glue_1.6.2                        pander_0.6.5                     
 [5] scales_1.2.0                      BSgenome.Hsapiens.UCSC.hg19_1.4.3
 [7] BSgenome_1.64.0                   Biostrings_2.64.0                
 [9] XVector_0.36.0                    magrittr_2.0.3                   
[11] rtracklayer_1.56.0                extraChIPs_1.0.0                 
[13] SummarizedExperiment_1.26.1       Biobase_2.56.0                   
[15] MatrixGenerics_1.8.0              matrixStats_0.62.0               
[17] BiocParallel_1.30.2               plyranges_1.16.0                 
[19] GenomicRanges_1.48.0              GenomeInfoDb_1.32.2              
[21] IRanges_2.30.0                    S4Vectors_0.34.0                 
[23] BiocGenerics_0.42.0               forcats_0.5.1                    
[25] stringr_1.4.0                     dplyr_1.0.9                      
[27] purrr_0.3.4                       readr_2.1.2                      
[29] tidyr_1.2.0                       tibble_3.1.7                     
[31] ggplot2_3.3.6                     tidyverse_1.3.1                  
[33] universalmotif_1.14.0             memes_1.4.1                      
[35] workflowr_1.7.0                  

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3             R.methodsS3_1.8.1         
  [3] csaw_1.30.1                bit64_4.0.5               
  [5] knitr_1.39                 DelayedArray_0.22.0       
  [7] R.utils_2.11.0             data.table_1.14.2         
  [9] rpart_4.1.16               KEGGREST_1.36.0           
 [11] RCurl_1.98-1.6             AnnotationFilter_1.20.0   
 [13] doParallel_1.0.17          generics_0.1.2            
 [15] GenomicFeatures_1.48.1     callr_3.7.0               
 [17] EnrichedHeatmap_1.26.0     usethis_2.1.6             
 [19] RSQLite_2.2.14             commonmark_1.8.0          
 [21] bit_4.0.4                  tzdb_0.3.0                
 [23] xml2_1.3.3                 lubridate_1.8.0           
 [25] httpuv_1.6.5               assertthat_0.2.1          
 [27] xfun_0.31                  hms_1.1.1                 
 [29] jquerylib_0.1.4            evaluate_0.15             
 [31] promises_1.2.0.1           fansi_1.0.3               
 [33] restfulr_0.0.13            progress_1.2.2            
 [35] dbplyr_2.1.1               readxl_1.4.0              
 [37] igraph_1.3.1               DBI_1.1.2                 
 [39] htmlwidgets_1.5.4          ellipsis_0.3.2            
 [41] backports_1.4.1            biomaRt_2.52.0            
 [43] vctrs_0.4.1                here_1.0.1                
 [45] ensembldb_2.20.1           cachem_1.0.6              
 [47] withr_2.5.0                ggforce_0.3.3             
 [49] Gviz_1.40.1                checkmate_2.1.0           
 [51] vroom_1.5.7                GenomicAlignments_1.32.0  
 [53] prettyunits_1.1.1          cluster_2.1.3             
 [55] lazyeval_0.2.2             crayon_1.5.1              
 [57] edgeR_3.38.1               pkgconfig_2.0.3           
 [59] labeling_0.4.2             tweenr_1.0.2              
 [61] pkgload_1.2.4              ProtGenerics_1.28.0       
 [63] nnet_7.3-17                rlang_1.0.2               
 [65] lifecycle_1.0.1            filelock_1.0.2            
 [67] BiocFileCache_2.4.0        modelr_0.1.8              
 [69] dichromat_2.0-0.1          cellranger_1.1.0          
 [71] rprojroot_2.0.3            polyclip_1.10-0           
 [73] Matrix_1.4-1               ggseqlogo_0.1             
 [75] reprex_2.0.1               base64enc_0.1-3           
 [77] whisker_0.4                GlobalOptions_0.1.2       
 [79] processx_3.5.3             png_0.1-7                 
 [81] rjson_0.2.21               GenomicInteractions_1.30.0
 [83] bitops_1.0-7               cmdfun_1.0.2              
 [85] getPass_0.2-2              R.oo_1.24.0               
 [87] blob_1.2.3                 shape_1.4.6               
 [89] jpeg_0.1-9                 memoise_2.0.1             
 [91] zlibbioc_1.42.0            compiler_4.2.0            
 [93] scatterpie_0.1.7           BiocIO_1.6.0              
 [95] RColorBrewer_1.1-3         clue_0.3-60               
 [97] Rsamtools_2.12.0           cli_3.3.0                 
 [99] ps_1.7.0                   htmlTable_2.4.0           
[101] Formula_1.2-4              MASS_7.3-56               
[103] ggside_0.2.0.9990          tidyselect_1.1.2          
[105] stringi_1.7.6              highr_0.9                 
[107] yaml_2.3.5                 locfit_1.5-9.5            
[109] latticeExtra_0.6-29        ggrepel_0.9.1             
[111] grid_4.2.0                 sass_0.4.1                
[113] VariantAnnotation_1.42.1   tools_4.2.0               
[115] circlize_0.4.15            rstudioapi_0.13           
[117] foreach_1.5.2              foreign_0.8-82            
[119] git2r_0.30.1               metapod_1.4.0             
[121] gridExtra_2.3              farver_2.1.0              
[123] digest_0.6.29              Rcpp_1.0.8.3              
[125] broom_0.8.0                later_1.3.0               
[127] httr_1.4.3                 AnnotationDbi_1.58.0      
[129] biovizBase_1.44.0          ComplexHeatmap_2.12.0     
[131] colorspace_2.0-3           rvest_1.0.2               
[133] brio_1.1.3                 XML_3.99-0.9              
[135] fs_1.5.2                   splines_4.2.0             
[137] jsonlite_1.8.0             ggfun_0.0.6               
[139] testthat_3.1.4             R6_2.5.1                  
[141] Hmisc_4.7-0                pillar_1.7.0              
[143] htmltools_0.5.2            fastmap_1.1.0             
[145] codetools_0.2-18           utf8_1.2.2                
[147] lattice_0.20-45            bslib_0.3.1               
[149] curl_4.3.2                 survival_3.2-13           
[151] limma_3.52.1               rmarkdown_2.14            
[153] InteractionSet_1.24.0      desc_1.4.1                
[155] munsell_0.5.0              GetoptLong_1.0.5          
[157] GenomeInfoDbData_1.2.8     iterators_1.0.14          
[159] haven_2.5.0                gtable_0.3.0

Motif Analysis

Steve Pederson

2022-05-25