Last updated: 2025-09-23
Checks: 7 0
Knit directory: DXR_continue/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20250701) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 6b65c26. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/Cormotif_data/
Ignored: data/DER_data/
Ignored: data/Other_paper_data/
Ignored: data/TE_annotation/
Ignored: data/alignment_summary.txt
Ignored: data/all_peak_final_dataframe.txt
Ignored: data/cell_line_info_.tsv
Ignored: data/full_summary_QC_metrics.txt
Ignored: data/motif_lists/
Ignored: data/number_frag_peaks_summary.txt
Untracked files:
Untracked: analysis/Top2a_Top2b_expression.Rmd
Untracked: analysis/maps_and_plots.Rmd
Untracked: analysis/proteomics.Rmd
Untracked: code/making_analysis_file_summary.R
Unstaged changes:
Modified: analysis/Outlier_removal.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/repeatmasker_data.Rmd) and
HTML (docs/repeatmasker_data.html) files. If you’ve
configured a remote Git repository (see ?wflow_git_remote),
click on the hyperlinks in the table below to view the files as they
were in that past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | 6b65c26 | reneeisnowhere | 2025-09-23 | wflow_publish("analysis/repeatmasker_data.Rmd") |
| html | fe42ebc | reneeisnowhere | 2025-09-16 | Build site. |
| Rmd | 6e58d85 | reneeisnowhere | 2025-09-16 | first setup |
| html | 967b7a8 | reneeisnowhere | 2025-09-16 | Build site. |
| Rmd | 48f3b75 | reneeisnowhere | 2025-09-16 | first setup |
| Rmd | 06d2af1 | reneeisnowhere | 2025-09-16 | wflow_git_commit("analysis/repeatmasker_data.Rmd") |
libraries
library(tidyverse)
library(DT)
library(genomation)
library(GenomicRanges)
library(BiocParallel)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(ChIPseeker)
library(plyranges)
library(ggsignif)
library(ggpubr)
Loading data files
repeatmasker <- read_delim("data/Other_paper_data/repeatmasker_20250911.txt",
delim = "\t", escape_double = FALSE,
trim_ws = TRUE)
peakAnnoList_H3K27ac <- readRDS("data/motif_lists/H3K27ac_annotated_peaks.RDS")
peakAnnoList_H3K27me3 <- readRDS("data/motif_lists/H3K27me3_annotated_peaks.RDS")
peakAnnoList_H3K36me3 <- readRDS("data/motif_lists/H3K36me3_annotated_peaks.RDS")
peakAnnoList_H3K9me3 <- readRDS("data/motif_lists/H3K9me3_annotated_peaks.RDS")
rmskr_std_gr <- repeatmasker %>%
makeGRangesFromDataFrame(., seqnames="genoName", keep.extra.columns=TRUE,
start.field="genoStart",
end.field="genoEnd",
starts.in.df.are.0based=TRUE) %>%
keepStandardChromosomes(., pruning.mode = "coarse")
##checking chromosomes
#seqlevels(rmskr_std_gr)
###making granges lists out of motifs
H3K27ac_sets_gr <- lapply(peakAnnoList_H3K27ac, function(df) {
as_granges(df)
})
H3K27me3_sets_gr <- lapply(peakAnnoList_H3K27me3, function(df) {
as_granges(df)
})
H3K36me3_sets_gr <- lapply(peakAnnoList_H3K36me3, function(df) {
as_granges(df)
})
H3K9me3_sets_gr <- lapply(peakAnnoList_H3K9me3, function(df) {
as_granges(df)
})
## this is now taking the overlaps and widths of the TE and roi, then calculating percentage overlaps for each (pct_te_overlap and pct_roi_overlap)
# H3K27ac_ols <- lapply(H3K27ac_sets_gr, function(df) {
# overlaps <- join_overlap_inner(df, rmskr_std_gr)
# # overlap base pairs
# })
H3K27ac_ols <- lapply(H3K27ac_sets_gr, function(df) {
# 1️⃣ find overlaps
hits <- findOverlaps(df, rmskr_std_gr)
# 2️⃣ intersect the overlapping ranges
overlaps_gr <- pintersect(df[queryHits(hits)], rmskr_std_gr[subjectHits(hits)])
# 3️⃣ convert to data frame
overlaps_df <- as.data.frame(overlaps_gr)
# 4️⃣ add overlap length
overlaps_df$overlap_bp <- width(overlaps_gr)
# 5️⃣ ROI lengths & % coverage
roi_widths <- width(df)
names(roi_widths) <- df$Peakid
overlaps_df$roi_len <- roi_widths[as.character(overlaps_df$Peakid)]
overlaps_df$pct_roi_overlap <- (overlaps_df$overlap_bp / overlaps_df$roi_len) * 100
# 6️⃣ TE lengths & % coverage
te_widths <- width(rmskr_std_gr)
overlaps_df$te_len <- te_widths[subjectHits(hits)]
overlaps_df$pct_te_overlap <- (overlaps_df$overlap_bp / overlaps_df$te_len) * 100
# 7️⃣ Add TE metadata from rmsk GRanges
overlaps_df$repName <- rmskr_std_gr$repName[subjectHits(hits)]
overlaps_df$repClass <- rmskr_std_gr$repClass[subjectHits(hits)]
overlaps_df$repFamily <- rmskr_std_gr$repFamily[subjectHits(hits)]
overlaps_df$milliDiv <- rmskr_std_gr$milliDiv[subjectHits(hits)]
overlaps_df$milliDel <- rmskr_std_gr$milliDel[subjectHits(hits)]
overlaps_df$milliIns <- rmskr_std_gr$milliIns[subjectHits(hits)]
overlaps_df
})
H3K27ac_ols$Set_1 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 8 × 2
n nn
<int> <int>
1 1 40854
2 2 30144
3 3 12604
4 4 4120
5 5 1127
6 6 171
7 7 5
8 8 1
H3K27ac_ols$Set_1 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 89,026 × 2
Peakid repClass
<chr> <chr>
1 chr10:100005111-100005718 LINE;SINE
2 chr10:100006009-100007728 SINE;LINE
3 chr10:100008216-100011014 SINE;Simple_repeat;Unknown;LINE
4 chr10:100043350-100044800 SINE;DNA;Low_complexity
5 chr10:100046129-100046853 LTR;SINE
6 chr10:100184879-100187424 DNA
7 chr10:100199091-100199540 SINE
8 chr10:10020802-10021452 SINE;LTR
9 chr10:100208189-100209822 SINE;LTR
10 chr10:100226355-100227311 LINE
# ℹ 89,016 more rows
H3K27ac_ols$Set_2 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 6 × 2
n nn
<int> <int>
1 1 513
2 2 411
3 3 195
4 4 38
5 5 8
6 6 1
H3K27ac_ols$Set_2 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 1,166 × 2
Peakid repClass
<chr> <chr>
1 chr10:103196972-103198089 LINE;SINE;Low_complexity
2 chr10:103898324-103899678 LINE;SINE
3 chr10:104775314-104776084 SINE;LINE
4 chr10:105604321-105604967 SINE;LINE;LTR;Simple_repeat
5 chr10:11058814-11060583 SINE;DNA
6 chr10:110845092-110847119 SINE;LINE;Simple_repeat
7 chr10:110865305-110866546 SINE
8 chr10:110946871-110947695 LINE
9 chr10:113326142-113327412 SINE
10 chr10:113524679-113525161 SINE
# ℹ 1,156 more rows
H3K27ac_ols$Set_3 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 6 × 2
n nn
<int> <int>
1 1 2303
2 2 2054
3 3 974
4 4 317
5 5 67
6 6 9
H3K27ac_ols$Set_3 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 5,724 × 2
Peakid repClass
<chr> <chr>
1 chr10:100738942-100739973 Simple_repeat
2 chr10:100790295-100791877 Low_complexity;Simple_repeat
3 chr10:101132954-101134846 Simple_repeat
4 chr10:101157763-101159378 RC;LTR
5 chr10:101227112-101229171 Low_complexity;Simple_repeat
6 chr10:101230211-101230842 Simple_repeat
7 chr10:101557797-101558624 SINE
8 chr10:101995380-101996279 LINE
9 chr10:102143712-102144701 LINE
10 chr10:102158807-102161487 SINE;Simple_repeat;LTR
# ℹ 5,714 more rows
H3K27ac_ols$all_H3K27ac %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 8 × 2
n nn
<int> <int>
1 1 55458
2 2 41836
3 3 17564
4 4 5519
5 5 1416
6 6 203
7 7 7
8 8 1
H3K27ac_ols$all_H3K27ac %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 122,004 × 2
Peakid repClass
<chr> <chr>
1 chr10:100005111-100005718 LINE;SINE
2 chr10:100006009-100007728 SINE;LINE
3 chr10:100008216-100011014 SINE;Simple_repeat;Unknown;LINE
4 chr10:100043350-100044800 SINE;DNA;Low_complexity
5 chr10:100046129-100046853 LTR;SINE
6 chr10:100179689-100180264 SINE
7 chr10:100184879-100187424 DNA
8 chr10:100199091-100199540 SINE
9 chr10:10020802-10021452 SINE;LTR
10 chr10:100208189-100209822 SINE;LTR
# ℹ 121,994 more rows
length(H3K27ac_sets_gr$all_H3K27ac)
[1] 150071
H3K27me3_ols <- lapply(H3K27me3_sets_gr, function(df) {
# 1️⃣ find overlaps
hits <- findOverlaps(df, rmskr_std_gr)
# 2️⃣ intersect the overlapping ranges
overlaps_gr <- pintersect(df[queryHits(hits)], rmskr_std_gr[subjectHits(hits)])
# 3️⃣ convert to data frame
overlaps_df <- as.data.frame(overlaps_gr)
# 4️⃣ add overlap length
overlaps_df$overlap_bp <- width(overlaps_gr)
# 5️⃣ ROI lengths & % coverage
roi_widths <- width(df)
names(roi_widths) <- df$Peakid
overlaps_df$roi_len <- roi_widths[as.character(overlaps_df$Peakid)]
overlaps_df$pct_roi_overlap <- (overlaps_df$overlap_bp / overlaps_df$roi_len) * 100
# 6️⃣ TE lengths & % coverage
te_widths <- width(rmskr_std_gr)
overlaps_df$te_len <- te_widths[subjectHits(hits)]
overlaps_df$pct_te_overlap <- (overlaps_df$overlap_bp / overlaps_df$te_len) * 100
# 7️⃣ Add TE metadata from rmsk GRanges
overlaps_df$repName <- rmskr_std_gr$repName[subjectHits(hits)]
overlaps_df$repClass <- rmskr_std_gr$repClass[subjectHits(hits)]
overlaps_df$repFamily <- rmskr_std_gr$repFamily[subjectHits(hits)]
overlaps_df$milliDiv <- rmskr_std_gr$milliDiv[subjectHits(hits)]
overlaps_df$milliDel <- rmskr_std_gr$milliDel[subjectHits(hits)]
overlaps_df$milliIns <- rmskr_std_gr$milliIns[subjectHits(hits)]
overlaps_df
})
H3K27me3_ols$Set_1 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 8 × 2
n nn
<int> <int>
1 1 59299
2 2 34090
3 3 12830
4 4 5071
5 5 1774
6 6 379
7 7 35
8 8 1
H3K27me3_ols$Set_1 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 113,479 × 2
Peakid repClass
<chr> <chr>
1 chr10:100046102-100046460 LTR;SINE
2 chr10:100054231-100054520 SINE
3 chr10:100078858-100079126 LINE
4 chr10:100084784-100089581 SINE;LINE;LTR;Low_complexity
5 chr10:100113114-100113630 SINE
6 chr10:100113867-100114131 SINE
7 chr10:100114436-100115458 LINE
8 chr10:100127919-100128315 LTR;LINE
9 chr10:100129022-100129787 LINE;SINE
10 chr10:100229003-100229494 Simple_repeat
# ℹ 113,469 more rows
H3K27me3_ols$Set_2 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 3 × 2
n nn
<int> <int>
1 1 104
2 2 41
3 3 8
H3K27me3_ols$Set_2 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 153 × 2
Peakid repClass
<chr> <chr>
1 chr10:104424708-104425325 LINE;SINE
2 chr10:113524664-113525075 SINE
3 chr10:129076374-129076858 Simple_repeat
4 chr10:130134682-130135251 LINE
5 chr10:132175296-132177004 LINE;Simple_repeat;tRNA
6 chr10:133324189-133324746 Simple_repeat
7 chr10:1398800-1399241 LINE
8 chr10:27256098-27256641 SINE;LTR
9 chr10:29968825-29969238 Simple_repeat
10 chr10:30348778-30349159 Simple_repeat
# ℹ 143 more rows
H3K27me3_ols$all_H3K27me3 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 8 × 2
n nn
<int> <int>
1 1 59987
2 2 34344
3 3 12893
4 4 5078
5 5 1775
6 6 379
7 7 35
8 8 1
H3K27me3_ols$all_H3K27me3 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 114,492 × 2
Peakid repClass
<chr> <chr>
1 chr10:100046102-100046460 LTR;SINE
2 chr10:100054231-100054520 SINE
3 chr10:100078858-100079126 LINE
4 chr10:100084784-100089581 SINE;LINE;LTR;Low_complexity
5 chr10:100113114-100113630 SINE
6 chr10:100113867-100114131 SINE
7 chr10:100114436-100115458 LINE
8 chr10:100127919-100128315 LTR;LINE
9 chr10:100129022-100129787 LINE;SINE
10 chr10:100229003-100229494 Simple_repeat
# ℹ 114,482 more rows
length(H3K27me3_sets_gr$all_H3K27me3)
[1] 150464
H3K36me3_ols <- lapply(H3K36me3_sets_gr, function(df) {
# 1️⃣ find overlaps
hits <- findOverlaps(df, rmskr_std_gr)
# 2️⃣ intersect the overlapping ranges
overlaps_gr <- pintersect(df[queryHits(hits)], rmskr_std_gr[subjectHits(hits)])
# 3️⃣ convert to data frame
overlaps_df <- as.data.frame(overlaps_gr)
# 4️⃣ add overlap length
overlaps_df$overlap_bp <- width(overlaps_gr)
# 5️⃣ ROI lengths & % coverage
roi_widths <- width(df)
names(roi_widths) <- df$Peakid
overlaps_df$roi_len <- roi_widths[as.character(overlaps_df$Peakid)]
overlaps_df$pct_roi_overlap <- (overlaps_df$overlap_bp / overlaps_df$roi_len) * 100
# 6️⃣ TE lengths & % coverage
te_widths <- width(rmskr_std_gr)
overlaps_df$te_len <- te_widths[subjectHits(hits)]
overlaps_df$pct_te_overlap <- (overlaps_df$overlap_bp / overlaps_df$te_len) * 100
# 7️⃣ Add TE metadata from rmsk GRanges
overlaps_df$repName <- rmskr_std_gr$repName[subjectHits(hits)]
overlaps_df$repClass <- rmskr_std_gr$repClass[subjectHits(hits)]
overlaps_df$repFamily <- rmskr_std_gr$repFamily[subjectHits(hits)]
overlaps_df$milliDiv <- rmskr_std_gr$milliDiv[subjectHits(hits)]
overlaps_df$milliDel <- rmskr_std_gr$milliDel[subjectHits(hits)]
overlaps_df$milliIns <- rmskr_std_gr$milliIns[subjectHits(hits)]
overlaps_df
})
H3K36me3_ols$Set_1 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 7 × 2
n nn
<int> <int>
1 1 61918
2 2 24560
3 3 5127
4 4 1062
5 5 192
6 6 22
7 7 1
H3K36me3_ols$Set_1 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 92,882 × 2
Peakid repClass
<chr> <chr>
1 chr10:100009339-100010339 SINE;Simple_repeat
2 chr10:100139015-100139306 LINE
3 chr10:100147443-100147738 LINE
4 chr10:100196474-100197370 SINE;DNA;LINE
5 chr10:100198988-100199626 SINE
6 chr10:100201468-100201937 DNA;SINE
7 chr10:100202789-100203153 SINE;LINE
8 chr10:100203832-100204198 LTR
9 chr10:100208051-100208777 SINE;LTR
10 chr10:100214861-100215206 LINE;SINE
# ℹ 92,872 more rows
H3K36me3_ols$Set_2 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 6 × 2
n nn
<int> <int>
1 1 628
2 2 359
3 3 148
4 4 43
5 5 18
6 6 5
H3K36me3_ols$Set_2 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 1,201 × 2
Peakid repClass
<chr> <chr>
1 chr10:110729433-110730934 LINE;SINE;Simple_repeat
2 chr10:110803615-110804488 SINE;LTR
3 chr10:114151594-114151957 DNA
4 chr10:117139090-117139567 Simple_repeat
5 chr10:117221071-117221394 DNA
6 chr10:118698372-118698876 SINE;LTR
7 chr10:122207358-122208228 DNA
8 chr10:123157127-123157564 LINE
9 chr10:124613974-124614749 SINE
10 chr10:124670751-124671297 LINE;LTR
# ℹ 1,191 more rows
H3K36me3_ols$all_H3K36me3 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 7 × 2
n nn
<int> <int>
1 1 86790
2 2 34976
3 3 7463
4 4 1502
5 5 284
6 6 39
7 7 3
H3K36me3_ols$all_H3K36me3 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 131,057 × 2
Peakid repClass
<chr> <chr>
1 chr10:100009339-100010339 SINE;Simple_repeat
2 chr10:100139015-100139306 LINE
3 chr10:100147443-100147738 LINE
4 chr10:100153009-100153887 SINE;LINE
5 chr10:100173135-100173409 DNA
6 chr10:100186943-100187317 DNA
7 chr10:100196474-100197370 SINE;DNA;LINE
8 chr10:100198988-100199626 SINE
9 chr10:100201468-100201937 DNA;SINE
10 chr10:100202789-100203153 SINE;LINE
# ℹ 131,047 more rows
length(H3K36me3_sets_gr$all_H3K36me3)
[1] 186724
H3K9me3_ols <- lapply(H3K9me3_sets_gr, function(df) {
# 1️⃣ find overlaps
hits <- findOverlaps(df, rmskr_std_gr)
# 2️⃣ intersect the overlapping ranges
overlaps_gr <- pintersect(df[queryHits(hits)], rmskr_std_gr[subjectHits(hits)])
# 3️⃣ convert to data frame
overlaps_df <- as.data.frame(overlaps_gr)
# 4️⃣ add overlap length
overlaps_df$overlap_bp <- width(overlaps_gr)
# 5️⃣ ROI lengths & % coverage
roi_widths <- width(df)
names(roi_widths) <- df$Peakid
overlaps_df$roi_len <- roi_widths[as.character(overlaps_df$Peakid)]
overlaps_df$pct_roi_overlap <- (overlaps_df$overlap_bp / overlaps_df$roi_len) * 100
# 6️⃣ TE lengths & % coverage
te_widths <- width(rmskr_std_gr)
overlaps_df$te_len <- te_widths[subjectHits(hits)]
overlaps_df$pct_te_overlap <- (overlaps_df$overlap_bp / overlaps_df$te_len) * 100
# 7️⃣ Add TE metadata from rmsk GRanges
overlaps_df$repName <- rmskr_std_gr$repName[subjectHits(hits)]
overlaps_df$repClass <- rmskr_std_gr$repClass[subjectHits(hits)]
overlaps_df$repFamily <- rmskr_std_gr$repFamily[subjectHits(hits)]
overlaps_df$milliDiv <- rmskr_std_gr$milliDiv[subjectHits(hits)]
overlaps_df$milliDel <- rmskr_std_gr$milliDel[subjectHits(hits)]
overlaps_df$milliIns <- rmskr_std_gr$milliIns[subjectHits(hits)]
overlaps_df
})
H3K9me3_ols$Set_1 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 6 × 2
n nn
<int> <int>
1 1 107939
2 2 32414
3 3 4976
4 4 574
5 5 69
6 6 2
H3K9me3_ols$Set_1 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 145,974 × 2
Peakid repClass
<chr> <chr>
1 chr10:100009524-100010017 Simple_repeat
2 chr10:100025478-100025829 LTR
3 chr10:100027743-100029126 LTR;SINE
4 chr10:100046119-100046433 LTR;SINE
5 chr10:100086477-100087363 LINE;LTR;SINE
6 chr10:100095279-100095524 LINE
7 chr10:100097281-100098436 SINE;LTR
8 chr10:100133860-100134292 LINE;Simple_repeat;LTR
9 chr10:10020906-10021393 SINE;LTR
10 chr10:100229039-100229275 Simple_repeat
# ℹ 145,964 more rows
H3K9me3_ols$Set_2 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 4 × 2
n nn
<int> <int>
1 1 1362
2 2 266
3 3 26
4 4 2
H3K9me3_ols$Set_2 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 1,656 × 2
Peakid repClass
<chr> <chr>
1 chr10:101006128-101006429 SINE
2 chr10:101777876-101778344 LINE
3 chr10:101951337-101951612 SINE
4 chr10:102418095-102418389 Simple_repeat
5 chr10:104329400-104329651 LINE
6 chr10:1101897-1102240 LINE
7 chr10:110227445-110227710 Low_complexity
8 chr10:114786610-114786850 Retroposon
9 chr10:117531528-117532237 SINE
10 chr10:11893983-11894609 SINE
# ℹ 1,646 more rows
H3K9me3_ols$Set_3 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 5 × 2
n nn
<int> <int>
1 1 405
2 2 119
3 3 26
4 4 4
5 5 1
H3K9me3_ols$Set_3 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 555 × 2
Peakid repClass
<chr> <chr>
1 chr10:114252483-114252701 SINE
2 chr10:115903919-115904555 LTR
3 chr10:124611053-124611641 SINE
4 chr10:1298991-1300057 SINE;LTR
5 chr10:1300214-1301077 LINE;SINE
6 chr10:132172535-132172962 LINE;Simple_repeat
7 chr10:132279104-132279514 SINE
8 chr10:132363329-132364326 Simple_repeat
9 chr10:132757314-132757720 DNA
10 chr10:24446759-24447078 SINE
# ℹ 545 more rows
H3K9me3_ols$all_H3K9me3 %>%
as.data.frame() %>%
group_by(Peakid,repClass) %>%
summarise() %>%
group_by(Peakid) %>% tally %>%
group_by(n) %>% tally
# A tibble: 6 × 2
n nn
<int> <int>
1 1 117258
2 2 34507
3 3 5210
4 4 591
5 5 70
6 6 2
H3K9me3_ols$all_H3K9me3 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"))
# A tibble: 157,638 × 2
Peakid repClass
<chr> <chr>
1 chr10:100006461-100006612 SINE;LINE
2 chr10:100009524-100010017 Simple_repeat
3 chr10:100025478-100025829 LTR
4 chr10:100027743-100029126 LTR;SINE
5 chr10:100046119-100046433 LTR;SINE
6 chr10:100086477-100087363 LINE;LTR;SINE
7 chr10:100095279-100095524 LINE
8 chr10:100097281-100098436 SINE;LTR
9 chr10:100133860-100134292 LINE;Simple_repeat;LTR
10 chr10:10020906-10021393 SINE;LTR
# ℹ 157,628 more rows
length(H3K9me3_sets_gr$all_H3K9me3)
[1] 218647
Below I am setting up the data frames, first I take all regions and annotate with the overlapping dataframe. Then I make new columns, TE_status, SINE_status, LINE_status, LTR_status, DNA_status, SVA_status. These columns mark Peakids if they have this attribute. Last, I am annotating a cluster column to add which set the Peakid exists in by histone. (field will be Set1, Set2, Set3, no_cluster)
All_H3K27ac <- H3K27ac_sets_gr$all_H3K27ac %>% as.data.frame()
All_H3K27me3 <- H3K27me3_sets_gr$all_H3K27me3 %>% as.data.frame()
All_H3K36me3 <- H3K36me3_sets_gr$all_H3K36me3 %>% as.data.frame()
All_H3K9me3 <- H3K9me3_sets_gr$all_H3K9me3 %>% as.data.frame()
H3K27ac_lookup <- imap_dfr(peakAnnoList_H3K27ac[1:3], ~
tibble(Peakid = .x@anno$Peakid, cluster = .y)
)
H3K27me3_lookup <- imap_dfr(peakAnnoList_H3K27me3[1:2], ~
tibble(Peakid = .x@anno$Peakid, cluster = .y)
)
H3K36me3_lookup <- imap_dfr(peakAnnoList_H3K36me3[1:2], ~
tibble(Peakid = .x@anno$Peakid, cluster = .y)
)
H3K9me3_lookup <- imap_dfr(peakAnnoList_H3K9me3[1:3], ~
tibble(Peakid = .x@anno$Peakid, cluster = .y)
)
annotated_H3K27ac <- All_H3K27ac %>%
left_join(., (H3K27ac_ols$all_H3K27ac %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"),
repFamily=paste0(unique(repFamily),collapse=";"),
repName=paste0(unique(repName),collapse=";"))))%>%
mutate(TE_status = if_else(is.na(repClass), "wnot_TE","TE")) %>%
mutate(SINE_status = case_when(is.na(repClass)~ "wnot_SINE",
str_detect(repClass, "SINE") ~ "SINE",
TRUE ~"wnot_SINE")) %>%
mutate(LINE_status = case_when(is.na(repClass)~ "wnot_LINE",
str_detect(repClass, "LINE") ~ "LINE",
TRUE ~"wnot_LINE")) %>%
mutate(LTR_status = case_when(is.na(repClass)~ "wnot_LTR",
str_detect(repClass, "LTR") ~ "LTR",
TRUE ~"wnot_LTR")) %>%
mutate(DNA_status = case_when(is.na(repClass)~ "wnot_DNA",
str_detect(repClass, "DNA") ~ "DNA",
TRUE ~"wnot_DNA")) %>%
mutate(SVA_status = case_when(is.na(repClass)~ "wnot_SVA",
str_detect(repClass, "Retroposon") ~ "SVA",
TRUE ~"wnot_SVA")) %>%
left_join(H3K27ac_lookup, by = "Peakid") %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster))
annotated_H3K27me3 <- All_H3K27me3 %>%
left_join(., (H3K27me3_ols$all_H3K27me3 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"),
repFamily=paste0(unique(repFamily),collapse=";"),
repName=paste0(unique(repName),collapse=";"))))%>%
mutate(TE_status = if_else(is.na(repClass), "wnot_TE","TE")) %>%
mutate(SINE_status = case_when(is.na(repClass)~ "wnot_SINE",
str_detect(repClass, "SINE") ~ "SINE",
TRUE ~"wnot_SINE")) %>%
mutate(LINE_status = case_when(is.na(repClass)~ "wnot_LINE",
str_detect(repClass, "LINE") ~ "LINE",
TRUE ~"wnot_LINE")) %>%
mutate(LTR_status = case_when(is.na(repClass)~ "wnot_LTR",
str_detect(repClass, "LTR") ~ "LTR",
TRUE ~"wnot_LTR")) %>%
mutate(DNA_status = case_when(is.na(repClass)~ "wnot_DNA",
str_detect(repClass, "DNA") ~ "DNA",
TRUE ~"wnot_DNA")) %>%
mutate(SVA_status = case_when(is.na(repClass)~ "wnot_SVA",
str_detect(repClass, "Retroposon") ~ "SVA",
TRUE ~"wnot_SVA")) %>%
left_join(H3K27me3_lookup, by = "Peakid") %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster))
annotated_H3K36me3 <- All_H3K36me3 %>%
left_join(., (H3K36me3_ols$all_H3K36me3 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"),
repFamily=paste0(unique(repFamily),collapse=";"),
repName=paste0(unique(repName),collapse=";"))))%>%
mutate(TE_status = if_else(is.na(repClass), "wnot_TE","TE")) %>%
mutate(SINE_status = case_when(is.na(repClass)~ "wnot_SINE",
str_detect(repClass, "SINE") ~ "SINE",
TRUE ~"wnot_SINE")) %>%
mutate(LINE_status = case_when(is.na(repClass)~ "wnot_LINE",
str_detect(repClass, "LINE") ~ "LINE",
TRUE ~"wnot_LINE")) %>%
mutate(LTR_status = case_when(is.na(repClass)~ "wnot_LTR",
str_detect(repClass, "LTR") ~ "LTR",
TRUE ~"wnot_LTR")) %>%
mutate(DNA_status = case_when(is.na(repClass)~ "wnot_DNA",
str_detect(repClass, "DNA") ~ "DNA",
TRUE ~"wnot_DNA")) %>%
mutate(SVA_status = case_when(is.na(repClass)~ "wnot_SVA",
str_detect(repClass, "Retroposon") ~ "SVA",
TRUE ~"wnot_SVA")) %>%
left_join(H3K36me3_lookup, by = "Peakid") %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster))
annotated_H3K9me3 <- All_H3K9me3 %>%
left_join(., (H3K9me3_ols$all_H3K9me3 %>%
as.data.frame() %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass), collapse=";"),
repFamily=paste0(unique(repFamily),collapse=";"),
repName=paste0(unique(repName),collapse=";"))))%>%
mutate(TE_status = if_else(is.na(repClass), "wnot_TE","TE")) %>%
mutate(SINE_status = case_when(is.na(repClass)~ "wnot_SINE",
str_detect(repClass, "SINE") ~ "SINE",
TRUE ~"wnot_SINE")) %>%
mutate(LINE_status = case_when(is.na(repClass)~ "wnot_LINE",
str_detect(repClass, "LINE") ~ "LINE",
TRUE ~"wnot_LINE")) %>%
mutate(LTR_status = case_when(is.na(repClass)~ "wnot_LTR",
str_detect(repClass, "LTR") ~ "LTR",
TRUE ~"wnot_LTR")) %>%
mutate(DNA_status = case_when(is.na(repClass)~ "wnot_DNA",
str_detect(repClass, "DNA") ~ "DNA",
TRUE ~"wnot_DNA")) %>%
mutate(SVA_status = case_when(is.na(repClass)~ "wnot_SVA",
str_detect(repClass, "Retroposon") ~ "SVA",
TRUE ~"wnot_SVA")) %>%
left_join(H3K9me3_lookup, by = "Peakid") %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster))
# annotated_tables <- list("annotated_H3K27ac"=annotated_H3K27ac,
# "annotated_H3K27me3"=annotated_H3K27me3,
# "annotated_H3K36me3"=annotated_H3K36me3,
# "annotated_H3K9me3"=annotated_H3K9me3)
#
# saveRDS(annotated_tables,"data/TE_annotation/annotated_tables_1bp_ol.RDS")
50% counting histones
annotated_H3K27ac_50_per <- All_H3K27ac %>%
left_join(., (H3K27ac_ols$all_H3K27ac %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap>50) %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass[pct_te_overlap>50]), collapse=";"),
repFamily=paste0(unique(repFamily[pct_te_overlap>50]),collapse=";"),
repName=paste0(unique(repName[pct_te_overlap>50]),collapse=";"))))%>%
mutate(TE_status = if_else(is.na(repClass), "wnot_TE","TE")) %>%
mutate(SINE_status = case_when(is.na(repClass)~ "wnot_SINE",
str_detect(repClass, "SINE") ~ "SINE",
TRUE ~"wnot_SINE")) %>%
mutate(LINE_status = case_when(is.na(repClass)~ "wnot_LINE",
str_detect(repClass, "LINE") ~ "LINE",
TRUE ~"wnot_LINE")) %>%
mutate(LTR_status = case_when(is.na(repClass)~ "wnot_LTR",
str_detect(repClass, "LTR") ~ "LTR",
TRUE ~"wnot_LTR")) %>%
mutate(DNA_status = case_when(is.na(repClass)~ "wnot_DNA",
str_detect(repClass, "DNA") ~ "DNA",
TRUE ~"wnot_DNA")) %>%
mutate(SVA_status = case_when(is.na(repClass)~ "wnot_SVA",
str_detect(repClass, "Retroposon") ~ "SVA",
TRUE ~"wnot_SVA")) %>%
left_join(H3K27ac_lookup, by = "Peakid") %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster))
annotated_H3K27me3_50_per <- All_H3K27me3 %>%
left_join(., (H3K27me3_ols$all_H3K27me3 %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap>50) %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass[pct_te_overlap>50]), collapse=";"),
repFamily=paste0(unique(repFamily[pct_te_overlap>50]),collapse=";"),
repName=paste0(unique(repName[pct_te_overlap>50]),collapse=";"))))%>%
mutate(TE_status = if_else(is.na(repClass), "wnot_TE","TE")) %>%
mutate(SINE_status = case_when(is.na(repClass)~ "wnot_SINE",
str_detect(repClass, "SINE") ~ "SINE",
TRUE ~"wnot_SINE")) %>%
mutate(LINE_status = case_when(is.na(repClass)~ "wnot_LINE",
str_detect(repClass, "LINE") ~ "LINE",
TRUE ~"wnot_LINE")) %>%
mutate(LTR_status = case_when(is.na(repClass)~ "wnot_LTR",
str_detect(repClass, "LTR") ~ "LTR",
TRUE ~"wnot_LTR")) %>%
mutate(DNA_status = case_when(is.na(repClass)~ "wnot_DNA",
str_detect(repClass, "DNA") ~ "DNA",
TRUE ~"wnot_DNA")) %>%
mutate(SVA_status = case_when(is.na(repClass)~ "wnot_SVA",
str_detect(repClass, "Retroposon") ~ "SVA",
TRUE ~"wnot_SVA")) %>%
left_join(H3K27me3_lookup, by = "Peakid") %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster))
annotated_H3K36me3_50_per <- All_H3K36me3 %>%
left_join(., (H3K36me3_ols$all_H3K36me3 %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap>50) %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass[pct_te_overlap>50]), collapse=";"),
repFamily=paste0(unique(repFamily[pct_te_overlap>50]),collapse=";"),
repName=paste0(unique(repName[pct_te_overlap>50]),collapse=";"))))%>%
mutate(TE_status = if_else(is.na(repClass), "wnot_TE","TE")) %>%
mutate(SINE_status = case_when(is.na(repClass)~ "wnot_SINE",
str_detect(repClass, "SINE") ~ "SINE",
TRUE ~"wnot_SINE")) %>%
mutate(LINE_status = case_when(is.na(repClass)~ "wnot_LINE",
str_detect(repClass, "LINE") ~ "LINE",
TRUE ~"wnot_LINE")) %>%
mutate(LTR_status = case_when(is.na(repClass)~ "wnot_LTR",
str_detect(repClass, "LTR") ~ "LTR",
TRUE ~"wnot_LTR")) %>%
mutate(DNA_status = case_when(is.na(repClass)~ "wnot_DNA",
str_detect(repClass, "DNA") ~ "DNA",
TRUE ~"wnot_DNA")) %>%
mutate(SVA_status = case_when(is.na(repClass)~ "wnot_SVA",
str_detect(repClass, "Retroposon") ~ "SVA",
TRUE ~"wnot_SVA")) %>%
left_join(H3K36me3_lookup, by = "Peakid") %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster))
annotated_H3K9me3_50_per <- All_H3K9me3 %>%
left_join(., (H3K9me3_ols$all_H3K9me3 %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap>50) %>%
group_by(Peakid) %>%
summarize(.,
repClass=paste0(unique(repClass[pct_te_overlap>50]), collapse=";"),
repFamily=paste0(unique(repFamily[pct_te_overlap>50]),collapse=";"),
repName=paste0(unique(repName[pct_te_overlap>50]),collapse=";"))))%>%
mutate(TE_status = if_else(is.na(repClass), "wnot_TE","TE")) %>%
mutate(SINE_status = case_when(is.na(repClass)~ "wnot_SINE",
str_detect(repClass, "SINE") ~ "SINE",
TRUE ~"wnot_SINE")) %>%
mutate(LINE_status = case_when(is.na(repClass)~ "wnot_LINE",
str_detect(repClass, "LINE") ~ "LINE",
TRUE ~"wnot_LINE")) %>%
mutate(LTR_status = case_when(is.na(repClass)~ "wnot_LTR",
str_detect(repClass, "LTR") ~ "LTR",
TRUE ~"wnot_LTR")) %>%
mutate(DNA_status = case_when(is.na(repClass)~ "wnot_DNA",
str_detect(repClass, "DNA") ~ "DNA",
TRUE ~"wnot_DNA")) %>%
mutate(SVA_status = case_when(is.na(repClass)~ "wnot_SVA",
str_detect(repClass, "Retroposon") ~ "SVA",
TRUE ~"wnot_SVA")) %>%
left_join(H3K9me3_lookup, by = "Peakid") %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster))
# annotated_tables_50_per <- list("annotated_H3K27ac_50_per"=annotated_H3K27ac_50_per,
# "annotated_H3K27me3_50_per"=annotated_H3K27me3_50_per,
# "annotated_H3K36me3_50_per"=annotated_H3K36me3_50_per,
# "annotated_H3K9me3_50_per"=annotated_H3K9me3_50_per)
#
# saveRDS(annotated_tables_50_per,"data/TE_annotation/annotated_tables_50per_ol.RDS")
te_cols <- c("TE_status","SINE_status","LINE_status","DNA_status","LTR_status","SVA_status")
H3K27ac_counts <- map(te_cols, ~ {
annotated_H3K27ac %>%
group_by(cluster, !!sym(.x)) %>% # dynamically use column
tally() %>%
pivot_wider(
id_cols = cluster,
names_from = !!sym(.x),
values_from = n
)
}) %>% set_names(te_cols)
H3K27ac_counts_long <- map2_dfr(
H3K27ac_counts, # the named list of wide tables
names(H3K27ac_counts), # names of TE columns
~ .x %>%
pivot_longer(
cols = -cluster, # everything except 'cluster'
names_to = "status", # new column for TE status (e.g., TE / wnot_TE)
values_to = "count"
) %>%
mutate(TE_type = .y) # add TE_type column based on list element name
)
H3K27ac_counts_long <- H3K27ac_counts_long %>%
group_by(TE_type) %>%
tidyr::complete(cluster, status, fill = list(count = 0)) %>%
ungroup()
H3K27me3_counts <- map(te_cols, ~ {
annotated_H3K27me3 %>%
group_by(cluster, !!sym(.x)) %>% # dynamically use column
tally() %>%
pivot_wider(
id_cols = cluster,
names_from = !!sym(.x),
values_from = n
)
}) %>% set_names(te_cols)
H3K27me3_counts_long <- map2_dfr(
H3K27me3_counts, # the named list of wide tables
names(H3K27me3_counts), # names of TE columns
~ .x %>%
pivot_longer(
cols = -cluster, # everything except 'cluster'
names_to = "status", # new column for TE status (e.g., TE / not_TE)
values_to = "count"
) %>%
mutate(TE_type = .y) # add TE_type column based on list element name
)
H3K27me3_counts_long <- H3K27me3_counts_long %>%
group_by(TE_type) %>%
tidyr::complete(cluster, status, fill = list(count = 0)) %>%
ungroup()
H3K36me3_counts <- map(te_cols, ~ {
annotated_H3K36me3 %>%
group_by(cluster, !!sym(.x)) %>% # dynamically use column
tally() %>%
pivot_wider(
id_cols = cluster,
names_from = !!sym(.x),
values_from = n
)
}) %>% set_names(te_cols)
H3K36me3_counts_long <- map2_dfr(
H3K36me3_counts, # the named list of wide tables
names(H3K36me3_counts), # names of TE columns
~ .x %>%
pivot_longer(
cols = -cluster, # everything except 'cluster'
names_to = "status", # new column for TE status (e.g., TE / not_TE)
values_to = "count"
) %>%
mutate(TE_type = .y) # add TE_type column based on list element name
)
H3K36me3_counts_long <- H3K36me3_counts_long %>%
group_by(TE_type) %>%
tidyr::complete(cluster, status, fill = list(count = 0)) %>%
ungroup()
H3K9me3_counts <- map(te_cols, ~ {
annotated_H3K9me3 %>%
group_by(cluster, !!sym(.x)) %>% # dynamically use column
tally() %>%
pivot_wider(
id_cols = cluster,
names_from = !!sym(.x),
values_from = n
)
}) %>% set_names(te_cols)
H3K9me3_counts_long <- map2_dfr(
H3K9me3_counts, # the named list of wide tables
names(H3K9me3_counts), # names of TE columns
~ .x %>%
pivot_longer(
cols = -cluster, # everything except 'cluster'
names_to = "status", # new column for TE status (e.g., TE / wnot_TE)
values_to = "count"
) %>%
mutate(TE_type = .y) # add TE_type column based on list element name
)
H3K9me3_counts_long <- H3K9me3_counts_long %>%
group_by(TE_type) %>%
tidyr::complete(cluster, status, fill = list(count = 0)) %>%
ungroup()
te_cols <- c("TE_status","SINE_status","LINE_status","DNA_status","LTR_status","SVA_status")
H3K27ac_counts_50per <- map(te_cols, ~ {
annotated_H3K27ac_50_per %>%
group_by(cluster, !!sym(.x)) %>% # dynamically use column
tally() %>%
pivot_wider(
id_cols = cluster,
names_from = !!sym(.x),
values_from = n
)
}) %>% set_names(te_cols)
H3K27ac_counts_long_50per <- map2_dfr(
H3K27ac_counts_50per, # the named list of wide tables
names(H3K27ac_counts_50per), # names of TE columns
~ .x %>%
pivot_longer(
cols = -cluster, # everything except 'cluster'
names_to = "status", # new column for TE status (e.g., TE / wnot_TE)
values_to = "count"
) %>%
mutate(TE_type = .y) # add TE_type column based on list element name
)
H3K27ac_counts_long_50per <- H3K27ac_counts_long_50per %>%
group_by(TE_type) %>%
tidyr::complete(cluster, status, fill = list(count = 0)) %>%
ungroup()
H3K27me3_counts_50per <- map(te_cols, ~ {
annotated_H3K27me3_50_per %>%
group_by(cluster, !!sym(.x)) %>% # dynamically use column
tally() %>%
pivot_wider(
id_cols = cluster,
names_from = !!sym(.x),
values_from = n
)
}) %>% set_names(te_cols)
H3K27me3_counts_long_50per <- map2_dfr(
H3K27me3_counts_50per, # the named list of wide tables
names(H3K27me3_counts_50per), # names of TE columns
~ .x %>%
pivot_longer(
cols = -cluster, # everything except 'cluster'
names_to = "status", # new column for TE status (e.g., TE / not_TE)
values_to = "count"
) %>%
mutate(TE_type = .y) # add TE_type column based on list element name
)
H3K27me3_counts_long_50per <- H3K27me3_counts_long_50per %>%
group_by(TE_type) %>%
tidyr::complete(cluster, status, fill = list(count = 0)) %>%
ungroup()
H3K36me3_counts_50per <- map(te_cols, ~ {
annotated_H3K36me3_50_per %>%
group_by(cluster, !!sym(.x)) %>% # dynamically use column
tally() %>%
pivot_wider(
id_cols = cluster,
names_from = !!sym(.x),
values_from = n
)
}) %>% set_names(te_cols)
H3K36me3_counts_long_50per <- map2_dfr(
H3K36me3_counts_50per, # the named list of wide tables
names(H3K36me3_counts_50per), # names of TE columns
~ .x %>%
pivot_longer(
cols = -cluster, # everything except 'cluster'
names_to = "status", # new column for TE status (e.g., TE / not_TE)
values_to = "count"
) %>%
mutate(TE_type = .y) # add TE_type column based on list element name
)
H3K36me3_counts_long_50per <- H3K36me3_counts_long_50per %>%
group_by(TE_type) %>%
tidyr::complete(cluster, status, fill = list(count = 0)) %>%
ungroup()
H3K9me3_counts_50per <- map(te_cols, ~ {
annotated_H3K9me3_50_per %>%
group_by(cluster, !!sym(.x)) %>% # dynamically use column
tally() %>%
pivot_wider(
id_cols = cluster,
names_from = !!sym(.x),
values_from = n
)
}) %>% set_names(te_cols)
H3K9me3_counts_long_50per <- map2_dfr(
H3K9me3_counts_50per, # the named list of wide tables
names(H3K9me3_counts_50per), # names of TE columns
~ .x %>%
pivot_longer(
cols = -cluster, # everything except 'cluster'
names_to = "status", # new column for TE status (e.g., TE / wnot_TE)
values_to = "count"
) %>%
mutate(TE_type = .y) # add TE_type column based on list element name
)
H3K9me3_counts_long_50per <- H3K9me3_counts_long_50per %>%
group_by(TE_type) %>%
tidyr::complete(cluster, status, fill = list(count = 0)) %>%
ungroup()
making the function
# # function to test enrichment for one TE type, one cluster pair
# test_pair_TE <- function(df_long, te_name, cluster1, cluster2) {
#
# # filter for TE type
# sub_df <- df_long %>% filter(TE_type == te_name)
#
# # get counts for cluster1 and cluster2
# c1 <- sub_df %>% dplyr::filter(cluster == cluster1)
# c2 <- sub_df %>% dplyr::filter(cluster == cluster2)
#
# # 2x2 table: rows = clusters, cols = TE / wnot_TE
# mat <- matrix(
# c(c2$count, sum(c2$count) - c2$count,
# c1$count, sum(c1$count) - c1$count),
# nrow = 2,
# byrow = TRUE,
# dimnames = list(
# cluster = c(cluster2, cluster1),
# category = c("TE", "wnot_TE")
# )
# )
#
# ft <- fisher.test(mat)
#
# tibble(
# TE_type = te_name,
# comparison = paste(cluster2, "vs", cluster1),
# odds_ratio = ft$estimate,
# lower_CI = ft$conf.int[1],
# upper_CI = ft$conf.int[2],
# p_value = ft$p.value
# )
# }
# Generic pairwise Fisher test
test_pair_TE_generic <- function(df_long, te_name, cluster1, cluster2) {
sub_df <- df_long %>% filter(TE_type == te_name)
# assume "status" column has TE vs wnot_TE automatically
statuses <- unique(sub_df$status)
if(length(statuses) != 2) {
# ensure we have exactly two categories, fill missing with 0
sub_df <- sub_df %>%
complete(cluster, status, fill = list(count = 0))
statuses <- unique(sub_df$status)
}
# extract counts for cluster1
c1_counts <- sub_df %>%
filter(cluster == cluster1) %>%
arrange(match(status, statuses)) %>% # ensure same order
pull(count)
# extract counts for cluster2
c2_counts <- sub_df %>%
filter(cluster == cluster2) %>%
arrange(match(status, statuses)) %>%
pull(count)
# build 2x2 matrix
mat <- matrix(
c(c2_counts, c1_counts),
nrow = 2,
byrow = TRUE,
dimnames = list(
cluster = c(cluster2, cluster1),
category = statuses
)
)
ft <- tryCatch(
fisher.test(mat, workspace = 2e8),
error = function(e) fisher.test(mat, simulate.p.value = TRUE, B = 1e5)
)
tibble(
TE_type = te_name,
comparison = paste(cluster2, "vs", cluster1),
odds_ratio = ft$estimate,
lower_CI = ft$conf.int[1],
upper_CI = ft$conf.int[2],
p_value = ft$p.value
)
}
applying to dataframes
TE_types <- te_cols
cluster_pairs_3 <- list(
c("Set_2","Set_1"),
c("Set_3","Set_1")
# add more pairs if needed or
)
cluster_pairs_2 <- list(
c("Set_2","Set_1"))
H3K27ac_results_1bp <- map_dfr(TE_types, function(te){
map_dfr(cluster_pairs_3, function(pair){
test_pair_TE_generic(H3K27ac_counts_long, te, cluster1 = pair[2], cluster2 = pair[1])
})
})
htmltools::tagList(
htmltools::h3("H3K27ac_results_1bp"),
datatable(H3K27ac_results_1bp,options = list(
scrollY = "600px",
scrollCollapse = TRUE,
pageLength = 20,
autoWidth = TRUE
),
rownames = FALSE))
H3K27ac_counts_long %>%
dplyr::filter(cluster != "not_assigned") %>%
ggplot(aes(x=cluster, y=count, fill=status)) +
geom_bar(stat="identity", position = "fill") +
facet_wrap(~TE_type, scales="free_y")+
ggtitle("H3K27ac 1 bp overlap enrichment")+
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold")
)

| Version | Author | Date |
|---|---|---|
| 967b7a8 | reneeisnowhere | 2025-09-16 |
###
H3K27ac_results_50per <- map_dfr(TE_types, function(te){
map_dfr(cluster_pairs_3, function(pair){
test_pair_TE_generic(H3K27ac_counts_long_50per, te, cluster1 = pair[2], cluster2 = pair[1])
})
})
htmltools::tagList(
htmltools::h3("H3K27ac_results 50% overlap"),
datatable(H3K27ac_results_50per,
options = list(
scrollY = "600px",
scrollCollapse = TRUE,
pageLength = 20,
autoWidth = TRUE
),
rownames = FALSE))
H3K27ac_counts_long_50per %>%
dplyr::filter(cluster != "not_assigned") %>%
ggplot(aes(x=cluster, y=count, fill=status)) +
geom_bar(stat="identity", position = "fill") +
facet_wrap(~TE_type, scales="free_y")+
ggtitle("H3K27ac 50% TE overlap enrichment")+
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold")
)

| Version | Author | Date |
|---|---|---|
| 967b7a8 | reneeisnowhere | 2025-09-16 |
H3K27me3_results_1bp <- map_dfr(TE_types, function(te){
map_dfr(cluster_pairs_2, function(pair){
test_pair_TE_generic(H3K27me3_counts_long, te, cluster1 = pair[2], cluster2 = pair[1])
})
})
htmltools::tagList(
htmltools::h3("H3K27me3_results_1bp"),
datatable(H3K27me3_results_1bp,options = list(
scrollY = "600px",
scrollCollapse = TRUE,
pageLength = 20,
autoWidth = TRUE
),
rownames = FALSE))
H3K27me3_counts_long %>%
dplyr::filter(cluster != "not_assigned") %>%
ggplot(aes(x=cluster, y=count, fill=status)) +
geom_bar(stat="identity", position = "fill") +
facet_wrap(~TE_type, scales="free_y")+
ggtitle("H3K27me3 1 bp overlap enrichment")+
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold")
)

| Version | Author | Date |
|---|---|---|
| 967b7a8 | reneeisnowhere | 2025-09-16 |
###
H3K27me3_results_50per <- map_dfr(TE_types, function(te){
map_dfr(cluster_pairs_2, function(pair){
test_pair_TE_generic(H3K27me3_counts_long_50per, te, cluster1 = pair[2], cluster2 = pair[1])
})
})
htmltools::tagList(
htmltools::h3("H3K27me3_results 50% overlap"),
datatable(H3K27me3_results_50per,options = list(
scrollY = "600px",
scrollCollapse = TRUE,
pageLength = 20,
autoWidth = TRUE
),
rownames = FALSE))
H3K27me3_counts_long_50per %>%
dplyr::filter(cluster != "not_assigned") %>%
ggplot(aes(x=cluster, y=count, fill=status)) +
geom_bar(stat="identity", position = "fill") +
facet_wrap(~TE_type, scales="free_y")+
ggtitle("H3K27me3 50% TE overlap enrichment")+
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold")
)

| Version | Author | Date |
|---|---|---|
| 967b7a8 | reneeisnowhere | 2025-09-16 |
H3K36me3_results_1bp <- map_dfr(TE_types, function(te){
map_dfr(cluster_pairs_2, function(pair){
test_pair_TE_generic(H3K36me3_counts_long, te, cluster1 = pair[2], cluster2 = pair[1])
})
})
htmltools::tagList(
htmltools::h3("H3K36me3_results_1bp"),
datatable(H3K36me3_results_1bp,options = list(
scrollY = "600px",
scrollCollapse = TRUE,
pageLength = 20,
autoWidth = TRUE
),
rownames = FALSE))
H3K36me3_counts_long %>%
dplyr::filter(cluster != "not_assigned") %>%
ggplot(aes(x=cluster, y=count, fill=status)) +
geom_bar(stat="identity", position = "fill") +
facet_wrap(~TE_type, scales="free_y")+
ggtitle("H3K36me3 1 bp overlap enrichment")+
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold")
)

| Version | Author | Date |
|---|---|---|
| 967b7a8 | reneeisnowhere | 2025-09-16 |
###
H3K36me3_results_50per <- map_dfr(TE_types, function(te){
map_dfr(cluster_pairs_2, function(pair){
test_pair_TE_generic(H3K36me3_counts_long_50per, te, cluster1 = pair[2], cluster2 = pair[1])
})
})
htmltools::tagList(
htmltools::h3("H3K36me3_results 50% overlap"),
datatable(H3K36me3_results_50per,options = list(
scrollY = "600px",
scrollCollapse = TRUE,
pageLength = 20,
autoWidth = TRUE
),
rownames = FALSE))
H3K36me3_counts_long_50per %>%
dplyr::filter(cluster != "not_assigned") %>%
ggplot(aes(x=cluster, y=count, fill=status)) +
geom_bar(stat="identity", position = "fill") +
facet_wrap(~TE_type, scales="free_y")+
ggtitle("H3K36me3 50% TE overlap enrichment")+
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold")
)

| Version | Author | Date |
|---|---|---|
| 967b7a8 | reneeisnowhere | 2025-09-16 |
H3K9me3_results_1bp <- map_dfr(TE_types, function(te){
map_dfr(cluster_pairs_3, function(pair){
test_pair_TE_generic(H3K9me3_counts_long, te, cluster1 = pair[2], cluster2 = pair[1])
})
})
htmltools::tagList(
htmltools::h3("H3K9me3_results_1bp"),
datatable(H3K9me3_results_1bp,options = list(
scrollY = "600px",
scrollCollapse = TRUE,
pageLength = 20,
autoWidth = TRUE
),
rownames = FALSE))
H3K9me3_counts_long %>%
dplyr::filter(cluster != "not_assigned") %>%
ggplot(aes(x=cluster, y=count, fill=status)) +
geom_bar(stat="identity", position = "fill") +
facet_wrap(~TE_type, scales="free_y")+
ggtitle("H3K9me3 1 bp overlap enrichment")+
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold")
)

| Version | Author | Date |
|---|---|---|
| 967b7a8 | reneeisnowhere | 2025-09-16 |
###
H3K9me3_results_50per <- map_dfr(TE_types, function(te){
map_dfr(cluster_pairs_3, function(pair){
test_pair_TE_generic(H3K9me3_counts_long_50per, te, cluster1 = pair[2], cluster2 = pair[1])
})
})
htmltools::tagList(
htmltools::h3("H3K9me3_results 50% overlap"),
datatable(H3K9me3_results_50per,options = list(
scrollY = "600px",
scrollCollapse = TRUE,
pageLength = 20,
autoWidth = TRUE
),
rownames = FALSE))
H3K9me3_counts_long_50per %>%
dplyr::filter(cluster != "not_assigned") %>%
ggplot(aes(x=cluster, y=count, fill=status)) +
geom_bar(stat="identity", position = "fill") +
facet_wrap(~TE_type, scales="free_y")+
ggtitle("H3K9me3 50% TE overlap enrichment")+
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold")
)

| Version | Author | Date |
|---|---|---|
| 967b7a8 | reneeisnowhere | 2025-09-16 |
H3K27ac_ols$all_H3K27ac %>%
left_join(., H3K27ac_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(repClass=="SINE"~"SINE",
repClass=="LINE"~"LINE",
repClass=="LTR"~"LTR",
repClass=="DNA"~"DNA",
repClass=="Retroposon"~"SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(x=per_div))+
geom_histogram(aes(fill=cluster, alpha = 0.4) ) +
facet_wrap(~special_type, scales="free_y") +
theme_classic()

H3K27ac_ols$all_H3K27ac %>%
as.data.frame() %>%
left_join(., H3K27ac_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(y=per_div,x=special_type))+
geom_boxplot(aes(fill=cluster) ) +
# facet_wrap(~special_type, scales="free_y") +
theme_classic()+
ggtitle("Divergence from overlapping TE H3K27ac 1bp")

H3K27ac_ols$all_H3K27ac %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap> 50) %>%
left_join(., H3K27ac_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(y=per_div,x=special_type, fill=cluster))+
# geom_boxplot(aes(fill=cluster) ) +
geom_boxplot()+
stat_compare_means(
aes(group = cluster), # tell ggpubr we want to compare clusters
method = "wilcox.test", # same test as before
comparisons = list(c("Set1", "Set2"), c("Set1", "Set3")),
label = "p.signif", # show stars instead of p-values
hide.ns = TRUE # hide non-significant comparisons
) +
theme_classic()+
# facet_wrap(~ special_type, scales = "free_y") +
ggtitle("Divergence from overlapping TE H3K27ac 50% overlap")

### try to calculate wilcox.test
comparisons <- list(
c("Set_1", "Set_2"),
c("Set_1", "Set_3")
)
#
H3K27ac_wilcox_results <- H3K27ac_ols$all_H3K27ac %>%
as.data.frame() %>%
left_join(., H3K27ac_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
filter(cluster %in% c("Set_1", "Set_2", "Set_3")) %>%
group_by(special_type) %>%
group_map(~ {
stype <- unique(.x$special_type) # capture the TE type
map_dfr(comparisons, function(comp) {
dat <- .x %>% filter(cluster %in% comp)
test <- wilcox.test(per_div ~ cluster, data = dat)
tibble(
special_type = stype,
comparison = paste(comp, collapse = "_vs_"),
p.value = test$p.value,
W = test$statistic,
median_Set1 = median(dat$per_div[dat$cluster == "Set_1"]),
median_other = median(dat$per_div[dat$cluster == comp[2]])
)
})
}, .keep = TRUE) %>%
bind_rows()
H3K27ac_wilcox_results_per50 <- H3K27ac_ols$all_H3K27ac %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap> 50) %>%
left_join(., H3K27ac_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
filter(cluster %in% c("Set_1", "Set_2", "Set_3")) %>%
group_by(special_type) %>%
group_map(~ {
stype <- unique(.x$special_type) # capture the TE type
map_dfr(comparisons, function(comp) {
dat <- .x %>% filter(cluster %in% comp)
test <- wilcox.test(per_div ~ cluster, data = dat)
tibble(
special_type = stype,
comparison = paste(comp, collapse = "_vs_"),
p.value = test$p.value,
W = test$statistic,
median_Set1 = median(dat$per_div[dat$cluster == "Set_1"]),
median_other = median(dat$per_div[dat$cluster == comp[2]])
)
})
}, .keep = TRUE) %>%
bind_rows()
H3K27me3_ols$all_H3K27me3 %>%
left_join(., H3K27me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(repClass=="SINE"~"SINE",
repClass=="LINE"~"LINE",
repClass=="LTR"~"LTR",
repClass=="DNA"~"DNA",
repClass=="Retroposon"~"SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(x=per_div))+
geom_histogram(aes(fill=cluster, alpha = 0.4) ) +
facet_wrap(~special_type, scales="free_y") +
theme_classic()

H3K27me3_ols$all_H3K27me3 %>%
as.data.frame() %>%
left_join(., H3K27me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(y=per_div,x=special_type))+
geom_boxplot(aes(fill=cluster) ) +
# facet_wrap(~special_type, scales="free_y") +
theme_classic()+
ggtitle("Divergence from overlapping TE H3K27me3 1bp")

H3K27me3_ols$all_H3K27me3 %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap> 50) %>%
left_join(., H3K27me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(y=per_div,x=special_type))+
geom_boxplot(aes(fill=cluster) ) +
# facet_wrap(~special_type, scales="free_y") +
theme_classic()+
ggtitle("Divergence from overlapping TE H3K27me3 50% overlap")

H3K27me3_wilcox_results_per50 <- H3K27me3_ols$all_H3K27me3 %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap> 50) %>%
left_join(., H3K27me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
filter(cluster %in% c("Set_1", "Set_2")) %>%
group_by(special_type) %>%
group_map(~ {
stype <- unique(.x$special_type) # capture the TE type
map_dfr(comparisons, function(comp) {
dat <- .x %>% filter(cluster %in% comp)
# Skip if both clusters are not present
if(length(unique(dat$cluster)) < 2) return(NULL)
test <- wilcox.test(per_div ~ cluster, data = dat)
tibble(
special_type = stype,
comparison = paste(comp, collapse = "_vs_"),
p.value = test$p.value,
W = test$statistic,
median_Set1 = median(dat$per_div[dat$cluster == "Set_1"]),
median_other = median(dat$per_div[dat$cluster == comp[2]])
)
})
}, .keep = TRUE) %>%
bind_rows()
H3K36me3_ols$all_H3K36me3 %>%
left_join(., H3K36me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(repClass=="SINE"~"SINE",
repClass=="LINE"~"LINE",
repClass=="LTR"~"LTR",
repClass=="DNA"~"DNA",
repClass=="Retroposon"~"SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(x=per_div))+
geom_histogram(aes(fill=cluster, alpha = 0.4) ) +
facet_wrap(~special_type, scales="free_y") +
theme_classic()

H3K36me3_ols$all_H3K36me3 %>%
as.data.frame() %>%
left_join(., H3K36me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(y=per_div,x=special_type))+
geom_boxplot(aes(fill=cluster) ) +
# facet_wrap(~special_type, scales="free_y") +
theme_classic()+
ggtitle("Divergence from overlapping TE H3K36me3 1bp")

H3K36me3_ols$all_H3K36me3 %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap> 50) %>%
left_join(., H3K36me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(y=per_div,x=special_type))+
geom_boxplot(aes(fill=cluster) ) +
# facet_wrap(~special_type, scales="free_y") +
theme_classic()+
ggtitle("Divergence from overlapping TE H3K36me3 50% overlap")

H3K36me3_ols$all_H3K36me3 %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap> 50) %>%
left_join(., H3K36me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(y=per_div,x=special_type))+
geom_boxplot(aes(fill=cluster) ) +
# facet_wrap(~special_type, scales="free_y") +
theme_classic()+
ggtitle("Divergence from overlapping TE H3K36me3 50% overlap")

H3K36me3_wilcox_results_per50 <- H3K36me3_ols$all_H3K36me3 %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap> 50) %>%
left_join(., H3K36me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
filter(cluster %in% c("Set_1", "Set_2")) %>%
group_by(special_type) %>%
group_map(~ {
stype <- unique(.x$special_type) # capture the TE type
map_dfr(comparisons, function(comp) {
dat <- .x %>% filter(cluster %in% comp)
# Skip if both clusters are not present
if(length(unique(dat$cluster)) < 2) return(NULL)
test <- wilcox.test(per_div ~ cluster, data = dat)
tibble(
special_type = stype,
comparison = paste(comp, collapse = "_vs_"),
p.value = test$p.value,
W = test$statistic,
median_Set1 = median(dat$per_div[dat$cluster == "Set_1"]),
median_other = median(dat$per_div[dat$cluster == comp[2]])
)
})
}, .keep = TRUE) %>%
bind_rows()
H3K9me3_ols$all_H3K9me3 %>%
left_join(., H3K9me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
mutate(special_type=case_when(repClass=="SINE"~"SINE",
repClass=="LINE"~"LINE",
repClass=="LTR"~"LTR",
repClass=="DNA"~"DNA",
repClass=="Retroposon"~"SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(x=per_div))+
geom_histogram(aes(fill=cluster, alpha = 0.4) ) +
facet_wrap(~special_type, scales="free_y") +
theme_classic()

H3K9me3_ols$all_H3K9me3 %>%
as.data.frame() %>%
left_join(., H3K9me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(y=per_div,x=special_type))+
geom_boxplot(aes(fill=cluster) ) +
# facet_wrap(~special_type, scales="free_y") +
theme_classic()+
ggtitle("Divergence from overlapping TE H3K9me3 1bp")

H3K9me3_ols$all_H3K9me3 %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap> 50) %>%
left_join(., H3K9me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
ggplot(aes(y=per_div,x=special_type))+
geom_boxplot(aes(fill=cluster) ) +
# facet_wrap(~special_type, scales="free_y") +
theme_classic()+
ggtitle("Divergence from overlapping TE H3K9me3 50% overlap")

H3K9me3_wilcox_results <- H3K9me3_ols$all_H3K9me3 %>%
as.data.frame() %>%
left_join(., H3K9me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
filter(cluster %in% c("Set_1", "Set_2")) %>%
group_by(special_type) %>%
group_map(~ {
stype <- unique(.x$special_type) # capture the TE type
map_dfr(comparisons, function(comp) {
dat <- .x %>% filter(cluster %in% comp)
# Skip if both clusters are not present
if(length(unique(dat$cluster)) < 2) return(NULL)
test <- wilcox.test(per_div ~ cluster, data = dat)
tibble(
special_type = stype,
comparison = paste(comp, collapse = "_vs_"),
p.value = test$p.value,
W = test$statistic,
median_Set1 = median(dat$per_div[dat$cluster == "Set_1"]),
median_other = median(dat$per_div[dat$cluster == comp[2]])
)
})
}, .keep = TRUE) %>%
bind_rows()
H3K9me3_wilcox_results_per50 <- H3K9me3_ols$all_H3K9me3 %>%
as.data.frame() %>%
dplyr::filter(pct_te_overlap> 50) %>%
left_join(., H3K9me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
dplyr::filter(cluster != "not_assigned") %>%
mutate(special_type=case_when(str_detect(repClass,"SINE")~"SINE",
str_detect(repClass, "LINE") ~ "LINE",
str_detect(repClass, "LTR") ~ "LTR",
str_detect(repClass, "DNA") ~ "DNA",
str_detect(repClass, "Retroposon") ~ "SVA",
is.na(repClass)~"not_TE",
TRUE~"Other")) %>%
dplyr::filter(special_type != "not_TE") %>%
mutate(per_div=milliDiv/10) %>%
filter(cluster %in% c("Set_1", "Set_2")) %>%
group_by(special_type) %>%
group_map(~ {
stype <- unique(.x$special_type) # capture the TE type
map_dfr(comparisons, function(comp) {
dat <- .x %>% filter(cluster %in% comp)
# Skip if both clusters are not present
if(length(unique(dat$cluster)) < 2) return(NULL)
test <- wilcox.test(per_div ~ cluster, data = dat)
tibble(
special_type = stype,
comparison = paste(comp, collapse = "_vs_"),
p.value = test$p.value,
W = test$statistic,
median_Set1 = median(dat$per_div[dat$cluster == "Set_1"]),
median_other = median(dat$per_div[dat$cluster == comp[2]])
)
})
}, .keep = TRUE) %>%
bind_rows()
H3K27ac_repclass_sum <- H3K27ac_ols$all_H3K27ac %>%
left_join(., H3K27ac_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
mutate(special_type=case_when(repClass=="SINE"~"SINE",
repClass=="LINE"~"LINE",
repClass=="LTR"~"LTR",
repClass=="DNA"~"DNA",
repClass=="Retroposon"~"SVA",
is.na(repClass)~"not_TE",
TRUE~"Other"))%>%
dplyr::filter(cluster != "not_assigned") %>%
group_by(Peakid, special_type) %>%
summarize(width= min(width),
min_te_len=min(te_len),
min_roi_len=min(roi_len),
min_overlap=min(overlap_bp),
min_dist_TSS=min(distanceToTSS),
max_dist_TSS=max(distanceToTSS),
cluster=paste0(unique(cluster), collapse = ":"))
H3K27ac_repclass_sum %>%
ggplot(aes(x=special_type,y=min_te_len))+
geom_boxplot() +
facet_wrap(~cluster)+
theme_classic()+
ggtitle("Overlapping minimum TE length per class in H3K27ac")

H3K27ac_sets_gr$all_H3K27ac %>%
as.data.frame() %>%
left_join(., H3K27ac_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
ggplot(aes(x=cluster, y=width))+
geom_boxplot()+
theme_classic()+
ggtitle("Region of interest length in H3K27ac")

H3K27ac_repclass_sum %>%
ggplot(aes(x=cluster,y=width))+
geom_boxplot() +
facet_wrap(~special_type)+
theme_classic()+
ggtitle("Minimum overlapping TE & ROI width length per class in H3K27ac")

H3K27ac_repclass_sum %>%
ggplot(aes(x=special_type,y=min_dist_TSS))+
geom_boxplot() +
facet_wrap(~cluster)+
theme_classic()+
ggtitle("Distance to TSS per class in H3K27ac of overlapping ROIs")+
coord_cartesian(ylim=c(-35000,35000))

H3K27me3_repclass_sum <- H3K27me3_ols$all_H3K27me3 %>%
left_join(., H3K27me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
mutate(special_type=case_when(repClass=="SINE"~"SINE",
repClass=="LINE"~"LINE",
repClass=="LTR"~"LTR",
repClass=="DNA"~"DNA",
repClass=="Retroposon"~"SVA",
is.na(repClass)~"not_TE",
TRUE~"Other"))%>%
dplyr::filter(cluster != "not_assigned") %>%
group_by(Peakid, special_type) %>%
summarize(width= min(width),
min_te_len=min(te_len),
min_roi_len=min(roi_len),
min_overlap=min(overlap_bp),
min_dist_TSS=min(distanceToTSS),
max_dist_TSS=max(distanceToTSS),
cluster=paste0(unique(cluster), collapse = ":"))
H3K27me3_repclass_sum %>%
ggplot(aes(x=special_type,y=min_te_len))+
geom_boxplot() +
facet_wrap(~cluster)+
theme_classic()+
ggtitle("Overlapping minimum TE length per class in H3K27me3")

H3K27me3_sets_gr$all_H3K27me3 %>%
as.data.frame() %>%
left_join(., H3K27me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
ggplot(aes(x=cluster, y=width))+
geom_boxplot()+
theme_classic()+
ggtitle("Region of interest length in H3K27me3")

H3K27me3_repclass_sum %>%
ggplot(aes(x=special_type,y=width))+
geom_boxplot() +
facet_wrap(~cluster)+
theme_classic()+
ggtitle("Minimum overlapping TE & ROI width length per class in H3K27me3")

H3K27me3_repclass_sum %>%
ggplot(aes(x=special_type,y=min_dist_TSS))+
geom_boxplot() +
facet_wrap(~cluster)+
theme_classic()+
ggtitle("Distance to TSS per class in H3K27me3 of overlapping ROIs")

# H3K27me3_sets_gr$all_H3K27me3 %>%
# as.data.frame() %>%
# dplyr::filter(width>60000)
#
H3K27me3_ols$all_H3K27me3 %>%
as.data.frame() %>%
# mutate(medi=median(roi_len))
dplyr::filter(roi_len>60000)
seqnames start end width strand Peakid
1 chr17 48584907 48584961 55 + chr17:48584861-48649905
2 chr17 48586185 48586249 65 + chr17:48584861-48649905
3 chr17 48586438 48586471 34 + chr17:48584861-48649905
4 chr17 48587783 48587832 50 + chr17:48584861-48649905
5 chr17 48590814 48590833 20 + chr17:48584861-48649905
6 chr17 48591679 48591702 24 + chr17:48584861-48649905
7 chr17 48591989 48592024 36 + chr17:48584861-48649905
8 chr17 48593811 48593839 29 + chr17:48584861-48649905
9 chr17 48595767 48595804 38 + chr17:48584861-48649905
10 chr17 48595805 48595831 27 + chr17:48584861-48649905
11 chr17 48600992 48601039 48 + chr17:48584861-48649905
12 chr17 48602377 48602409 33 + chr17:48584861-48649905
13 chr17 48603475 48603517 43 + chr17:48584861-48649905
14 chr17 48604820 48604896 77 + chr17:48584861-48649905
15 chr17 48605180 48605207 28 + chr17:48584861-48649905
16 chr17 48606351 48606392 42 + chr17:48584861-48649905
17 chr17 48606816 48606859 44 + chr17:48584861-48649905
18 chr17 48607704 48607742 39 + chr17:48584861-48649905
19 chr17 48607793 48607859 67 + chr17:48584861-48649905
20 chr17 48608141 48608452 312 + chr17:48584861-48649905
21 chr17 48611613 48611651 39 + chr17:48584861-48649905
22 chr17 48611652 48611670 19 + chr17:48584861-48649905
23 chr17 48612063 48612097 35 + chr17:48584861-48649905
24 chr17 48612615 48612756 142 + chr17:48584861-48649905
25 chr17 48613536 48613561 26 + chr17:48584861-48649905
26 chr17 48614113 48614148 36 + chr17:48584861-48649905
27 chr17 48614720 48614740 21 + chr17:48584861-48649905
28 chr17 48614758 48614797 40 + chr17:48584861-48649905
29 chr17 48614849 48614884 36 + chr17:48584861-48649905
30 chr17 48614991 48615014 24 + chr17:48584861-48649905
31 chr17 48615214 48615385 172 + chr17:48584861-48649905
32 chr17 48616131 48616167 37 + chr17:48584861-48649905
33 chr17 48617534 48617551 18 + chr17:48584861-48649905
34 chr17 48618425 48618448 24 + chr17:48584861-48649905
35 chr17 48619224 48619270 47 + chr17:48584861-48649905
36 chr17 48625316 48625427 112 + chr17:48584861-48649905
37 chr17 48625916 48625976 61 + chr17:48584861-48649905
38 chr17 48626377 48626411 35 + chr17:48584861-48649905
39 chr17 48626805 48626842 38 + chr17:48584861-48649905
40 chr17 48627335 48627366 32 + chr17:48584861-48649905
41 chr17 48627793 48627835 43 + chr17:48584861-48649905
42 chr17 48633058 48633110 53 + chr17:48584861-48649905
43 chr17 48633497 48633575 79 + chr17:48584861-48649905
44 chr17 48634188 48634487 300 - chr17:48584861-48649905
45 chr17 48634496 48634548 53 + chr17:48584861-48649905
46 chr17 48635094 48635148 55 + chr17:48584861-48649905
47 chr17 48635771 48635920 150 - chr17:48584861-48649905
48 chr17 48636528 48636553 26 + chr17:48584861-48649905
49 chr17 48636898 48637099 202 - chr17:48584861-48649905
50 chr17 48638058 48638085 28 + chr17:48584861-48649905
51 chr17 48638401 48638418 18 + chr17:48584861-48649905
52 chr17 48638418 48638446 29 + chr17:48584861-48649905
53 chr17 48638466 48638613 148 + chr17:48584861-48649905
54 chr17 48639192 48639269 78 - chr17:48584861-48649905
55 chr17 48640842 48640983 142 + chr17:48584861-48649905
56 chr17 48641720 48641756 37 + chr17:48584861-48649905
57 chr17 48647022 48647169 148 + chr17:48584861-48649905
58 chr17 48647332 48647388 57 + chr17:48584861-48649905
59 chr17 48647999 48648087 89 + chr17:48584861-48649905
60 chr17 48648093 48648256 164 - chr17:48584861-48649905
61 chr17 48648287 48648578 292 - chr17:48584861-48649905
62 chr17 48648759 48649063 305 - chr17:48584861-48649905
63 chr17 48649065 48649207 143 - chr17:48584861-48649905
64 chr17 48649242 48649905 664 - chr17:48584861-48649905
annotation geneChr geneStart geneEnd geneLength geneStrand geneId
1 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
2 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
3 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
4 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
5 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
6 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
7 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
8 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
9 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
10 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
11 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
12 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
13 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
14 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
15 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
16 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
17 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
18 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
19 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
20 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
21 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
22 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
23 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
24 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
25 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
26 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
27 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
28 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
29 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
30 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
31 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
32 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
33 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
34 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
35 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
36 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
37 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
38 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
39 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
40 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
41 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
42 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
43 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
44 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
45 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
46 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
47 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
48 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
49 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
50 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
51 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
52 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
53 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
54 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
55 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
56 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
57 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
58 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
59 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
60 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
61 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
62 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
63 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
64 Promoter (<=1kb) 17 48628675 48634932 6258 1 100874351
transcriptId distanceToTSS hit overlap_bp roi_len pct_roi_overlap
1 ENST00000480386.1 0 TRUE 55 65045 0.08455685
2 ENST00000480386.1 0 TRUE 65 65045 0.09993082
3 ENST00000480386.1 0 TRUE 34 65045 0.05227150
4 ENST00000480386.1 0 TRUE 50 65045 0.07686986
5 ENST00000480386.1 0 TRUE 20 65045 0.03074794
6 ENST00000480386.1 0 TRUE 24 65045 0.03689753
7 ENST00000480386.1 0 TRUE 36 65045 0.05534630
8 ENST00000480386.1 0 TRUE 29 65045 0.04458452
9 ENST00000480386.1 0 TRUE 38 65045 0.05842109
10 ENST00000480386.1 0 TRUE 27 65045 0.04150972
11 ENST00000480386.1 0 TRUE 48 65045 0.07379506
12 ENST00000480386.1 0 TRUE 33 65045 0.05073411
13 ENST00000480386.1 0 TRUE 43 65045 0.06610808
14 ENST00000480386.1 0 TRUE 77 65045 0.11837958
15 ENST00000480386.1 0 TRUE 28 65045 0.04304712
16 ENST00000480386.1 0 TRUE 42 65045 0.06457068
17 ENST00000480386.1 0 TRUE 44 65045 0.06764548
18 ENST00000480386.1 0 TRUE 39 65045 0.05995849
19 ENST00000480386.1 0 TRUE 67 65045 0.10300561
20 ENST00000480386.1 0 TRUE 312 65045 0.47966792
21 ENST00000480386.1 0 TRUE 39 65045 0.05995849
22 ENST00000480386.1 0 TRUE 19 65045 0.02921055
23 ENST00000480386.1 0 TRUE 35 65045 0.05380890
24 ENST00000480386.1 0 TRUE 142 65045 0.21831040
25 ENST00000480386.1 0 TRUE 26 65045 0.03997233
26 ENST00000480386.1 0 TRUE 36 65045 0.05534630
27 ENST00000480386.1 0 TRUE 21 65045 0.03228534
28 ENST00000480386.1 0 TRUE 40 65045 0.06149589
29 ENST00000480386.1 0 TRUE 36 65045 0.05534630
30 ENST00000480386.1 0 TRUE 24 65045 0.03689753
31 ENST00000480386.1 0 TRUE 172 65045 0.26443232
32 ENST00000480386.1 0 TRUE 37 65045 0.05688370
33 ENST00000480386.1 0 TRUE 18 65045 0.02767315
34 ENST00000480386.1 0 TRUE 24 65045 0.03689753
35 ENST00000480386.1 0 TRUE 47 65045 0.07225767
36 ENST00000480386.1 0 TRUE 112 65045 0.17218848
37 ENST00000480386.1 0 TRUE 61 65045 0.09378123
38 ENST00000480386.1 0 TRUE 35 65045 0.05380890
39 ENST00000480386.1 0 TRUE 38 65045 0.05842109
40 ENST00000480386.1 0 TRUE 32 65045 0.04919671
41 ENST00000480386.1 0 TRUE 43 65045 0.06610808
42 ENST00000480386.1 0 TRUE 53 65045 0.08148205
43 ENST00000480386.1 0 TRUE 79 65045 0.12145438
44 ENST00000480386.1 0 TRUE 300 65045 0.46121916
45 ENST00000480386.1 0 TRUE 53 65045 0.08148205
46 ENST00000480386.1 0 TRUE 55 65045 0.08455685
47 ENST00000480386.1 0 TRUE 150 65045 0.23060958
48 ENST00000480386.1 0 TRUE 26 65045 0.03997233
49 ENST00000480386.1 0 TRUE 202 65045 0.31055423
50 ENST00000480386.1 0 TRUE 28 65045 0.04304712
51 ENST00000480386.1 0 TRUE 18 65045 0.02767315
52 ENST00000480386.1 0 TRUE 29 65045 0.04458452
53 ENST00000480386.1 0 TRUE 148 65045 0.22753478
54 ENST00000480386.1 0 TRUE 78 65045 0.11991698
55 ENST00000480386.1 0 TRUE 142 65045 0.21831040
56 ENST00000480386.1 0 TRUE 37 65045 0.05688370
57 ENST00000480386.1 0 TRUE 148 65045 0.22753478
58 ENST00000480386.1 0 TRUE 57 65045 0.08763164
59 ENST00000480386.1 0 TRUE 89 65045 0.13682835
60 ENST00000480386.1 0 TRUE 164 65045 0.25213314
61 ENST00000480386.1 0 TRUE 292 65045 0.44891998
62 ENST00000480386.1 0 TRUE 305 65045 0.46890614
63 ENST00000480386.1 0 TRUE 143 65045 0.21984780
64 ENST00000480386.1 0 TRUE 664 65045 1.02083173
te_len pct_te_overlap repName repClass repFamily milliDiv
1 55 100.00000 (C)n Simple_repeat Simple_repeat 209
2 65 100.00000 GA-rich Low_complexity Low_complexity 299
3 34 100.00000 (GGC)n Simple_repeat Simple_repeat 241
4 50 100.00000 (AAACA)n Simple_repeat Simple_repeat 92
5 20 100.00000 (T)n Simple_repeat Simple_repeat 0
6 24 100.00000 (GGA)n Simple_repeat Simple_repeat 0
7 36 100.00000 (AAT)n Simple_repeat Simple_repeat 128
8 29 100.00000 G-rich Low_complexity Low_complexity 196
9 38 100.00000 (TTA)n Simple_repeat Simple_repeat 153
10 27 100.00000 (ATC)n Simple_repeat Simple_repeat 0
11 48 100.00000 (GT)n Simple_repeat Simple_repeat 0
12 33 100.00000 (GGTGGG)n Simple_repeat Simple_repeat 63
13 43 100.00000 (TTTTC)n Simple_repeat Simple_repeat 133
14 77 100.00000 (TC)n Simple_repeat Simple_repeat 181
15 28 100.00000 (TCT)n Simple_repeat Simple_repeat 83
16 42 100.00000 A-rich Low_complexity Low_complexity 221
17 44 100.00000 (TG)n Simple_repeat Simple_repeat 271
18 39 100.00000 (T)n Simple_repeat Simple_repeat 112
19 67 100.00000 (CTCTCC)n Simple_repeat Simple_repeat 262
20 312 100.00000 AluYm1 SINE Alu 48
21 39 100.00000 (A)n Simple_repeat Simple_repeat 148
22 19 100.00000 A-rich Low_complexity Low_complexity 142
23 35 100.00000 (GCCTG)n Simple_repeat Simple_repeat 161
24 142 100.00000 GA-rich Low_complexity Low_complexity 237
25 26 100.00000 (GAGGG)n Simple_repeat Simple_repeat 81
26 36 100.00000 (AAGAAAG)n Simple_repeat Simple_repeat 89
27 21 100.00000 (GGC)n Simple_repeat Simple_repeat 49
28 40 100.00000 G-rich Low_complexity Low_complexity 174
29 36 100.00000 A-rich Low_complexity Low_complexity 173
30 24 100.00000 (C)n Simple_repeat Simple_repeat 137
31 172 100.00000 GA-rich Low_complexity Low_complexity 291
32 37 100.00000 (CCCTG)n Simple_repeat Simple_repeat 117
33 18 100.00000 (A)n Simple_repeat Simple_repeat 0
34 24 100.00000 (ATT)n Simple_repeat Simple_repeat 139
35 47 100.00000 (GT)n Simple_repeat Simple_repeat 0
36 112 100.00000 (CCGGCCC)n Simple_repeat Simple_repeat 209
37 61 100.00000 (CGCCGG)n Simple_repeat Simple_repeat 246
38 35 100.00000 (C)n Simple_repeat Simple_repeat 196
39 38 100.00000 G-rich Low_complexity Low_complexity 154
40 32 100.00000 G-rich Low_complexity Low_complexity 216
41 43 100.00000 (TCCTGC)n Simple_repeat Simple_repeat 90
42 53 100.00000 GA-rich Low_complexity Low_complexity 253
43 79 100.00000 (GGC)n Simple_repeat Simple_repeat 305
44 300 100.00000 AluSx3 SINE Alu 60
45 53 100.00000 (TTTTC)n Simple_repeat Simple_repeat 146
46 55 100.00000 (GCTCAG)n Simple_repeat Simple_repeat 235
47 150 100.00000 Charlie4z DNA hAT-Charlie 324
48 26 100.00000 (T)n Simple_repeat Simple_repeat 0
49 202 100.00000 MIRc SINE MIR 316
50 28 100.00000 (TG)n Simple_repeat Simple_repeat 159
51 18 100.00000 (TC)n Simple_repeat Simple_repeat 0
52 29 100.00000 (CA)n Simple_repeat Simple_repeat 74
53 148 100.00000 MIRb SINE MIR 299
54 78 100.00000 L2a LINE L2 187
55 142 100.00000 Charlie4z DNA hAT-Charlie 202
56 37 100.00000 (CAGGGGCGG)n Simple_repeat Simple_repeat 28
57 148 100.00000 (GA)n Simple_repeat Simple_repeat 81
58 57 100.00000 G-rich Low_complexity Low_complexity 283
59 89 100.00000 MIR1_Amn SINE MIR 247
60 164 100.00000 L1M5 LINE L1 181
61 292 100.00000 AluSx SINE Alu 123
62 305 100.00000 AluSp SINE Alu 79
63 143 100.00000 AluSp SINE Alu 56
64 749 88.65154 L1ME3 LINE L1 236
milliDel milliIns
1 0 0
2 46 0
3 0 0
4 0 40
5 0 0
6 0 83
7 0 28
8 0 0
9 0 0
10 0 0
11 62 0
12 30 0
13 47 22
14 52 12
15 36 35
16 24 0
17 0 0
18 0 0
19 45 14
20 0 6
21 0 0
22 52 33
23 57 0
24 42 34
25 0 0
26 28 0
27 0 0
28 27 0
29 28 27
30 0 0
31 17 0
32 27 0
33 0 0
34 0 0
35 0 0
36 11 101
37 33 32
38 0 0
39 0 26
40 0 0
41 47 89
42 38 18
43 0 51
44 13 0
45 38 73
46 18 54
47 58 33
48 0 0
49 36 59
50 0 0
51 0 0
52 0 0
53 62 74
54 0 39
55 70 162
56 0 0
57 20 7
58 18 17
59 128 0
60 70 24
61 3 0
62 29 3
63 14 0
64 75 42
H3K36me3_repclass_sum <- H3K36me3_ols$all_H3K36me3 %>%
left_join(., H3K36me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
mutate(special_type=case_when(repClass=="SINE"~"SINE",
repClass=="LINE"~"LINE",
repClass=="LTR"~"LTR",
repClass=="DNA"~"DNA",
repClass=="Retroposon"~"SVA",
is.na(repClass)~"not_TE",
TRUE~"Other"))%>%
dplyr::filter(cluster != "not_assigned") %>%
group_by(Peakid, special_type) %>%
summarize(width= min(width),
min_te_len=min(te_len),
min_roi_len=min(roi_len),
min_overlap=min(overlap_bp),
min_dist_TSS=min(distanceToTSS),
max_dist_TSS=max(distanceToTSS),
cluster=paste0(unique(cluster), collapse = ":"))
H3K36me3_repclass_sum %>%
ggplot(aes(x=special_type,y=min_te_len))+
geom_boxplot() +
facet_wrap(~cluster)+
theme_classic()+
ggtitle("Overlapping minimum TE length per class in H3K36me3")

H3K36me3_sets_gr$all_H3K36me3 %>%
as.data.frame() %>%
left_join(., H3K36me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
ggplot(aes(x=cluster, y=width))+
geom_boxplot()+
theme_classic()+
ggtitle("Region of interest length in H3K36me3")

H3K36me3_repclass_sum %>%
ggplot(aes(x=special_type,y=width))+
geom_boxplot() +
facet_wrap(~cluster)+
theme_classic()+
ggtitle("Minimum overlapping TE & ROI width length per class in H3K36me3")

H3K36me3_repclass_sum %>%
ggplot(aes(x=special_type,y=min_dist_TSS))+
geom_boxplot() +
facet_wrap(~cluster)+
theme_classic()+
ggtitle("Distance to TSS per class in H3K36me3 of overlapping ROIs")

# H3K36me3_sets_gr$all_H3K36me3 %>%
# as.data.frame() %>%
# dplyr::filter(width>60000)
#
# H3K36me3_ols$all_H3K36me3 %>%
# as.data.frame() %>%
# mutate(medi=median(roi_len))
# dplyr::filter(roi_len>60000)
H3K9me3_repclass_sum <- H3K9me3_ols$all_H3K9me3 %>%
left_join(., H3K9me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
mutate(special_type=case_when(repClass=="SINE"~"SINE",
repClass=="LINE"~"LINE",
repClass=="LTR"~"LTR",
repClass=="DNA"~"DNA",
repClass=="Retroposon"~"SVA",
is.na(repClass)~"not_TE",
TRUE~"Other"))%>%
dplyr::filter(cluster != "not_assigned") %>%
group_by(Peakid, special_type) %>%
summarize(width= min(width),
min_te_len=min(te_len),
min_roi_len=min(roi_len),
min_overlap=min(overlap_bp),
min_dist_TSS=min(distanceToTSS),
max_dist_TSS=max(distanceToTSS),
cluster=paste0(unique(cluster), collapse = ":"))
H3K9me3_repclass_sum %>%
ggplot(aes(x=special_type,y=min_te_len))+
geom_boxplot() +
facet_wrap(~cluster)+
theme_classic()+
ggtitle("Overlapping minimum TE length per class in H3K9me3")

H3K9me3_sets_gr$all_H3K9me3 %>%
as.data.frame() %>%
left_join(., H3K9me3_lookup) %>%
mutate(cluster = if_else(is.na(cluster), "not_assigned", cluster)) %>%
ggplot(aes(x=cluster, y=width))+
geom_boxplot()+
theme_classic()+
ggtitle("Region of interest length in H3K9me3")

H3K9me3_repclass_sum %>%
ggplot(aes(x=special_type,y=width))+
geom_boxplot() +
facet_wrap(~cluster)+
theme_classic()+
ggtitle("Minimum overlapping TE & ROI width length per class in H3K9me3")

H3K9me3_repclass_sum %>%
ggplot(aes(x=special_type,y=min_dist_TSS))+
geom_boxplot() +
facet_wrap(~cluster)+
theme_classic()+
ggtitle("Distance to TSS per class in H3K9me3 of overlapping ROIs")

# H3K9me3_sets_gr$all_H3K9me3 %>%
# as.data.frame() %>%
# dplyr::filter(width>60000)
#
# H3K27ac_ols$all_H3K27ac %>%
# as.data.frame() %>%
# mutate(medi=median(roi_len))
# dplyr::filter(roi_len>60000)
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats4 grid stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ggpubr_0.6.1
[2] ggsignif_0.6.4
[3] plyranges_1.26.0
[4] ChIPseeker_1.42.1
[5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
[6] GenomicFeatures_1.58.0
[7] AnnotationDbi_1.68.0
[8] Biobase_2.66.0
[9] BiocParallel_1.40.2
[10] GenomicRanges_1.58.0
[11] GenomeInfoDb_1.42.3
[12] IRanges_2.40.1
[13] S4Vectors_0.44.0
[14] BiocGenerics_0.52.0
[15] genomation_1.38.0
[16] DT_0.33
[17] lubridate_1.9.4
[18] forcats_1.0.0
[19] stringr_1.5.1
[20] dplyr_1.1.4
[21] purrr_1.1.0
[22] readr_2.1.5
[23] tidyr_1.3.1
[24] tibble_3.3.0
[25] ggplot2_3.5.2
[26] tidyverse_2.0.0
[27] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] splines_4.4.2
[2] later_1.4.2
[3] BiocIO_1.16.0
[4] bitops_1.0-9
[5] ggplotify_0.1.2
[6] R.oo_1.27.1
[7] XML_3.99-0.18
[8] lifecycle_1.0.4
[9] rstatix_0.7.2
[10] rprojroot_2.1.1
[11] processx_3.8.6
[12] lattice_0.22-7
[13] vroom_1.6.5
[14] crosstalk_1.2.2
[15] backports_1.5.0
[16] magrittr_2.0.3
[17] sass_0.4.10
[18] rmarkdown_2.29
[19] jquerylib_0.1.4
[20] yaml_2.3.10
[21] plotrix_3.8-4
[22] httpuv_1.6.16
[23] ggtangle_0.0.7
[24] cowplot_1.2.0
[25] DBI_1.2.3
[26] RColorBrewer_1.1-3
[27] abind_1.4-8
[28] zlibbioc_1.52.0
[29] R.utils_2.13.0
[30] RCurl_1.98-1.17
[31] yulab.utils_0.2.1
[32] rappdirs_0.3.3
[33] git2r_0.36.2
[34] GenomeInfoDbData_1.2.13
[35] enrichplot_1.26.6
[36] ggrepel_0.9.6
[37] tidytree_0.4.6
[38] codetools_0.2-20
[39] DelayedArray_0.32.0
[40] DOSE_4.0.1
[41] tidyselect_1.2.1
[42] aplot_0.2.8
[43] UCSC.utils_1.2.0
[44] farver_2.1.2
[45] matrixStats_1.5.0
[46] GenomicAlignments_1.42.0
[47] jsonlite_2.0.0
[48] Formula_1.2-5
[49] tools_4.4.2
[50] treeio_1.30.0
[51] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[52] Rcpp_1.1.0
[53] glue_1.8.0
[54] SparseArray_1.6.2
[55] xfun_0.52
[56] qvalue_2.38.0
[57] MatrixGenerics_1.18.1
[58] withr_3.0.2
[59] fastmap_1.2.0
[60] boot_1.3-32
[61] callr_3.7.6
[62] caTools_1.18.3
[63] digest_0.6.37
[64] timechange_0.3.0
[65] R6_2.6.1
[66] gridGraphics_0.5-1
[67] seqPattern_1.38.0
[68] colorspace_2.1-1
[69] GO.db_3.20.0
[70] gtools_3.9.5
[71] dichromat_2.0-0.1
[72] RSQLite_2.4.3
[73] R.methodsS3_1.8.2
[74] utf8_1.2.6
[75] generics_0.1.4
[76] data.table_1.17.8
[77] rtracklayer_1.66.0
[78] httr_1.4.7
[79] htmlwidgets_1.6.4
[80] S4Arrays_1.6.0
[81] whisker_0.4.1
[82] pkgconfig_2.0.3
[83] gtable_0.3.6
[84] blob_1.2.4
[85] impute_1.80.0
[86] XVector_0.46.0
[87] htmltools_0.5.8.1
[88] carData_3.0-5
[89] fgsea_1.32.4
[90] scales_1.4.0
[91] png_0.1-8
[92] ggfun_0.2.0
[93] knitr_1.50
[94] rstudioapi_0.17.1
[95] tzdb_0.5.0
[96] reshape2_1.4.4
[97] rjson_0.2.23
[98] nlme_3.1-168
[99] curl_7.0.0
[100] cachem_1.1.0
[101] KernSmooth_2.23-26
[102] parallel_4.4.2
[103] restfulr_0.0.16
[104] pillar_1.11.0
[105] vctrs_0.6.5
[106] gplots_3.2.0
[107] promises_1.3.3
[108] car_3.1-3
[109] evaluate_1.0.5
[110] cli_3.6.5
[111] compiler_4.4.2
[112] Rsamtools_2.22.0
[113] rlang_1.1.6
[114] crayon_1.5.3
[115] labeling_0.4.3
[116] ps_1.9.1
[117] getPass_0.2-4
[118] plyr_1.8.9
[119] fs_1.6.6
[120] stringi_1.8.7
[121] gridBase_0.4-7
[122] Biostrings_2.74.1
[123] lazyeval_0.2.2
[124] GOSemSim_2.32.0
[125] Matrix_1.7-3
[126] BSgenome_1.74.0
[127] hms_1.1.3
[128] patchwork_1.3.2
[129] bit64_4.6.0-1
[130] KEGGREST_1.46.0
[131] SummarizedExperiment_1.36.0
[132] igraph_2.1.4
[133] broom_1.0.9
[134] memoise_2.0.1
[135] bslib_0.9.0
[136] ggtree_3.14.0
[137] fastmatch_1.1-6
[138] bit_4.6.0
[139] ape_5.8-1