Last updated: 2025-09-17
Checks: 5 2
Knit directory:
locust-comparative-genomics/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20221025) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.
| absolute | relative |
|---|---|
| /Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data/cafe5_results/Results_26May_iqtree/Base_change.tab | data/cafe5_results/Results_26May_iqtree/Base_change.tab |
| /Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data/orthofinder/Polyneoptera | data/orthofinder/Polyneoptera |
| /Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data | data |
| /Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data/cafe5_results/Orthogroups.GeneCount.tsv | data/cafe5_results/Orthogroups.GeneCount.tsv |
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version b7729ad. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: analysis/.DS_Store
Ignored: analysis/.Rhistory
Ignored: analysis/2_signatures-selection_cache/
Ignored: analysis/figure/
Ignored: code/.DS_Store
Ignored: code/scripts/.DS_Store
Ignored: code/scripts/pal2nal.v14/.DS_Store
Ignored: data/.DS_Store
Ignored: data/DEG_results/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/americana/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/cancellata/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/cubense/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/gregaria/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/nitens/.DS_Store
Ignored: data/HYPHY_selection/.DS_Store
Ignored: data/HYPHY_selection/ParsedABSRELResults_unlabeled/.DS_Store
Ignored: data/HYPHY_selection/pathway_enrichment/.DS_Store
Ignored: data/HYPHY_selection/pathway_enrichment/americana/
Ignored: data/HYPHY_selection/pathway_enrichment/cancellata/
Ignored: data/HYPHY_selection/pathway_enrichment/cubense/
Ignored: data/HYPHY_selection/pathway_enrichment/nitens/
Ignored: data/HYPHY_selection/pathway_enrichment/piceifrons/
Ignored: data/WGCNA/.DS_Store
Ignored: data/WGCNA/input/.DS_Store
Ignored: data/WGCNA/input/Bulk_RNAseq/.DS_Store
Ignored: data/WGCNA/output/.DS_Store
Ignored: data/WGCNA/output/Bulk_RNAseq/.DS_Store
Ignored: data/WGCNA/output/Bulk_RNAseq/gregaria/.DS_Store
Ignored: data/behavioral_data/.DS_Store
Ignored: data/behavioral_data/Raw_data/.DS_Store
Ignored: data/cafe5_results/.DS_Store
Ignored: data/list/.DS_Store
Ignored: data/list/Bulk_RNAseq/.DS_Store
Ignored: data/list/GO_Annotations/.DS_Store
Ignored: data/list/GO_Annotations/DesertLocustR/.DS_Store
Ignored: data/list/excluded_loci/.DS_Store
Ignored: data/orthofinder/.DS_Store
Ignored: data/orthofinder/Polyneoptera/.DS_Store
Ignored: data/orthofinder/Polyneoptera/Results_I2_iqtree/.DS_Store
Ignored: data/orthofinder/Polyneoptera/Results_I2_iqtree/Orthogroups/.DS_Store
Ignored: data/orthofinder/Polyneoptera/Results_I2_withDaust/.DS_Store
Ignored: data/orthofinder/Polyneoptera/Results_I2_withDaust/Orthogroups/.DS_Store
Ignored: data/orthofinder/Schistocerca/.DS_Store
Ignored: data/orthofinder/Schistocerca/Results_I2/.DS_Store
Ignored: data/orthofinder/Schistocerca/Results_I2/Orthogroups/.DS_Store
Ignored: data/overlap/.DS_Store
Ignored: data/pathway_enrichment/.DS_Store
Ignored: data/pathway_enrichment/OLD/.DS_Store
Ignored: data/pathway_enrichment/OLD/custom_sgregaria_orgdb/.DS_Store
Ignored: data/pathway_enrichment/REVIGO_results/.DS_Store
Ignored: data/pathway_enrichment/REVIGO_results/BP/.DS_Store
Ignored: data/pathway_enrichment/REVIGO_results/CC/.DS_Store
Ignored: data/pathway_enrichment/REVIGO_results/MF/.DS_Store
Ignored: data/pathway_enrichment/americana/.DS_Store
Ignored: data/pathway_enrichment/cancellata/.DS_Store
Ignored: data/pathway_enrichment/gregaria/.DS_Store
Ignored: data/pathway_enrichment/nitens/Thorax/
Ignored: data/pathway_enrichment/piceifrons/.DS_Store
Ignored: data/readcounts/.DS_Store
Ignored: data/readcounts/Bulk_RNAseq/.DS_Store
Ignored: data/readcounts/RNAi/.DS_Store
Untracked files:
Untracked: analysis/KEGG_expansion_diagnostics.csv
Untracked: data/HYPHY_selection/functional_pathways/
Untracked: data/RefSeq/
Untracked: data/cafe5_results/Base_change_FILE/
Untracked: data/cafe5_results/Gene_count_FILE/
Unstaged changes:
Modified: analysis/2_gene-family.Rmd
Modified: analysis/2_psmc-analysis.Rmd
Modified: analysis/2_signatures-selection.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/2_gene-family.Rmd) and
HTML (docs/2_gene-family.html) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote), click on
the hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| html | 36f043d | Maeva TECHER | 2025-06-05 | Build site. |
| Rmd | 4e391c3 | Maeva TECHER | 2025-05-30 | add new analysis orthology, synteny |
| html | 4e391c3 | Maeva TECHER | 2025-05-30 | add new analysis orthology, synteny |
| Rmd | b982319 | Maeva TECHER | 2025-03-03 | update font |
| html | b982319 | Maeva TECHER | 2025-03-03 | update font |
| html | 84bca53 | Maeva TECHER | 2025-02-27 | Build site. |
| html | d0b7267 | Maeva TECHER | 2025-02-27 | Build site. |
| Rmd | faf2db3 | Maeva TECHER | 2025-01-13 | update markdown |
| html | faf2db3 | Maeva TECHER | 2025-01-13 | update markdown |
| html | 3fa8e62 | Maeva TECHER | 2024-11-09 | updated analysis |
| html | edb70fe | Maeva TECHER | 2024-11-08 | overlap and deg results created |
| html | ba35b82 | Maeva A. TECHER | 2024-06-20 | Build site. |
| html | bf74631 | Maeva A. TECHER | 2024-05-14 | Build site. |
| Rmd | 41bd625 | Maeva A. TECHER | 2024-05-14 | wflow_publish("analysis/2_gene-family.Rmd") |
| html | 0837617 | Maeva A. TECHER | 2024-01-30 | Build site. |
| html | f701a01 | Maeva A. TECHER | 2024-01-30 | reupdate |
| html | 7b901bb | Maeva A. TECHER | 2024-01-24 | Build site. |
| html | 8c22d7c | Maeva A. TECHER | 2024-01-24 | Build site. |
| html | 1b09cbe | Maeva A. TECHER | 2024-01-24 | remove |
| html | deba29e | Maeva A. TECHER | 2023-12-18 | Build site. |
| Rmd | 53877fa | Maeva A. TECHER | 2023-12-18 | add pages |
We will be analyzing the gene family expansion and contraction using
CAFE5
developed by the Hahn lab. We can use the output files generated by
OrthoFinder after a quick manipulation to format the input
needed to estimate the gene gain and loss. CAFE5 requires two input: -
Input 1: The number of genes per species in each gene family (here
orthogroup). - Input 2: An ultrametric time tree with the same species
name.
For input 1, we will be using the GeneCount matrix generated by
OrthoFinder2 which is located in the
Results_DATE/Orthogroups output folder. The file of
interest is the Orthogroups.GeneCount.tsv matrix which is
formatted as shown below:

However, whether it is the Polyneoptera run (13 species) or Schistocerca only (6 species), we encounter an issue for downstream analysis as some orthogroups have too many genes or a too high size variation which can cause difficulty for model convergence. We found a large number of duplication events in several orthogroups which can bias the overal picture. We will then apply some filtering rules:
For Polyneoptera - We remove all orthogroups with > 400 genes - We remove all orthogroups with > 100 size variation
For Schistocerca - We remove all orthogroups with > X genes - We remove all orthogroups with > X size variation
We can do it manually or use the following script
python 1_filter_large_variation_OGs.py:
srun --ntasks 1 --cpus-per-task 8 --mem 50G --time 03:00:00 --pty bash
ml GCCcore/14.2.0 Python/3.13.1
#python -m venv --system-site-packages ./venv
#source ./venv/bin/activate
#pip install --upgrade pip
#pip install pandas
import pandas as pd
# Input/Output file paths
input_file = "Orthogroups.GeneCount.tsv"
output_file = "Orthogroups_drop4CAFE.tsv"
# Read TSV (tab-separated)
df = pd.read_csv(input_file, sep="\t")
# Drop orthogroups with more than 450 genes in total
df_filtered = df[df.iloc[:, 1:].sum(axis=1) <= 200] # For Schistocerca
#df_filtered = df[df.iloc[:, 1:].sum(axis=1) <= 300] # For Polyneoptera
# Drop orthogroups where the max difference across species is >100
df_filtered = df_filtered[df_filtered.iloc[:, 1:].max(axis=1) - df_filtered.iloc[:, 1:].min(axis=1) <= 50] # For Schistocerca
#df_filtered = df_filtered[df_filtered.iloc[:, 1:].max(axis=1) - df_filtered.iloc[:, 1:].min(axis=1) <= 100] # For Polyneoptera
# Write the filtered table to a TSV file
df_filtered.to_csv(output_file, sep="\t", index=False)
print(f"Filtered orthogroups written to: {output_file}")
Then we need to format the file more by removing the
Total column, adding a column Description with
values = null.
# Step 1: Add "(null)" as the first column
awk -F'\t' '{print "(null)\t"$0}' Orthogroups_drop4CAFE.tsv > tmp.tsv
# Step 2: Rename the first header column to "Desc"
sed -i '1s/^(null)/Desc/' tmp.tsv
# Step 3: Remove the last column (Total), even if column numbers vary
awk -F'\t' '{$NF=""; print $0}' tmp.tsv | rev | sed 's/^\s*//g' | rev | tr ' ' '\t' > Orthogroups_readytoCAFE.tsv
# Optional: remove intermediate file
rm tmp.tsv
# We clean the species name so it is easier for plotting later
sed '1s/_filteredTranscripts//g' Orthogroups_readytoCAFE.tsv > Orthogroups_readytoCAFE_cleaned.tsv
For input 2, we will copy the tree file generated by
OrthoFinder unless we specify our own calibrated tree. We
will take SpeciesTree_rooted.txt from the
Results_DATE/SpeciesTree output folder.
If we modify the header (names of species) of the GeneCount file,
this must be reflected in the species tree file as well. For CAFE, the
tree must be ultrametric in the past I used directly the script make_ultrametric.py
available from Orthofinder:
# Orthofinder needs to be loaded for it
# On Grace we may have to create a virtual environment if numpy is not installed
# python -m venv --system-site-packages ./venv
# source venv/bin/activate
# pip install numpy
#ml GCC/12.3.0 OpenMPI/4.1.5 OrthoFinder/2.5.5
#python ./make_ultrametric.py SpeciesTree_rooted.txt
However, I noticed that the tree does not rescale always properly and
also if the branches are too short like for Schistocerca,
CAFE will throw the inf error as the model can
not converge as there is not enough time for gene family evolution. So
now, as we do not have a calibrated tree, I rescale the topology of the
tree (which was run with bootstraps) with a factor of 10 and make
ultrametric in R:
ml GCC/13.2.0 OpenMPI/4.1.6 R_tamu/4.4.1
export R_LIBS=$SCRATCH/R_LIBS_USER/
We then launch R:
library(ape)
# Set your input/output
input_tree <- "SpeciesTree_rooted.txt"
output_tree <- "rescaled_ultrametric_SpeciesTree.txt"
# Read the tree
tree <- read.tree(input_tree)
# Force ultrametric (can adjust lambda or try other methods if needed)
ultra_tree <- chronos(tree, lambda = 1)
# Optional: check it worked
stopifnot(is.ultrametric(ultra_tree))
# 🔁 RESCALE: Multiply all branch lengths by a factor (e.g., ×100)
scaling_factor <- 10
ultra_tree$edge.length <- ultra_tree$edge.length * scaling_factor
# Write the rescaled ultrametric tree
write.tree(ultra_tree, file = output_tree)
cat("✅ Tree written to:", output_tree, "\n")
As we did earlier for the Orthogroups file, we want to clean the name so it is easier for plotting later:
sed 's/_filteredTranscripts//g' rescaled_ultrametric_SpeciesTree.txt > rescaled_ultrametric_SpeciesTree_cleaned.txt
CAFE5module purge
ml GCC/12.2.0 CAFE5/5.0.0
cafe5 -t rescaled_ultrametric_SpeciesTree_cleaned.txt -i Orthogroups_readytoCAFE_cleaned.tsv
When successful we can look at the results folder and
extract the results from Base_clade_results

We can plot the results for all families and for each gene family by
using cafeplotter.
For that we run the following commands and output examples are shown below:
source /venv/bin/activate
# pip install cafeplotter
cafeplotter -i results -o plots


We are interested in knowing the function of the gene family expansion per species. First we need to extract the list of genes expanding for each species. We are working on two different files.
The first file will be looking at each gene family that expanded and
those who contracted using the Base_change.tab from CAFE5
which shows the increase and decrease for each node of the tree for each
species.
The second file will be the actual gene counts that was used as input
Orthogroups.GeneCount.tsv and we will look only at families
with at least 10 copies per species.
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
# Load files
base_change <- read.delim("/Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data/cafe5_results/Results_26May_iqtree/Base_change.tab", sep="\t", header=TRUE)
# Current names
names(base_change)
[1] "FamilyID" "Samer.0." "Sscub.1." "X.2." "Spice.3." "X.4."
[7] "Scanc.5." "X.6." "Snite.7." "Gbima.8." "Glong.9." "X.10."
[13] "Sgreg.11." "X.12." "Asimp.13." "Lmigr.14." "X.15." "X.16."
[19] "X.17." "Brsri.18." "X.19." "Csecu.20." "Pamer.21." "X.22."
[25] "X.23." "X.24."
# Clean up: remove junk "X.*" columns
base_change <- base_change %>%
dplyr::select(-matches("^X")) # drop all columns starting with X
# Rename properly
colnames(base_change) <- sub("\\..*", "", colnames(base_change)) # keep only part before first dot
# make sure FamilyID column is consistently called Orthogroup
colnames(base_change)[1] <- "Orthogroup"
# Now FamilyID stays, and others become clean species codes
head(base_change)
Orthogroup Samer Sscub Spice Scanc Snite Gbima Glong Sgreg Asimp Lmigr Brsri
1 OG0000002 -1 0 -1 2 0 -5 21 0 -4 -1 -2
2 OG0000004 0 -1 -1 0 -1 1 -4 0 1 1 2
3 OG0000005 1 0 0 0 -4 -6 2 1 10 -2 -4
4 OG0000006 0 0 0 -1 0 -4 0 0 0 1 -2
5 OG0000007 0 1 0 2 0 -2 0 6 7 -4 1
6 OG0000009 0 4 3 -1 -1 0 0 -1 0 0 1
Csecu Pamer
1 6 -7
2 -2 -2
3 -1 -1
4 -7 26
5 -6 7
6 0 0
# -------------------------
# Load orthogroup map
# -------------------------
ortho_dir <- "/Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data/orthofinder/Polyneoptera"
input_file <- file.path(ortho_dir, "Results_I2_iqtree/Orthogroups_genesproteinbiotype_13species_annotated_May2025.csv")
ortho_map <- read.csv(input_file, header = TRUE, stringsAsFactors = FALSE)
# -------------------------
# Species translation
# -------------------------
species_translate <- c(
Sgreg = "gregaria",
Scanc = "cancellata",
Spice = "piceifrons",
Samer = "americana",
Sscub = "cubense",
Snite = "nitens",
Lmigr = "locusta"
)
# -------------------------
# Build expanded_genes tibble
# -------------------------
# -------------------------
# Build expanded + contracted genes tibble
# -------------------------
cafe_genes <- map_dfr(names(species_translate), function(sp) {
message("Collecting OGs for ", sp)
# expansions (>0) and contractions (<0)
expanded_ogs <- base_change %>%
filter(!!sym(sp) > 0) %>%
mutate(ChangeType = "Expansion") %>%
pull(Orthogroup)
contracted_ogs <- base_change %>%
filter(!!sym(sp) < 0) %>%
mutate(ChangeType = "Contraction") %>%
pull(Orthogroup)
all_ogs <- tibble(
Orthogroup = c(expanded_ogs, contracted_ogs),
ChangeType = c(rep("Expansion", length(expanded_ogs)),
rep("Contraction", length(contracted_ogs)))
)
# If Lmigr -> reuse Sgreg’s genes
if (sp == "Lmigr") {
proxy_sp <- "Sgreg"
} else {
proxy_sp <- sp
}
ortho_map %>%
filter(Species == proxy_sp, Orthogroup %in% all_ogs$Orthogroup) %>%
left_join(all_ogs, by = "Orthogroup") %>%
mutate(Species = sp) %>%
dplyr::select(Species, Orthogroup, GeneID, ChangeType) %>%
distinct()
})
# Preview
head(cafe_genes)
Species Orthogroup GeneID ChangeType
1 Sgreg OG0000005 LOC126281744 Expansion
2 Sgreg OG0000005 LOC126281746 Expansion
3 Sgreg OG0000005 LOC126282022 Expansion
4 Sgreg OG0000005 LOC126282417 Expansion
5 Sgreg OG0000005 LOC126335687 Expansion
6 Sgreg OG0000005 LOC126336625 Expansion
Then we use the same function we used for DEGs GO terms and KEGG enrichment:
# === Paths and Constants ===
workDir <- "/Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data"
GODir <- file.path(workDir, "list", "GO_Annotations")
RefDir <- file.path(workDir, "RefSeq")
enrichDir <- file.path(workDir, "cafe5_results/Base_change_FILE")
species_list <- c("gregaria", "cancellata", "piceifrons", "americana", "cubense", "nitens", "locusta")
# === Load Required Libraries ===
library(data.table)
library(dplyr)
library(readr)
library(clusterProfiler)
library(GO.db)
library(rtracklayer)
library(DesertLocustR) # Local installation
gff_map <- c(
gregaria = "GCF_023897955.1_iqSchGreg1.2_genomic.gff",
cancellata = "GCF_023864275.1_iqSchCanc2.1_genomic.gff",
piceifrons = "GCF_021461385.2_iqSchPice1.1_genomic.gff",
americana = "GCF_021461395.2_iqSchAmer2.1_genomic.gff",
cubense = "GCF_023864345.2_iqSchSeri2.2_genomic.gff",
nitens = "GCF_023898315.1_iqSchNite1.1_genomic.gff",
locusta = "GCF_023897955.1_iqSchGreg1.2_genomic.gff"
)
annot_map <- c(
gregaria = "EggNog_Arthropoda_one2one.emapper.annotations",
cancellata = "GCF_023864275.1_iqSchCanc2.1_Arthopoda_one2one.emapper.annotations",
piceifrons = "GCF_021461385.2_iqSchPice1.1_Arthopoda_one2one.emapper.annotations",
americana = "GCF_021461395.2_iqSchAmer2.1_Arthopoda_one2one.emapper.annotations",
cubense = "GCF_023864345.2_iqSchSeri2.2_Arthopoda_one2one.emapper.annotations",
nitens = "GCF_023898315.1_iqSchNite1.1_Arthopoda_one2one.emapper.annotations",
locusta = "EggNog_Arthropoda_one2one.emapper.annotations"
)
# GO enrichment
enrich_GO <- function(dge_genes.df, term2gene, term2name, pval, qval){
genes <- rownames(dge_genes.df)
enricher(
genes,
TERM2GENE = term2gene,
TERM2NAME = term2name,
pvalueCutoff = pval,
pAdjustMethod = "BH",
qvalueCutoff = qval
)
}
# KEGG preparation
assign_kegg_ids <- function(sig_genes.df){
if (is.vector(sig_genes.df)) {
sig_genes.df <- data.frame(X.query = sig_genes.df, stringsAsFactors = FALSE)
} else {
sig_genes.df$X.query <- rownames(sig_genes.df)
}
dge_with_kegg_ids <- left_join(sig_genes.df, kegg_final, by = "X.query")
dge_with_kegg_ids$KEGG_ko[grepl("^K", dge_with_kegg_ids$KEGG_ko)]
}
# KEGG enrichment
enrich_KEGG <- function(dge_genes.df, pval_cutoff = 0.05, qval_cutoff = 0.2) {
gene_with_kegg_ids <- assign_kegg_ids(dge_genes.df)
enrichKEGG(
gene = gene_with_kegg_ids,
organism = "ko",
pvalueCutoff = pval_cutoff,
qvalueCutoff = qval_cutoff,
pAdjustMethod = "BH"
)
}
run_GO_enrichment_selected <- function(
gene_list,
go_table,
term2name,
species,
suffix,
ontology,
output_dir,
show_n = 30,
top_n = 30
) {
if (length(gene_list) == 0) return(NULL)
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
# Make sure column names are correct for clusterProfiler::enricher()
go_table_fixed <- go_table[, 1:2]
colnames(go_table_fixed) <- c("go_id", "gene_id")
term2name_fixed <- term2name[, 1:2]
colnames(term2name_fixed) <- c("go_id", "name")
# Run enrichment
go_result <- enricher(
gene = gene_list,
TERM2GENE = go_table_fixed,
TERM2NAME = term2name_fixed,
pvalueCutoff = 0.05,
qvalueCutoff = 0.2
)
if (!is.null(go_result) &&
inherits(go_result, "enrichResult") &&
nrow(go_result@result) > 0 &&
sum(!is.na(go_result@result$Description)) > 0) {
# Save dotplot
try({
pdf(file = file.path(output_dir, paste0("GO_", ontology, "_dotplot_", species, "_", suffix, ".pdf")),
width = 8, height = 6)
print(dotplot(go_result, showCategory = min(show_n, nrow(go_result@result))) +
ggtitle(paste(ontology, suffix)))
dev.off()
}, silent = TRUE)
# Export top terms with log10(p)
species_enrich_ready <- go_result@result[, c("ID", "p.adjust")]
species_enrich_ready$logp <- -log10(species_enrich_ready$p.adjust)
species_enrich_ready <- species_enrich_ready[order(-species_enrich_ready$logp), ]
species_enrich_ready <- head(species_enrich_ready, n = top_n)[, c("ID", "logp")]
write.table(species_enrich_ready,
file = file.path(output_dir, paste0("enrich_", ontology, "_GOs_", species, "_", suffix, ".txt")),
sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
# Also export the full table if needed
write.csv(go_result@result,
file = file.path(output_dir, paste0("GO_enrichment_full_", ontology, "_", species, "_", suffix, ".csv")),
row.names = FALSE)
# ✅ Return enrichment results with Species + ONTOLOGY
df <- go_result@result %>%
dplyr::mutate(Species = species, ONTOLOGY = ontology)
return(df)
} else {
message(paste0("⚠️ No GO enrichment result to plot/export for ", species, " - ", suffix, " [", ontology, "]"))
return(NULL)
}
}
run_KEGG_enrichment_selected <- function(gene_list, species, suffix, output_dir,
show_n = 40, top_n = 40) {
if (length(gene_list) == 0) return(NULL)
kegg_result <- enrich_KEGG(gene_list, pval_cutoff = 0.05, qval_cutoff = 0.2)
if (!is.null(kegg_result) && inherits(kegg_result, "enrichResult") &&
nrow(kegg_result@result) > 0) {
try({
pdf(file = file.path(output_dir, paste0("KEGG_dotplot_", species, "_", suffix, ".pdf")),
width = 8, height = 6)
print(dotplot(kegg_result, showCategory = min(show_n, nrow(kegg_result@result))) +
ggtitle(paste("KEGG", suffix)))
dev.off()
}, silent = TRUE)
# Full result
write.csv(kegg_result@result,
file = file.path(output_dir, paste0("KEGG_enrichment_", species, "_", suffix, ".csv")),
row.names = FALSE)
# Top KEGG terms
species_enrich_kegg <- kegg_result@result[, c("ID", "p.adjust")]
species_enrich_kegg$logp <- -log10(species_enrich_kegg$p.adjust)
species_enrich_kegg <- species_enrich_kegg[order(-species_enrich_kegg$logp), ][1:min(nrow(species_enrich_kegg), top_n), ]
species_enrich_kegg <- species_enrich_kegg[, c("ID", "logp")]
write.table(species_enrich_kegg[, c("ID", "logp")],
file = file.path(output_dir, paste0("enrich_KEGG_", species, "_", suffix, ".txt")),
sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
# ✅ return the full result with Species
df <- as.data.frame(kegg_result@result)
df$Species <- species
return(df)
} else {
message(paste("⚠️ No KEGG enrichment result to plot/export for", species, "-", suffix))
return(NULL)
}
}
GO_terms_list <- list()
ontologies_list <- list()
term2name_list <- list()
kegg_final_list <- list()
for (sp in species_list) {
message("Preparing annotations for ", sp)
sp_code <- species_translate[sp]
eggnog_path <- file.path(GODir, annot_map[[sp]])
gff_path <- file.path(RefDir, gff_map[[sp]])
output_dir <- file.path(enrichDir, sp)
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
eggnog_annots <- read.delim(eggnog_path, sep = "\t", skip = 4)
eggnog_annots <- eggnog_annots[1:(nrow(eggnog_annots) - 3), ]
gff.df <- as.data.frame(import(gff_path))
protein_2_gene <- unique(gff.df[c("Name", "gene")])
protein_2_gene_df <- subset(protein_2_gene, grepl("^XP", protein_2_gene$Name))
eggnog_annots$Name <- eggnog_annots$X.query
eggnog_annots <- left_join(eggnog_annots, protein_2_gene_df, by = "Name")
eggnog_annots$X.query <- eggnog_annots$gene
# GO
GO_terms <- data.table(eggnog_annots[, c("X.query", "GOs")])
GO_terms <- GO_terms[, .(GOs = unlist(strsplit(GOs, ","))), by = X.query]
term2name <- GO_terms[, .(GOs, X.query)]
term2name$Names <- mapIds(GO.db, keys = term2name$GOs, column = "TERM", keytype = "GOID")
term2name$Ontology <- mapIds(GO.db, keys = term2name$GOs, column = "ONTOLOGY", keytype = "GOID")
term2name <- as.data.frame(term2name)
go_bp <- term2name[term2name$Ontology == "BP", c("GOs", "X.query")]
go_mf <- term2name[term2name$Ontology == "MF", c("GOs", "X.query")]
go_cc <- term2name[term2name$Ontology == "CC", c("GOs", "X.query")]
term2name_filtered <- term2name[!is.na(term2name$Names), c("GOs", "Names")]
ontologies <- list(BP = go_bp, MF = go_mf, CC = go_cc)
# KEGG
KO_terms <- data.table(eggnog_annots[, c("X.query", "KEGG_ko")])
KO_terms$KEGG_ko <- gsub("ko:", "", as.character(KO_terms$KEGG_ko))
KO_terms <- KO_terms[, .(KEGG_ko = unlist(strsplit(KEGG_ko, ","))), by = X.query]
kegg_final <- KO_terms[, .(KEGG_ko, X.query)]
# Store per species
GO_terms_list[[sp]] <- GO_terms
ontologies_list[[sp]] <- ontologies
term2name_list[[sp]] <- term2name_filtered
kegg_final_list[[sp]] <- kegg_final
}
We run the loop using the the list of selected genes extracted in step 1:
suffix <- "cafe"
species_translate <- c(
gregaria = "Sgreg",
cancellata = "Scanc",
piceifrons = "Spice",
americana = "Samer",
cubense = "Sscub",
nitens = "Snite",
locusta = "Lmigr" # proxy annotation from gregaria
)
# ===== Loop through each species =====
go_results_all <- list()
kegg_results_all <- list()
# initialize diagnostic log
diagnostic_log <- data.frame(
Species = character(),
ChangeType = character(),
CAFE_genes = integer(),
GO_overlap = integer(),
KEGG_overlap = integer(),
stringsAsFactors = FALSE
)
for (sp in species_list) {
sp_code <- species_translate[sp]
for (chg in c("Expansion", "Contraction")) {
message("Processing ", sp, " (", chg, ")")
output_dir <- file.path(enrichDir, sp, chg)
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
# 1. Extract cafe GeneIDs for this species
species_genes <- cafe_genes %>%
filter(Species == sp_code, ChangeType == chg) %>%
pull(GeneID) %>%
unique()
n_cafe <- length(species_genes)
message("→ ", n_cafe, " Gene families with ", chg, " in ", sp)
# 2. Check overlaps with GO and KEGG annotations
go_annot <- GO_terms_list[[sp]]$X.query
kegg_annot <- kegg_final_list[[sp]]$X.query
go_overlap <- sum(species_genes %in% go_annot)
kegg_overlap <- sum(species_genes %in% kegg_annot)
message("→ Overlap counts: GO = ", go_overlap, ", KEGG = ", kegg_overlap)
if (kegg_overlap == 0) {
message(" ⚠️ No KEGG matches for ", sp,
". Example missing IDs: ",
paste(head(setdiff(species_genes, kegg_annot), 5), collapse = ", "))
}
# Add to diagnostic log
diagnostic_log <- rbind(diagnostic_log, data.frame(
Species = sp,
ChangeType = chg,
CAFE_genes = n_cafe,
GO_overlap = go_overlap,
KEGG_overlap = kegg_overlap
))
# 3. Run GO enrichment (BP, MF, CC separately)
selected_genes_go <- species_genes[species_genes %in% go_annot]
go_by_onto <- list()
for (onto in names(ontologies_list[[sp]])) {
go_res <- run_GO_enrichment_selected(
gene_list = selected_genes_go,
go_table = ontologies_list[[sp]][[onto]],
term2name = term2name_list[[sp]],
species = paste0(sp, "_", chg),
suffix = suffix,
ontology = onto,
output_dir = output_dir
)
if (!is.null(go_res)) {
go_res <- as.data.frame(go_res) %>%
mutate(Species = sp, ChangeType = chg, ONTOLOGY = onto)
go_by_onto[[onto]] <- go_res
}
}
go_results_all[[paste0(sp, "_", chg)]] <- go_by_onto
# 4. Run KEGG enrichment
selected_genes_kegg <- species_genes[species_genes %in% kegg_annot]
kegg_final <<- kegg_final_list[[sp]]
kegg_res <- run_KEGG_enrichment_selected(
gene_list = selected_genes_kegg,
species = paste0(sp, "_", chg),
suffix = suffix,
output_dir = output_dir
)
if (!is.null(kegg_res)) {
kegg_results_all[[paste0(sp, "_", chg)]] <- kegg_res %>%
mutate(Species = sp, ChangeType = chg)
}
}
}
# save diagnostic log
write.csv(diagnostic_log,
file.path(enrichDir, "diagnostic_overlap_log.csv"),
row.names = FALSE)
print(diagnostic_log)
Species ChangeType CAFE_genes GO_overlap KEGG_overlap
1 gregaria Expansion 1178 927 927
2 gregaria Contraction 123 108 108
3 cancellata Expansion 831 640 640
4 cancellata Contraction 215 177 177
5 piceifrons Expansion 1057 793 793
6 piceifrons Contraction 177 150 150
7 americana Expansion 687 568 568
8 americana Contraction 184 135 135
9 cubense Expansion 824 623 623
10 cubense Contraction 189 171 171
11 nitens Expansion 830 599 599
12 nitens Contraction 238 215 215
13 locusta Expansion 2162 2013 2013
14 locusta Contraction 843 704 704
We need to make a heat map with all species:
library(dplyr)
library(tidyr)
library(pheatmap)
library(tibble)
# 1. Flatten all species results
all_go_results <- bind_rows(
lapply(go_results_all, function(sp_list) bind_rows(sp_list))
)
# 2. Loop through each ontology
for (onto in c("BP", "MF", "CC")) {
# Filter for this ontology
onto_res <- all_go_results %>% filter(ONTOLOGY == onto)
if (nrow(onto_res) == 0) {
message("⚠️ No results for ", onto)
next
}
# 3. Take top 10 per species
top_go <- onto_res %>%
group_by(Species, ChangeType) %>%
slice_min(order_by = p.adjust, n = 15, with_ties = FALSE) %>%
ungroup()
# Add signed score: expansions positive, contractions negative
top_go <- top_go %>%
mutate(score = -log10(p.adjust) * ifelse(ChangeType == "Contraction", -1, 1))
# Build matrix
# Build matrix
heatmap_mat <- top_go %>%
dplyr::select(Description, Species, ChangeType, score) %>%
pivot_wider(names_from = Description, values_from = score, values_fill = 0) %>%
mutate(RowID = paste(Species, ChangeType, sep = "_")) %>%
column_to_rownames("RowID")
# 🔑 Ensure only numeric columns remain
heatmap_mat <- heatmap_mat %>%
dplyr::select(where(is.numeric)) %>%
as.matrix()
# Define desired species order
species_order <- c("locusta", "gregaria", "nitens",
"cancellata", "piceifrons", "americana", "cubense")
row_order <- unlist(lapply(species_order, function(sp) c(paste0(sp, "_Expansion"),
paste0(sp, "_Contraction"))))
heatmap_mat <- heatmap_mat[row_order[row_order %in% rownames(heatmap_mat)], , drop = FALSE]
# Set color scale: 0 = white, gradient for >0
max_val <- max(abs(heatmap_mat), na.rm = TRUE)
breaks <- seq(-max_val, max_val, length.out = 101)
colors <- colorRampPalette(c("blue", "white", "orange"))(100)
# Save heatmap
out_pdf <- file.path(enrichDir, paste0("GO_", onto, "_heatmap_top15_ExpVsCon.pdf"))
pdf(out_pdf, width = 12, height = 6)
pheatmap(
mat = heatmap_mat,
cluster_rows = FALSE, # keep species order fixed
cluster_cols = FALSE, # cluster GO terms
scale = "none",
fontsize_row = 12,
fontsize_col = 6,
angle_col = 90,
color = colors,
breaks = breaks,
main = paste("Top 10 GO",
ifelse(onto == "BP", "Biological Process",
ifelse(onto == "MF", "Molecular Function", "Cellular Component")),
"terms per species")
)
dev.off()
}
This code is to make a KEGG pathway enrichment:
library(dplyr)
library(tidyr)
library(ggplot2)
# === Collect KEGG results ===
all_kegg_results <- bind_rows(
lapply(names(kegg_results_all), function(sp) {
kegg_results_all[[sp]]
})
)
# === Take top 15 per species (expansion + contraction separately) ===
top_kegg <- all_kegg_results %>%
group_by(Species, ChangeType) %>%
slice_min(order_by = p.adjust, n = 15, with_ties = FALSE) %>%
ungroup()
# === Clean dataframe ===
heatmap_df_kegg <- top_kegg %>%
dplyr::select(Species, category, subcategory, Count, ChangeType) %>%
filter(!is.na(category), !is.na(subcategory)) %>%
mutate(Count = as.numeric(Count))
# === Species order ===
species_order <- c("cubense", "americana", "piceifrons",
"cancellata", "nitens", "gregaria", "locusta")
heatmap_df_kegg$Species <- factor(heatmap_df_kegg$Species, levels = species_order)
# === Subcategory order within each category ===
heatmap_df_kegg <- heatmap_df_kegg %>%
group_by(category) %>%
mutate(subcategory = factor(subcategory, levels = unique(subcategory))) %>%
ungroup()
# === Shared max value for scales ===
max_val <- max(heatmap_df_kegg$Count, na.rm = TRUE)
# === Expansion heatmap (red) ===
out_pdf_exp <- file.path(enrichDir, "KEGG_subcategory_faceted_heatmap_Expansion.pdf")
pdf(out_pdf_exp, width = 14, height = 8)
ggplot(filter(heatmap_df_kegg, ChangeType == "Expansion"),
aes(x = subcategory, y = Species, fill = Count)) +
geom_tile(color = "black", size = 0.3) +
scale_fill_gradientn(colors = c("white", "turquoise3", "blue"),
limits = c(0, 26),
name = "# Genes Expanded") +
facet_wrap(~ category, scales = "free_x", nrow = 1) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1, size = 12),
axis.text.y = element_text(size = 12),
panel.grid = element_blank(),
strip.text = element_text(face = "bold", size = 8)
) +
labs(title = "",
x = "", y = "")
dev.off()
quartz_off_screen
2
# === Contraction heatmap (blue) ===
out_pdf_con <- file.path(enrichDir, "KEGG_subcategory_faceted_heatmap_Contraction.pdf")
pdf(out_pdf_con, width = 14, height = 8)
ggplot(filter(heatmap_df_kegg, ChangeType == "Contraction"),
aes(x = subcategory, y = Species, fill = Count)) +
geom_tile(color = "black", size = 0.3) +
scale_fill_gradientn(colors = c("white", "orange", "darkred"),
limits = c(0, 15),
name = "# Genes Contracted") +
facet_wrap(~ category, scales = "free_x", nrow = 1) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 12),
axis.text.y = element_text(size = 9),
panel.grid = element_blank(),
strip.text = element_text(face = "bold", size = 8)
) +
labs(title = "Top 15 KEGG subcategories per species (Contractions)",
x = "Subcategory", y = "Species")
dev.off()
quartz_off_screen
2
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
# Load files
gene_count <- read.delim("/Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data/cafe5_results/Orthogroups.GeneCount.tsv", sep = "\t", header = TRUE, check.names = FALSE)
# 2. Remove `_filteredTranscripts` suffix from column names
colnames(gene_count) <- gsub("_filteredTranscripts", "", colnames(gene_count))
# 3. Pivot longer for easier handling (Orthogroup–Species–Count format)
gene_count_long <- gene_count %>%
pivot_longer(
cols = -Orthogroup,
names_to = "Species",
values_to = "Count"
)
# ✅ Now gene_count_long has clean species names
head(gene_count_long)
# A tibble: 6 × 3
Orthogroup Species Count
<chr> <chr> <int>
1 OG0000000 Asimp 6
2 OG0000000 Brsri 10
3 OG0000000 Csecu 20
4 OG0000000 Gbima 1
5 OG0000000 Glong 17
6 OG0000000 Lmigr 12
# -------------------------
# Load orthogroup map
# -------------------------
ortho_dir <- "/Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data/orthofinder/Polyneoptera"
input_file <- file.path(ortho_dir, "Results_I2_iqtree/Orthogroups_genesproteinbiotype_13species_annotated_May2025.csv")
ortho_map <- read.csv(input_file, header = TRUE, stringsAsFactors = FALSE)
# -------------------------
# Species translation
# -------------------------
species_translate <- c(
Sgreg = "gregaria",
Scanc = "cancellata",
Spice = "piceifrons",
Samer = "americana",
Sscub = "cubense",
Snite = "nitens",
Lmigr = "locusta"
)
# -------------------------
# Build expanded_genes tibble
# -------------------------
cafe_genes <- map_dfr(names(species_translate), function(sp) {
message("Collecting expanded OGs for ", sp)
expanded_ogs <- gene_count %>%
filter(!!sym(sp) > 10) %>%
pull(Orthogroup)
# If Lmigr -> reuse Sgreg’s genes
if (sp == "Lmigr") {
proxy_sp <- "Sgreg"
} else {
proxy_sp <- sp
}
ortho_map %>%
filter(Species == proxy_sp, Orthogroup %in% expanded_ogs) %>%
mutate(Species = sp) %>% # keep label as Lmigr
dplyr::select(Species, Orthogroup, GeneID) %>%
distinct()
})
# Preview
head(cafe_genes)
Species Orthogroup GeneID
1 Sgreg OG0000000 LOC126284328
2 Sgreg OG0000000 LOC126284627
3 Sgreg OG0000000 LOC126285012
4 Sgreg OG0000000 LOC126291426
5 Sgreg OG0000000 LOC126335618
6 Sgreg OG0000000 LOC126335743
Then we use the same function we used for DEGs GO terms and KEGG enrichment:
# === Paths and Constants ===
workDir <- "/Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data"
GODir <- file.path(workDir, "list", "GO_Annotations")
RefDir <- file.path(workDir, "RefSeq")
enrichDir <- file.path(workDir, "cafe5_results/Gene_count_FILE")
species_list <- c("gregaria", "cancellata", "piceifrons", "americana", "cubense", "nitens", "locusta")
# === Load Required Libraries ===
library(data.table)
library(dplyr)
library(readr)
library(clusterProfiler)
library(GO.db)
library(rtracklayer)
library(DesertLocustR) # Local installation
gff_map <- c(
gregaria = "GCF_023897955.1_iqSchGreg1.2_genomic.gff",
cancellata = "GCF_023864275.1_iqSchCanc2.1_genomic.gff",
piceifrons = "GCF_021461385.2_iqSchPice1.1_genomic.gff",
americana = "GCF_021461395.2_iqSchAmer2.1_genomic.gff",
cubense = "GCF_023864345.2_iqSchSeri2.2_genomic.gff",
nitens = "GCF_023898315.1_iqSchNite1.1_genomic.gff",
locusta = "GCF_023897955.1_iqSchGreg1.2_genomic.gff"
)
annot_map <- c(
gregaria = "EggNog_Arthropoda_one2one.emapper.annotations",
cancellata = "GCF_023864275.1_iqSchCanc2.1_Arthopoda_one2one.emapper.annotations",
piceifrons = "GCF_021461385.2_iqSchPice1.1_Arthopoda_one2one.emapper.annotations",
americana = "GCF_021461395.2_iqSchAmer2.1_Arthopoda_one2one.emapper.annotations",
cubense = "GCF_023864345.2_iqSchSeri2.2_Arthopoda_one2one.emapper.annotations",
nitens = "GCF_023898315.1_iqSchNite1.1_Arthopoda_one2one.emapper.annotations",
locusta = "EggNog_Arthropoda_one2one.emapper.annotations"
)
# GO enrichment
# GO enrichment
enrich_GO <- function(dge_genes.df, term2gene, term2name, pval, qval){
genes <- rownames(dge_genes.df)
enricher(
genes,
TERM2GENE = term2gene,
TERM2NAME = term2name,
pvalueCutoff = pval,
pAdjustMethod = "BH",
qvalueCutoff = qval
)
}
# KEGG preparation
assign_kegg_ids <- function(sig_genes.df){
if (is.vector(sig_genes.df)) {
sig_genes.df <- data.frame(X.query = sig_genes.df, stringsAsFactors = FALSE)
} else {
sig_genes.df$X.query <- rownames(sig_genes.df)
}
dge_with_kegg_ids <- left_join(sig_genes.df, kegg_final, by = "X.query")
dge_with_kegg_ids$KEGG_ko[grepl("^K", dge_with_kegg_ids$KEGG_ko)]
}
# KEGG enrichment
enrich_KEGG <- function(dge_genes.df, pval_cutoff = 0.05, qval_cutoff = 0.2) {
gene_with_kegg_ids <- assign_kegg_ids(dge_genes.df)
enrichKEGG(
gene = gene_with_kegg_ids,
organism = "ko",
pvalueCutoff = pval_cutoff,
qvalueCutoff = qval_cutoff,
pAdjustMethod = "BH"
)
}
run_GO_enrichment_selected <- function(
gene_list,
go_table,
term2name,
species,
suffix,
ontology,
output_dir,
show_n = 30,
top_n = 30
) {
if (length(gene_list) == 0) return(NULL)
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
# Make sure column names are correct for clusterProfiler::enricher()
go_table_fixed <- go_table[, 1:2]
colnames(go_table_fixed) <- c("go_id", "gene_id")
term2name_fixed <- term2name[, 1:2]
colnames(term2name_fixed) <- c("go_id", "name")
# Run enrichment
go_result <- enricher(
gene = gene_list,
TERM2GENE = go_table_fixed,
TERM2NAME = term2name_fixed,
pvalueCutoff = 0.05,
qvalueCutoff = 0.2
)
if (!is.null(go_result) &&
inherits(go_result, "enrichResult") &&
nrow(go_result@result) > 0 &&
sum(!is.na(go_result@result$Description)) > 0) {
# Save dotplot
try({
pdf(file = file.path(output_dir, paste0("GO_", ontology, "_dotplot_", species, "_", suffix, ".pdf")),
width = 8, height = 6)
print(dotplot(go_result, showCategory = min(show_n, nrow(go_result@result))) +
ggtitle(paste(ontology, suffix)))
dev.off()
}, silent = TRUE)
# Export top terms with log10(p)
species_enrich_ready <- go_result@result[, c("ID", "p.adjust")]
species_enrich_ready$logp <- -log10(species_enrich_ready$p.adjust)
species_enrich_ready <- species_enrich_ready[order(-species_enrich_ready$logp), ]
species_enrich_ready <- head(species_enrich_ready, n = top_n)[, c("ID", "logp")]
write.table(species_enrich_ready,
file = file.path(output_dir, paste0("enrich_", ontology, "_GOs_", species, "_", suffix, ".txt")),
sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
# Also export the full table if needed
write.csv(go_result@result,
file = file.path(output_dir, paste0("GO_enrichment_full_", ontology, "_", species, "_", suffix, ".csv")),
row.names = FALSE)
# ✅ Return enrichment results with Species + ONTOLOGY
df <- go_result@result %>%
dplyr::mutate(Species = species, ONTOLOGY = ontology)
return(df)
} else {
message(paste0("⚠️ No GO enrichment result to plot/export for ", species, " - ", suffix, " [", ontology, "]"))
return(NULL)
}
}
run_KEGG_enrichment_selected <- function(gene_list, species, suffix, output_dir,
show_n = 40, top_n = 40) {
if (length(gene_list) == 0) return(NULL)
kegg_result <- enrich_KEGG(gene_list, pval_cutoff = 0.05, qval_cutoff = 0.2)
if (!is.null(kegg_result) && inherits(kegg_result, "enrichResult") &&
nrow(kegg_result@result) > 0) {
try({
pdf(file = file.path(output_dir, paste0("KEGG_dotplot_", species, "_", suffix, ".pdf")),
width = 8, height = 6)
print(dotplot(kegg_result, showCategory = min(show_n, nrow(kegg_result@result))) +
ggtitle(paste("KEGG", suffix)))
dev.off()
}, silent = TRUE)
# Full result
write.csv(kegg_result@result,
file = file.path(output_dir, paste0("KEGG_enrichment_", species, "_", suffix, ".csv")),
row.names = FALSE)
# Top KEGG terms
species_enrich_kegg <- kegg_result@result[, c("ID", "p.adjust")]
species_enrich_kegg$logp <- -log10(species_enrich_kegg$p.adjust)
species_enrich_kegg <- species_enrich_kegg[order(-species_enrich_kegg$logp), ][1:min(nrow(species_enrich_kegg), top_n), ]
species_enrich_kegg <- species_enrich_kegg[, c("ID", "logp")]
write.table(species_enrich_kegg[, c("ID", "logp")],
file = file.path(output_dir, paste0("enrich_KEGG_", species, "_", suffix, ".txt")),
sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
# ✅ return the full result with Species
df <- as.data.frame(kegg_result@result)
df$Species <- species
return(df)
} else {
message(paste("⚠️ No KEGG enrichment result to plot/export for", species, "-", suffix))
return(NULL)
}
}
GO_terms_list <- list()
ontologies_list <- list()
term2name_list <- list()
kegg_final_list <- list()
for (sp in species_list) {
message("Preparing annotations for ", sp)
sp_code <- species_translate[sp]
eggnog_path <- file.path(GODir, annot_map[[sp]])
gff_path <- file.path(RefDir, gff_map[[sp]])
output_dir <- file.path(enrichDir, sp)
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
eggnog_annots <- read.delim(eggnog_path, sep = "\t", skip = 4)
eggnog_annots <- eggnog_annots[1:(nrow(eggnog_annots) - 3), ]
gff.df <- as.data.frame(import(gff_path))
protein_2_gene <- unique(gff.df[c("Name", "gene")])
protein_2_gene_df <- subset(protein_2_gene, grepl("^XP", protein_2_gene$Name))
eggnog_annots$Name <- eggnog_annots$X.query
eggnog_annots <- left_join(eggnog_annots, protein_2_gene_df, by = "Name")
eggnog_annots$X.query <- eggnog_annots$gene
# GO
GO_terms <- data.table(eggnog_annots[, c("X.query", "GOs")])
GO_terms <- GO_terms[, .(GOs = unlist(strsplit(GOs, ","))), by = X.query]
term2name <- GO_terms[, .(GOs, X.query)]
term2name$Names <- mapIds(GO.db, keys = term2name$GOs, column = "TERM", keytype = "GOID")
term2name$Ontology <- mapIds(GO.db, keys = term2name$GOs, column = "ONTOLOGY", keytype = "GOID")
term2name <- as.data.frame(term2name)
go_bp <- term2name[term2name$Ontology == "BP", c("GOs", "X.query")]
go_mf <- term2name[term2name$Ontology == "MF", c("GOs", "X.query")]
go_cc <- term2name[term2name$Ontology == "CC", c("GOs", "X.query")]
term2name_filtered <- term2name[!is.na(term2name$Names), c("GOs", "Names")]
ontologies <- list(BP = go_bp, MF = go_mf, CC = go_cc)
# KEGG
KO_terms <- data.table(eggnog_annots[, c("X.query", "KEGG_ko")])
KO_terms$KEGG_ko <- gsub("ko:", "", as.character(KO_terms$KEGG_ko))
KO_terms <- KO_terms[, .(KEGG_ko = unlist(strsplit(KEGG_ko, ","))), by = X.query]
kegg_final <- KO_terms[, .(KEGG_ko, X.query)]
# Store per species
GO_terms_list[[sp]] <- GO_terms
ontologies_list[[sp]] <- ontologies
term2name_list[[sp]] <- term2name_filtered
kegg_final_list[[sp]] <- kegg_final
}
We run the loop using the the list of selected genes extracted in step 1:
suffix <- "cafe"
species_translate <- c(
gregaria = "Sgreg",
cancellata = "Scanc",
piceifrons = "Spice",
americana = "Samer",
cubense = "Sscub",
nitens = "Snite",
locusta = "Lmigr" # proxy annotation from gregaria
)
# ===== Loop through each species =====
go_results_all <- list()
kegg_results_all <- list()
# initialize diagnostic log
diagnostic_log <- data.frame(
Species = character(),
CAFE_genes = integer(),
GO_overlap = integer(),
KEGG_overlap = integer(),
stringsAsFactors = FALSE
)
for (sp in species_list) {
message("Processing ", sp)
sp_code <- species_translate[sp]
output_dir <- file.path(enrichDir, sp)
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
# 1. Extract cafe GeneIDs for this species
species_genes <- cafe_genes %>%
filter(Species == sp_code) %>%
pull(GeneID) %>%
unique()
n_cafe <- length(species_genes)
message("→ ", n_cafe, " Gene families expanding in ", sp)
# 2. Check overlaps with GO and KEGG annotations
go_annot <- GO_terms_list[[sp]]$X.query
kegg_annot <- kegg_final_list[[sp]]$X.query
go_overlap <- sum(species_genes %in% go_annot)
kegg_overlap <- sum(species_genes %in% kegg_annot)
message("→ Overlap counts: GO = ", go_overlap, ", KEGG = ", kegg_overlap)
if (kegg_overlap == 0) {
message(" ⚠️ No KEGG matches for ", sp,
". Example missing IDs: ",
paste(head(setdiff(species_genes, kegg_annot), 5), collapse = ", "))
}
# Add to diagnostic log
diagnostic_log <- rbind(diagnostic_log, data.frame(
Species = sp,
CAFE_genes = n_cafe,
GO_overlap = go_overlap,
KEGG_overlap = kegg_overlap
))
# 3. Run GO enrichment (BP, MF, CC separately)
selected_genes_go <- species_genes[species_genes %in% go_annot]
go_by_onto <- list()
for (onto in names(ontologies_list[[sp]])) {
go_res <- run_GO_enrichment_selected(
gene_list = selected_genes_go,
go_table = ontologies_list[[sp]][[onto]], # term2gene mapping
term2name = term2name_list[[sp]], # term2name mapping
species = sp,
suffix = suffix,
ontology = onto,
output_dir = output_dir
)
if (!is.null(go_res)) {
go_res <- as.data.frame(go_res) %>%
mutate(Species = sp, ONTOLOGY = onto) # 👈 add both species + ontology here
go_by_onto[[onto]] <- go_res
}
}
go_results_all[[sp]] <- go_by_onto
# 4. Run KEGG enrichment
selected_genes_kegg <- species_genes[species_genes %in% kegg_annot]
kegg_final <<- kegg_final_list[[sp]] # required by assign_kegg_ids()
kegg_results_all[[sp]] <- run_KEGG_enrichment_selected(
gene_list = selected_genes_kegg,
species = sp,
suffix = suffix,
output_dir = output_dir
)
}
# save diagnostic log
write.csv(diagnostic_log, file.path(enrichDir, "diagnostic_overlap_log.csv"), row.names = FALSE)
print(diagnostic_log)
Species CAFE_genes GO_overlap KEGG_overlap
1 gregaria 817 500 500
2 cancellata 916 569 569
3 piceifrons 1146 588 588
4 americana 1413 825 825
5 cubense 965 620 620
6 nitens 1169 472 472
7 locusta 352 272 272
We need to make a heat map with all species:
library(dplyr)
library(tidyr)
library(pheatmap)
library(tibble)
# 1. Flatten all species results
all_go_results <- bind_rows(lapply(go_results_all, bind_rows))
# 2. Loop through each ontology
for (onto in c("BP", "MF", "CC")) {
# Filter for this ontology
onto_res <- all_go_results %>% filter(ONTOLOGY == onto)
# Skip if empty
if (nrow(onto_res) == 0) {
message("⚠️ No results for ", onto)
next
}
# 3. Take top 10 per species
top_go <- onto_res %>%
group_by(Species) %>%
slice_min(order_by = p.adjust, n = 15, with_ties = FALSE) %>%
ungroup()
# 4. Build matrix of -log10(p.adjust)
heatmap_mat <- top_go %>%
mutate(score = -log10(p.adjust)) %>%
dplyr::select(Description, Species, score) %>%
pivot_wider(names_from = Species, values_from = score, values_fill = 0) %>%
column_to_rownames("Description") %>%
as.matrix()
# Define desired species order
species_order <- c("locusta", "gregaria", "nitens",
"cancellata", "piceifrons", "americana", "cubense")
# Flip matrix: species as rows, GO terms as columns
heatmap_mat_flipped <- t(heatmap_mat)
# Reorder rows to match your desired order
heatmap_mat_flipped <- heatmap_mat_flipped[species_order, , drop = FALSE]
# Set color scale: 0 = white, gradient for >0
max_val <- max(heatmap_mat_flipped, na.rm = TRUE)
breaks <- c(0, seq(1e-6, max_val, length.out = 100))
colors <- c("white", colorRampPalette(c("orange", "darkred"))(length(breaks) - 2))
# Save heatmap
out_pdf <- file.path(enrichDir, paste0("GO_", onto, "_heatmap_top15.pdf"))
pdf(out_pdf, width = 12, height = 6)
pheatmap(
mat = heatmap_mat_flipped,
cluster_rows = FALSE, # keep species order fixed
cluster_cols = FALSE, # cluster GO terms
scale = "none",
fontsize_row = 12,
fontsize_col = 6,
angle_col = 90,
color = colors,
breaks = breaks,
main = paste("Top 10 GO",
ifelse(onto == "BP", "Biological Process",
ifelse(onto == "MF", "Molecular Function", "Cellular Component")),
"terms per species")
)
dev.off()
}
This code is to make a KEGG pathway enrichment:
library(dplyr)
library(tidyr)
library(pheatmap)
# === Collect KEGG results ===
all_kegg_results <- bind_rows(
lapply(names(kegg_results_all), function(sp) {
kegg_results_all[[sp]] # now each is already a df with Species
})
)
# === Take top 10 per species ===
top_kegg <- all_kegg_results %>%
group_by(Species) %>%
slice_min(order_by = p.adjust, n = 15, with_ties = FALSE) %>%
ungroup()
library(dplyr)
library(tidyr)
library(ggplot2)
# === Start from top_kegg ===
heatmap_df_kegg <- top_kegg %>%
dplyr::select(Species, category, subcategory, Count) %>%
filter(!is.na(category), !is.na(subcategory)) %>%
group_by(Species, category, subcategory) %>%
summarise(Count = max(Count, na.rm = TRUE), .groups = "drop")
# Fix Species order (rows)
species_order <- c( "cubense", "americana","piceifrons","cancellata","nitens", "gregaria","locusta")
heatmap_df_kegg$Species <- factor(heatmap_df_kegg$Species, levels = species_order)
# Optional: order subcategories alphabetically within each category
heatmap_df_kegg <- heatmap_df_kegg %>%
group_by(category) %>%
mutate(subcategory = factor(subcategory, levels = unique(subcategory))) %>%
ungroup()
# === Set color scale (white for 0, gradient for >0) ===
max_val <- max(heatmap_df_kegg$Count, na.rm = TRUE)
colors <- c("white", colorRampPalette(c("orange", "darkred"))(10))
# === Plot with ggplot2 ===
out_pdf <- file.path(enrichDir, "KEGG_subcategory_faceted_heatmap_Count.pdf")
pdf(out_pdf, width = 14, height = 8)
ggplot(heatmap_df_kegg, aes(x = subcategory, y = Species, fill = Count)) +
geom_tile(color = "black", size = 0.3, na.rm = FALSE) +
scale_fill_gradientn(colors = colors, limits = c(0, max_val)) +
facet_wrap(~ category, scales = "free_x", nrow = 1) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 8),
axis.text.y = element_text(size = 9),
panel.grid = element_blank(),
strip.text = element_text(face = "bold", size = 8)
) +
labs(
title = "Top 20 KEGG subcategories per species (colored by gene count)",
x = "Subcategory",
y = "Species",
fill = "# Genes"
)
dev.off()
quartz_off_screen
2
sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.6.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Asia/Tokyo
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ggplot2_3.5.2 tibble_3.3.0 pheatmap_1.0.13
[4] DesertLocustR_0.1.0 remotes_2.5.0 Biostrings_2.74.1
[7] XVector_0.46.0 AnnotationHub_3.14.0 BiocFileCache_2.14.0
[10] dbplyr_2.5.0 rtracklayer_1.66.0 GenomicRanges_1.58.0
[13] GenomeInfoDb_1.42.3 GO.db_3.20.0 AnnotationDbi_1.68.0
[16] IRanges_2.40.1 S4Vectors_0.44.0 Biobase_2.66.0
[19] BiocGenerics_0.52.0 clusterProfiler_4.14.6 data.table_1.17.6
[22] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1
[25] dplyr_1.1.4
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.17.1
[3] jsonlite_2.0.0 magrittr_2.0.3
[5] ggtangle_0.0.6 farver_2.1.2
[7] rmarkdown_2.29 fs_1.6.6
[9] BiocIO_1.16.0 zlibbioc_1.52.0
[11] vctrs_0.6.5 memoise_2.0.1
[13] Rsamtools_2.22.0 RCurl_1.98-1.17
[15] ggtree_3.14.0 htmltools_0.5.8.1
[17] S4Arrays_1.6.0 curl_6.4.0
[19] SparseArray_1.6.2 gridGraphics_0.5-1
[21] sass_0.4.10 bslib_0.9.0
[23] plyr_1.8.9 cachem_1.1.0
[25] GenomicAlignments_1.42.0 whisker_0.4.1
[27] igraph_2.1.4 lifecycle_1.0.4
[29] pkgconfig_2.0.3 Matrix_1.7-3
[31] R6_2.6.1 fastmap_1.2.0
[33] gson_0.1.0 GenomeInfoDbData_1.2.13
[35] MatrixGenerics_1.18.1 digest_0.6.37
[37] aplot_0.2.7 enrichplot_1.26.6
[39] colorspace_2.1-1 patchwork_1.3.1
[41] rprojroot_2.0.4 RSQLite_2.4.1
[43] labeling_0.4.3 filelock_1.0.3
[45] httr_1.4.7 abind_1.4-8
[47] compiler_4.4.2 bit64_4.6.0-1
[49] withr_3.0.2 BiocParallel_1.40.2
[51] DBI_1.2.3 R.utils_2.13.0
[53] rappdirs_0.3.3 DelayedArray_0.32.0
[55] rjson_0.2.23 tools_4.4.2
[57] ape_5.8-1 httpuv_1.6.16
[59] R.oo_1.27.1 glue_1.8.0
[61] restfulr_0.0.15 nlme_3.1-168
[63] GOSemSim_2.32.0 promises_1.3.3
[65] grid_4.4.2 reshape2_1.4.4
[67] fgsea_1.32.4 generics_0.1.4
[69] gtable_0.3.6 tzdb_0.5.0
[71] R.methodsS3_1.8.2 hms_1.1.3
[73] utf8_1.2.6 BiocVersion_3.20.0
[75] ggrepel_0.9.6 pillar_1.10.2
[77] stringr_1.5.1 yulab.utils_0.2.0
[79] later_1.4.2 splines_4.4.2
[81] treeio_1.30.0 lattice_0.22-7
[83] bit_4.6.0 tidyselect_1.2.1
[85] knitr_1.50 git2r_0.36.2
[87] SummarizedExperiment_1.36.0 xfun_0.52
[89] matrixStats_1.5.0 stringi_1.8.7
[91] UCSC.utils_1.2.0 workflowr_1.7.1
[93] lazyeval_0.2.2 ggfun_0.1.9
[95] yaml_2.3.10 evaluate_1.0.4
[97] codetools_0.2-20 qvalue_2.38.0
[99] BiocManager_1.30.26 ggplotify_0.1.2
[101] cli_3.6.5 jquerylib_0.1.4
[103] dichromat_2.0-0.1 Rcpp_1.0.14
[105] png_0.1-8 XML_3.99-0.18
[107] parallel_4.4.2 blob_1.2.4
[109] DOSE_4.0.1 bitops_1.0-9
[111] tidytree_0.4.6 scales_1.4.0
[113] crayon_1.5.3 rlang_1.1.6
[115] cowplot_1.1.3 fastmatch_1.1-6
[117] KEGGREST_1.46.0