Last updated: 2025-03-03

Checks: 5 2

Knit directory: locust-comparative-genomics/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20221025)

The command set.seed(20221025) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: absolute

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute	relative
/Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data	data

Repository version: f8bda68

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version f8bda68. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/figure/
    Ignored:    data/.DS_Store
    Ignored:    data/DEG_results/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/americana/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/cancellata/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/cubense/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/davidO/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/gregaria/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/nitens/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/piceifrons/.DS_Store
    Ignored:    data/DEG_results/RNAi/.DS_Store
    Ignored:    data/DEG_results/RNAi/All/.DS_Store
    Ignored:    data/DEG_results/RNAi/All_control/.DS_Store
    Ignored:    data/DEG_results/RNAi/All_no_rRNA/.DS_Store
    Ignored:    data/DEG_results/RNAi/Head/.DS_Store
    Ignored:    data/DEG_results/RNAi/Head_control/.DS_Store
    Ignored:    data/DEG_results/RNAi/Head_no_rRNA/.DS_Store
    Ignored:    data/DEG_results/RNAi/Thorax/.DS_Store
    Ignored:    data/DEG_results/RNAi/Thorax_no_rRNA/.DS_Store
    Ignored:    data/DEG_results/gregaria/
    Ignored:    data/DEG_results/single_cell/.DS_Store
    Ignored:    data/WGCNA/.DS_Store
    Ignored:    data/WGCNA/input/.DS_Store
    Ignored:    data/WGCNA/input/Bulk_RNAseq/.DS_Store
    Ignored:    data/WGCNA/output/
    Ignored:    data/behavioral_data/.DS_Store
    Ignored:    data/behavioral_data/Raw_data/.DS_Store
    Ignored:    data/custom_sgregaria_orgdb/.DS_Store
    Ignored:    data/list/.DS_Store
    Ignored:    data/list/Bulk_RNAseq/.DS_Store
    Ignored:    data/list/GO_Annotations/.DS_Store
    Ignored:    data/orthofinder/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2/Orthogroups/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I5/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I5/Orthogroups/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/Results_I2/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/Results_I2/Orthogroups/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/Results_I5/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/Results_I5/Orthogroups/.DS_Store
    Ignored:    data/overlap/.DS_Store
    Ignored:    data/overlap/Bulk_RNAseq/.DS_Store
    Ignored:    data/overlap/Bulk_RNAseq/cancellata/
    Ignored:    data/readcounts/.DS_Store
    Ignored:    data/readcounts/Bulk_RNAseq/.DS_Store
    Ignored:    data/readcounts/RNAi/.DS_Store

Untracked files:
    Untracked:  data/DEG_results/Bulk_RNAseq/GO10_enrichment_Thorax_gregaria_custom.csv
    Untracked:  data/DEG_results/Bulk_RNAseq/GO10_enrichment_Thorax_piceifrons_custom.csv
    Untracked:  data/DEG_results/RNAi/All_GFP/
    Untracked:  data/DEG_results/RNAi/All_GFP_no_rRNA/
    Untracked:  data/DEG_results/RNAi/Head_GFP/
    Untracked:  data/DEG_results/RNAi/Head_GFP_no_rRNA/
    Untracked:  data/DEG_results/RNAi/Thorax_GFP/
    Untracked:  data/DEG_results/RNAi/Thorax_GFP_no_rRNA/
    Untracked:  data/RefSeq/
    Untracked:  data/list/RNAi/All_RNAi_GFPCRDsample_list.csv
    Untracked:  data/list/RNAi/Head_RNAi_GFPCRDsample_list.csv
    Untracked:  data/list/RNAi/Thorax_RNAi_GFPCRDsample_list.csv

Unstaged changes:
    Modified:   analysis/3_compiling_tables.Rmd
    Modified:   analysis/3_go-enrichment.Rmd
    Modified:   analysis/3_overlap-venn.Rmd
    Modified:   analysis/4_RNAi_degs.Rmd
    Modified:   analysis/_site.yml
    Modified:   data/DEG_results/RNAi/All_control/GFP_vs_CONTROL/DEG_sigresults_GFP_vs_CONTROL.csv
    Modified:   data/DEG_results/RNAi/All_control_no_rRNA/GFP_vs_CONTROL/DEG_sigresults_GFP_vs_CONTROL.csv
    Modified:   data/DEG_results/RNAi/Head/UNCH_vs_GFP/volcano_plot_UNCH_vs_GFP.tiff
    Modified:   data/DEG_results/RNAi/Head_control/GFP_vs_CONTROL/DEG_sigresults_GFP_vs_CONTROL.csv
    Modified:   data/DEG_results/RNAi/Head_control_no_rRNA/GFP_vs_CONTROL/DEG_sigresults_GFP_vs_CONTROL.csv
    Modified:   data/DEG_results/RNAi/Head_no_rRNA/Hex1_vs_GFP/volcano_plot_Hex1_vs_GFP.tiff
    Modified:   data/DEG_results/RNAi/Head_no_rRNA/Hex2_vs_GFP/volcano_plot_Hex2_vs_GFP.tiff
    Modified:   data/DEG_results/RNAi/Head_no_rRNA/JHMT_vs_GFP/volcano_plot_JHMT_vs_GFP.tiff
    Modified:   data/DEG_results/RNAi/Head_no_rRNA/MIOX_vs_GFP/volcano_plot_MIOX_vs_GFP.tiff
    Modified:   data/DEG_results/RNAi/Head_no_rRNA/PCA_vst_Gene_labelled.png
    Modified:   data/DEG_results/RNAi/Head_no_rRNA/UNCH_vs_GFP/volcano_plot_UNCH_vs_GFP.tiff
    Modified:   data/DEG_results/RNAi/Head_no_rRNA/sva_stripchart_SV1.png
    Modified:   data/DEG_results/RNAi/Head_no_rRNA/sva_stripchart_SV2.png
    Modified:   data/DEG_results/RNAi/Head_no_rRNA/sva_stripchart_SV3.png
    Modified:   data/DEG_results/RNAi/Thorax_control/GFP_vs_CONTROL/DEG_sigresults_GFP_vs_CONTROL.csv
    Modified:   data/DEG_results/RNAi/Thorax_control_no_rRNA/GFP_vs_CONTROL/DEG_sigresults_GFP_vs_CONTROL.csv
    Deleted:    data/list/RNAi/Head_RNAi_noninjectedsample_list2.csv
    Modified:   data/orthofinder/Schistocerca/Results_I2/Orthogroups_genesproteinbiotype_Schistocerca_Jan2025.csv
    Modified:   data/overlap/summaryPolyneoptera_DEGs_Orthogroups_Feb2025.csv
    Modified:   data/overlap/summaryPolyneoptera_DEGs_Orthogroups_togregaria_Feb2025.csv
    Modified:   data/overlap/summarySchistocerca_DEGs_Orthogroups_Feb2025.csv
    Modified:   data/overlap/summarySchistocerca_DEGs_Orthogroups_togregaria_Feb2025.csv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/3_go-enrichment.Rmd) and HTML (docs/3_go-enrichment.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	89984c0	Maeva TECHER	2025-02-19	Add overlap update
Rmd	3746422	Maeva TECHER	2025-02-12	Add RNAi
html	3746422	Maeva TECHER	2025-02-12	Add RNAi
Rmd	1fddc47	Maeva TECHER	2025-02-03	Go enrichment
html	1fddc47	Maeva TECHER	2025-02-03	Go enrichment
Rmd	faf2db3	Maeva TECHER	2025-01-13	update markdown
Rmd	616f6d6	Maeva TECHER	2025-01-07	remove old files
html	616f6d6	Maeva TECHER	2025-01-07	remove old files
Rmd	0f0ac1f	Maeva TECHER	2024-11-19	update deseq2
html	0f0ac1f	Maeva TECHER	2024-11-19	update deseq2
Rmd	fe6dae9	Maeva TECHER	2024-11-19	changes ESA
html	fe6dae9	Maeva TECHER	2024-11-19	changes ESA
Rmd	3fa8e62	Maeva TECHER	2024-11-09	updated analysis
html	3fa8e62	Maeva TECHER	2024-11-09	updated analysis
Rmd	edb70fe	Maeva TECHER	2024-11-08	overlap and deg results created
html	edb70fe	Maeva TECHER	2024-11-08	overlap and deg results created
html	ba35b82	Maeva A. TECHER	2024-06-20	Build site.
html	d605bd3	Maeva A. TECHER	2024-05-16	Build site.
Rmd	9f04a80	Maeva A. TECHER	2024-05-16	wflow_publish("analysis/3_go-enrichment.Rmd")
html	d7b2c58	Maeva A. TECHER	2024-05-16	Build site.
Rmd	f5a78da	Maeva A. TECHER	2024-05-16	wflow_publish("analysis/3_go-enrichment.Rmd")
html	a32a56d	Maeva A. TECHER	2024-05-15	Build site.
Rmd	ebc0f04	Maeva A. TECHER	2024-05-15	wflow_publish("analysis/3_go-enrichment.Rmd")

Once we have shortlisted some genes of interest—whether they are obtained from top differentially expressed genes (DEGs), weighted gene co-expression network analysis (WGCNA) modules, or other comparative genomics analyses (e.g., signatures of selection, gene family expansion)—we want to determine if certain functions are enriched in our subset. For example, we hypothesize that although locusts have evolved similar traits, they may have diverged in their strategies to respond to the environment. Therefore, we expect to see DEGs involved in divergent biological processes, molecular function, and cellular components between S. gregaria, S. piceifrons and S. cancellata.

To test that, we need to look for Gene Ontology (GO) terms that can provide us a bird’s-eye of the related functions associated with our genes of interests.

1. Blast2Go file

To create the GO association file with each of our genome, we are using the paid version of OmicsBox with the integrated workflow Blast2Go. We details below our step-by-step with one Schistocerca genome, but followed the same process for all six RefSeq.

Step 1: Load Genome (fasta + GFF)

Step 2: Run Blast

We choose the More Sensitive mode of blastx from the Diamond Blast mode which allows to align large lists of nucelotide or protein sequences against up-to-date public sequence collections. Diamond Blast has a very similar accuracy compared to the NCBI Blast with a much higher throughput. All our association files were run against the Database (NR (2024-07-11)).

2. GO term enrichment

2.1. Load necessary libraries

library(topGO)
library(dplyr) 
library(ggplot2)
library(tidyr)
library(tibble)

# Define working directory and species
workDir <- "/Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data"
species <- "gregaria"

# Step 1: Load DESeq2 results for the species
deg_file <- file.path(workDir, "DEG_results/Bulk_RNAseq/", paste0(species, "/Head/DESeq2_sigresults_Head_", species ,".csv"))
deg_data <- read.csv(deg_file, stringsAsFactors = FALSE)
names(deg_data)[names(deg_data) == "X"] <- "GeneID"

# Separate DEGs into upregulated and downregulated
upregulated_genes <- subset(deg_data, padj < 0.05 & log2FoldChange > 1)$GeneID
downregulated_genes <- subset(deg_data, padj < 0.05 & log2FoldChange < -1)$GeneID

# Load the custom annotation file
custom_annot_file <- file.path(workDir, "list/GO_Annotations", paste0("blast2go_", species, "_custom.txt")) 
custom_annot_df <- read.table(custom_annot_file, sep = "\t", header = TRUE, quote = "", fill = TRUE, stringsAsFactors = FALSE)

# Prepare gene-to-GO mapping for topGO
colnames(custom_annot_df) <- c("GeneID", "Description", "GO_Extended")
#custom_annot_df <- custom_annot_df %>% 
#    separate(GO_Extended, into = c("Category", "GO_ID", "GO_Term"), sep = " ", extra = "merge") %>%
#    mutate(Category = substr(Category, 1, 1))

library(data.table)

# Convert to data.table (if not already)
setDT(custom_annot_df)

# Split `GO_Extended` column efficiently
custom_annot_df[, c("Category", "GO_ID", "GO_Term") := tstrsplit(GO_Extended, " ", fixed = TRUE, keep = 1:3)]

# Extract first letter of `Category`
custom_annot_df[, Category := substr(Category, 1, 1)]


gene2GO <- custom_annot_df %>% 
  group_by(GeneID) %>% 
  summarize(GOterms = list(unique(GO_ID))) %>% 
  deframe()

# Function to run topGO analysis by ontology
run_topGO <- function(ontology, gene_set, gene2GO) {
  all_genes <- factor(as.integer(names(gene2GO) %in% gene_set), levels = c(0, 1))
  names(all_genes) <- names(gene2GO)
  GOdata <- new("topGOdata", ontology = ontology, allGenes = all_genes, annot = annFUN.gene2GO, gene2GO = gene2GO)
  resultFisher <- runTest(GOdata, algorithm = "classic", statistic = "fisher")
  GenTable(GOdata, classicFisher = resultFisher, orderBy = "classicFisher", topNodes = 10)
}

# Run topGO for each ontology category and regulation type
allRes_up_BP <- run_topGO("BP", upregulated_genes, gene2GO) %>% mutate(Regulation = "Upregulated", ontology = "BP")
allRes_up_MF <- run_topGO("MF", upregulated_genes, gene2GO) %>% mutate(Regulation = "Upregulated", ontology = "MF")
allRes_up_CC <- run_topGO("CC", upregulated_genes, gene2GO) %>% mutate(Regulation = "Upregulated", ontology = "CC")

allRes_down_BP <- run_topGO("BP", downregulated_genes, gene2GO) %>% mutate(Regulation = "Downregulated", ontology = "BP")
allRes_down_MF <- run_topGO("MF", downregulated_genes, gene2GO) %>% mutate(Regulation = "Downregulated", ontology = "MF")
allRes_down_CC <- run_topGO("CC", downregulated_genes, gene2GO) %>% mutate(Regulation = "Downregulated", ontology = "CC")

# Combine all results with ontology labels
allRes <- bind_rows(
  allRes_up_BP, allRes_up_MF, allRes_up_CC,
  allRes_down_BP, allRes_down_MF, allRes_down_CC
)

# Check if ontology is retained
head(allRes)

       GO.ID                                        Term Annotated Significant
1 GO:0019310                  inositol catabolic process         6           4
2 GO:0046164                   alcohol catabolic process         7           4
3 GO:0046174                    polyol catabolic process         7           4
4 GO:1901616 organic hydroxy compound catabolic proce...         7           4
5 GO:0006020                  inositol metabolic process         8           4
6 GO:0055085                     transmembrane transport       649          26
  Expected classicFisher  Regulation ontology
1     0.11       1.3e-06 Upregulated       BP
2     0.12       3.0e-06 Upregulated       BP
3     0.12       3.0e-06 Upregulated       BP
4     0.12       3.0e-06 Upregulated       BP
5     0.14       5.9e-06 Upregulated       BP
6    11.37       2.9e-05 Upregulated       BP

# Visualization with ggplot2
allRes$classicFisher <- as.numeric(as.character(allRes$classicFisher))
allRes$FoldEnrichment <- allRes$Significant / allRes$Expected

# Plot with ggplot2 using facet_wrap by ontology
ggplot(allRes, aes(x = reorder(Term, -log10(classicFisher)), y = -log10(classicFisher), size = Significant, color = FoldEnrichment)) +
    geom_point() +
    facet_wrap(~ ontology, scales = "free_y") +
    coord_flip() +
    labs(
        x = "GO Term", 
        y = "-log10(p-value)", 
        size = "Gene Count", 
        color = "Fold Enrichment", 
        title = "Top 10 Enriched GO Terms by Ontology",
        subtitle = "Head transcriptomes S. gregaria"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 12, face = "bold.italic"),
        axis.text.y = element_text(size = 8, face = "bold.italic", color = "black")
    ) +
    scale_size_continuous(range = c(3, 10)) +
    scale_color_viridis_c(option = "D")

Version	Author	Date
1fddc47	Maeva TECHER	2025-02-03
faf2db3	Maeva TECHER	2025-01-13
fe6dae9	Maeva TECHER	2024-11-19
3fa8e62	Maeva TECHER	2024-11-09
edb70fe	Maeva TECHER	2024-11-08

# Plotting code up and downregulated
ggplot(allRes, aes(x = reorder(Term, -log10(classicFisher)), y = -log10(classicFisher), size = Significant, color = Regulation)) +
    geom_point() +
    facet_wrap(~ ontology, scales = "free_y") +
    coord_flip() +
    labs(
        x = "GO Term", 
        y = "-log10(p-value)", 
        size = "Gene Count", 
        color = "Regulation", 
        title = "Top 10 Enriched GO Terms by Ontology",
        subtitle = "Head transcriptomes S. gregaria"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 12, face = "italic"),
        axis.text.y = element_text(size = 8, face = "bold", color = "black")
    ) +
    scale_size_continuous(range = c(3, 10)) +
    scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "blue"))

Version	Author	Date
1fddc47	Maeva TECHER	2025-02-03
faf2db3	Maeva TECHER	2025-01-13
fe6dae9	Maeva TECHER	2024-11-19
3fa8e62	Maeva TECHER	2024-11-09
edb70fe	Maeva TECHER	2024-11-08

# Bar Plot for top GO terms per ontology
ggplot(allRes, aes(x = FoldEnrichment, y = reorder(Term, FoldEnrichment), color = -log10(classicFisher), size = Significant)) +
    geom_point() +
    facet_wrap(~ ontology, scales = "free_y") +
    labs(
        x = "Fold Enrichment", 
        y = "GO Term", 
        color = "-log10(p-value)", 
        size = "Gene Count",
        title = "GO Term Enrichment: Fold Enrichment vs p-value"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 12, face = "bold.italic"),
        axis.text.y = element_text(size = 8, face = "bold", color = "black")
    ) +
    scale_fill_viridis_c(option = "C")

Version	Author	Date
1fddc47	Maeva TECHER	2025-02-03
faf2db3	Maeva TECHER	2025-01-13
fe6dae9	Maeva TECHER	2024-11-19

library(pheatmap)

# Aggregate data to ensure unique combinations of Term and ontology
# Keep the row with the smallest classicFisher value for each Term and ontology pair
heatmap_data <- allRes %>%
    dplyr::group_by(Term, ontology) %>%
    dplyr::slice_min(order_by = classicFisher, n = 1) %>%
    dplyr::ungroup() %>%
    dplyr::select(Term, ontology, classicFisher, Regulation) %>%
    tidyr::spread(ontology, classicFisher) %>%
    tibble::column_to_rownames("Term")

# Verify heatmap data
str(heatmap_data)

'data.frame':   56 obs. of  4 variables:
 $ Regulation: chr  "Upregulated" "Downregulated" "Downregulated" "Downregulated" ...
 $ BP        : num  3.0e-06 3.5e-04 1.9e-04 4.1e-04 NA 1.9e-04 4.1e-04 NA NA NA ...
 $ CC        : num  NA NA NA NA NA NA NA NA 0.167 0.00242 ...
 $ MF        : num  NA NA NA NA 3.1e-04 NA NA 1.9e-06 NA NA ...

# Ensure Regulation exists
if (!"Regulation" %in% colnames(heatmap_data)) {
  stop("Error: 'Regulation' column is missing from heatmap_data!")
}

# Create annotation data frame
annotation <- data.frame(Regulation = heatmap_data$Regulation)
rownames(annotation) <- rownames(heatmap_data)

# Ensure no NA values
annotation <- na.omit(annotation)

# Define annotation colors
annotation_colors <- list(
  Regulation = c("Upregulated" = "red", "Downregulated" = "blue")
)

# Ensure annotation_colors matches annotation values
if (!all(unique(annotation$Regulation) %in% names(annotation_colors$Regulation))) {
  stop("Error: annotation_colors does not match all values in annotation$Regulation")
}

# Create heatmap matrix
heatmap_matrix <- as.matrix(-log10(heatmap_data %>%
  dplyr::select(-Regulation)))

# Replace NA values
heatmap_matrix[is.na(heatmap_matrix)] <- -log10(1)

# Plot heatmap
pheatmap(
  heatmap_matrix,
  cluster_rows = TRUE,
  cluster_cols = FALSE,
  color = colorRampPalette(c("white", "darkblue"))(50),
  annotation_row = annotation,
  main = "GO Term Enrichment Heatmap by Ontology",
  annotation_colors = annotation_colors
)

Version	Author	Date
1fddc47	Maeva TECHER	2025-02-03
faf2db3	Maeva TECHER	2025-01-13
fe6dae9	Maeva TECHER	2024-11-19

# Define working directory and species list
workDir <- "/Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data"
species_list <- c("gregaria", "piceifrons", "cancellata", "americana", "cubense", "nitens")

# Function to run the GO analysis for a given species
run_GO_analysis_for_species <- function(species) {
  # Load DESeq2 results for the species
  deg_file <- file.path(workDir, "DEG_results/Bulk_RNAseq/", paste0(species, "/Head/DESeq2_results_Head_", species ,".csv"))
  deg_data <- read.csv(deg_file, stringsAsFactors = FALSE)
  names(deg_data)[names(deg_data) == "X"] <- "GeneID"
  
  # Separate DEGs into upregulated and downregulated
  upregulated_genes <- subset(deg_data, padj < 0.05 & log2FoldChange > 1)$GeneID
  downregulated_genes <- subset(deg_data, padj < 0.05 & log2FoldChange < -1)$GeneID
  
  # Load the custom annotation file
  custom_annot_file <- file.path(workDir, "list/GO_Annotations", paste0("blast2go_", species, "_custom.txt"))
  custom_annot_df <- read.table(custom_annot_file, sep = "\t", header = TRUE, quote = "", fill = TRUE, stringsAsFactors = FALSE)
  
  # Prepare gene-to-GO mapping for topGO
  colnames(custom_annot_df) <- c("GeneID", "Description", "GO_Extended")

library(data.table)

# Convert to data.table (if not already)
setDT(custom_annot_df)

# Split `GO_Extended` column efficiently
custom_annot_df[, c("Category", "GO_ID", "GO_Term") := tstrsplit(GO_Extended, " ", fixed = TRUE, keep = 1:3)]

# Extract first letter of `Category`
custom_annot_df[, Category := substr(Category, 1, 1)]

  gene2GO <- custom_annot_df %>%
    group_by(GeneID) %>%
    summarize(GOterms = list(unique(GO_ID))) %>%
    deframe()
  
  # Function to run topGO analysis by ontology
  run_topGO <- function(ontology, gene_set, gene2GO) {
    all_genes <- factor(as.integer(names(gene2GO) %in% gene_set), levels = c(0, 1))
    names(all_genes) <- names(gene2GO)
    GOdata <- new("topGOdata", ontology = ontology, allGenes = all_genes, annot = annFUN.gene2GO, gene2GO = gene2GO)
    resultFisher <- runTest(GOdata, algorithm = "classic", statistic = "fisher")
    GenTable(GOdata, classicFisher = resultFisher, orderBy = "classicFisher", topNodes = 10)
  }
  
  # Run topGO for each ontology category and regulation type
  allRes_up_BP <- run_topGO("BP", upregulated_genes, gene2GO) %>% mutate(Regulation = "Upregulated", ontology = "BP")
  allRes_up_MF <- run_topGO("MF", upregulated_genes, gene2GO) %>% mutate(Regulation = "Upregulated", ontology = "MF")
  allRes_up_CC <- run_topGO("CC", upregulated_genes, gene2GO) %>% mutate(Regulation = "Upregulated", ontology = "CC")
  
  allRes_down_BP <- run_topGO("BP", downregulated_genes, gene2GO) %>% mutate(Regulation = "Downregulated", ontology = "BP")
  allRes_down_MF <- run_topGO("MF", downregulated_genes, gene2GO) %>% mutate(Regulation = "Downregulated", ontology = "MF")
  allRes_down_CC <- run_topGO("CC", downregulated_genes, gene2GO) %>% mutate(Regulation = "Downregulated", ontology = "CC")
  
  # Combine all results with ontology labels
  allRes <- bind_rows(
    allRes_up_BP, allRes_up_MF, allRes_up_CC,
    allRes_down_BP, allRes_down_MF, allRes_down_CC
  )
  
  # Calculate FoldEnrichment and convert p-values
  allRes$classicFisher <- as.numeric(as.character(allRes$classicFisher))
  allRes$FoldEnrichment <- allRes$Significant / allRes$Expected
  
  # Export results for this species
  output_file <- file.path(workDir, "DEG_results/Bulk_RNAseq", paste0("GO10_enrichment_Head_", species, "_custom.csv"))
  write.csv(allRes, output_file, row.names = FALSE)
  
  return(allRes)
}

# Name each element in species_list
names(species_list) <- species_list

# Run the analysis for each species
results_list <- lapply(species_list, run_GO_analysis_for_species)

# Combine all results into a single table if desired
combined_results <- bind_rows(results_list, .id = "Species")

# Define working directory and species list
workDir <- "/Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data"
species_list <- c("gregaria", "piceifrons", "cancellata", "americana", "cubense", "nitens")

# Function to run the GO analysis for a given species
run_GO_analysis_for_species <- function(species) {
  # Load DESeq2 results for the species
  deg_file <- file.path(workDir, "DEG_results/Bulk_RNAseq/", paste0(species, "/Thorax/DESeq2_results_Thorax_", species ,".csv"))
  deg_data <- read.csv(deg_file, stringsAsFactors = FALSE)
  names(deg_data)[names(deg_data) == "X"] <- "GeneID"
  
  # Separate DEGs into upregulated and downregulated
  upregulated_genes <- subset(deg_data, padj < 0.05 & log2FoldChange > 1)$GeneID
  downregulated_genes <- subset(deg_data, padj < 0.05 & log2FoldChange < -1)$GeneID
  
  # Load the custom annotation file
  custom_annot_file <- file.path(workDir, "list/GO_Annotations", paste0("blast2go_", species, "_custom.txt"))
  custom_annot_df <- read.table(custom_annot_file, sep = "\t", header = TRUE, quote = "", fill = TRUE, stringsAsFactors = FALSE)
  
  # Prepare gene-to-GO mapping for topGO
  colnames(custom_annot_df) <- c("GeneID", "Description", "GO_Extended")

library(data.table)

# Convert to data.table (if not already)
setDT(custom_annot_df)

# Split `GO_Extended` column efficiently
custom_annot_df[, c("Category", "GO_ID", "GO_Term") := tstrsplit(GO_Extended, " ", fixed = TRUE, keep = 1:3)]

# Extract first letter of `Category`
custom_annot_df[, Category := substr(Category, 1, 1)]

  gene2GO <- custom_annot_df %>%
    group_by(GeneID) %>%
    summarize(GOterms = list(unique(GO_ID))) %>%
    deframe()
  
  # Function to run topGO analysis by ontology
  run_topGO <- function(ontology, gene_set, gene2GO) {
    all_genes <- factor(as.integer(names(gene2GO) %in% gene_set), levels = c(0, 1))
    names(all_genes) <- names(gene2GO)
    GOdata <- new("topGOdata", ontology = ontology, allGenes = all_genes, annot = annFUN.gene2GO, gene2GO = gene2GO)
    resultFisher <- runTest(GOdata, algorithm = "classic", statistic = "fisher")
    GenTable(GOdata, classicFisher = resultFisher, orderBy = "classicFisher", topNodes = 10)
  }
  
  # Run topGO for each ontology category and regulation type
  allRes_up_BP <- run_topGO("BP", upregulated_genes, gene2GO) %>% mutate(Regulation = "Upregulated", ontology = "BP")
  allRes_up_MF <- run_topGO("MF", upregulated_genes, gene2GO) %>% mutate(Regulation = "Upregulated", ontology = "MF")
  allRes_up_CC <- run_topGO("CC", upregulated_genes, gene2GO) %>% mutate(Regulation = "Upregulated", ontology = "CC")
  
  allRes_down_BP <- run_topGO("BP", downregulated_genes, gene2GO) %>% mutate(Regulation = "Downregulated", ontology = "BP")
  allRes_down_MF <- run_topGO("MF", downregulated_genes, gene2GO) %>% mutate(Regulation = "Downregulated", ontology = "MF")
  allRes_down_CC <- run_topGO("CC", downregulated_genes, gene2GO) %>% mutate(Regulation = "Downregulated", ontology = "CC")
  
  # Combine all results with ontology labels
  allRes <- bind_rows(
    allRes_up_BP, allRes_up_MF, allRes_up_CC,
    allRes_down_BP, allRes_down_MF, allRes_down_CC
  )
  
  # Calculate FoldEnrichment and convert p-values
  allRes$classicFisher <- as.numeric(as.character(allRes$classicFisher))
  allRes$FoldEnrichment <- allRes$Significant / allRes$Expected
  
  # Export results for this species
  output_file <- file.path(workDir, "DEG_results/Bulk_RNAseq", paste0("GO10_enrichment_Thorax_", species, "_custom.csv"))
  write.csv(allRes, output_file, row.names = FALSE)
  
  return(allRes)
}

# Name each element in species_list
names(species_list) <- species_list

# Run the analysis for each species
results_list <- lapply(species_list, run_GO_analysis_for_species)

# Combine all results into a single table if desired
combined_results <- bind_rows(results_list, .id = "Species")

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Tokyo
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] pheatmap_1.0.12      data.table_1.17.0    tibble_3.2.1        
 [4] tidyr_1.3.1          ggplot2_3.5.1        dplyr_1.1.4         
 [7] topGO_2.58.0         SparseM_1.84-2       GO.db_3.20.0        
[10] AnnotationDbi_1.68.0 IRanges_2.40.1       S4Vectors_0.44.0    
[13] Biobase_2.66.0       graph_1.84.1         BiocGenerics_0.52.0 

loaded via a namespace (and not attached):
 [1] KEGGREST_1.46.0         gtable_0.3.6            xfun_0.51              
 [4] bslib_0.9.0             lattice_0.22-6          vctrs_0.6.5            
 [7] tools_4.4.2             generics_0.1.3          RSQLite_2.3.9          
[10] blob_1.2.4              pkgconfig_2.0.3         RColorBrewer_1.1-3     
[13] lifecycle_1.0.4         GenomeInfoDbData_1.2.13 farver_2.1.2           
[16] compiler_4.4.2          stringr_1.5.1           git2r_0.35.0           
[19] Biostrings_2.74.1       munsell_0.5.1           httpuv_1.6.15          
[22] GenomeInfoDb_1.42.3     htmltools_0.5.8.1       sass_0.4.9             
[25] yaml_2.3.10             later_1.4.1             pillar_1.10.1          
[28] crayon_1.5.3            jquerylib_0.1.4         whisker_0.4.1          
[31] cachem_1.1.0            tidyselect_1.2.1        digest_0.6.37          
[34] stringi_1.8.4           purrr_1.0.4             labeling_0.4.3         
[37] rprojroot_2.0.4         fastmap_1.2.0           grid_4.4.2             
[40] colorspace_2.1-1        cli_3.6.4               magrittr_2.0.3         
[43] withr_3.0.2             scales_1.3.0            UCSC.utils_1.2.0       
[46] promises_1.3.2          bit64_4.6.0-1           rmarkdown_2.29         
[49] XVector_0.46.0          httr_1.4.7              matrixStats_1.5.0      
[52] bit_4.5.0.1             workflowr_1.7.1         png_0.1-8              
[55] memoise_2.0.1           evaluate_1.0.3          knitr_1.49             
[58] viridisLite_0.4.2       rlang_1.1.5             Rcpp_1.0.14            
[61] glue_1.8.0              DBI_1.2.3               rstudioapi_0.17.1      
[64] jsonlite_1.9.0          R6_2.6.1                fs_1.6.5               
[67] zlibbioc_1.52.0

Functional enrichment analysis

Maeva Techer

2025-03-03

1. Blast2Go file

Step 1: Load Genome (fasta + GFF)

Step 2: Run Blast

2. GO term enrichment

2.1. Load necessary libraries