Last updated: 2025-06-05

Checks: 7 0

Knit directory: locust-comparative-genomics/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20221025)

The command set.seed(20221025) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 22234e2

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 22234e2. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    code/.DS_Store
    Ignored:    code/scripts/.DS_Store
    Ignored:    code/scripts/pal2nal.v14/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/DEG_results/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/americana/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/cancellata/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/cancellata/Thorax/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/cubense/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/davidO/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/gregaria/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/nitens/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/piceifrons/.DS_Store
    Ignored:    data/DEG_results/RNAi/.DS_Store
    Ignored:    data/DEG_results/RNAi/All/.DS_Store
    Ignored:    data/DEG_results/RNAi/All_GFP/.DS_Store
    Ignored:    data/DEG_results/RNAi/All_control/.DS_Store
    Ignored:    data/DEG_results/RNAi/All_no_rRNA/.DS_Store
    Ignored:    data/DEG_results/RNAi/Head/.DS_Store
    Ignored:    data/DEG_results/RNAi/Head_control/.DS_Store
    Ignored:    data/DEG_results/RNAi/Head_no_rRNA/.DS_Store
    Ignored:    data/DEG_results/RNAi/Thorax/.DS_Store
    Ignored:    data/DEG_results/RNAi/Thorax_no_rRNA/.DS_Store
    Ignored:    data/DEG_results/gregaria/
    Ignored:    data/DEG_results/single_cell/.DS_Store
    Ignored:    data/WGCNA/.DS_Store
    Ignored:    data/WGCNA/input/.DS_Store
    Ignored:    data/WGCNA/input/Bulk_RNAseq/.DS_Store
    Ignored:    data/WGCNA/output/.DS_Store
    Ignored:    data/WGCNA/output/Bulk_RNAseq/.DS_Store
    Ignored:    data/behavioral_data/.DS_Store
    Ignored:    data/behavioral_data/Raw_data/.DS_Store
    Ignored:    data/list/.DS_Store
    Ignored:    data/list/Bulk_RNAseq/.DS_Store
    Ignored:    data/list/GO_Annotations/.DS_Store
    Ignored:    data/list/excluded_loci/.DS_Store
    Ignored:    data/orthofinder/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2_iqtree/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2_withDaust/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2_withDaust/Orthogroups/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/Results_I2/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/Results_I2/Orthogroups/.DS_Store
    Ignored:    data/overlap/.DS_Store
    Ignored:    data/overlap/Bulk_RNAseq/.DS_Store
    Ignored:    data/overlap/Bulk_RNAseq/cancellata/
    Ignored:    data/pathway_enrichment/.DS_Store
    Ignored:    data/pathway_enrichment/custom_sgregaria_orgdb/.DS_Store
    Ignored:    data/readcounts/.DS_Store
    Ignored:    data/readcounts/Bulk_RNAseq/.DS_Store
    Ignored:    data/readcounts/RNAi/.DS_Store

Untracked files:
    Untracked:  data/RefSeq/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/2_gene-family.Rmd) and HTML (docs/2_gene-family.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	4e391c3	Maeva TECHER	2025-05-30	add new analysis orthology, synteny
html	4e391c3	Maeva TECHER	2025-05-30	add new analysis orthology, synteny
Rmd	b982319	Maeva TECHER	2025-03-03	update font
html	b982319	Maeva TECHER	2025-03-03	update font
html	84bca53	Maeva TECHER	2025-02-27	Build site.
html	d0b7267	Maeva TECHER	2025-02-27	Build site.
Rmd	faf2db3	Maeva TECHER	2025-01-13	update markdown
html	faf2db3	Maeva TECHER	2025-01-13	update markdown
html	3fa8e62	Maeva TECHER	2024-11-09	updated analysis
html	edb70fe	Maeva TECHER	2024-11-08	overlap and deg results created
html	ba35b82	Maeva A. TECHER	2024-06-20	Build site.
html	bf74631	Maeva A. TECHER	2024-05-14	Build site.
Rmd	41bd625	Maeva A. TECHER	2024-05-14	wflow_publish("analysis/2_gene-family.Rmd")
html	0837617	Maeva A. TECHER	2024-01-30	Build site.
html	f701a01	Maeva A. TECHER	2024-01-30	reupdate
html	7b901bb	Maeva A. TECHER	2024-01-24	Build site.
html	8c22d7c	Maeva A. TECHER	2024-01-24	Build site.
html	1b09cbe	Maeva A. TECHER	2024-01-24	remove
html	deba29e	Maeva A. TECHER	2023-12-18	Build site.
Rmd	53877fa	Maeva A. TECHER	2023-12-18	add pages

Running CAFE on OrthoFinder output

We will be analyzing the gene family expansion and contraction using CAFE5 developed by the Hahn lab. We can use the output files generated by OrthoFinder after a quick manipulation to format the input needed to estimate the gene gain and loss. CAFE5 requires two input: - Input 1: The number of genes per species in each gene family (here orthogroup). - Input 2: An ultrametric time tree with the same species name.

1. Formatting the Gene Count Matrix

For input 1, we will be using the GeneCount matrix generated by OrthoFinder2 which is located in the Results_DATE/Orthogroups output folder. The file of interest is the Orthogroups.GeneCount.tsv matrix which is formatted as shown below:

module_spider

However, whether it is the Polyneoptera run (13 species) or Schistocerca only (6 species), we encounter an issue for downstream analysis as some orthogroups have too many genes or a too high size variation which can cause difficulty for model convergence. We found a large number of duplication events in several orthogroups which can bias the overal picture. We will then apply some filtering rules:

For Polyneoptera - We remove all orthogroups with > 400 genes - We remove all orthogroups with > 100 size variation

For Schistocerca - We remove all orthogroups with > X genes - We remove all orthogroups with > X size variation

We can do it manually or use the following script python 1_filter_large_variation_OGs.py:

srun --ntasks 1 --cpus-per-task 8 --mem 50G --time 03:00:00 --pty bash

ml GCCcore/14.2.0 Python/3.13.1
#python -m venv --system-site-packages ./venv 
#source ./venv/bin/activate
#pip install --upgrade pip
#pip install pandas

import pandas as pd

# Input/Output file paths
input_file = "Orthogroups.GeneCount.tsv"
output_file = "Orthogroups_drop4CAFE.tsv"

# Read TSV (tab-separated)
df = pd.read_csv(input_file, sep="\t")

# Drop orthogroups with more than 450 genes in total
df_filtered = df[df.iloc[:, 1:].sum(axis=1) <= 200] # For Schistocerca
#df_filtered = df[df.iloc[:, 1:].sum(axis=1) <= 300] # For Polyneoptera

# Drop orthogroups where the max difference across species is >100
df_filtered = df_filtered[df_filtered.iloc[:, 1:].max(axis=1) - df_filtered.iloc[:, 1:].min(axis=1) <= 50] # For Schistocerca
#df_filtered = df_filtered[df_filtered.iloc[:, 1:].max(axis=1) - df_filtered.iloc[:, 1:].min(axis=1) <= 100] # For Polyneoptera

# Write the filtered table to a TSV file
df_filtered.to_csv(output_file, sep="\t", index=False)
print(f"Filtered orthogroups written to: {output_file}")

Then we need to format the file more by removing the Total column, adding a column Description with values = null.

# Step 1: Add "(null)" as the first column
awk -F'\t' '{print "(null)\t"$0}' Orthogroups_drop4CAFE.tsv > tmp.tsv

# Step 2: Rename the first header column to "Desc"
sed -i '1s/^(null)/Desc/' tmp.tsv

# Step 3: Remove the last column (Total), even if column numbers vary
awk -F'\t' '{$NF=""; print $0}' tmp.tsv | rev | sed 's/^\s*//g' | rev | tr ' ' '\t' > Orthogroups_readytoCAFE.tsv

# Optional: remove intermediate file
rm tmp.tsv

# We clean the species name so it is easier for plotting later
sed '1s/_filteredTranscripts//g' Orthogroups_readytoCAFE.tsv > Orthogroups_readytoCAFE_cleaned.tsv

2. Formatting the Species Tree

For input 2, we will copy the tree file generated by OrthoFinder unless we specify our own calibrated tree. We will take SpeciesTree_rooted.txt from the Results_DATE/SpeciesTree output folder.

If we modify the header (names of species) of the GeneCount file, this must be reflected in the species tree file as well. For CAFE, the tree must be ultrametric in the past I used directly the script make_ultrametric.py available from Orthofinder:

# Orthofinder needs to be loaded for it
# On Grace we may have to create a virtual environment if numpy is not installed
# python -m venv --system-site-packages ./venv
# source venv/bin/activate
# pip install numpy 

#ml GCC/12.3.0  OpenMPI/4.1.5 OrthoFinder/2.5.5
#python ./make_ultrametric.py SpeciesTree_rooted.txt

However, I noticed that the tree does not rescale always properly and also if the branches are too short like for Schistocerca, CAFE will throw the inf error as the model can not converge as there is not enough time for gene family evolution. So now, as we do not have a calibrated tree, I rescale the topology of the tree (which was run with bootstraps) with a factor of 10 and make ultrametric in R:

ml GCC/13.2.0  OpenMPI/4.1.6 R_tamu/4.4.1
export R_LIBS=$SCRATCH/R_LIBS_USER/

We then launch R:

library(ape)

# Set your input/output
input_tree <- "SpeciesTree_rooted.txt"
output_tree <- "rescaled_ultrametric_SpeciesTree.txt"

# Read the tree
tree <- read.tree(input_tree)

# Force ultrametric (can adjust lambda or try other methods if needed)
ultra_tree <- chronos(tree, lambda = 1)

# Optional: check it worked
stopifnot(is.ultrametric(ultra_tree))

# 🔁 RESCALE: Multiply all branch lengths by a factor (e.g., ×100)
scaling_factor <- 10
ultra_tree$edge.length <- ultra_tree$edge.length * scaling_factor

# Write the rescaled ultrametric tree
write.tree(ultra_tree, file = output_tree)

cat("✅ Tree written to:", output_tree, "\n")

As we did earlier for the Orthogroups file, we want to clean the name so it is easier for plotting later:

sed 's/_filteredTranscripts//g' rescaled_ultrametric_SpeciesTree.txt > rescaled_ultrametric_SpeciesTree_cleaned.txt

3. Running `CAFE5`

module purge
ml GCC/12.2.0 CAFE5/5.0.0

cafe5 -t rescaled_ultrametric_SpeciesTree_cleaned.txt -i Orthogroups_readytoCAFE_cleaned.tsv

When successful we can look at the results folder and extract the results from Base_clade_results

cafe results

4. Plot results

We can plot the results for all families and for each gene family by using cafeplotter.

For that we run the following commands and output examples are shown below:

source /venv/bin/activate
# pip install cafeplotter

cafeplotter -i results -o plots

cafe results

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Tokyo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       httr_1.4.7        cli_3.6.5         knitr_1.49       
 [5] rlang_1.1.6       xfun_0.51         stringi_1.8.4     processx_3.8.6   
 [9] promises_1.3.2    jsonlite_1.9.1    glue_1.8.0        rprojroot_2.0.4  
[13] git2r_0.35.0      htmltools_0.5.8.1 httpuv_1.6.15     ps_1.9.0         
[17] sass_0.4.9        rmarkdown_2.29    jquerylib_0.1.4   tibble_3.2.1     
[21] evaluate_1.0.3    fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4  
[25] whisker_0.4.1     stringr_1.5.1     compiler_4.4.2    fs_1.6.5         
[29] pkgconfig_2.0.3   Rcpp_1.0.14       rstudioapi_0.17.1 later_1.4.1      
[33] digest_0.6.37     R6_2.6.1          pillar_1.10.2     callr_3.7.6      
[37] magrittr_2.0.3    bslib_0.9.0       tools_4.4.2       cachem_1.1.0     
[41] getPass_0.2-4

Gene family expansion/contraction

Maeva Techer

2025-06-05

Running CAFE on OrthoFinder output

1. Formatting the Gene Count Matrix

2. Formatting the Species Tree

3. Running `CAFE5`

4. Plot results

Gene family expansion/contraction

Maeva Techer

2025-06-05

Running CAFE on OrthoFinder output

1. Formatting the Gene Count Matrix

2. Formatting the Species Tree

3. Running CAFE5

4. Plot results

3. Running `CAFE5`