Last updated: 2025-06-05
Checks: 7 0
Knit directory:
locust-comparative-genomics/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20221025) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 22234e2. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: analysis/.DS_Store
Ignored: analysis/.Rhistory
Ignored: code/.DS_Store
Ignored: code/scripts/.DS_Store
Ignored: code/scripts/pal2nal.v14/.DS_Store
Ignored: data/.DS_Store
Ignored: data/DEG_results/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/americana/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/cancellata/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/cancellata/Thorax/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/cubense/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/davidO/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/gregaria/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/nitens/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/piceifrons/.DS_Store
Ignored: data/DEG_results/RNAi/.DS_Store
Ignored: data/DEG_results/RNAi/All/.DS_Store
Ignored: data/DEG_results/RNAi/All_GFP/.DS_Store
Ignored: data/DEG_results/RNAi/All_control/.DS_Store
Ignored: data/DEG_results/RNAi/All_no_rRNA/.DS_Store
Ignored: data/DEG_results/RNAi/Head/.DS_Store
Ignored: data/DEG_results/RNAi/Head_control/.DS_Store
Ignored: data/DEG_results/RNAi/Head_no_rRNA/.DS_Store
Ignored: data/DEG_results/RNAi/Thorax/.DS_Store
Ignored: data/DEG_results/RNAi/Thorax_no_rRNA/.DS_Store
Ignored: data/DEG_results/gregaria/
Ignored: data/DEG_results/single_cell/.DS_Store
Ignored: data/WGCNA/.DS_Store
Ignored: data/WGCNA/input/.DS_Store
Ignored: data/WGCNA/input/Bulk_RNAseq/.DS_Store
Ignored: data/WGCNA/output/.DS_Store
Ignored: data/WGCNA/output/Bulk_RNAseq/.DS_Store
Ignored: data/behavioral_data/.DS_Store
Ignored: data/behavioral_data/Raw_data/.DS_Store
Ignored: data/list/.DS_Store
Ignored: data/list/Bulk_RNAseq/.DS_Store
Ignored: data/list/GO_Annotations/.DS_Store
Ignored: data/list/excluded_loci/.DS_Store
Ignored: data/orthofinder/.DS_Store
Ignored: data/orthofinder/Polyneoptera/.DS_Store
Ignored: data/orthofinder/Polyneoptera/Results_I2_iqtree/.DS_Store
Ignored: data/orthofinder/Polyneoptera/Results_I2_withDaust/.DS_Store
Ignored: data/orthofinder/Polyneoptera/Results_I2_withDaust/Orthogroups/.DS_Store
Ignored: data/orthofinder/Schistocerca/.DS_Store
Ignored: data/orthofinder/Schistocerca/Results_I2/.DS_Store
Ignored: data/orthofinder/Schistocerca/Results_I2/Orthogroups/.DS_Store
Ignored: data/overlap/.DS_Store
Ignored: data/overlap/Bulk_RNAseq/.DS_Store
Ignored: data/overlap/Bulk_RNAseq/cancellata/
Ignored: data/pathway_enrichment/.DS_Store
Ignored: data/pathway_enrichment/custom_sgregaria_orgdb/.DS_Store
Ignored: data/readcounts/.DS_Store
Ignored: data/readcounts/Bulk_RNAseq/.DS_Store
Ignored: data/readcounts/RNAi/.DS_Store
Untracked files:
Untracked: data/RefSeq/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/2_gene-family.Rmd) and
HTML (docs/2_gene-family.html) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote), click on
the hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | 4e391c3 | Maeva TECHER | 2025-05-30 | add new analysis orthology, synteny |
| html | 4e391c3 | Maeva TECHER | 2025-05-30 | add new analysis orthology, synteny |
| Rmd | b982319 | Maeva TECHER | 2025-03-03 | update font |
| html | b982319 | Maeva TECHER | 2025-03-03 | update font |
| html | 84bca53 | Maeva TECHER | 2025-02-27 | Build site. |
| html | d0b7267 | Maeva TECHER | 2025-02-27 | Build site. |
| Rmd | faf2db3 | Maeva TECHER | 2025-01-13 | update markdown |
| html | faf2db3 | Maeva TECHER | 2025-01-13 | update markdown |
| html | 3fa8e62 | Maeva TECHER | 2024-11-09 | updated analysis |
| html | edb70fe | Maeva TECHER | 2024-11-08 | overlap and deg results created |
| html | ba35b82 | Maeva A. TECHER | 2024-06-20 | Build site. |
| html | bf74631 | Maeva A. TECHER | 2024-05-14 | Build site. |
| Rmd | 41bd625 | Maeva A. TECHER | 2024-05-14 | wflow_publish("analysis/2_gene-family.Rmd") |
| html | 0837617 | Maeva A. TECHER | 2024-01-30 | Build site. |
| html | f701a01 | Maeva A. TECHER | 2024-01-30 | reupdate |
| html | 7b901bb | Maeva A. TECHER | 2024-01-24 | Build site. |
| html | 8c22d7c | Maeva A. TECHER | 2024-01-24 | Build site. |
| html | 1b09cbe | Maeva A. TECHER | 2024-01-24 | remove |
| html | deba29e | Maeva A. TECHER | 2023-12-18 | Build site. |
| Rmd | 53877fa | Maeva A. TECHER | 2023-12-18 | add pages |
We will be analyzing the gene family expansion and contraction using
CAFE5
developed by the Hahn lab. We can use the output files generated by
OrthoFinder after a quick manipulation to format the input
needed to estimate the gene gain and loss. CAFE5 requires two input: -
Input 1: The number of genes per species in each gene family (here
orthogroup). - Input 2: An ultrametric time tree with the same species
name.
For input 1, we will be using the GeneCount matrix generated by
OrthoFinder2 which is located in the
Results_DATE/Orthogroups output folder. The file of
interest is the Orthogroups.GeneCount.tsv matrix which is
formatted as shown below:

However, whether it is the Polyneoptera run (13 species) or Schistocerca only (6 species), we encounter an issue for downstream analysis as some orthogroups have too many genes or a too high size variation which can cause difficulty for model convergence. We found a large number of duplication events in several orthogroups which can bias the overal picture. We will then apply some filtering rules:
For Polyneoptera - We remove all orthogroups with > 400 genes - We remove all orthogroups with > 100 size variation
For Schistocerca - We remove all orthogroups with > X genes - We remove all orthogroups with > X size variation
We can do it manually or use the following script
python 1_filter_large_variation_OGs.py:
srun --ntasks 1 --cpus-per-task 8 --mem 50G --time 03:00:00 --pty bash
ml GCCcore/14.2.0 Python/3.13.1
#python -m venv --system-site-packages ./venv
#source ./venv/bin/activate
#pip install --upgrade pip
#pip install pandas
import pandas as pd
# Input/Output file paths
input_file = "Orthogroups.GeneCount.tsv"
output_file = "Orthogroups_drop4CAFE.tsv"
# Read TSV (tab-separated)
df = pd.read_csv(input_file, sep="\t")
# Drop orthogroups with more than 450 genes in total
df_filtered = df[df.iloc[:, 1:].sum(axis=1) <= 200] # For Schistocerca
#df_filtered = df[df.iloc[:, 1:].sum(axis=1) <= 300] # For Polyneoptera
# Drop orthogroups where the max difference across species is >100
df_filtered = df_filtered[df_filtered.iloc[:, 1:].max(axis=1) - df_filtered.iloc[:, 1:].min(axis=1) <= 50] # For Schistocerca
#df_filtered = df_filtered[df_filtered.iloc[:, 1:].max(axis=1) - df_filtered.iloc[:, 1:].min(axis=1) <= 100] # For Polyneoptera
# Write the filtered table to a TSV file
df_filtered.to_csv(output_file, sep="\t", index=False)
print(f"Filtered orthogroups written to: {output_file}")
Then we need to format the file more by removing the
Total column, adding a column Description with
values = null.
# Step 1: Add "(null)" as the first column
awk -F'\t' '{print "(null)\t"$0}' Orthogroups_drop4CAFE.tsv > tmp.tsv
# Step 2: Rename the first header column to "Desc"
sed -i '1s/^(null)/Desc/' tmp.tsv
# Step 3: Remove the last column (Total), even if column numbers vary
awk -F'\t' '{$NF=""; print $0}' tmp.tsv | rev | sed 's/^\s*//g' | rev | tr ' ' '\t' > Orthogroups_readytoCAFE.tsv
# Optional: remove intermediate file
rm tmp.tsv
# We clean the species name so it is easier for plotting later
sed '1s/_filteredTranscripts//g' Orthogroups_readytoCAFE.tsv > Orthogroups_readytoCAFE_cleaned.tsv
For input 2, we will copy the tree file generated by
OrthoFinder unless we specify our own calibrated tree. We
will take SpeciesTree_rooted.txt from the
Results_DATE/SpeciesTree output folder.
If we modify the header (names of species) of the GeneCount file,
this must be reflected in the species tree file as well. For CAFE, the
tree must be ultrametric in the past I used directly the script make_ultrametric.py
available from Orthofinder:
# Orthofinder needs to be loaded for it
# On Grace we may have to create a virtual environment if numpy is not installed
# python -m venv --system-site-packages ./venv
# source venv/bin/activate
# pip install numpy
#ml GCC/12.3.0 OpenMPI/4.1.5 OrthoFinder/2.5.5
#python ./make_ultrametric.py SpeciesTree_rooted.txt
However, I noticed that the tree does not rescale always properly and
also if the branches are too short like for Schistocerca,
CAFE will throw the inf error as the model can
not converge as there is not enough time for gene family evolution. So
now, as we do not have a calibrated tree, I rescale the topology of the
tree (which was run with bootstraps) with a factor of 10 and make
ultrametric in R:
ml GCC/13.2.0 OpenMPI/4.1.6 R_tamu/4.4.1
export R_LIBS=$SCRATCH/R_LIBS_USER/
We then launch R:
library(ape)
# Set your input/output
input_tree <- "SpeciesTree_rooted.txt"
output_tree <- "rescaled_ultrametric_SpeciesTree.txt"
# Read the tree
tree <- read.tree(input_tree)
# Force ultrametric (can adjust lambda or try other methods if needed)
ultra_tree <- chronos(tree, lambda = 1)
# Optional: check it worked
stopifnot(is.ultrametric(ultra_tree))
# 🔁 RESCALE: Multiply all branch lengths by a factor (e.g., ×100)
scaling_factor <- 10
ultra_tree$edge.length <- ultra_tree$edge.length * scaling_factor
# Write the rescaled ultrametric tree
write.tree(ultra_tree, file = output_tree)
cat("✅ Tree written to:", output_tree, "\n")
As we did earlier for the Orthogroups file, we want to clean the name so it is easier for plotting later:
sed 's/_filteredTranscripts//g' rescaled_ultrametric_SpeciesTree.txt > rescaled_ultrametric_SpeciesTree_cleaned.txt
CAFE5module purge
ml GCC/12.2.0 CAFE5/5.0.0
cafe5 -t rescaled_ultrametric_SpeciesTree_cleaned.txt -i Orthogroups_readytoCAFE_cleaned.tsv
When successful we can look at the results folder and
extract the results from Base_clade_results

We can plot the results for all families and for each gene family by
using cafeplotter.
For that we run the following commands and output examples are shown below:
source /venv/bin/activate
# pip install cafeplotter
cafeplotter -i results -o plots


sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Asia/Tokyo
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 httr_1.4.7 cli_3.6.5 knitr_1.49
[5] rlang_1.1.6 xfun_0.51 stringi_1.8.4 processx_3.8.6
[9] promises_1.3.2 jsonlite_1.9.1 glue_1.8.0 rprojroot_2.0.4
[13] git2r_0.35.0 htmltools_0.5.8.1 httpuv_1.6.15 ps_1.9.0
[17] sass_0.4.9 rmarkdown_2.29 jquerylib_0.1.4 tibble_3.2.1
[21] evaluate_1.0.3 fastmap_1.2.0 yaml_2.3.10 lifecycle_1.0.4
[25] whisker_0.4.1 stringr_1.5.1 compiler_4.4.2 fs_1.6.5
[29] pkgconfig_2.0.3 Rcpp_1.0.14 rstudioapi_0.17.1 later_1.4.1
[33] digest_0.6.37 R6_2.6.1 pillar_1.10.2 callr_3.7.6
[37] magrittr_2.0.3 bslib_0.9.0 tools_4.4.2 cachem_1.1.0
[41] getPass_0.2-4