Last updated: 2025-06-05

Checks: 7 0

Knit directory: locust-comparative-genomics/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20221025)

The command set.seed(20221025) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 9a03ca6

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 9a03ca6. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/figure/
    Ignored:    code/.DS_Store
    Ignored:    code/scripts/.DS_Store
    Ignored:    code/scripts/pal2nal.v14/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/DEG_results/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/americana/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/cancellata/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/cancellata/Thorax/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/cubense/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/davidO/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/gregaria/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/nitens/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/piceifrons/.DS_Store
    Ignored:    data/DEG_results/RNAi/.DS_Store
    Ignored:    data/DEG_results/RNAi/All/.DS_Store
    Ignored:    data/DEG_results/RNAi/All_GFP/.DS_Store
    Ignored:    data/DEG_results/RNAi/All_control/.DS_Store
    Ignored:    data/DEG_results/RNAi/All_no_rRNA/.DS_Store
    Ignored:    data/DEG_results/RNAi/Head/.DS_Store
    Ignored:    data/DEG_results/RNAi/Head_control/.DS_Store
    Ignored:    data/DEG_results/RNAi/Head_no_rRNA/.DS_Store
    Ignored:    data/DEG_results/RNAi/Thorax/.DS_Store
    Ignored:    data/DEG_results/RNAi/Thorax_no_rRNA/.DS_Store
    Ignored:    data/DEG_results/gregaria/
    Ignored:    data/DEG_results/single_cell/.DS_Store
    Ignored:    data/WGCNA/.DS_Store
    Ignored:    data/WGCNA/input/.DS_Store
    Ignored:    data/WGCNA/input/Bulk_RNAseq/.DS_Store
    Ignored:    data/WGCNA/output/.DS_Store
    Ignored:    data/WGCNA/output/Bulk_RNAseq/.DS_Store
    Ignored:    data/behavioral_data/.DS_Store
    Ignored:    data/behavioral_data/Raw_data/.DS_Store
    Ignored:    data/list/.DS_Store
    Ignored:    data/list/Bulk_RNAseq/.DS_Store
    Ignored:    data/list/GO_Annotations/.DS_Store
    Ignored:    data/list/excluded_loci/.DS_Store
    Ignored:    data/orthofinder/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2_iqtree/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2_withDaust/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2_withDaust/Orthogroups/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/Results_I2/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/Results_I2/Orthogroups/.DS_Store
    Ignored:    data/overlap/.DS_Store
    Ignored:    data/overlap/Bulk_RNAseq/.DS_Store
    Ignored:    data/overlap/Bulk_RNAseq/cancellata/
    Ignored:    data/pathway_enrichment/.DS_Store
    Ignored:    data/pathway_enrichment/custom_sgregaria_orgdb/.DS_Store
    Ignored:    data/readcounts/.DS_Store
    Ignored:    data/readcounts/Bulk_RNAseq/.DS_Store
    Ignored:    data/readcounts/RNAi/.DS_Store

Untracked files:
    Untracked:  data/RefSeq/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/2_signatures-selection.Rmd) and HTML (docs/2_signatures-selection.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	9a03ca6	Maeva TECHER	2025-06-05	Update website
html	17484e8	Maeva TECHER	2025-06-05	Build site.
html	3e696d6	Maeva TECHER	2025-06-05	Adding ortho heatmap
Rmd	4e391c3	Maeva TECHER	2025-05-30	add new analysis orthology, synteny
html	4e391c3	Maeva TECHER	2025-05-30	add new analysis orthology, synteny
Rmd	cacc1db	Maeva TECHER	2025-05-02	updates files
html	cacc1db	Maeva TECHER	2025-05-02	updates files
Rmd	b982319	Maeva TECHER	2025-03-03	update font
html	b982319	Maeva TECHER	2025-03-03	update font
html	f6a4762	Maeva TECHER	2025-02-27	Build site.
Rmd	e55bac6	Maeva TECHER	2025-01-26	Updating the github
html	e55bac6	Maeva TECHER	2025-01-26	Updating the github
html	faf2db3	Maeva TECHER	2025-01-13	update markdown
html	6954b9b	Maeva TECHER	2025-01-13	Build site.
Rmd	8df3d7c	Maeva TECHER	2025-01-13	changes
Rmd	b80db34	Maeva TECHER	2025-01-13	Adding selection analysis part
html	b80db34	Maeva TECHER	2025-01-13	Adding selection analysis part
html	3fa8e62	Maeva TECHER	2024-11-09	updated analysis
html	edb70fe	Maeva TECHER	2024-11-08	overlap and deg results created
html	ba35b82	Maeva A. TECHER	2024-06-20	Build site.
html	acfa0db	Maeva A. TECHER	2024-05-14	Build site.
Rmd	2c5b31c	Maeva A. TECHER	2024-05-14	wflow_publish("analysis/2_signatures-selection.Rmd")
html	0837617	Maeva A. TECHER	2024-01-30	Build site.
html	f701a01	Maeva A. TECHER	2024-01-30	reupdate
html	6e878be	Maeva A. TECHER	2024-01-24	Build site.
html	1b09cbe	Maeva A. TECHER	2024-01-24	remove
html	4ae7db7	Maeva A. TECHER	2023-12-18	Build site.
Rmd	53877fa	Maeva A. TECHER	2023-12-18	add pages

Testing for signatures of selection using HyPhy: aBSREL, BUSTED and RELAX

Note: We used OrthoFinder results, PAL2NAL and HyPhy to identify signatures of selection in orthologous genes. For this part, refers to the well curated pipeline FormicidaeMolecularEvolution by Megan Barkdull (Assistant Curator of Entomology at the Natural History Museum of Los Angeles County). We describe below the modifications made and mostly copied the workflow from her Github.

We will be running three methods on our tree:

Are certain species in the Schistocerca phylogeny subject to episodic (at a subset of sites) positive or purifying selection? For this analysis, we will use aBSREL (adaptive Branch-Site Random Effects Likelihood), the preferred method for detecting episodic selection on individual branches within the locust phylogeny.
Has a gene experienced positive selection at any site in a locust species or group of species? To answer this question, we will apply BUSTED (Branch-Site Unrestricted Statistical Test for Episodic Diversification). This method works well for datasets with fewer than 10 taxa and helps identify positive selection events associated with species or groups.
Have selection pressures on genes been relaxed or intensified in a subset of Schistocerca species? For this, we will use RELAX which is not designed to detect positive selection but rather to determine whether selection pressures have been relaxed or intensified along a specified set of “test” branches.

1. Parsing orthogroups files for PAL2NAL

The script written by M. Barkdull remains unchanged; however, it requires R with the phylotools package installed. This step ensures that the OrthoFinder FASTA file is reordered. Instead of having one file per orthogroup, this process consolidates the data into species-specific files, with all orthogroups combined and properly reordered. These files will be input for PAL2NAL, which is a program that converts a multiple sequence alignement of proteins and the corresponding DNA sequences (here cds) into a codon alignment.

srun --ntasks 1 --cpus-per-task 8 --mem 50G --time 04:00:00 --pty bash

ml GCC/13.2.0  OpenMPI/4.1.6 R_tamu/4.4.1 MCScanX/2024.19.19
export R_LIBS=$SCRATCH/R_LIBS_USER/

# Example for Schistocerca only  
./scripts/DataMSA.R ./scripts/inputurls_Schistocerca_Jan2025.txt /scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/5_OrthoFinder/fasta/Results_Jan15_I2/MultipleSequenceAlignments/
  
# Example for Polyneoptera
./scripts/DataMSA.R ./scripts/inputurls_13polyneoptera_May2025.txt /scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/5_OrthoFinder/fasta/Results_May26_iqtree/MultipleSequenceAlignments/

This is the final messages we get when it is successful.

Once we have obtained all the files, we go to the next step which is filtering the protein alignment files to contain only the subset of genes that will be called by PAL2NAL. This is due to the fact that certain genes were not classified in orthogroups.

ml GCC/13.2.0  OpenMPI/4.1.6 R_tamu/4.4.1
export R_LIBS=$SCRATCH/R_LIBS_USER/

# Example for Schistocerca only  
./scripts/FilteringCDSbyMSA.R ./scripts/inputurls_Schistocerca_Jan2025.txt 

# Example for Polyneoptera
./scripts/FilteringCDSbyMSA.R ./scripts/inputurls_13polyneoptera_May2025.txt

Some of the files seems to have discrepancy of one “>” entry line between the protein and cds file (due to a concatenation error that I could not troubleshoot) so we are going to run the script doublecheckCDSbyMAS which I created to remove extra line.

# Example for Schistocerca only 
./scripts/doublecheckCDSbyMAS ./scripts/inputurls_Schistocerca_Jan2025.txt 

# Example for Polyneoptera
./scripts/doublecheckCDSbyMAS ./scripts/inputurls_13polyneoptera_May2025.txt 

# You can also check if there is a difference with the following quick steps
grep ">" ./6_1_SpeciesMSA/proteins_Sscub.fasta | sort > proteins_Sscub_names.txt
grep ">" ./6_2_FilteredCDS/filtered_Sscub_cds.fasta | sort > cds_Sscub_names.txt
diff proteins_Sscub_names.txt cds_Sscub_names.txt

The following is the content of doublecheckCDSbyMAS:

#!/bin/bash

# Check if input file is provided
if [ "$#" -lt 1 ]; then
  echo "Usage: $0 <input_file>"
  exit 1
fi

# Input file containing species information
input_file=$1

# Directories
protein_dir="./6_1_SpeciesMSA"
cds_dir="./6_2_FilteredCDS"
backup_dir="$protein_dir/backup"
log_file="./cleaning_check.log"

# Create necessary directories
mkdir -p "$backup_dir"
rm -f "$log_file"  # Clear previous logs

# Extract species abbreviations (no header in the file)
species_list=$(awk -F',' '{print $4}' "$input_file")

# Loop through each species
for species in $species_list; do
  protein_file="$protein_dir/proteins_${species}.fasta"
  cds_file="$cds_dir/filtered_${species}_cds.fasta"
  cleaned_protein_file="$protein_dir/proteins_${species}_cleaned.fasta"
  cleaned_cds_file="$cds_dir/filtered_${species}_cds_cleaned.fasta"

  echo "Processing species: $species"

  # Check if protein and CDS files exist
  if [[ -f "$protein_file" && -f "$cds_file" ]]; then
    # Backup the original protein file
    cp "$protein_file" "$backup_dir/proteins_${species}.fasta.bak"
    echo "Backup created for: $protein_file -> $backup_dir/proteins_${species}.fasta.bak"

    # Cleaning Step: Align sequence headers between protein and CDS files
    grep ">" "$protein_file" | sort > proteins_names.txt
    grep ">" "$cds_file" | sort > cds_names.txt

    # Identify common sequence headers
    comm -12 proteins_names.txt cds_names.txt > common_names.txt

    # Check if common_names.txt is empty (indicating no matching headers)
    if [[ ! -s common_names.txt ]]; then
      echo "ERROR: No common sequence headers found for species: $species" >> "$log_file"
      echo "ERROR: Cleaning failed for species: $species due to no matching sequence headers."
      continue
    fi

    # Filter protein file
    grep -A 1 -Ff common_names.txt "$protein_file" > "$cleaned_protein_file" || {
      echo "ERROR: Failed to clean protein file for species: $species" >> "$log_file"
      continue
    }

    # Filter CDS file
    grep -A 1 -Ff common_names.txt "$cds_file" > "$cleaned_cds_file" || {
      echo "ERROR: Failed to clean CDS file for species: $species" >> "$log_file"
      continue
    }

    # Replace the original files with cleaned versions
    mv "$cleaned_protein_file" "$protein_file"
    mv "$cleaned_cds_file" "$cds_file"

    # Perform grep check to validate cleaning
    grep ">" "$protein_file" | sort > proteins_names_cleaned.txt
    grep ">" "$cds_file" | sort > cds_names_cleaned.txt
    diff_output=$(diff proteins_names_cleaned.txt cds_names_cleaned.txt)

    if [[ -z "$diff_output" ]]; then
      echo "Check passed for species: $species" >> "$log_file"
      echo "Protein and CDS sequence names match for species: $species."
    else
      echo "Check failed for species: $species" >> "$log_file"
      echo "Protein and CDS sequence names mismatch for species: $species." >> "$log_file"
      echo "$diff_output" >> "$log_file"
    fi

  else
    echo "ERROR: Missing files for species: $species" >> "$log_file"
    echo "ERROR: Protein or CDS file missing for species: $species. Skipping."
  fi
done

# Cleanup temporary files
rm -f proteins_names.txt cds_names.txt common_names.txt proteins_names_cleaned.txt cds_names_cleaned.txt

echo "All species processed. Logs saved to $log_file."

2. Generating codon-aware nucleotide alignments

PAL2NAL is installed on Grace as a module but the same version is available in the script of this repository. We will use the inputs generated in the previous step to obtain codon-aware alignments.

# Example for Schistocerca only 
./scripts/DataRunPAL2NAL ./scripts/inputurls_Schistocerca_Jan2025.txt  

# Example for Polyneoptera 
./scripts/DataRunPAL2NAL ./scripts/inputurls_13polyneoptera_May2025.txt

3. Assembling nucleotide sequence orthogroups for input in HyPHY

From M. Bardull: For some models like BUSTED, we need files that contain orthologous nucleotide sequences from each species. Therefore, we must recombine our codon-aware alignments in a step that is the inverse of previous steps. To do this, use the R script ./scripts/DataSubsetCDS.R. Run with the command:

ml GCC/13.2.0  OpenMPI/4.1.6 R_tamu/4.4.1
export R_LIBS=$SCRATCH/R_LIBS_USER/

# Example for Schistocerca only 
./scripts/DataSubsetCDS.R ./scripts/inputurls_Schistocerca_Jan2025.txt /scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/5_OrthoFinder/fasta/Results_Jan15_I2/MultipleSequenceAlignments/
  
# Example for Polyneoptera 
 ./scripts/DataSubsetCDS.R ./scripts/inputurls_13polyneoptera_May2025.txt /scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/5_OrthoFinder/fasta/Results_May26_iqtree/MultipleSequenceAlignments/

From M. Bardull: BUSTED will not run on sequences which contain stop codons, even if these are reasonable, terminal stop codons. HypPhy includes a utility which will mask these these terminal stop codons in the orthogroups (there should be few-to-no other stop codons, because our alignments are codon-aware). To execute this step, use the following:

module purge  
ml GCC/13.3.0  OpenMPI/5.0.3 HyPhy/2.5.71

./scripts/DataRemoveStopCodons

# for large groups launch it with sbatch
sbatch ./scripts/DataRemoveStopCodons

4. Preparing labeled phylogenies

Before performing a signature of selection analysis using HyPhy, it is important to note that some methods such as RELAX, require the phylogeny to have labeled branches to define branches. These labels define branch sets for selection testing and allow to compare selection pressures.

So we modify the script LabellingPhylogeniesHYPHY.R

ml GCC/13.2.0  OpenMPI/4.1.6 R_tamu/4.4.1
export R_LIBS=$SCRATCH/R_LIBS_USER/

# Example for Schistocerca only   
./scripts/LabellingPhylogeniesHYPHY.R /scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/5_OrthoFinder/fasta/Results_Jan15_I2/Resolved_Gene_Trees/ Locusts.txt Locusts

# Example for Polyneoptera 
./scripts/LabellingPhylogeniesHYPHY.R /scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/5_OrthoFinder/fasta/Results_May26_iqtree/Resolved_Gene_Trees/ Locusts.txt Locusts

How it appears when it is successful. You can see that the locust species are labelled with {Foreground}.

5. Annotating proteings with InterProScan and orthogroups with KinFin

As part of the process, we want to make sure that the genes under selection have meaningful biological interpretations through functional annotation and GO enrichment analysis. To achieve this, we will use InterProScan to annotate individual genes and KinFin to generate gene-level annotations, assigning functional categories to entire orthogroups. This approach aligns with the orthogroup-level focus of our analyses in aBSREL, BUSTED, and RELAX, providing insights into the functional relevance of selective pressures.

For that we run the following command:

# Example for Schistocerca only 
./scripts/RunningInterProScan_modif ./scripts/inputurls_Schistocerca_Jan2025.txt /scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/5_OrthoFinder/fasta/

# Example for Polyneoptera   
./scripts/RunningInterProScan_modif ./scripts/inputurls_13polyneoptera_Jan2025.txt /scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_I2/5_OrthoFinder/fasta/
  
# we replace the version of interproscan to the most recent: interproscan-5.72-103.0
  
# we also comment out ax.set_facecolor('white')' on lines 681 and 1754 of ./kinfin/src/kinfin.py

Here is the details of ./scripts/RunningInterProScan_modif

#!/bin/bash

## SLURM Job Specifications
#SBATCH --job-name=interproscan         # Set the job name
#SBATCH --time=4-00:00:00              # Set the wall clock limit to 4 days
#SBATCH --ntasks=1                     # Request 1 task
#SBATCH --cpus-per-task=12             # Request 12 CPUs for the task
#SBATCH --mem=100G                     # Request 100GB memory
#SBATCH --output=interproscan_%j.out   # Standard output log
#SBATCH --error=interproscan_%j.err    # Standard error log

# Ensure the script receives correct arguments
if [ "$#" -ne 2 ]; then
  echo "Usage: $0 <input_file> <path_to_proteins_directory>"
  exit 1
fi

input_file=$1
proteins_dir=$2

# Load necessary modules
ml Java/11.0.2
ml WebProxy

export http_proxy=http://10.73.132.63:8080
export https_proxy=http://10.73.132.63:8080

# Main working directories
interpro_dir="./11_InterProScan/interproscan-5.72-103.0"
output_dir="$interpro_dir/out"
backup_dir="./11_InterProScan/backup"

# Create necessary directories
mkdir -p "$output_dir"
mkdir -p "$backup_dir"

# Iterate through the input file to process each species
while read -r line; do
  # Extract the species abbreviation
  name=$(echo "$line" | awk -F',' '{print $4}')
  protein_name="${name}_filteredTranscripts.fasta"
  
  echo "Processing species: $name"

  # Check if the protein file exists
  protein_path="$proteins_dir/$protein_name"
  if [ ! -f "$protein_path" ]; then
    echo "Protein file $protein_name not found in $proteins_dir. Skipping."
    continue
  fi

  # Check if the species has already been annotated
  annotated_file="$output_dir/${protein_name}.tsv"
  if [ -f "$annotated_file" ]; then
    echo "$annotated_file exists; skipping $name."
    continue
  fi

  # Backup original protein file and clean it
  cp "$protein_path" "$backup_dir/${protein_name}.bak"
  cp "$protein_path" "$interpro_dir/$protein_name"
  sed -i'.original' -e "s|\*||g" "$interpro_dir/$protein_name"
  rm "$interpro_dir/${protein_name}.original"

  # Run InterProScan
  echo "Running InterProScan for $protein_name..."
  cd "$interpro_dir"
  ./interproscan.sh -i "$protein_name" -d out/ -t p --goterms -appl Pfam -f TSV
  cd - > /dev/null

done < "$input_file"

# Combine all annotated results into a single file
cat "$output_dir"/*.tsv > "$interpro_dir/all_proteins.tsv"
echo "Annotation completed. Combined results stored in $interpro_dir/all_proteins.tsv."

# KinFin Preparation
kinfin_dir="./11_InterProScan/kinfin"
if [ ! -d "$kinfin_dir" ]; then
  echo "KinFin not installed. Please install KinFin and rerun this step."
  exit 1
fi

# Convert InterProScan results to KinFin-compatible format
echo "Preparing InterProScan results for KinFin..."
"$kinfin_dir/scripts/iprs2table.py" -i "$interpro_dir/all_proteins.tsv" --domain_sources Pfam

# Copy Orthofinder files to KinFin directory
cp 5_OrthoFinder/fasta/OrthoFinder/Results*/Orthogroups/Orthogroups.txt "$kinfin_dir/"
cp 5_OrthoFinder/fasta/OrthoFinder/Results*/WorkingDirectory/SequenceIDs.txt "$kinfin_dir/"
cp 5_OrthoFinder/fasta/OrthoFinder/Results*/WorkingDirectory/SpeciesIDs.txt "$kinfin_dir/"

# Create KinFin configuration file
echo '#IDX,TAXON' > "$kinfin_dir/config.txt"
sed 's/: /,/g' "$kinfin_dir/SpeciesIDs.txt" | cut -f 1 -d"." >> "$kinfin_dir/config.txt"

# Run KinFin functional annotation
echo "Running KinFin functional annotation..."
"$kinfin_dir/kinfin" --cluster_file "$kinfin_dir/Orthogroups.txt" \
  --config_file "$kinfin_dir/config.txt" \
  --sequence_ids_file "$kinfin_dir/SequenceIDs.txt" \
  --functional_annotation functional_annotation.txt

echo "Functional annotation completed."

6. aBSREL

We will perform aBSREL analysis using both unlabelled and labelled phylogenies.

The unlabelled phylogenies will allow for an exploratory analysis, testing all Schistocerca for positive selection. While this approach provides a broad overview, it comes at the cost of reduced statistical power due to multiple testing.
In contrast, the labelled phylogenies will focus specifically on locust species compared to all other species, enabling us to determine whether locusts experience heightened selective pressures relative to other groups.

Note: The new version of OrthoFinder makes a list of SingleCopy Orthologues by adding a N0:H before the orthogroup name. N0.HOG0000086 N0.HOG0000090 N0.HOG0000212 N0.HOG0000220 N0.HOG0000478 N0.HOG0000479 N0.HOG0000503 N0.HOG0000505

So we need to clean that up before running our files using the command

sed 's/^N0\.HOG/OG/' Orthogroups_SingleCopyOrthologues.txt > Orthogroups_SingleCopyOrthologues_renamed.txt

# For unlabelled phylogeny
sbatch ./scripts/RunaBSREL_May2025.sh \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/5_OrthoFinder/fasta/Results_Jan15_I2/Resolved_Gene_Trees/ \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/5_OrthoFinder/fasta/Results_Jan15_I2/Orthogroups/Orthogroups_SingleCopyOrthologues.txt 

# For labelled phylogeny
sbatch ./scripts/RunaBSREL_labeled_May2025.sh \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/9_1_LabelledPhylogenies/Locusts \
Locusts \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/5_OrthoFinder/fasta/Results_Jan15_I2/Orthogroups/Orthogroups_SingleCopyOrthologues.txt 

################################
# Polyneoptera
# For unlabelled phylogeny
sbatch ./scripts/RunaBSREL_May2025.sh \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/5_OrthoFinder/fasta/Results_May26_iqtree/Resolved_Gene_Trees/ \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/5_OrthoFinder/fasta/Results_May26_iqtree/Orthogroups/Orthogroups_SingleCopyOrthologues_renamed.txt 

# For labelled phylogeny
sbatch ./scripts/RunaBSREL_labeled_May2025.sh \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/9_1_LabelledPhylogenies/Locusts \
Locusts \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/5_OrthoFinder/fasta/Results_May26_iqtree/Orthogroups/Orthogroups_SingleCopyOrthologues_renamed.txt

For parsing the results, you just do:

ml GCC/13.2.0  OpenMPI/4.1.6 R_tamu/4.4.1
export R_LIBS=$SCRATCH/R_LIBS_USER/

Rscript ./scripts/Parsing_aBSRELresulsr_unlabel.R

7. BUSTED

We will perform aBSREL analysis using both unlabelled and labelled phylogenies.The unlabelled phylogenies will allow for a gene-wide exploratory analysis treating the entire tree of Schistocerca as foreground.

# For unlabelled phylogeny
sbatch scripts/RunBUSTED_May2025.sh \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/5_OrthoFinder/fasta/Results_Jan15_I2/Resolved_Gene_Trees \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/5_OrthoFinder/fasta/Results_Jan15_I2/Orthogroups/Orthogroups_SingleCopyOrthologues.txt 

# For labelled phylogeny
sbatch ./scripts/RunBUSTED_labeled_May2025.sh \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/9_1_LabelledPhylogenies/Locusts \
Locusts \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/5_OrthoFinder/fasta/Results_Jan15_I2/Orthogroups/Orthogroups_SingleCopyOrthologues.txt 

################################
# Polyneoptera
# For unlabelled phylogeny
sbatch scripts/RunBUSTED_May2025.sh  \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/5_OrthoFinder/fasta/Results_May26_iqtree/Resolved_Gene_Trees/ \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/5_OrthoFinder/fasta/Results_May26_iqtree/Orthogroups/Orthogroups_SingleCopyOrthologues_renamed.txt 

# For labelled phylogeny
sbatch ./scripts/RunBUSTED_labeled_May2025.sh \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/9_1_LabelledPhylogenies/Locusts \
Locusts \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/5_OrthoFinder/fasta/Results_May26_iqtree/Orthogroups/Orthogroups_SingleCopyOrthologues_renamed.txt

8. RELAX

We will perform RELAX analysis using both unlabelled and labelled phylogenies.The unlabelled phylogenies will allow for a gene-wide exploratory analysis treating the entire tree of Schistocerca as foreground.

# For labelled phylogeny
sbatch ./scripts/RunRELAX_labeled_May2025.sh \ /scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/9_1_LabelledPhylogenies/Locusts \
Locusts \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Schistocerca_I2/5_OrthoFinder/fasta/Results_Jan15_I2/Orthogroups/Orthogroups_SingleCopyOrthologues.txt 

################################
# Polyneoptera
# For labelled phylogeny
sbatch ./scripts/RunRELAX_labeled_May2025.sh \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/9_1_LabelledPhylogenies/Locusts \
Locusts \
/scratch/group/songlab/maeva/LocustsGenomeEvolution/Polyneoptera_FINAL/5_OrthoFinder/fasta/Results_May26_iqtree/Orthogroups/Orthogroups_SingleCopyOrthologues_renamed.txt

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Tokyo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       httr_1.4.7        cli_3.6.5         knitr_1.49       
 [5] rlang_1.1.6       xfun_0.51         stringi_1.8.4     processx_3.8.6   
 [9] promises_1.3.2    jsonlite_1.9.1    glue_1.8.0        rprojroot_2.0.4  
[13] git2r_0.35.0      htmltools_0.5.8.1 httpuv_1.6.15     ps_1.9.0         
[17] sass_0.4.9        rmarkdown_2.29    jquerylib_0.1.4   tibble_3.2.1     
[21] evaluate_1.0.3    fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4  
[25] whisker_0.4.1     stringr_1.5.1     compiler_4.4.2    fs_1.6.5         
[29] pkgconfig_2.0.3   Rcpp_1.0.14       rstudioapi_0.17.1 later_1.4.1      
[33] digest_0.6.37     R6_2.6.1          pillar_1.10.2     callr_3.7.6      
[37] magrittr_2.0.3    bslib_0.9.0       tools_4.4.2       cachem_1.1.0     
[41] getPass_0.2-4

Signatures of selection: Did locusts or specific genes experience positive selection?

Maeva Techer

2025-06-05