Last updated: 2025-07-01

Checks: 7 0

Knit directory: locust-comparative-genomics/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20221025)

The command set.seed(20221025) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 0c8f4b0

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 0c8f4b0. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/figure/
    Ignored:    code/.DS_Store
    Ignored:    code/scripts/.DS_Store
    Ignored:    code/scripts/pal2nal.v14/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/DEG_results/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/americana/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/cancellata/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/cubense/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/gregaria/.DS_Store
    Ignored:    data/DEG_results/Bulk_RNAseq/nitens/.DS_Store
    Ignored:    data/HYPHY_selection/.DS_Store
    Ignored:    data/HYPHY_selection/ParsedABSRELResults_unlabeled/
    Ignored:    data/HYPHY_selection/pathway_enrichment/.DS_Store
    Ignored:    data/HYPHY_selection/pathway_enrichment/americana/
    Ignored:    data/HYPHY_selection/pathway_enrichment/cancellata/
    Ignored:    data/HYPHY_selection/pathway_enrichment/cubense/
    Ignored:    data/HYPHY_selection/pathway_enrichment/nitens/
    Ignored:    data/HYPHY_selection/pathway_enrichment/piceifrons/
    Ignored:    data/WGCNA/.DS_Store
    Ignored:    data/WGCNA/input/.DS_Store
    Ignored:    data/WGCNA/input/Bulk_RNAseq/.DS_Store
    Ignored:    data/WGCNA/output/.DS_Store
    Ignored:    data/WGCNA/output/Bulk_RNAseq/.DS_Store
    Ignored:    data/WGCNA/output/Bulk_RNAseq/gregaria/.DS_Store
    Ignored:    data/behavioral_data/.DS_Store
    Ignored:    data/behavioral_data/Raw_data/.DS_Store
    Ignored:    data/cafe5_results/.DS_Store
    Ignored:    data/list/.DS_Store
    Ignored:    data/list/Bulk_RNAseq/.DS_Store
    Ignored:    data/list/GO_Annotations/.DS_Store
    Ignored:    data/list/GO_Annotations/DesertLocustR/.DS_Store
    Ignored:    data/list/excluded_loci/.DS_Store
    Ignored:    data/orthofinder/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2_iqtree/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2_iqtree/Orthogroups/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2_withDaust/.DS_Store
    Ignored:    data/orthofinder/Polyneoptera/Results_I2_withDaust/Orthogroups/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/Results_I2/.DS_Store
    Ignored:    data/orthofinder/Schistocerca/Results_I2/Orthogroups/.DS_Store
    Ignored:    data/overlap/.DS_Store
    Ignored:    data/pathway_enrichment/.DS_Store
    Ignored:    data/pathway_enrichment/OLD/.DS_Store
    Ignored:    data/pathway_enrichment/OLD/custom_sgregaria_orgdb/.DS_Store
    Ignored:    data/pathway_enrichment/REVIGO_results/.DS_Store
    Ignored:    data/pathway_enrichment/REVIGO_results/BP/.DS_Store
    Ignored:    data/pathway_enrichment/REVIGO_results/CC/.DS_Store
    Ignored:    data/pathway_enrichment/REVIGO_results/MF/.DS_Store
    Ignored:    data/pathway_enrichment/cancellata/.DS_Store
    Ignored:    data/pathway_enrichment/gregaria/.DS_Store
    Ignored:    data/pathway_enrichment/nitens/Thorax/
    Ignored:    data/pathway_enrichment/piceifrons/.DS_Store
    Ignored:    data/readcounts/.DS_Store
    Ignored:    data/readcounts/Bulk_RNAseq/.DS_Store
    Ignored:    data/readcounts/RNAi/.DS_Store

Untracked files:
    Untracked:  data/RefSeq/

Unstaged changes:
    Modified:   data/orthofinder/Polyneoptera/Results_I2_iqtree/Orthogroups/Orthogroups_UnassignedGenes_reprocessed.tsv
    Modified:   data/orthofinder/Polyneoptera/Results_I2_iqtree/Orthogroups/Orthogroups_reprocessed.tsv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/2_hic-snps-phylogeny.Rmd) and HTML (docs/2_hic-snps-phylogeny.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	0c8f4b0	Maeva TECHER	2025-07-01	Build site.
Rmd	a2d2955	Maeva TECHER	2025-07-01	Updated wgcna and compiling
html	a2d2955	Maeva TECHER	2025-07-01	Updated wgcna and compiling
html	3d6f00f	Maeva TECHER	2025-06-05	Build site.
html	3e696d6	Maeva TECHER	2025-06-05	Adding ortho heatmap
html	285370e	Maeva TECHER	2025-06-04	Build site.
Rmd	cacc1db	Maeva TECHER	2025-05-02	updates files

1. Generating the data

Erez, Olga and their team, have created Hi-C library from 44 samples of the genus Schistocerca from Hojun’s ethanol collection. We want to use the common SNPs from these species to assess the phylogenetic relationship of Schistocerca using genome-wide variants.

[ASK OLGA TO ADD MORE DETAILS ABOUT SEQUENCING AND READS PREPROCESSING]. The reads were then mapped against S. piceifrons genome and hard filtered using Dragen at BCM.

From the vcf header we can see that Dragen, a variant call

DRAGENCommandLine=<ID=HashTableBuild,Version="SW: 01.003.044.3.10.11, HashTableVersion: 8",
CommandLineOptions="dragen --build-hash-table true --ht-num-threads 40 --ht-mem-limit 32Gb --ht-reference ./reference/GCF_021461385.2_iqSchPice1.1_genomic.fna --output-dir ./reference"

DRAGENCommandLine=<ID=dragen,Version="SW: 07.021.624.3.10.11, HW: 07.021.624",Date="Mon Feb 10 16:06:12 CST 2025",
CommandLineOptions="-r ./reference --fastq-list all_fastq.csv --fastq-list-sample-id HIC17106 --output-directory ./vc_HIC17106 --output-file-prefix HIC17106 --enable-auto-multifile false --combine-samples-by-name false --enable-map-align true --enable-sort false --enable-variant-caller true --enable-duplicate-marking true --enable-map-align-output false --output-format bam --Aligner.unpaired-pen=0 --enable-vcf-indexing false"

Then the data was filtered so that the SNP quality must be higher than 10.41, and indel higher than 7.83. The depth of each variant must be at least of 1.

FILTER=<ID=DRAGENSnpHardQUAL,Description="Set if true:QUAL < 10.41">
FILTER=<ID=DRAGENIndelHardQUAL,Description="Set if true:QUAL < 7.83">
FILTER=<ID=LowDepth,Description="Set if true:DP <= 1">
FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
FILTER=<ID=RMxNRepeatRegion,Description="Site filtered because all or part of the variant allele is a repeat of the reference">

2. Downloading the hard-filtered data

The data was shared via Dropbox and to transfer them directly to GRACE clusterm we followed the steps mentionned below:

We copy the link address of each library and created a file_links.txt and replace the dl=0 with dl=1

https://www.dropbox.com/scl/fo/a3hvph0rop31gtaoucj8v/AC5daUrN-dBAydjSKdvq3KE/HIC17106.hard-filtered.vcf.gz?rlkey=0bkducf1e7pxil6fi3sdb7fat&dl=0
https://www.dropbox.com/scl/fo/a3hvph0rop31gtaoucj8v/AIun42Q_haxfprCT6QjHmUA/HIC17107.hard-filtered.vcf.gz?rlkey=0bkducf1e7pxil6fi3sdb7fat&dl=0
https://www.dropbox.com/scl/fo/a3hvph0rop31gtaoucj8v/AIaUDoe9_0kzXsr1wCYKoHo/HIC17108.hard-filtered.vcf.gz?rlkey=0bkducf1e7pxil6fi3sdb7fat&dl=0
(...)

https://www.dropbox.com/scl/fo/a3hvph0rop31gtaoucj8v/AC5daUrN-dBAydjSKdvq3KE/HIC17106.hard-filtered.vcf.gz?rlkey=0bkducf1e7pxil6fi3sdb7fat&dl=1
https://www.dropbox.com/scl/fo/a3hvph0rop31gtaoucj8v/AIun42Q_haxfprCT6QjHmUA/HIC17107.hard-filtered.vcf.gz?rlkey=0bkducf1e7pxil6fi3sdb7fat&dl=1
https://www.dropbox.com/scl/fo/a3hvph0rop31gtaoucj8v/AIaUDoe9_0kzXsr1wCYKoHo/HIC17108.hard-filtered.vcf.gz?rlkey=0bkducf1e7pxil6fi3sdb7fat&dl=1
(...)

We can download each file individually with the following command:

wget --no-check-certificate "https://www.dropbox.com/scl/fo/a3hvph0rop31gtaoucj8v/AKkVh4m_l_VtHP8zTHcPlyM?e=2&preview=HIC17106.hard-filtered.vcf.gz&rlkey=0bkducf1e7pxil6fi3sdb7fat&st=wihwy6ah&dl=1" -O HIC17106.hard-filtered.vcf.gz

or use directly the file_links.txt:

wget --no-check-certificate -i file_links.txt

for file in *.vcf.gz\?*; do
    mv "$file" "$(echo "$file" | cut -d'?' -f1)"
done

# then because the files will appears with their link names, run the following renaming command
for file in *\?*; do
    clean=$(echo "$file" | sed "s/'//g" | sed 's/\?.*//')
    mv "$file" "$clean"
done

Then we need to make index for these files before running th downstream pipeline:

srun --ntasks 1 --cpus-per-task 16 --mem 50G --time 04:00:00 --pty bash
ml GCC/13.2.0 BCFtools/1.19

# Loop over each .vcf.gz file
for vcf in *.hard-filtered.vcf.gz; do
  echo "Indexing $vcf..."
  bcftools index --csi --threads 16 "$vcf"
done

3. Merging and filtering common SNPs

We created a Snakemake pipeline to streamlined the further VCF merging, common SNPs filtering and fasta files. We just need to launch the first part of our Snakemake file using sbatch snakemake.slurm

#!/bin/bash

#SBATCH --job-name=snakemake
##SBATCH --partition=short
#SBATCH --time="7-00:00:00"
#SBATCH --mem=50GB
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1

#module load GCC/11.2.0  OpenMPI/4.1.1 snakemake/6.10.0
ml GCC/11.3.0  OpenMPI/4.1.4 snakemake/7.22.0

snakemake --latency-wait 3600 --restart-times 1 -j 200 -p \
--max-jobs-per-second 1 \
--cluster-config cluster.json \
--cluster "sbatch -p {cluster.partition} --ntasks {cluster.ntasks} --cpus-per-task {cluster.cpus-per-task} -t {cluster.time} --mem {cluster.mem}" \
--notemp --nolock \
--rerun-incomplete

and at the moment we are interested in the rules that merge, and keep only biallelic SNPs with a minimum allele count of 1. All indels will be filtered out and we will compute the stats to decide further filtering of sites with high missingness.

### =================================================================
### MERGE VCFs
### =================================================================

rule merge_vcfs:
    input:
        vcfs = expand("{workdir}/00-raw-vcf/{sample}.hard-filtered.vcf.gz",
                      sample=config["VCF_SAMPLES"], workdir=config["WORKDir"]),
        indices = expand("{workdir}/00-raw-vcf/{sample}.hard-filtered.vcf.gz.csi",
                         sample=config["VCF_SAMPLES"], workdir=config["WORKDir"])
    output:
        vcf = config["WORKDir"] + "/01-merged-vcf/all_samples.vcf.gz",
        index = config["WORKDir"] + "/01-merged-vcf/all_samples.vcf.gz.csi"
    shell:
        """
        module purge
        ml GCC/13.2.0 BCFtools/1.19

        bcftools merge --force-samples -Oz -o {output.vcf} {input.vcfs}
        bcftools index -c {output.vcf}
        """

### =================================================================
### FILTER MERGED VCF (biallelic SNPs, no indels, MAC=1, missing<60%)
### =================================================================

rule filter_snps:
    input:
        vcf = config["WORKDir"] + "/01-merged-vcf/all_samples.vcf.gz"
    output:
        vcf = config["WORKDir"] + "/02-filtered-vcf/filtered_snps.vcf.gz",
        index = config["WORKDir"] + "/02-filtered-vcf/filtered_snps.vcf.gz.csi"
    params: 
        cutoff = "--max-meanDP 14",
        filters = "--remove-indels --minDP 4 --minQ 30 --max-missing 0.7 --max-alleles 2",
        span = "--chr NC_060138.1 --chr NC_060139.1 --chr NC_060140.1 --chr NC_060141.1 --chr NC_060142.1 --chr NC_060143.1 --chr NC_060144.1 --chr NC_060145.1 --chr NC_060146.1 --chr NC_060147.1 --chr NC_060148.1 --chr NC_060149.1"
    shell:
        """
        module purge
        ml GCC/13.2.0 VCFtools/0.1.16 BCFtools/1.19

        vcftools --gzvcf {input.vcf} {params.span} {params.filters} {params.cutoff} --recode --recode-INFO-all --stdout | bgzip --threads 4 > {output.vcf}
        bcftools index -c {output.vcf}
        """

### =================================================================
### SNP STATS REPORT
### =================================================================

rule stats_snps:
    input:
        vcf = config["WORKDir"] + "/02-filtered-vcf/filtered_snps.vcf.gz"
    output:
        depth = config["WORKDir"] + "/02-filtered-vcf/snps.depth",
        site_depth = config["WORKDir"] + "/02-filtered-vcf/snps.ldepth.mean",
        miss_indv = config["WORKDir"] + "/02-filtered-vcf/snps.imiss",
        miss_site = config["WORKDir"] + "/02-filtered-vcf/snps.lmiss"
    params: 
        span = "--chr NC_060138.1 --chr NC_060139.1 --chr NC_060140.1 --chr NC_060141.1 --chr NC_060142.1 --chr NC_060143.1 --chr NC_060144.1 --chr NC_060145.1 --chr NC_060146.1 --chr NC_060147.1 --chr NC_060148.1 --chr NC_060149.1"
    shell:
        """
        module purge
        ml GCC/13.2.0 VCFtools/0.1.16

        echo "Calculating depth..."
        vcftools --gzvcf {input.vcf} {params.span} --depth --out {config[WORKDir]}/02-filtered-vcf/snps

        echo "Calculating site mean depth..."
        vcftools --gzvcf {input.vcf} {params.span} --site-mean-depth --out {config[WORKDir]}/02-filtered-vcf/snps

        echo "Calculating individual missing data..."
        vcftools --gzvcf {input.vcf} {params.span} --missing-indv --out {config[WORKDir]}/02-filtered-vcf/snps

        echo "Calculating site missing data..."
        vcftools --gzvcf {input.vcf} {params.span} --missing-site --out {config[WORKDir]}/02-filtered-vcf/snps
        """

So we have merged the snps.

module purge
ml GCC/12.3.0 vcflib/1.0.9-R-4.3.2 BCFtools/1.18

we will subsample the number of snps so we can run the stats on it

bcftools view all_samples.vcf.gz | vcfrandomsample -r 0.02 > all_samples_subset.vcf


module purge
ml GCC/13.2.0  OpenMPI/4.1.6 R_tamu/4.4.1
export R_LIBS=$SCRATCH/R_LIBS_USER/
  
library(tidyverse)
library(ggplot2)

# Load and parse data
var_depth <- read_delim("all_samples.ldepth.mean", delim = "\t", 
                        col_names = c("chr", "pos", "mean_depth", "var_depth"), skip = 1)

# Compute the mean
summary(var_depth$mean_depth)
mean_val <- mean(var_depth$mean_depth, na.rm = TRUE)

# Base plot with vertical mean line
a <- ggplot(var_depth, aes(mean_depth)) +
  geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) +
  geom_vline(xintercept = mean_val, colour = "red", linetype = "dashed") +
  theme_light()

# Open a multi-page PDF
pdf("mean_depth_density.pdf", width = 6, height = 4)

# Page 1: full density plot
print(a)

# Page 2: zoomed in
a_zoom <- a + xlim(0, 100)
print(a_zoom)

# Close device
dev.off()


###############

# Load missingness data
var_miss <- read_delim("all_samples.lmiss", delim = "\t", 
                       col_names = c("chr", "pos", "nchr", "nfiltered", "nmiss", "fmiss"), 
                       skip = 1)

# Calculate mean missingness
summary(var_miss$fmiss)
mean_fmiss <- mean(var_miss$fmiss, na.rm = TRUE)

# Base plot with vertical mean line
a <- ggplot(var_miss, aes(fmiss)) + 
  geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) +
  geom_vline(xintercept = mean_fmiss, colour = "red", linetype = "dashed") +
  theme_light()

# Open multi-page PDF
pdf("variant_missingness.pdf", width = 6, height = 4)

# Page 1: full density plot
print(a)

# Page 2: zoomed in
a_zoom <- a + xlim(0, 1)  # fmiss is a fraction (0–1), not 0–100
print(a_zoom)

# Close device
dev.off()

###############

# Load individual depth data
ind_depth <- read_delim("all_samples.idepth", delim = "\t",
                        col_names = c("ind", "nsites", "depth"), skip = 1)

# Calculate mean depth per individual
mean_depth <- mean(ind_depth$depth, na.rm = TRUE)

# Create histogram with vertical mean line
a <- ggplot(ind_depth, aes(depth)) +
  geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3, bins = 30) +
  geom_vline(xintercept = mean_depth, colour = "red", linetype = "dashed") +
  theme_light()

# Open PDF for two pages
pdf("individual_depth_distribution.pdf", width = 6, height = 4)

# Page 1: full view
print(a)

# Page 2: zoomed in (e.g., 0 to 100)
a_zoom <- a + xlim(0, 100)
print(a_zoom)

# Close the PDF device
dev.off()

##################

# Load individual missingness data
ind_miss <- read_delim("all_samples.imiss", delim = "\t",
                       col_names = c("ind", "ndata", "nfiltered", "nmiss", "fmiss"),
                       skip = 1)

# Calculate mean missingness
mean_fmiss <- mean(ind_miss$fmiss, na.rm = TRUE)

# Create histogram with vertical line
a <- ggplot(ind_miss, aes(fmiss)) +
  geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3, bins = 30) +
  geom_vline(xintercept = mean_fmiss, colour = "red", linetype = "dashed") +
  theme_light()

# Save to multi-page PDF
pdf("individual_missingness_distribution.pdf", width = 6, height = 4)

# Page 1: full histogram
print(a)

# Page 2: zoomed-in view
a_zoom <- a + xlim(0, 1)
print(a_zoom)

# Close PDF device
dev.off()

sessionInfo()

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Tokyo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       httr_1.4.7        cli_3.6.5         knitr_1.50       
 [5] rlang_1.1.6       xfun_0.52         stringi_1.8.7     processx_3.8.6   
 [9] promises_1.3.3    jsonlite_2.0.0    glue_1.8.0        rprojroot_2.0.4  
[13] git2r_0.36.2      htmltools_0.5.8.1 httpuv_1.6.16     ps_1.9.1         
[17] sass_0.4.10       rmarkdown_2.29    tibble_3.3.0      jquerylib_0.1.4  
[21] evaluate_1.0.4    fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4  
[25] whisker_0.4.1     stringr_1.5.1     compiler_4.4.2    fs_1.6.6         
[29] pkgconfig_2.0.3   Rcpp_1.0.14       rstudioapi_0.17.1 later_1.4.2      
[33] digest_0.6.37     R6_2.6.1          pillar_1.10.2     callr_3.7.6      
[37] magrittr_2.0.3    bslib_0.9.0       tools_4.4.2       cachem_1.1.0     
[41] getPass_0.2-4

SNPs based phylogeny using 3D-carta HiC data

Maeva Techer

2025-07-01

1. Generating the data

2. Downloading the hard-filtered data

3. Merging and filtering common SNPs