Last updated: 2025-06-05
Checks: 6 1
Knit directory:
locust-comparative-genomics/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20221025) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.
| absolute | relative |
|---|---|
| /Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data/metadata/Stats_RNAseq_QC_19Feb2024.txt | data/metadata/Stats_RNAseq_QC_19Feb2024.txt |
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version fb90bdd. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: analysis/.DS_Store
Ignored: analysis/.Rhistory
Ignored: code/.DS_Store
Ignored: code/scripts/.DS_Store
Ignored: code/scripts/pal2nal.v14/.DS_Store
Ignored: data/.DS_Store
Ignored: data/DEG_results/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/americana/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/cancellata/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/cancellata/Thorax/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/cubense/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/davidO/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/gregaria/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/nitens/.DS_Store
Ignored: data/DEG_results/Bulk_RNAseq/piceifrons/.DS_Store
Ignored: data/DEG_results/RNAi/.DS_Store
Ignored: data/DEG_results/RNAi/All/.DS_Store
Ignored: data/DEG_results/RNAi/All_GFP/.DS_Store
Ignored: data/DEG_results/RNAi/All_control/.DS_Store
Ignored: data/DEG_results/RNAi/All_no_rRNA/.DS_Store
Ignored: data/DEG_results/RNAi/Head/.DS_Store
Ignored: data/DEG_results/RNAi/Head_control/.DS_Store
Ignored: data/DEG_results/RNAi/Head_no_rRNA/.DS_Store
Ignored: data/DEG_results/RNAi/Thorax/.DS_Store
Ignored: data/DEG_results/RNAi/Thorax_no_rRNA/.DS_Store
Ignored: data/DEG_results/gregaria/
Ignored: data/DEG_results/single_cell/.DS_Store
Ignored: data/WGCNA/.DS_Store
Ignored: data/WGCNA/input/.DS_Store
Ignored: data/WGCNA/input/Bulk_RNAseq/.DS_Store
Ignored: data/WGCNA/output/.DS_Store
Ignored: data/WGCNA/output/Bulk_RNAseq/.DS_Store
Ignored: data/behavioral_data/.DS_Store
Ignored: data/behavioral_data/Raw_data/.DS_Store
Ignored: data/list/.DS_Store
Ignored: data/list/Bulk_RNAseq/.DS_Store
Ignored: data/list/GO_Annotations/.DS_Store
Ignored: data/list/excluded_loci/.DS_Store
Ignored: data/orthofinder/.DS_Store
Ignored: data/orthofinder/Polyneoptera/.DS_Store
Ignored: data/orthofinder/Polyneoptera/Results_I2_iqtree/.DS_Store
Ignored: data/orthofinder/Polyneoptera/Results_I2_withDaust/.DS_Store
Ignored: data/orthofinder/Polyneoptera/Results_I2_withDaust/Orthogroups/.DS_Store
Ignored: data/orthofinder/Schistocerca/.DS_Store
Ignored: data/orthofinder/Schistocerca/Results_I2/.DS_Store
Ignored: data/orthofinder/Schistocerca/Results_I2/Orthogroups/.DS_Store
Ignored: data/overlap/.DS_Store
Ignored: data/overlap/Bulk_RNAseq/.DS_Store
Ignored: data/overlap/Bulk_RNAseq/cancellata/
Ignored: data/pathway_enrichment/.DS_Store
Ignored: data/pathway_enrichment/custom_sgregaria_orgdb/.DS_Store
Ignored: data/readcounts/.DS_Store
Ignored: data/readcounts/Bulk_RNAseq/.DS_Store
Ignored: data/readcounts/RNAi/.DS_Store
Untracked files:
Untracked: data/RefSeq/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/3_seq-data-qc.Rmd) and
HTML (docs/3_seq-data-qc.html) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote), click on
the hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | b982319 | Maeva TECHER | 2025-03-03 | update font |
| html | b982319 | Maeva TECHER | 2025-03-03 | update font |
| html | 474315f | Maeva TECHER | 2025-02-27 | Build site. |
| Rmd | faf2db3 | Maeva TECHER | 2025-01-13 | update markdown |
| html | faf2db3 | Maeva TECHER | 2025-01-13 | update markdown |
| html | 3fa8e62 | Maeva TECHER | 2024-11-09 | updated analysis |
| Rmd | edb70fe | Maeva TECHER | 2024-11-08 | overlap and deg results created |
| html | edb70fe | Maeva TECHER | 2024-11-08 | overlap and deg results created |
| html | 7f1d1fe | Maeva TECHER | 2024-11-01 | Build site. |
| Rmd | f01f1cf | Maeva TECHER | 2024-11-01 | Adding new files and docs |
| html | f01f1cf | Maeva TECHER | 2024-11-01 | Adding new files and docs |
| html | ba35b82 | Maeva A. TECHER | 2024-06-20 | Build site. |
| html | b9f148d | Maeva A. TECHER | 2024-05-15 | Build site. |
| Rmd | bba0fce | Maeva A. TECHER | 2024-05-15 | wflow_publish("analysis/3_seq-data-qc.Rmd") |
| html | 178254e | Maeva A. TECHER | 2024-05-14 | Build site. |
| Rmd | be09a11 | Maeva A. TECHER | 2024-05-14 | update markdown |
| html | be09a11 | Maeva A. TECHER | 2024-05-14 | update markdown |
| html | d1cebea | Maeva A. TECHER | 2024-02-20 | Build site. |
| Rmd | 55f9385 | Maeva A. TECHER | 2024-02-20 | wflow_publish("analysis/3_seq-data-qc.Rmd") |
| html | df94db2 | Maeva A. TECHER | 2024-02-20 | adding markdown qc |
| Rmd | f6b4961 | Maeva A. TECHER | 2024-02-20 | wflow_publish("analysis/3_seq-data-qc.Rmd") |
| html | e39d280 | Maeva A. TECHER | 2024-01-30 | Build site. |
| html | f701a01 | Maeva A. TECHER | 2024-01-30 | reupdate |
| html | be046c6 | Maeva A. TECHER | 2024-01-24 | Build site. |
| html | 1b09cbe | Maeva A. TECHER | 2024-01-24 | remove |
| html | 141b63c | Maeva A. TECHER | 2023-12-18 | Build site. |
| Rmd | 53877fa | Maeva A. TECHER | 2023-12-18 | add pages |
At this point, we will have organized both SRA and de novo
paired-end sequencing data within our working
directory (for us: located on Grace cluster at
/scratch/group/songlab/maeva/headthor-locusts-rna/data) and
will be ready to run the Snakemake pipeline on it. As a reminder, each
library was build from total RNA extracted and ribodepleted (mRNA,
lncRNA, circRNA) from bulk tissues (either head or thorax) from a single
specimen reared under isolated or crowded conditions.
library("knitr")
library("rmdformats")
library("tidyverse")
library("DT") # for making interactive search table
library("plotly") # for interactive plots
library("ggthemes") # for theme_calc
library("reshape2")
library("readr")
library("ggplot2")
## Global options
options(max.print="10000")
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
cache = FALSE,
comment = FALSE,
prompt = FALSE,
tidy = TRUE
)
opts_knit$set(width=75)
The first step is to ensure that we have enough reads per library and
remove any potential outlier resulting from library preparation or
sequencing failure. For that, we will assess each .fastq file with
FASTQC. However, considering we are working with a
significant sample size, we will compile the results using
MULTIQC.
The aggregated results can be conveniently viewed by opening the HTML report in a web browser.
# On Grace cluster at Texas A&M University
module load GCC/12.2.0 OpenMPI/4.1.4 MultiQC/1.14
multiqc --title 'TYPE THE TITLE YOU WANT' -v /PATHTODIRECTORY
montana <- read_table("/Users/maevatecher/Documents/GitHub/locust-comparative-genomics/data/metadata/Stats_RNAseq_QC_19Feb2024.txt", col_names = TRUE, guess_max = 1000)
head(montana)
FALSE # A tibble: 6 × 7
FALSE Sample_Name Species Perc_Dups Perc_GC M_Seqs Unique_Reads Duplicate_Reads
FALSE <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
FALSE 1 SAMER_G_Crd_SRR… americ… 0.71 0.46 23.7 6958831 16765516
FALSE 2 SAMER_G_Crd_SRR… americ… 0.67 0.47 23.7 7784916 15939431
FALSE 3 SAMER_G_Crd_SRR… americ… 0.71 0.48 15.7 4570667 11085054
FALSE 4 SAMER_G_Crd_SRR… americ… 0.67 0.48 15.7 5145265 10510456
FALSE 5 SAMER_G_Crd_SRR… americ… 0.84 0.49 37.2 6049864 31125514
FALSE 6 SAMER_G_Crd_SRR… americ… 0.8 0.49 37.2 7301268 29874110
# Convert values to millions
montana <- montana %>%
mutate_at(vars(contains("Reads")), list(~ ./1000000))
# Pivot the data
montana_long <- montana %>%
pivot_longer(cols = contains("Reads"), names_to = "Variable", values_to = "Value")
# Define colors
colors.reads <- c("Duplicate_Reads" = "black", "Unique_Reads" = "deepskyblue")
# Plot the stacked bar plot with values in millions and custom colors
ggplot(montana_long, aes(x = Sample_Name, y = Value, fill = Variable)) +
geom_bar(stat = "identity", position = "stack") +
facet_wrap(~Species, scales = "free") +
labs(title = "Stacked Bar Plot of Unique Reads and Duplicate Reads by Sample",
x = NULL, # Remove x-axis label
y = "Millions of Reads") +
theme_minimal() +
theme(axis.text.x = element_blank()) + # Remove x-axis labels
scale_fill_manual(values = colors.reads) + # Set custom colors
geom_hline(yintercept = 30, linetype = "dashed", color = "red") +
geom_hline(yintercept = 50, linetype = "dashed", color = "gray70") +
ylim(0, 62) # Set y-axis limits

These plots show that most of our samples have over 30 million reads per sample and that most of these reads are considered duplicates. However, it is possible that the “duplicate” status come from the over expression of certain genes in Schistocerca.
library(ggConvexHull)
# Define custom colors for each species
species_colors <- c("americana" = "forestgreen", "cubense" = "yellow3", "gregaria" = "orange", "nitens" = "blue", "piceifrons" = "red2", cancellata = "deeppink")
p <- ggplot(montana, aes(x = Perc_Dups, y = M_Seqs, color = Species)) +
geom_point(size =2) +
scale_color_manual(values = species_colors) + # Apply custom colors
labs(title = "M_Seqs vs % Dups by Species",
x = "Percentage of Duplicates",
y = "Millions of Reads") +
theme_minimal()
p + geom_convexhull(aes(fill = Species, color = Species), alpha = 0.2) +
scale_fill_manual(values = species_colors)

After checking the initial sequence quality, we can determine whether
any parameters adjustments are needed. There are several tools for
trimming and removing adapters but we used fastp and
trim_galore to remove our contaminants. Below are the two
Snakemake rules we are using. We decided to go ahead with
trim_galore as it was able to remove adapter-dimers the
best in our dataset.
########################################
# Snakefile rule
########################################
rule trimming_fastp:
input:
read1 = WORKDir + "/00-{species}-reads/{locust}_1.fastq.gz",
read2 = WORKDir + "/00-{species}-reads/{locust}_2.fastq.gz"
output:
reportjson = WORKDir + "/01-{species}-trimmed-fastp/TrimQC/{locust}_fastp.json",
reporthtml = WORKDir + "/01-{species}-trimmed-fastp/TrimQC/{locust}_fastp.html",
tread1 = WORKDir + "/01-{species}-trimmed-fastp/{locust}_1.trimmed.fastq.gz",
tread2 = WORKDir + "/01-{species}-trimmed-fastp/{locust}_2.trimmed.fastq.gz"
shell:
"""
module purge
ml GCC/11.2.0 fastp/0.23.2
fastp --thread 16 \\
--in1 {input.read1} \\
--in2 {input.read2} \\
--out1 {output.tread1} \\
--out2 {output.tread2} \\
--trim_front1 2 \\
--trim_front2 2 \\
--detect_adapter_for_pe \\
-l 50 \\
--json {output.reportjson} \\
--html {output.reporthtml}
"""
rule trimming_tgalore:
input:
read1 = WORKDir + "/00-{species}-reads/{locust}_1.fastq.gz",
read2 = WORKDir + "/00-{species}-reads/{locust}_2.fastq.gz"
output:
tread1 = WORKDir + "/01-{species}-trimmed-tgalore/{locust}_1_val_1.fq.gz",
tread2 = WORKDir + "/01-{species}-trimmed-tgalore/{locust}_2_val_2.fq.gz"
shell:
"""
module purge
ml GCCcore/11.2.0 Trim_Galore/0.6.7
trim_galore --trim-n \\
--cores 16 \\
--quality 20 \\
--clip_R1 2 \\
--clip_R2 2 \\
--nextera \\
--output_dir {WORKDir}/01-{wildcards.species}-trimmed-tgalore/ \\
--fastqc \\
--paired {input.read1} {input.read2}
"""
########################################
# Parameters in the cluster.json file
########################################
"trimming_fastp": {
"cpus-per-task" : 16,
"partition" : "medium",
"ntasks": 1,
"mem" : "40G",
"time": "0-12:00:00"
},
"trimming_tgalore": {
"cpus-per-task" : 16,
"partition" : "medium",
"ntasks": 1,
"mem" : "40G",
"time": "0-12:00:00"
},
For trimming with fastp, we use the following
parameters:
--in1 {input.read1} and
--in2 {input.read2}: Specifies the input
files in FASTQ format.
--out1 {output.tread1} and
--out2 {output.tread2}: Defines the output
files for the trimmed reads.
--trim_front1 2 and
--trim_front2 2: Trims the first two
nucleotides from the start (5’ end) of each read in both files.
--detect_adapter_for_pe: Enables
automatic adapter detection for paired-end sequencing reads. This
setting ensures that adapter sequences, if present, are detected and
trimmed from the reads.
-l 50: Sets the minimum length for
reads after trimming. Any read shorter than 50 nucleotides after
trimming is discarded.
--json {output.reportjson} and
--html {output.reporthtml}: Specifies a
JSON/HTML format report file to summarize trimming statistics and
quality control metrics.
For trimming with trim_galore, we use the following
parameters:
--paired {input.read1} {input.read2}:
Specifies that the input is paired-end data.--output_dir {WORKDir}/01-{wildcards.species}-trimmed-tgalore/:
Specifies the output directory where trimmed files will be saved.
{wildcards.species} is a placeholder for the species name,
allowing for organized output by species.--clip_R1 2 and
--clip_R2 2: Clips the first two bases
from the 5’ end of both reads (R1 and R2) in paired-end sequencing.--nextera: Specifies the use of
Nextera adapter sequences for trimming, which are specific to Nextera
library preparations. We use this option because with FastQC we saw
contaminants of these adapters in some libraries.--quality 20: Trims low-quality bases
from the end of each read, keeping bases with a Phred score of 20 or
higher.--fastqc: Runs FastQC after trimming
to generate quality reports, allowing you to assess the quality of the
trimmed reads.If you run the two rules above, you do not need to run the following Snakemake portion, but I put it in case we need to perform extra and customized quality control checks after trimming to ensure that the clipping and filtering of sequences were not overly aggressive. Given the number of sequences we work with, manually inspecting each file immediately would be time-consuming. Instead, we adopt a random sampling approach across different species, rearing conditions, and tissues to see that the trimming process worked well.
########################################
# Snakefile rule
########################################
# Quality control step after trimming: checked for adapter content in particular and quality scores
rule trim_fastqc:
input:
read1 = OUTdir + "/trimming/{locust}_trim1P_1.fastq.gz",
read2 = OUTdir + "/trimming/{locust}_trim2P_2.fastq.gz",
output:
htmlqc1 = OUTdir + "/trimming/{locust}_trim1P_1_fastqc.html",
htmlqc2 = OUTdir + "/trimming/{locust}_trim2P_2_fastqc.html",
shell:
"""
module load FastQC/0.11.9-Java-11
fastqc {input.read1}
fastqc {input.read2}
"""
########################################
# Parameters in the cluster.json file
########################################
"trim_fastqc":
{
"cpus-per-task" : 2,
"partition" : "medium",
"ntasks": 1,
"mem" : "500M",
"time": "0-03:00:00"
},
EXAMPLE OF READS QUALITY BEFORE TRIMMING

EXAMPLE OF READS QUALITY AFTER TRIMMING
We can see that the sequence length has changed and that the 5’ and 3’
end positions with lower quality have been removed.

EXAMPLE OF READS QUALITY BEFORE TRIMMING

EXAMPLE OF READS QUALITY AFTER TRIMMING
We can see that most detected adapter sequences have been adequately
removed after trimming.

Contamination is likely to occur throughout various stages of experiments, including tissue acquisition, RNA extraction and library preparation. One can hope to reduces as much as possible its impact on the final sequencing data which can affect the success rate of reads mapping.
We decided to screen for microbes sequences present in the trimmed
paired-end reads FASTQ.gz files using the tool Kaiju.
Kaiju translates metatranscriptomics sequencing reads into
six possible reading frames and searches for maximum exact matches of
amino acid sequences in a given annotation protein database.
We used the most extensive microbial database nr_euk
which encompass the subset of NCBI BLAST nr database containing all
proteins belonging to Archaea, Bacteria, Viruses, Fungi and microbial
Eukaryotes.
# enable proxy to allow compute node connection to internet
module load WebProxy
## Downloading the 2023-05-10 database from Kaiju webserver
wget --no-check-certificate https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_nr_euk_2023-05-10.tgz
To run Kaiju we used the following
Snakemake rule:
########################################
# Snakefile rule
########################################
rule kaiju:
input:
database = KAIJUdir + "/kaiju_db_nr_euk.fmi",
taxonid = KAIJUdir + "/nodes.dmp",
taxonnames = KAIJUdir + "/names.dmp",
trimmed_read1 = WORKDir + "/01-{species}-trimmed-tgalore/{locust}_1_val_1.fq.gz",
trimmed_read2 = WORKDir + "/01-{species}-trimmed-tgalore/{locust}_2_val_2.fq.gz"
output:
kaijuout = WORKDir + "/KAIJU/{species}-kaiju/{locust}_kaiju.out",
classification = WORKDir + "/KAIJU/{species}-kaiju/{locust}_kaiju.tsv"
shell:
"""
module purge
ml GCC/8.3.0 OpenMPI/3.1.4 Kaiju/1.7.3-Python-3.7.4
kaiju -z 12 -v \\
-a greedy \\
-f {input.database} \\
-t {input.taxonid} \\
-i {input.trimmed_read1} \\
-j {input.trimmed_read2} \\
-o {output.kaijuout}
kaiju2table -v \\
-t {input.taxonid} \\
-n {input.taxonnames} \\
-r phylum \\
-o {output.classification} {output.kaijuout}
"""
########################################
# Parameters in the cluster.json file
########################################
"kaiju":
{
"cpus-per-task" : 12,
"partition" : "medium",
"ntasks": 2,
"mem" : "200G",
"time": "0-12:00:00"
},
-z 12: Sets the number of 12 CPU
threads to use for parallel processing.
-v: Enables verbose mode, which
provides detailed output information as the program runs.
-a greedy: Chooses the
greedy algorithm for sequence alignment in Kaiju. This
algorithm searches for the longest exact match in the database and
allows for more sensitive, although potentially slower, matching
compared to other options.
-f {input.database}: Specifies the
database file to use for classification, typically a precompiled
.fmi file containing microbial or taxonomic protein
sequences.
-t {input.taxonid}: Points to the
taxonomy node file (nodes.dmp), which Kaiju uses to assign
taxonomic IDs based on the sequence matches. This file is part of the
NCBI taxonomy data.
-i {input.trimmed_read1} and
-j {input.trimmed_read2}: Specifies the
input FASTQ files for paired-end reads. -i is for the
forward read file, and -j is for the reverse read
file.
-o {output.kaijuout}: Defines the
output file for Kaiju results. This file contains the classifications
and matches for each read.
Parameters for kaiju2table:
-n {input.taxonnames}: Points to
the taxonomy names file (names.dmp), which Kaiju uses to
convert taxonomic IDs into scientific names in the output
table.
-r phylum: Specifies the taxonomic
rank at which results are summarized. Setting it to phylum
means that results will be grouped at the phylum level in the output
table.
-o {output.classification}:
Specifies the output file for the generated classification table, where
each row corresponds to a read’s classification and its taxonomic path
to the specified rank (phylum).
The output produced here allows us to see the percentage of reads that map to unclassified (likely our locust host here) and the percentage of microbial contamination ranked in a phylum level.
Kaiju output can be exported to be view in a interactive
and hierarchical multi-layered pie-charts using Krona. The
results are generated by a .html page. We followed the Kaiju tutorial
on the Github page:
########################################
# Snakefile rule
########################################
rule krona:
input:
kaijuout = WORKDir + "/KAIJU/{species}-kaiju/{locust}_kaiju.out",
taxonid = KAIJUdir + "/nodes.dmp",
taxonnames = KAIJUdir + "/names.dmp",
output:
conversion = WORKDir + "/KAIJU/{species}-kaiju/{locust}_krona.out",
webreport = WORKDir + "/KAIJU/{species}-kaiju/{locust}_krona.html",
shell:
"""
module purge
ml GCCcore/8.2.0 KronaTools/2.7.1
kaiju2krona \\
-t {input.taxonid} \\
-n {input.taxonnames} \\
-i {input.kaijuout} \\
-o {output.conversion}
ktImportText -o {output.webreport} {output.conversion}
"""
########################################
# Parameters in the cluster.json file
########################################
"krona":
{
"cpus-per-task" : 2,
"partition" : "short",
"ntasks": 1,
"mem" : "500M",
"time": "0-0:10:00"
},
-t {input.taxonid}: Specifies the
taxonomy node file (nodes.dmp), which contains the
hierarchical structure of taxonomic IDs. This file is used by Kaiju to
organize and interpret taxonomic relationships for each classified
read.
-n {input.taxonnames}: Specifies
the taxonomy names file (names.dmp), which maps taxonomic
IDs to scientific names. This file allows Kaiju to output the actual
names of taxa instead of numerical IDs.
-i {input.kaijuout}: Indicates the
input file with classification results generated by the
kaiju command. This file contains the taxonomic assignments
for each read.
-o {output.conversion}: Specifies
the output file to be used as input for Krona. This file will be
formatted to include taxonomic information that Krona can interpret,
allowing for a hierarchical visualization of microbial content.
-o {output.webreport}: Specifies
the output HTML file that Krona will generate. This file will contain an
interactive, multi-layered pie chart for visualizing the microbial
classification data at different taxonomic levels.
The combination of Kraken2 and Bracken is
another option to classify the reads compared to a database. Before
launching the snakemake, you will need to create a database:
# enable proxy to allow compute node connection to internet
module load WebProxy
## Downloading the Kraken database
wget --no-check-certificate https://genome-idx.s3.amazonaws.com/kraken/k2_core_nt_20240904.tar.gz
Then you can launch this Snakemake by referencing to
KRAKENDir.
########################################
# Snakefile rule
########################################
rule kraken2:
input:
trimmed_read1 = WORKDir + "/01-{species}-trimmed-tgalore/{locust}_1_val_1.fq.gz",
trimmed_read2 = WORKDir + "/01-{species}-trimmed-tgalore/{locust}_2_val_2.fq.gz",
params:
report = WORKDir + "/KRAKEN2/{species}-kraken2/{locust}_report",
output:
taxoreport = WORKDir + "/KRAKEN2/{species}-kraken2/{locust}.kraken"),
shell:
"""
module purge
ml GCC/11.2.0 OpenMPI/4.1.1 Kraken2/2.1.3
kraken2 \\
--use-names \\
--threads 16 \\
--db {KRAKENDir} \\
--report {params.report} \\
--gzip-compressed \\
--paired {input.read1} {input.read2} > {output.taxoreport}
"""
rule braken:
input:
report = directory(WORKDir + "/KRAKEN2/{species}-kraken2/{locust}_report")
output:
taxonomy = WORKDir + "/KRAKEN2/{species}-kraken2/{locust}_GENUS.bracken")
shell:
"""
module purge
ml GCCcore/10.3.0 Bracken/2.9
bracken -v \\
-d {KRAKENDir} \\
-i {input.report} \\
-l G \\
-o {output.taxonomy}
"""
########################################
# Parameters in the cluster.json file
########################################
--paired {input.read1} {input.read2}:
Specifies that the input data is paired-end sequencing reads, with
{input.read1} as the forward read file and
{input.read2} as the reverse read file.
--gzip-compressed: Indicates that
the input files are gzip-compressed, so Kraken2 will decompress them on
the fly.
--use-names: This option tells
Kraken2 to display the taxonomic names in the output rather than just
taxonomic IDs, making it easier to interpret the results
directly.
--db {KRAKENDir}: Specifies the
path to the Kraken2 database, where {KRAKENDir} is a
placeholder for the directory containing the database files. This
database contains the reference sequences and taxonomy information
Kraken2 uses to classify the reads.
--report {params.report}: Defines
the output file for the Kraken2 report, which summarizes the
classification results, typically showing the percentage of reads
classified at each taxonomic level.
> {output.taxoreport}: Redirects
the standard output of Kraken2 to a file,
{output.taxoreport}, which will contain the detailed
classification information for each read. This includes the assigned
taxonomic classification and confidence scores.
sessionInfo()
FALSE R version 4.4.2 (2024-10-31)
FALSE Platform: aarch64-apple-darwin20
FALSE Running under: macOS Sequoia 15.5
FALSE
FALSE Matrix products: default
FALSE BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
FALSE LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
FALSE
FALSE locale:
FALSE [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
FALSE
FALSE time zone: Asia/Tokyo
FALSE tzcode source: internal
FALSE
FALSE attached base packages:
FALSE [1] stats graphics grDevices utils datasets methods base
FALSE
FALSE other attached packages:
FALSE [1] ggConvexHull_0.1.0 reshape2_1.4.4 ggthemes_5.1.0 plotly_4.10.4
FALSE [5] DT_0.33 lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1
FALSE [9] dplyr_1.1.4 purrr_1.0.4 readr_2.1.5 tidyr_1.3.1
FALSE [13] tibble_3.2.1 ggplot2_3.5.2 tidyverse_2.0.0 rmdformats_1.0.4
FALSE [17] knitr_1.49 workflowr_1.7.1
FALSE
FALSE loaded via a namespace (and not attached):
FALSE [1] gtable_0.3.6 xfun_0.51 bslib_0.9.0 htmlwidgets_1.6.4
FALSE [5] processx_3.8.6 callr_3.7.6 tzdb_0.4.0 vctrs_0.6.5
FALSE [9] tools_4.4.2 ps_1.9.0 generics_0.1.3 pkgconfig_2.0.3
FALSE [13] data.table_1.17.0 RColorBrewer_1.1-3 lifecycle_1.0.4 compiler_4.4.2
FALSE [17] farver_2.1.2 git2r_0.35.0 getPass_0.2-4 httpuv_1.6.15
FALSE [21] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10 lazyeval_0.2.2
FALSE [25] later_1.4.1 pillar_1.10.2 crayon_1.5.3 jquerylib_0.1.4
FALSE [29] whisker_0.4.1 cachem_1.1.0 tidyselect_1.2.1 digest_0.6.37
FALSE [33] stringi_1.8.4 bookdown_0.42 labeling_0.4.3 rprojroot_2.0.4
FALSE [37] fastmap_1.2.0 grid_4.4.2 cli_3.6.5 magrittr_2.0.3
FALSE [41] utf8_1.2.4 dichromat_2.0-0.1 withr_3.0.2 scales_1.4.0
FALSE [45] promises_1.3.2 timechange_0.3.0 rmarkdown_2.29 httr_1.4.7
FALSE [49] hms_1.1.3 evaluate_1.0.3 viridisLite_0.4.2 rlang_1.1.6
FALSE [53] Rcpp_1.0.14 glue_1.8.0 formatR_1.14 rstudioapi_0.17.1
FALSE [57] jsonlite_1.9.1 R6_2.6.1 plyr_1.8.9 fs_1.6.5