Last updated: 2024-05-15

Checks: 6 1

Knit directory: locust-comparative-genomics/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20221025)

The command set.seed(20221025) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: absolute

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute	relative
/Users/alphamanae/Documents/GitHub/locust-comparative-genomics	.
/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data	data
/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/finalcounts/	data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/finalcounts
/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/list/HeadSGREG.txt	data/list/HeadSGREG.txt
/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/list/gregaria_lncrna_list.txt	data/list/gregaria_lncrna_list.txt

Repository version: 0ac8380

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 0ac8380. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    data/.DS_Store
    Ignored:    data/.Rhistory
    Ignored:    data/DEseq2_SAMER_SAMER_HEAD_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SAMER_SAMER_THORAX_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SAMER_SGREG_HEAD_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SAMER_SGREG_THORAX_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SCANC_SCANC_HEAD_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SCANC_SCANC_THORAX_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SCANC_SGREG_HEAD_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SCANC_SGREG_THORAX_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SCUBE_SCUBE_HEAD_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SCUBE_SCUBE_THORAX_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SCUBE_SGREG_HEAD_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SCUBE_SGREG_THORAX_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SGREG_SGREG_HEAD_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SGREG_SGREG_THORAX_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SNITE_SGREG_HEAD_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SNITE_SGREG_THORAX_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SNITE_SNITE_HEAD_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SNITE_SNITE_THORAX_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SPICE_SGREG_HEAD_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SPICE_SGREG_THORAX_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SPICE_SPICE_HEAD_STARnew_features/.DS_Store
    Ignored:    data/DEseq2_SPICE_SPICE_THORAX_STARnew_features/.DS_Store
    Ignored:    data/americana/.DS_Store
    Ignored:    data/americana/deg_counts/.DS_Store
    Ignored:    data/americana/deg_counts/STAR/.DS_Store
    Ignored:    data/americana/deg_counts/STAR/americana/.DS_Store
    Ignored:    data/americana/deg_counts/STAR/americana/featurecounts/.DS_Store
    Ignored:    data/americana/deg_counts/STAR/gregaria/.DS_Store
    Ignored:    data/americana/deg_counts/STAR/gregaria/featurecounts/.DS_Store
    Ignored:    data/americana/deg_counts/STAR_newparams/.DS_Store
    Ignored:    data/americana/deg_counts/STAR_newparams/americana/.DS_Store
    Ignored:    data/americana/deg_counts/STAR_newparams/americana/featurecounts/.DS_Store
    Ignored:    data/americana/deg_counts/STAR_newparams/gregaria/.DS_Store
    Ignored:    data/americana/deg_counts/STAR_newparams/gregaria/featurecounts/.DS_Store
    Ignored:    data/cancellata/.DS_Store
    Ignored:    data/cancellata/deg_counts/.DS_Store
    Ignored:    data/cancellata/deg_counts/STAR/.DS_Store
    Ignored:    data/cancellata/deg_counts/STAR_newparams/.DS_Store
    Ignored:    data/cubense/.DS_Store
    Ignored:    data/cubense/deg_counts/.DS_Store
    Ignored:    data/cubense/deg_counts/STAR/.DS_Store
    Ignored:    data/cubense/deg_counts/STAR/cubense/.DS_Store
    Ignored:    data/cubense/deg_counts/STAR/cubense/featurecounts/
    Ignored:    data/cubense/deg_counts/STAR/gregaria/
    Ignored:    data/cubense/deg_counts/STAR_newparams/.DS_Store
    Ignored:    data/cubense/deg_counts/STAR_newparams/cubense/.DS_Store
    Ignored:    data/cubense/deg_counts/STAR_newparams/gregaria/.DS_Store
    Ignored:    data/gregaria/.DS_Store
    Ignored:    data/gregaria/deg_counts/.DS_Store
    Ignored:    data/gregaria/deg_counts/STAR/.DS_Store
    Ignored:    data/gregaria/deg_counts/STAR_newparams/.DS_Store
    Ignored:    data/gregaria/deg_counts/STAR_newparams/gregaria/.DS_Store
    Ignored:    data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/.DS_Store
    Ignored:    data/list/.DS_Store
    Ignored:    data/metadata/.DS_Store
    Ignored:    data/nitens/.DS_Store
    Ignored:    data/nitens/deg_counts/.DS_Store
    Ignored:    data/nitens/deg_counts/STAR_newparams/.DS_Store
    Ignored:    data/nitens/deg_counts/STAR_newparams/gregaria/.DS_Store
    Ignored:    data/nitens/deg_counts/STAR_newparams/gregaria/featurecounts/.DS_Store
    Ignored:    data/nitens/deg_counts/STAR_newparams/nitens/.DS_Store
    Ignored:    data/nitens/deg_counts/STAR_newparams/nitens/featurecounts/.DS_Store
    Ignored:    data/piceifrons/.DS_Store
    Ignored:    data/piceifrons/deg_counts/.DS_Store
    Ignored:    data/piceifrons/deg_counts/STAR/.DS_Store
    Ignored:    data/piceifrons/deg_counts/STAR/gregaria/.DS_Store
    Ignored:    data/piceifrons/deg_counts/STAR/gregaria/featurecounts/.DS_Store
    Ignored:    data/piceifrons/deg_counts/STAR/piceifrons/.DS_Store
    Ignored:    data/piceifrons/deg_counts/STAR/piceifrons/featurecounts/.DS_Store
    Ignored:    data/piceifrons/deg_counts/STAR_newparams/.DS_Store
    Ignored:    data/piceifrons/deg_counts/STAR_newparams/gregaria/.DS_Store
    Ignored:    data/piceifrons/deg_counts/STAR_newparams/piceifrons/.DS_Store
    Ignored:    figures/
    Ignored:    tables/

Untracked files:
    Untracked:  data/DEseq2_SGRE_HEAD/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/3_deg-analysis.Rmd) and HTML (docs/3_deg-analysis.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	0ac8380	Maeva A. TECHER	2024-05-15	wflow_publish("analysis/3_deg-analysis.Rmd")
html	da47a64	Maeva A. TECHER	2024-05-14	Build site.
html	a39899b	Maeva A. TECHER	2024-05-14	Build site.
Rmd	b02e6e9	Maeva A. TECHER	2024-05-14	wflow_publish("analysis/3_deg-analysis.Rmd")
html	e39d280	Maeva A. TECHER	2024-01-29	Build site.
html	f701a01	Maeva A. TECHER	2024-01-29	reupdate
html	a3d138b	Maeva A. TECHER	2024-01-29	Build site.
Rmd	915285e	Maeva A. TECHER	2024-01-29	wflow_publish("analysis/3_deg-analysis.Rmd")
html	959a4e9	Maeva A. TECHER	2024-01-24	Build site.
html	1b09cbe	Maeva A. TECHER	2024-01-24	remove
html	478e72d	Maeva A. TECHER	2023-12-18	Build site.
Rmd	53877fa	Maeva A. TECHER	2023-12-18	add pages

Here we present the workflow example with the head data from S. piceifrons mapped to its own genome. We applied the same process for each species ( S. americana, S. serialis cubense, S. nitens, S. gregaria, S. cancellata and S. piceifrons), tissue (head or thorax) and each mapping strategy (mapped to sharedgregaria RefSeq or own RefSeq).

For the analysis of differentially expressed genes, we will follow some guidelines from an online RNA course tutorial that uses either DESeq2 or edgeR on STAR output. We also adapted some script lines from Foquet et al. 2021 code. Our aim will be to combine both results and take only the overlapping DEGs detected by these two software.

Both DESeq2 and edgeR identify differentially expressed genes by starting with the hypothesis that no genes are differentially expressed. They work similarly but differs slightly in the normalization process. Where DEseq2 use a “geometric” normalization, edgeR uses a weighted mean of log ratios-based method. If you want to understand the process and algorithms for each of these R package, I recommend to check their bioconductor page.

DESeq2 Bioconductor here.
edgeR Bioconductor here.

Warning: For sample size over 8 per condition, please be aware that both methods have been found to overestimate DEGs with false positive in human population samples. You can read more about it with Li et al. 2022

1. DESeq2 analysis using STAR input

1.1. Preparing input files

Raw count matrices

We generated this in the precedent section.

Transcript-to-gene annotation file

We need to format the .gtf annotation file for input in DESeq2.

Below is the example with S. piceifrons

# Download annotation and place it into the folder refgenomes
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/021/461/385/GCF_021461385.2_iqSchPice1.1/GCF_021461385.2_iqSchPice1.1_genomic.gtf.gz

# first column is the transcript ID, second column is the gene ID, third column is the gene symbol
zcat GCF_021461385.2_iqSchPice1.1_genomic.gtf.gz | awk -F "\t" 'BEGIN{OFS="\t"}{if($3=="transcript"){split($9, a, "\""); print a[4],a[2],a[8]}}' > tx2gene.piceifrons.csv

Sample sheet

DESeq2 needs a sample sheet that describes the samples characteristics: SampleName, FileName (…counts.txt), and subsequently anything that can be used for statistical design such as RearingCondition, replicates, tissue, time points, etc. in the form.

The design indicates how to model the samples: in the model we need to specify what we want to measure and what we want to control.

1.2. Lazy DEG analysis

Load libraries

Here we will be using the html report function of DESeq2 to export all the results, however, the option to do it step by step will be also provided below.

We start by loading all the required R packages.

#(install first from CRAN or Bioconductor)
library("knitr")
library("rmdformats")
library("tidyverse")
library("data.table")
library("DT")  # for making interactive search table
library("plotly") # for interactive plots
library("ggthemes") # for theme_calc
library("reshape2")
library("DESeq2")
library("apeglm")
library("ggpubr")
library("ggplot2")
library("ggrepel")
library("EnhancedVolcano")
library("SARTools")
library("pheatmap")

Set parameters

We need to specify the path of our working directories, where are located our gene counts files and where the .html report will be created:

## PARAMETERS for running the script
homeDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics"
workDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data"        # Working directory

# path to the directory containing raw counts files
rawDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/finalcounts/" 

## Details to create the html report
projectName <- "SGRE_HEAD"                      # name of the project
author <- "Maeva TECHER"                        # author of the statistical analysis/report

## Create all the needed directories for the html report
setwd(workDir)
Dirname <- paste("DEseq2_", projectName, sep = "")
dir.create(Dirname)
setwd(Dirname)
workDir_DEseq2 <- getwd()

We then define the parameters that we will be running with DESeq2:

## PARAMETERS for running DEseq2
tresh_logfold <- 1                            # Treshold for log2(foldchange) in final DE-files
tresh_padj <- 0.05                            # Treshold for adjusted p-valued in final DE-files
alpha_DEseq2 <- 0.05                  # threshold of statistical significance
pAdjustMethod_DEseq2 <- "BH"          # p-value adjustment method: "BH" (default) or "BY"
featuresToRemove <- c(NULL)           # names of the features to be removed, NULL if none or if using Idxstats
varInt <- "RearingCondition"          # factor of interest
condRef <- "Crowded"                 # reference biological condition
batch <- NULL                         # blocking factor: NULL (default) or "batch" for example  
fitType <- "parametric"               # mean-variance relationship: "parametric" (default) or "local"
cooksCutoff <- TRUE                   # TRUE/FALSE to perform the outliers detection (default is TRUE)
independentFiltering <- TRUE          # TRUE/FALSE to perform independent filtering (default is TRUE)
typeTrans <- "rlog"                     # transformation for PCA/clustering: "VST" or "rlog"
locfunc <- "median"

# Path and name of targetfile containing conditions and file names
targetFile <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/list/HeadSGREG.txt"           
colors <- c("#B31B21", "#1465AC", "green3", "orange")

Here our targetFile is our metadata which should have a column for each: sample ID, the file name (exactly the same as in your folder, be careful with ” ” or “_“) and the conditions. Below we show an example with the files for piceifrons head:

If everything has been loaded properly, the next chunk of code should not show any warning and we can simpy proceed to the actual DEG analysis.

# checking parameters
setwd(workDir_DEseq2)

checkParameters.DESeq2(projectName=projectName,author=author,targetFile=targetFile,
                       rawDir=rawDir,featuresToRemove=featuresToRemove,varInt=varInt,
                       condRef=condRef,batch=batch,fitType=fitType,cooksCutoff=cooksCutoff,
                       independentFiltering=independentFiltering,alpha=alpha_DEseq2,pAdjustMethod=pAdjustMethod_DEseq2,
                       typeTrans=typeTrans,locfunc=locfunc,colors=colors)

HTML report

Here if you just want a quick report without being able to tailor step by step, you can use the code below and obtain several graphs and details for your DEG analysis.

## import the sample sheet that indicates Rearing Conditions and Tissue origins
setwd(workDir_DEseq2)

# loading target file
target <- loadTargetFile(targetFile=targetFile, varInt=varInt, condRef=condRef, batch=batch)

# loading counts
counts <- loadCountData(target=target, rawDir=rawDir, featuresToRemove=featuresToRemove)

# description plots
majSequences <- descriptionPlots(counts=counts, group=target[,varInt], col=colors)

# analysis with DESeq2
out.DESeq2 <- run.DESeq2(counts=counts, target=target, varInt=varInt, batch=batch,
                         locfunc=locfunc, fitType=fitType, pAdjustMethod=pAdjustMethod_DEseq2,
                         cooksCutoff=cooksCutoff, independentFiltering=independentFiltering, alpha=alpha_DEseq2)

# PCA + Clustering
exploreCounts(object=out.DESeq2$dds, group=target[,varInt], typeTrans=typeTrans, col=colors)

# summary of the analysis (boxplots, dispersions, diag size factors, export table, nDiffTotal, histograms, MA plot)
summaryResults <- summarizeResults.DESeq2(out.DESeq2, group=target[,varInt], col=colors,
                                          independentFiltering=independentFiltering,
                                          cooksCutoff=cooksCutoff, alpha=alpha_DEseq2)

# generating HTML report
writeReport.DESeq2(target=target, counts=counts, out.DESeq2=out.DESeq2, summaryResults=summaryResults,
                   majSequences=majSequences, workDir=workDir_DEseq2, projectName=projectName, author=author,
                   targetFile=targetFile, rawDir=rawDir, featuresToRemove=featuresToRemove, varInt=varInt,
                   condRef=condRef, batch=batch, fitType=fitType, cooksCutoff=cooksCutoff,
                   independentFiltering=independentFiltering, alpha=alpha_DEseq2, pAdjustMethod=pAdjustMethod_DEseq2,
                   typeTrans=typeTrans, locfunc=locfunc, colors=colors)

1.3. Step-by-step DEG analysis

Load libraries

Follow the same process section 1.2

Set parameters

Here we only need to specify the path of our raw counts and metadata files.

## PARAMETERS for running the script
homeDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics"
workDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data"        # Working directory

# path to the directory containing raw counts files
rawDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/finalcounts/"

We then define the parameters that we will be running with DESeq2:

## PARAMETERS for running DEseq2
tresh_logfold <- 1                            # Treshold for log2(foldchange) in final DE-files
tresh_padj <- 0.05                            # Treshold for adjusted p-valued in final DE-files
alpha_DEseq2 <- 0.05                  # threshold of statistical significance
pAdjustMethod_DEseq2 <- "BH"          # p-value adjustment method: "BH" (default) or "BY"
featuresToRemove <- c(NULL)           # names of the features to be removed, NULL if none or if using Idxstats
varInt <- "RearingCondition"          # factor of interest
condRef <- "Crowded"                 # reference biological condition
batch <- NULL                         # blocking factor: NULL (default) or "batch" for example  
fitType <- "parametric"               # mean-variance relationship: "parametric" (default) or "local"
cooksCutoff <- TRUE                   # TRUE/FALSE to perform the outliers detection (default is TRUE)
independentFiltering <- TRUE          # TRUE/FALSE to perform independent filtering (default is TRUE)
typeTrans <- "rlog"                     # transformation for PCA/clustering: "VST" or "rlog"
locfunc <- "median"

We then import the metadata table and format it as such as there is no error downstream.

# Get in the home directory and load the file of interest
setwd(homeDir)
sampletable <- fread("data/list/HeadSGREG.txt")

## add the sample names as row names (it is needed for some of the DESeq functions)
rownames(sampletable) <- sampletable$SampleName

## Make sure discriminant variables are factor
sampletable$RearingCondition <- as.factor(sampletable$RearingCondition)
sampletable$Tissue <- as.factor(sampletable$Tissue)
#sampletable$Species <- as.factor(sampletable$Species)
dim(sampletable)

Then we obtain the output from STAR GeneCount and import here individually using the sampletable as a reference to fetch them. We also filter the lowly expressed genes to avoid noisy data.

## Import count files
satoshi <- DESeqDataSetFromHTSeqCount(sampleTable = sampletable,
                        directory = "data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/finalcounts",
                        design = ~ RearingCondition )
satoshi

# keep genes for which sums of raw counts across experimental samples is > 5
satoshi <- satoshi[rowSums(counts(satoshi)) > 1, ]
nrow(satoshi)

Run the model Fit DESeq2

Run DESeq2 analysis using DESeq, which performs (1) estimation of size factors, (2) estimation of dispersion, then (3) Negative Binomial GLM fitting and Wald statistics. The results tables (log2 fold changes and p-values) can be generated using the results function (copied from online chapter).

One of the thing we want to do with the step-by-step method is to remove some of special genomic features such as long -non-coding RNA genes or ribosomal RNA. For that we created a list beforehand by parsing the .gff file and selecting field of interest for each species.

IF WE WANT TO EXCLUDE SOME LOCI USE THIS CODE:

# Read the list of loci from a text file to later remove it
loci_to_exclude <- readLines("/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/list/gregaria_lncrna_list.txt")

# Subset the rows of the DESeqDataSet to exclude the loci in the 'loci_to_exclude' list
satoshi_filtered <- satoshi[!(rownames(satoshi) %in% loci_to_exclude), ]

shigeru <- DESeq(satoshi_filtered)
cbind(resultsNames(shigeru))

IF WE DO NOT WANT TO EXCLUDE ANY LOCI USE THIS CODE:

# Fit the statistical model
shigeru <- DESeq(satoshi)
cbind(resultsNames(shigeru))

Then we will observe the final result by keeping DEGs with an adjusted p-value cutoff > 0.05.

# Here we plot the adjusted p-value means corrected for multiple testing (FDR padj)
res_shigeru <- results(shigeru)

# We only keep genes with an adjusted p-value cutoff > 0.05 by changing the default significance cut-off
sum(res_shigeru$padj < tresh_padj, na.rm = TRUE)

brock <- results(shigeru, name = "RearingCondition_Isolated_vs_Crowded", alpha = alpha_DEseq2)
summary(brock)

# Details of what is each column meaning in our final result 
mcols(brock)$description
head(brock)

1.4. Visualizing and exploring the results

This cam be done only with the step-by-step method. The html report can output similar figures but with less possibility for tailoring.

PCA of the samples

We transformed the data for visualization by comparing both recommended rlog (Regularized log) or vst (Variance Stabilizing Transformation) transformations. Both options produce log2 scale data which has been normalized by the DESeq2 method with respect to library size.

# Try with the vst transformation
shigeru_vst <- vst(shigeru)
shigeru_rlog <- rlog(shigeru)

Here we will plot the data using the rlog approach:

# Create the pca on the defined groups
pcaData <- plotPCA(object = shigeru_rlog, intgroup = c("RearingCondition2"),returnData=TRUE)

# Store the information for each axis variance in %
percentVar <- round(100 * attr(pcaData, "percentVar"))

# Make sure that the discriminant variable are in factor for using shape and color later on
#pcaData$RearingCondition<-factor(pcaData$RearingCondition,levels=c("Crowded","Isolated"), labels=c("Crowded for 72h mother","Isolated mother"))
pcaData$RearingCondition2<-factor(pcaData$RearingCondition2,levels=c("GregariousCrowded","GregariousIsolated", "SolitariousCrowded", "SolitariousIsolated"), labels=c("Gregarious control","Gregarious isolated for 72h", "Solitarious crowded for 72h", "Solitarious control"))
levels(pcaData$RearingCondition2)
#pcaData$Tissue<-factor(pcaData1$Tissue,levels=c("Whole_body","Optical_lobes"), labels=c("Hatchling", "OLB"))
#levels(pcaData$Tissue)

library(ggdist)

p <- ggplot(pcaData, aes(PC1, PC2, color= RearingCondition2)) +
  geom_point(size=12) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) + 
  scale_color_manual(values = c("orange", "red2", "blue", "forestgreen")) +
  #coord_fixed() +
  theme_bw() +
  theme(legend.title = element_blank()) + 
  theme(legend.text = element_text(face="bold", size=24)) +
  theme(axis.text = element_text(size=20)) +
  theme(axis.title = element_text(size=20))

p + geom_convexhull(aes(fill = RearingCondition2, color = RearingCondition2), alpha = 0.5) +
    scale_fill_manual(values = c("orange", "red2", "blue", "forestgreen"))

  #stat_ellipse(aes(group = RearingCondition, fill = RearingCondition), level = 0.95, geom = "polygon", alpha = 0.2)


  #ggtitle("PCA on STAR read counts from accessory glands (using 13472 genes)", 
  #        subtitle = "rlog transformation in DEseq2") +
  #theme(plot.title = element_text(size = 18, face = "bold", margin = margin(t = 10)))+
  #theme(plot.subtitle = element_text(size = 16, face = "italic"))

#ggplot(pcaData, aes(PC1, PC2, color= Tissue, shape = Species)) +
#  geom_point(size=4)+xlab(paste0("PC1: ",percentVar[1],"% variance")) +
#  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
#  scale_color_manual(values = c("blue3","red2","magenta2", "black", "red4", "green4", "deepskyblue", "darkorchid1", "darkorange3", "darkorange1" , "gray40", "gray", "darkorange2"))+
#  geom_text_repel(aes(label = name), nudge_x = -1, nudge_y = 0.2, size = 3) +
#  coord_fixed()+
#  theme_bw()+
#  theme(legend.title = element_blank())+ 
#  theme(legend.text = element_text(face="bold", size=15))+
#  theme(axis.text = element_text(size=15))+
#  theme(axis.title = element_text(size=16))+ 
#  ggtitle("PCA on S. piceifrons thorax tissues", subtitle = "rlog transformation") + 
#  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
#  ylab(paste0("PC2: ",percentVar[2],"% variance"))

Plot the PCA vsd

# Create the pca on the defined groups
pcaData <- plotPCA(object = shigeru_vst, intgroup = c("RearingCondition"),returnData=TRUE)

# Store the information for each axis variance in %
percentVar <- round(100 * attr(pcaData, "percentVar"))

# Make sure that the discriminant variable are in factor for using shape and color later on
pcaData$RearingCondition<-factor(pcaData$RearingCondition,levels=c("Crowded","Isolated"), labels=c("Crowded","Isolated"))
levels(pcaData$RearingCondition)

#pcaData$Tissue<-factor(pcaData$Tissue,levels=c("Antenna_lobes","Antennae", "Fat_Body", "Head", "Maxillary", "Metathoracic_ganglia", "Mushroom_body", "Optical_lobes", "Pleuropodium", "T3_leg" , "Thorax", "Whole_body", "Whole_egg"), labels=c("Antenna lobes","Antennae","Fat Body", "Whole Head", "Maxillary", "Metathoracic ganglia", "Mushroom body", "Optical lobes", "Pleuropodium", "T3 leg" , "Thorax", "Whole body", "Egg"))
#levels(pcaData$Tissue)

ggplot(pcaData, aes(PC1, PC2, color= RearingCondition)) +
  geom_point(size=4)+xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  scale_color_manual(values = c("blue", "red")) +
  geom_text_repel(aes(label = name), nudge_x = -1, nudge_y = 0.2, size = 4) +
  coord_fixed()+
  theme_bw()+
  theme(legend.title = element_blank())+ 
  theme(legend.text = element_text(face="bold", size=15))+
  theme(axis.text = element_text(size=15))+
  theme(axis.title = element_text(size=16))+ 
  ggtitle("PCA on Schisto tissues", subtitle = "vst transformation") + 
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance"))


#ggplot(pcaData, aes(PC1, PC2, shape= Species, fill= Tissue )) +
#  geom_point(size=5)+xlab(paste0("PC1: ",percentVar[1],"% variance")) +
#  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
#  scale_fill_manual(values = c("blue3","red2","magenta2", "black", "red4", "green4", "deepskyblue", "cyan", "darkorange3", "darkorange1" , "gray40", "gray", "darkorange2"))+
#  scale_shape_manual(values=c(21, 23, 24, 25, 21))+
#   geom_text_repel(aes(label = Tissue), nudge_x = -1, nudge_y = 0.2, size = 3) +
#  coord_fixed()+
# theme_bw()+
#  theme(legend.title = element_blank())+ 
#  theme(panel.border = element_blank())+
#  theme(legend.text = element_text(face="bold", size=15))+
#  theme(axis.text = element_text(size=15))+
#  theme(axis.title = element_text(size=16))+ 
#  ggtitle("PCA on mixed tissues", subtitle = "vst transformation") + 
#  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
#  ylab(paste0("PC2: ",percentVar[2],"% variance"))

Sample Matrix Distance

Using also the transformed data, we check the distance between samples and see how they correlate to each others.

Heatmap using rlog

# calculate between-sample distance matrix
sampleDistMatrix.rlog <- as.matrix(dist(t(assay(shigeru_rlog))))

metadata <- sampletable[,c("RearingCondition2", "Tissue")]
rownames(metadata) <- sampletable$SampleName


pheatmap(sampleDistMatrix.rlog, annotation_col=metadata, main = "Ovaries tissue heatmap, rlog transformation")

Heatmap using vst

# calculate between-sample distance matrix
sampleDistMatrix.vst <- as.matrix(dist(t(assay(shigeru_vst))))

pheatmap(sampleDistMatrix.vst, annotation_col=metadata, main = "Ovaries tissue heatmap, vst transformation")

MA-plot

This plot allows us to show the log2 fold changes over the mean of normalized counts for all the samples. Points will be colored in red if the adjusted p-value is less than 0.05 and the log2 fold change is bigger than 1. In blue, will be the reverse for the log2 fold change.

To generate more accurate log2 foldchange estimates, DESeq2 allows (and recommends) the shrinkage of the LFC estimates toward zero when the information for a gene is low, which could include:

-Low counts
-High dispersion values

# include the log2FoldChange shrinkage use to visualize gene ranking
de_shrink <- lfcShrink(dds = shigeru, coef="RearingCondition_Isolated_vs_Crowded", type="apeglm")
head(de_shrink)

# Ma plot
maplot <-ggmaplot(de_shrink, fdr = 0.05, fc = 1, size = 2, palette = c("#B31B21", "#1465AC", "darkgray"), genenames = as.vector(rownames(de_shrink$name)), top = 0,legend="top",label.select = NULL) +
coord_cartesian(xlim = c(0, 20)) + 
scale_y_continuous(limits=c(-12, 12)) + 
theme(axis.text.x = element_text(size=16),axis.text.y = element_text(size=15),axis.title.x = element_text(size=17),axis.title.y = element_text(size=17),axis.line = element_line(size = 1, colour="gray20"),axis.ticks = element_line(size = 1, colour="gray20"))+
guides(color = guide_legend(override.aes = list(size = c(3,3,3))))+
theme(legend.position = c(0.70, 0.12),legend.text=element_text(size=14,face="bold"),legend.background = element_rect(fill="transparent"))+
theme(plot.title = element_text(size=18, colour="gray30", face="bold",hjust=0.06, vjust=-5))+
labs(title="MA-plot for the shrunken log2 fold changes in the AGY")
maplot

Volcano Plot

The EnhancedVolcano helps visualise the resulst of differential expression analysis.

keyvals <-ifelse( 
  res_shigeru$log2FoldChange >= 1 & res_shigeru$padj <= 0.05, '#B31B21', 
  ifelse(res_shigeru$log2FoldChange <= -1 & res_shigeru$padj <= 0.05, '#1465AC', 'darkgray')) 

keyvals[is.na(keyvals)] <-'darkgray' 
names(keyvals)[keyvals == "#B31B21"] <-'Upregulated' 
names(keyvals)[keyvals == "#1465AC"] <-'Downregulated' 
names(keyvals)[keyvals == 'darkgray'] <-'NS'

EnhancedVolcano(res_shigeru,
                lab = rownames(res_shigeru),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 0.05,
                FCcutoff = 1,
                pointSize = 3,
                labSize = 5,
                colAlpha = 4/5,
                colCustom = keyvals,
                drawConnectors = TRUE)

Normalized Matrix distance

We then normalize the result by extracting only significant genes with a fold change of 1.

resorted_deresults <- res_shigeru[order(res_shigeru$padj),]

## Select only the genes that have a padj > 0.05 and with minimum log2FoldChange of 1
sig <- resorted_deresults[!is.na(resorted_deresults$padj) &
                            resorted_deresults$padj < tresh_padj &
                            abs(resorted_deresults$log2FoldChange) >=3,]
selected <- rownames(sig);selected

## Norm transform the data from DEseq2 run
ntd <- normTransform(satoshi)

## Plot the relation among samples considering only the significant genes
pheatmap(assay(ntd)[selected,], 
         annotation_col=metadata,
         cluster_rows = TRUE, 
         show_rownames = TRUE, 
         cluster_cols = TRUE, 
         cutree_rows = 4,
         cutree_cols = 3,
         labels_col= colData(satoshi)$SampleName)

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] pheatmap_1.0.12             SARTools_1.8.1             
 [3] kableExtra_1.4.0            edgeR_3.42.4               
 [5] limma_3.56.2                ashr_2.2-63                
 [7] EnhancedVolcano_1.18.0      ggrepel_0.9.5              
 [9] ggpubr_0.6.0                apeglm_1.22.1              
[11] DESeq2_1.40.2               SummarizedExperiment_1.30.2
[13] Biobase_2.60.0              MatrixGenerics_1.12.3      
[15] matrixStats_1.3.0           GenomicRanges_1.52.1       
[17] GenomeInfoDb_1.36.4         IRanges_2.34.1             
[19] S4Vectors_0.38.2            BiocGenerics_0.46.0        
[21] reshape2_1.4.4              ggthemes_5.1.0             
[23] plotly_4.10.4               DT_0.33                    
[25] data.table_1.15.4           lubridate_1.9.3            
[27] forcats_1.0.0               stringr_1.5.1              
[29] dplyr_1.1.4                 purrr_1.0.2                
[31] readr_2.1.5                 tidyr_1.3.1                
[33] tibble_3.2.1                ggplot2_3.5.1              
[35] tidyverse_2.0.0             rmdformats_1.0.4           
[37] knitr_1.45                  workflowr_1.7.1            

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3      ggdendro_0.2.0          rstudioapi_0.16.0      
  [4] jsonlite_1.8.8          magrittr_2.0.3          rmarkdown_2.26         
  [7] fs_1.6.4                zlibbioc_1.46.0         vctrs_0.6.5            
 [10] memoise_2.0.1           RCurl_1.98-1.14         SQUAREM_2021.1         
 [13] rstatix_0.7.2           mixsqp_0.3-54           htmltools_0.5.8.1      
 [16] S4Arrays_1.0.6          truncnorm_1.0-9         broom_1.0.5            
 [19] sass_0.4.9              bslib_0.7.0             htmlwidgets_1.6.4      
 [22] plyr_1.8.9              cachem_1.0.8            whisker_0.4.1          
 [25] lifecycle_1.0.4         pkgconfig_2.0.3         Matrix_1.6-5           
 [28] R6_2.5.1                fastmap_1.1.1           GenomeInfoDbData_1.2.10
 [31] digest_0.6.35           numDeriv_2016.8-1.1     GGally_2.2.1           
 [34] colorspace_2.1-0        AnnotationDbi_1.62.2    ps_1.7.6               
 [37] rprojroot_2.0.4         irlba_2.3.5.1           RSQLite_2.3.6          
 [40] invgamma_1.1            fansi_1.0.6             timechange_0.3.0       
 [43] httr_1.4.7              abind_1.4-5             compiler_4.3.1         
 [46] bit64_4.0.5             withr_3.0.0             backports_1.4.1        
 [49] BiocParallel_1.34.2     DBI_1.2.2               carData_3.0-5          
 [52] ggstats_0.6.0           highr_0.10              ggsignif_0.6.4         
 [55] MASS_7.3-60.0.1         DelayedArray_0.26.7     tools_4.3.1            
 [58] httpuv_1.6.15           glue_1.7.0              callr_3.7.6            
 [61] promises_1.3.0          grid_4.3.1              getPass_0.2-4          
 [64] generics_0.1.3          gtable_0.3.5            tzdb_0.4.0             
 [67] hms_1.1.3               xml2_1.3.6              car_3.1-2              
 [70] utf8_1.2.4              XVector_0.40.0          pillar_1.9.0           
 [73] emdbook_1.3.13          genefilter_1.82.1       later_1.3.2            
 [76] splines_4.3.1           lattice_0.22-6          survival_3.6-4         
 [79] bit_4.0.5               annotate_1.78.0         tidyselect_1.2.1       
 [82] locfit_1.5-9.9          Biostrings_2.68.1       git2r_0.33.0           
 [85] gridExtra_2.3           bookdown_0.39           svglite_2.1.3          
 [88] xfun_0.43               stringi_1.8.4           lazyeval_0.2.2         
 [91] yaml_2.3.8              evaluate_0.23           codetools_0.2-20       
 [94] bbmle_1.0.25.1          cli_3.6.2               xtable_1.8-4           
 [97] systemfonts_1.0.6       munsell_0.5.1           processx_3.8.4         
[100] jquerylib_0.1.4         Rcpp_1.0.12             png_0.1-8              
[103] coda_0.19-4.1           XML_3.99-0.16.1         bdsmatrix_1.3-7        
[106] parallel_4.3.1          blob_1.2.4              bitops_1.0-7           
[109] viridisLite_0.4.2       mvtnorm_1.2-4           scales_1.3.0           
[112] crayon_1.5.2            rlang_1.1.3             KEGGREST_1.40.1

Differential Gene Expression with DESeq2 and edgeR

Maeva Techer

2024-05-15

1. DESeq2 analysis using STAR input

1.1. Preparing input files

Raw count matrices

Transcript-to-gene annotation file

Sample sheet

1.2. Lazy DEG analysis

Load libraries

Set parameters

HTML report

1.3. Step-by-step DEG analysis

Load libraries

Set parameters

Run the model Fit DESeq2

1.4. Visualizing and exploring the results

PCA of the samples

Sample Matrix Distance

MA-plot

Volcano Plot

Normalized Matrix distance