Last updated: 2024-01-29

Checks: 7 0

Knit directory: locust-comparative-genomics/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20221025)

The command set.seed(20221025) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 0837617

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 0837617. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    data/.DS_Store
    Ignored:    data/.Rhistory
    Ignored:    data/metadata/.DS_Store
    Ignored:    figures/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/3_deg-analysis.Rmd) and HTML (docs/3_deg-analysis.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	f701a01	Maeva A. TECHER	2024-01-29	reupdate
html	a3d138b	Maeva A. TECHER	2024-01-29	Build site.
Rmd	915285e	Maeva A. TECHER	2024-01-29	wflow_publish("analysis/3_deg-analysis.Rmd")
html	959a4e9	Maeva A. TECHER	2024-01-24	Build site.
html	1b09cbe	Maeva A. TECHER	2024-01-24	remove
html	478e72d	Maeva A. TECHER	2023-12-18	Build site.
Rmd	53877fa	Maeva A. TECHER	2023-12-18	add pages

EdgeR

Load required R libraries

Here we present the workflow example with the head data from S. piceifrons

EdgeR analysis using STAR input

Preparing input files

Raw count matrices

We generated this in the precedent section.

Transcript-to-gene annotation file

Below is the example with S. piceifrons

# Download annotation and place it into the folder refgenomes
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/021/461/385/GCF_021461385.2_iqSchPice1.1/GCF_021461385.2_iqSchPice1.1_genomic.gtf.gz

# first column is the transcript ID, second column is the gene ID, third column is the gene symbol
zcat GCF_021461385.2_iqSchPice1.1_genomic.gtf.gz | awk -F "\t" 'BEGIN{OFS="\t"}{if($3=="transcript"){split($9, a, "\""); print a[4],a[2],a[8]}}' > tx2gene.piceifrons.csv

Sample sheet

Differential expression analysis

Import input count data

We start by reading the sample sheet.


###############################
#  MOSTLY FOR THE HMTL REPORT
###############################
## PARAMETERS for running the script
homeDir <- "/Users/alphamanae/Documents/GitHub/locust-phase-transition-RNAseq"
workDir <- "/Users/alphamanae/Documents/GitHub/locust-phase-transition-RNAseq/data/piceifrons"          # Working directory
projectName <- "SGREG_HEAD"                         # name of the project
author <- "Maeva TECHER"                        # author of the statistical analysis/report

## Create all the needed directories
setwd(workDir)
Dirname <- paste("edgeR_", projectName, sep = "")
dir.create(Dirname)
setwd(Dirname)
workDir_edgeR <- getwd()
# path to the directory containing raw counts files
rawDir <- "/Users/alphamanae/Documents/GitHub/locust-phase-transition-RNAseq/data/piceifrons/STAR_counts_4thcol"  

## PARAMETERS for running EdgeR
tresh_logfold <- 1                            # Treshold for log2(foldchange) in final DE-files
tresh_padj <- 0.05                            # Treshold for adjusted p-valued in final DE-files
alpha_edgeR <- 0.05                   # threshold of statistical significance
pAdjustMethod_edgeR <- "BH"           # p-value adjustment method: "BH" (default) or "BY"
featuresToRemove <- c(NULL)           # names of the features to be removed, NULL if none or if using Idxstats
varInt <- "RearingCondition"          # factor of interest
condRef <- "Isolated"                 # reference biological condition
batch <- NULL                         # blocking factor: NULL (default) or "batch" for example  
cpmCutoff <- 1                        # counts-per-million cut-off to filter low counts
gene.selection <- "pairwise"          # selection of the features in MDSPlot
normalizationMethod <- "TMM"          # normalization method: "TMM" (default), "RLE" (DESeq) or "upperquartile"

# Path and name of targetfile containing conditions and file names
targetFile <- "/Users/alphamanae/Documents/GitHub/locust-phase-transition-RNAseq/data/piceifrons/list/HeadSPICE.txt"    
colors <- c("#B31B21", "#1465AC")


# checking parameters
setwd(workDir_edgeR)
checkParameters.edgeR(projectName=projectName,author=author,targetFile=targetFile,
                      rawDir=rawDir,featuresToRemove=featuresToRemove,varInt=varInt,
                      condRef=condRef,batch=batch,alpha=alpha_edgeR,pAdjustMethod=pAdjustMethod_edgeR,
                      cpmCutoff=cpmCutoff,gene.selection=gene.selection,
                      normalizationMethod=normalizationMethod,colors=colors)

We followed the comprehensive tutorial for edgeR

Filtering and normalizing


# loading target file
target <- loadTargetFile(targetFile=targetFile, varInt=varInt, condRef=condRef, batch=batch)

# loading counts
counts <- loadCountData(target=target, rawDir=rawDir, featuresToRemove=featuresToRemove)

## Create DGEList data class
dge <- DGEList(counts = counts, group = target[,varInt])
dge

## Filter out the genes with low counts
keep <- filterByExpr(y = dge)
dge <- dge[keep, , keep.lib.sizes=FALSE]
dge

## Normalization and effective library sizes
dge <- calcNormFactors(object = dge)
dge

Run the model fit with EdgeR


## Dispersion estimation
dge <- estimateDisp(y = dge)

## Testing for differential gene expression
et <- exactTest(object = dge)

## Extract the table with adjusted p-values (FDR = padj for DEseq2)
top_degs = topTags(object = et, n = "Inf")
top_degs

## Obtain the summary
summary(decideTests(object = et, lfc = 1))

## Export the table
#write.csv(as.data.frame(top_degs), file="condition_infected_vs_control_dge.csv")

DGE Visualization

Volcano plot


sakura <- as.data.frame(top_degs)

keyvals <-ifelse( 
  sakura$logFC >= 1 & sakura$FDR <= 0.05, '#B31B21', 
  ifelse(sakura$logFC <= -1 & sakura$FDR <= 0.05, '#1465AC', 'darkgray')) 

keyvals[is.na(keyvals)] <-'darkgray' 
names(keyvals)[keyvals == "#B31B21"] <-'Upregulated' 
names(keyvals)[keyvals == "#1465AC"] <-'Downregulated' 
names(keyvals)[keyvals == 'darkgray'] <-'NS'

EnhancedVolcano(sakura,
                lab = rownames(sakura),
                x = 'logFC',
                y = 'FDR',
                pCutoff = 0.05,
                FCcutoff = 1,
                pointSize = 3,
                labSize = 4,
                colAlpha = 4/5,
                colCustom = keyvals,
                drawConnectors = TRUE)

MA plot


plotMD(object = et, p.value = 0.05)

Create a html report with EdgeR

setwd(workDir_edgeR)
# description plots
majSequences <- descriptionPlots(counts=counts, group=target[,varInt], col=colors)

# edgeR analysis
out.edgeR <- run.edgeR(counts=counts, target=target, varInt=varInt, condRef=condRef,
                       batch=batch, cpmCutoff=cpmCutoff, normalizationMethod=normalizationMethod,
                       pAdjustMethod=pAdjustMethod_edgeR)

# MDS + clustering
exploreCounts(object=out.edgeR$dge, group=target[,varInt], gene.selection=gene.selection, col=colors)

# summary of the analysis (boxplots, dispersions, export table, nDiffTotal, histograms, MA plot)
summaryResults <- summarizeResults.edgeR(out.edgeR, group=target[,varInt], counts=counts, alpha=alpha_edgeR, col=colors)

# generating HTML report
writeReport.edgeR(target=target, counts=counts, out.edgeR=out.edgeR, summaryResults=summaryResults,
                  majSequences=majSequences, workDir=workDir_edgeR, projectName=projectName, author=author,
                  targetFile=targetFile, rawDir=rawDir, featuresToRemove=featuresToRemove, varInt=varInt,
                  condRef=condRef, batch=batch, alpha=alpha_edgeR, pAdjustMethod=pAdjustMethod_edgeR, colors=colors,
                  gene.selection=gene.selection, normalizationMethod=normalizationMethod, cpmCutoff =cpmCutoff)

#DeSeq2

Load required R libraries

#(install first from CRAN or Bioconductor)
library("knitr")
library("rmdformats")
library("tidyverse")
library("DT")  # for making interactive search table
library("plotly") # for interactive plots
library("ggthemes") # for theme_calc
library("reshape2")
library("DESeq2")
library("data.table")
library("apeglm")
library("ggpubr")
library("ggplot2")
library("ggrepel")
library("EnhancedVolcano")
library("SARTools")
library("pheatmap")

## Global options
options(max.print="10000")
knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE,
    cache = FALSE,
    comment = FALSE,
    prompt = FALSE,
    tidy = TRUE
)
opts_knit$set(width=75)

Here we present the workflow example with the head data from S. piceifrons

For the analysis of differentially expressed genes, we will follow some guidelines from an online RNA course tutorial that uses either DESeq2 or edgeR on STAR output. We also adapted some script lines from Foquet et al. 2021 code.

DESeq2 tests for differential expression using negative binomial generalized linear models. DESeq2 (as edgeR) is based on the hypothesis that most genes are not differentially expressed. The package takes as an input raw counts (i.e. non normalized counts): the DESeq2 model internally corrects for library size, so giving as an input normalized count would be incorrect.

DESeq2 analysis using STAR input

Preparing input files

Raw count matrices

We generated this in the precedent section.

Transcript-to-gene annotation file

Below is the example with S. piceifrons

# Download annotation and place it into the folder refgenomes
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/021/461/385/GCF_021461385.2_iqSchPice1.1/GCF_021461385.2_iqSchPice1.1_genomic.gtf.gz

# first column is the transcript ID, second column is the gene ID, third column is the gene symbol
zcat GCF_021461385.2_iqSchPice1.1_genomic.gtf.gz | awk -F "\t" 'BEGIN{OFS="\t"}{if($3=="transcript"){split($9, a, "\""); print a[4],a[2],a[8]}}' > tx2gene.piceifrons.csv

Sample sheet

DESeq2 needs a sample sheet that describes the samples characteristics: SampleName, FileName (…counts.txt), and subsequently anything that can be used for statistical design such as RearingCondition, replicates, tissue, time points, etc. in the form.

The design indicates how to model the samples: in the model we need to specify what we want to measure and what we want to control.

Differential expression analysis

Import input count data

We start by reading the sample sheet.


###############################
#  MOSTLY FOR THE HMTL REPORT
###############################
## PARAMETERS for running the script
homeDir <- "/Users/alphamanae/Documents/GitHub/locust-phase-transition-RNAseq"
workDir <- "/Users/alphamanae/Documents/GitHub/locust-phase-transition-RNAseq/data/piceifrons"          # Working directory
projectName <- "SPICE_HEAD"                         # name of the project
author <- "Maeva TECHER"                        # author of the statistical analysis/report

## Create all the needed directories
setwd(workDir)
Dirname <- paste("DEseq2_", projectName, sep = "")
dir.create(Dirname)
setwd(Dirname)
workDir_DEseq2 <- getwd()
# path to the directory containing raw counts files
rawDir <- "/Users/alphamanae/Documents/GitHub/locust-phase-transition-RNAseq/data/piceifrons/STAR_counts_4thcol"              

## PARAMETERS for running DEseq2
tresh_logfold <- 1                            # Treshold for log2(foldchange) in final DE-files
tresh_padj <- 0.05                            # Treshold for adjusted p-valued in final DE-files
alpha_DEseq2 <- 0.05                  # threshold of statistical significance
pAdjustMethod_DEseq2 <- "BH"          # p-value adjustment method: "BH" (default) or "BY"
featuresToRemove <- c(NULL)           # names of the features to be removed, NULL if none or if using Idxstats
varInt <- "RearingCondition"          # factor of interest
condRef <- "Isolated"                 # reference biological condition
batch <- NULL                         # blocking factor: NULL (default) or "batch" for example  
fitType <- "parametric"               # mean-variance relationship: "parametric" (default) or "local"
cooksCutoff <- TRUE                   # TRUE/FALSE to perform the outliers detection (default is TRUE)
independentFiltering <- TRUE          # TRUE/FALSE to perform independent filtering (default is TRUE)
typeTrans <- "VST"                      # transformation for PCA/clustering: "VST" or "rlog"
locfunc <- "median"

# Path and name of targetfile containing conditions and file names
targetFile <- "/Users/alphamanae/Documents/GitHub/locust-phase-transition-RNAseq/data/piceifrons/list/HeadSPICE.txt"           
colors <- c("#B31B21", "#1465AC")

# checking parameters
setwd(workDir_DEseq2)
checkParameters.DESeq2(projectName=projectName,author=author,targetFile=targetFile,
                       rawDir=rawDir,featuresToRemove=featuresToRemove,varInt=varInt,
                       condRef=condRef,batch=batch,fitType=fitType,cooksCutoff=cooksCutoff,
                       independentFiltering=independentFiltering,alpha=alpha,pAdjustMethod=pAdjustMethod_DEseq2,
                       typeTrans=typeTrans,locfunc=locfunc,colors=colors)



###############################
###############################

setwd(homeDir)
sampletable <- fread("data/piceifrons/list/HeadSPICE.txt")
## add the sample names as row names (it is needed for some of the DESeq functions)
rownames(sampletable) <- sampletable$SampleName

## Make sure discriminant variables are factor
sampletable$RearingCondition <- as.factor(sampletable$RearingCondition)
sampletable$Tissue <- as.factor(sampletable$Tissue)
dim(sampletable)

Then we obtain the output from STAR GeneCount and import here individually using the sampletable as a reference to fetch them. We also filter the lowly expressed genes to avoid noisy data.


## Import count files
satoshi <- DESeqDataSetFromHTSeqCount(sampleTable = sampletable,
                        directory = "data/piceifrons/STAR_counts_4thcol",
                        design = ~ RearingCondition)
satoshi

# keep genes for which sums of raw counts across experimental samples is > 5
satoshi <- satoshi[rowSums(counts(satoshi)) > 5, ]
nrow(satoshi)

#set a standard to be compared to (hatchling) ONLY IF WE HAVE A CONTROL
#satoshi$Tissue <- relevel(satoshi$Tissue, ref = "Whole_body")

Run the model Fit DESeq2

Run DESeq2 analysis using DESeq, which performs (1) estimation of size factors, (2) estimation of dispersion, then (3) Negative Binomial GLM fitting and Wald statistics. The results tables (log2 fold changes and p-values) can be generated using the results function (copied from online chapter)

Fit the statistical model

shigeru <- DESeq(satoshi) cbind(resultsNames(shigeru))

Here we plot the adjusted p-value means corrected for multiple testing (FDR padj)

res_shigeru <- results(shigeru)

We only keep genes with an adjusted p-value cutoff > 0.05 by changing the default significance cut-off

sum(res_shigeru$padj < tresh_padj, na.rm = TRUE)

brock <- results(shigeru, name = “RearingCondition_Isolated_vs_Crowded”, alpha = alpha_DEseq2) summary(brock)

Details of what is each column meaning in our final result

mcols(brock)$description head(brock)


For this data set after FDR filtering of 0.05, we have 254 genes up-regulates and 404 genes down-regulated in crowded versus solitary individuals.

## Visualizing and exploring the results
### PCA of the samples

We transformed the data for visualization by comparing both recommended **rlog** (Regularized log) or **vst** (Variance Stabilizing Transformation) transformations. Both options produce log2 scale data which has been normalized by the DESeq2 method with respect to library size.

Try with the vst transformation

shigeru_vst <- vst(shigeru) shigeru_rlog <- rlog(shigeru)


Plot the PCA rlog

Create the pca on the defined groups

pcaData <- plotPCA(object = shigeru_rlog, intgroup = c(“RearingCondition”, “Tissue”),returnData=TRUE)

Store the information for each axis variance in %

percentVar <- round(100 * attr(pcaData, “percentVar”))

Make sure that the discriminant variable are in factor for using shape and color later on

pcaData$RearingCondition<-factor(pcaData$RearingCondition,levels=c(“Crowded”,“Isolated”), labels=c(“crowded piceifrons”,“isolated piceifrons”)) levels(pcaData$RearingCondition) #pcaData$Tissue<-factor(pcaData1$Tissue,levels=c("Whole_body","Optical_lobes"), labels=c("Hatchling", "OLB")) #levels(pcaData$Tissue)

ggplot(pcaData, aes(PC1, PC2, color= RearingCondition)) + geom_point(size=4)+xlab(paste0(“PC1:”,percentVar[1],“% variance”)) + ylab(paste0(“PC2:”,percentVar[2],“% variance”)) + scale_color_manual(values = c(“blue”, “red”)) + geom_text_repel(aes(label = name), nudge_x = -1, nudge_y = 0.2, size = 3) + coord_fixed()+ theme_bw()+ theme(legend.title = element_blank())+ theme(legend.text = element_text(face=“bold”, size=15))+ theme(axis.text = element_text(size=15))+ theme(axis.title = element_text(size=16))+ ggtitle(“PCA on S. piceifrons head tissues”, subtitle = “rlog transformation”) + xlab(paste0(“PC1:”,percentVar[1],“% variance”)) + ylab(paste0(“PC2:”,percentVar[2],“% variance”))


Plot the PCA vsd

Create the pca on the defined groups

pcaData <- plotPCA(object = shigeru_vst, intgroup = c(“RearingCondition”, “Tissue”),returnData=TRUE)

Store the information for each axis variance in %

percentVar <- round(100 * attr(pcaData, “percentVar”))

Make sure that the discriminant variable are in factor for using shape and color later on

ggplot(pcaData, aes(PC1, PC2, color= RearingCondition)) + geom_point(size=4)+xlab(paste0(“PC1:”,percentVar[1],“% variance”)) + ylab(paste0(“PC2:”,percentVar[2],“% variance”)) + scale_color_manual(values = c(“blue”, “red”)) + geom_text_repel(aes(label = name), nudge_x = -1, nudge_y = 0.2, size = 3) + coord_fixed()+ theme_bw()+ theme(legend.title = element_blank())+ theme(legend.text = element_text(face=“bold”, size=15))+ theme(axis.text = element_text(size=15))+ theme(axis.title = element_text(size=16))+ ggtitle(“PCA on S. piceifrons head tissues”, subtitle = “vst transformation”) + xlab(paste0(“PC1:”,percentVar[1],“% variance”)) + ylab(paste0(“PC2:”,percentVar[2],“% variance”))


### Sample Matrix Distance

Using also the transformed data, we check the distance between samples and see how they correlate to each others.

Heatmap using rlog

``

# calculate between-sample distance matrix
sampleDistMatrix.rlog <- as.matrix(dist(t(assay(shigeru_rlog))))

metadata <- sampletable[,c("RearingCondition", "Tissue")]
rownames(metadata) <- sampletable$SampleName


pheatmap(sampleDistMatrix.rlog, annotation_col=metadata, main = "Head tissue heatmap, rlog transformation")

Heatmap using vst


# calculate between-sample distance matrix
sampleDistMatrix.vst <- as.matrix(dist(t(assay(shigeru_vst))))

pheatmap(sampleDistMatrix.vst, annotation_col=metadata, main = "Head tissue heatmap, vst transformation")

MA-plot

This plot allows us to show the log2 fold changes over the mean of normalized counts for all the samples. Points will be colored in red if the adjusted p-value is less than 0.05 and the log2 fold change is bigger than 1. In blue, will be the reverse for the log2 fold change.

To generate more accurate log2 foldchange estimates, DESeq2 allows (and recommends) the shrinkage of the LFC estimates toward zero when the information for a gene is low, which could include:

-Low counts
-High dispersion values


# include the log2FoldChange shrinkage use to visualize gene ranking
de_shrink <- lfcShrink(dds = shigeru, coef="RearingCondition_Isolated_vs_Crowded", type="apeglm")
head(de_shrink)

# Ma plot
maplot <-ggmaplot(de_shrink, fdr = 0.05, fc = 1, size = 2, palette = c("#B31B21", "#1465AC", "darkgray"), genenames = as.vector(rownames(de_shrink$name)), top = 0,legend="top",label.select = NULL) +
coord_cartesian(xlim = c(0, 20)) + 
scale_y_continuous(limits=c(-12, 12)) + 
theme(axis.text.x = element_text(size=16),axis.text.y = element_text(size=15),axis.title.x = element_text(size=17),axis.title.y = element_text(size=17),axis.line = element_line(size = 1, colour="gray20"),axis.ticks = element_line(size = 1, colour="gray20"))+
guides(color = guide_legend(override.aes = list(size = c(3,3,3))))+
theme(legend.position = c(0.70, 0.12),legend.text=element_text(size=14,face="bold"),legend.background = element_rect(fill="transparent"))+
theme(plot.title = element_text(size=18, colour="gray30", face="bold",hjust=0.06, vjust=-5))+
labs(title="MA-plot for the shrunken log2 fold changes in the head")
maplot

Volcano Plot

The EnhancedVolcano helps visualise the resulst of differential expression analysis.

keyvals <-ifelse( 
  res_shigeru$log2FoldChange >= 1 & res_shigeru$padj <= 0.05, '#B31B21', 
  ifelse(res_shigeru$log2FoldChange <= -1 & res_shigeru$padj <= 0.05, '#1465AC', 'darkgray')) 

keyvals[is.na(keyvals)] <-'darkgray' 
names(keyvals)[keyvals == "#B31B21"] <-'Upregulated' 
names(keyvals)[keyvals == "#1465AC"] <-'Downregulated' 
names(keyvals)[keyvals == 'darkgray'] <-'NS'

EnhancedVolcano(res_shigeru,
                lab = rownames(res_shigeru),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 0.05,
                FCcutoff = 1,
                pointSize = 3,
                labSize = 4,
                colAlpha = 4/5,
                colCustom = keyvals,
                drawConnectors = TRUE)

Normalized Matrix distance

We then normalize the result by extracting only significant genes with a fold change of 1.


resorted_deresults <- res_shigeru[order(res_shigeru$padj),]

## Select only the genes that have a padj > 0.05 and with minimum log2FoldChange of 1
sig <- resorted_deresults[!is.na(resorted_deresults$padj) &
                            resorted_deresults$padj < tresh_padj &
                            abs(resorted_deresults$log2FoldChange) >= tresh_logfold,]
selected <- rownames(sig);selected

## Norm transform the data from DEseq2 run
ntd <- normTransform(satoshi)

## Plot the relation among samples considering only the significant genes
pheatmap(assay(ntd)[selected,], 
         cluster_rows = TRUE, 
         show_rownames = TRUE, 
         cluster_cols = TRUE, 
         cutree_rows = 4,
         cutree_cols = 3,
         labels_col= colData(satoshi)$SampleName)

Create a hmtl report with DEseq2


## import the sample sheet that indicates Rearing Conditions and Tissue origins
setwd(workDir_DEseq2)
# loading target file
target <- loadTargetFile(targetFile=targetFile, varInt=varInt, condRef=condRef, batch=batch)

# loading counts
counts <- loadCountData(target=target, rawDir=rawDir, featuresToRemove=featuresToRemove)

# description plots
majSequences <- descriptionPlots(counts=counts, group=target[,varInt], col=colors)

# analysis with DESeq2
out.DESeq2 <- run.DESeq2(counts=counts, target=target, varInt=varInt, batch=batch,
                         locfunc=locfunc, fitType=fitType, pAdjustMethod=pAdjustMethod_DEseq2,
                         cooksCutoff=cooksCutoff, independentFiltering=independentFiltering, alpha=alpha_DEseq2)

# PCA + Clustering
exploreCounts(object=out.DESeq2$dds, group=target[,varInt], typeTrans=typeTrans, col=colors)

# summary of the analysis (boxplots, dispersions, diag size factors, export table, nDiffTotal, histograms, MA plot)
summaryResults <- summarizeResults.DESeq2(out.DESeq2, group=target[,varInt], col=colors,
                                          independentFiltering=independentFiltering,
                                          cooksCutoff=cooksCutoff, alpha=alpha_DEseq2)

# generating HTML report
writeReport.DESeq2(target=target, counts=counts, out.DESeq2=out.DESeq2, summaryResults=summaryResults,
                   majSequences=majSequences, workDir=workDir_DEseq2, projectName=projectName, author=author,
                   targetFile=targetFile, rawDir=rawDir, featuresToRemove=featuresToRemove, varInt=varInt,
                   condRef=condRef, batch=batch, fitType=fitType, cooksCutoff=cooksCutoff,
                   independentFiltering=independentFiltering, alpha=alpha_DEseq2, pAdjustMethod=pAdjustMethod_DEseq2,
                   typeTrans=typeTrans, locfunc=locfunc, colors=colors)

sessionInfo()

FALSE R version 4.3.1 (2023-06-16)
FALSE Platform: x86_64-apple-darwin20 (64-bit)
FALSE Running under: macOS Sonoma 14.1.2
FALSE 
FALSE Matrix products: default
FALSE BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
FALSE LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
FALSE 
FALSE locale:
FALSE [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
FALSE 
FALSE time zone: America/Chicago
FALSE tzcode source: internal
FALSE 
FALSE attached base packages:
FALSE [1] stats4    stats     graphics  grDevices utils     datasets  methods  
FALSE [8] base     
FALSE 
FALSE other attached packages:
FALSE  [1] pheatmap_1.0.12             SARTools_1.8.1             
FALSE  [3] kableExtra_1.3.4            edgeR_3.42.4               
FALSE  [5] limma_3.56.2                ashr_2.2-63                
FALSE  [7] EnhancedVolcano_1.18.0      ggrepel_0.9.5              
FALSE  [9] ggpubr_0.6.0                apeglm_1.22.1              
FALSE [11] data.table_1.14.10          DESeq2_1.40.2              
FALSE [13] SummarizedExperiment_1.30.2 Biobase_2.60.0             
FALSE [15] MatrixGenerics_1.12.3       matrixStats_1.2.0          
FALSE [17] GenomicRanges_1.52.1        GenomeInfoDb_1.36.4        
FALSE [19] IRanges_2.34.1              S4Vectors_0.38.2           
FALSE [21] BiocGenerics_0.46.0         reshape2_1.4.4             
FALSE [23] ggthemes_5.0.0              plotly_4.10.4              
FALSE [25] DT_0.31                     lubridate_1.9.3            
FALSE [27] forcats_1.0.0               stringr_1.5.1              
FALSE [29] dplyr_1.1.4                 purrr_1.0.2                
FALSE [31] readr_2.1.5                 tidyr_1.3.0                
FALSE [33] tibble_3.2.1                ggplot2_3.4.4              
FALSE [35] tidyverse_2.0.0             rmdformats_1.0.4           
FALSE [37] knitr_1.45                  workflowr_1.7.1            
FALSE 
FALSE loaded via a namespace (and not attached):
FALSE   [1] RColorBrewer_1.1-3      ggdendro_0.1.23         rstudioapi_0.15.0      
FALSE   [4] jsonlite_1.8.8          magrittr_2.0.3          rmarkdown_2.25         
FALSE   [7] fs_1.6.3                zlibbioc_1.46.0         vctrs_0.6.5            
FALSE  [10] memoise_2.0.1           RCurl_1.98-1.14         SQUAREM_2021.1         
FALSE  [13] webshot_0.5.5           rstatix_0.7.2           mixsqp_0.3-54          
FALSE  [16] htmltools_0.5.7         S4Arrays_1.0.6          truncnorm_1.0-9        
FALSE  [19] broom_1.0.5             sass_0.4.8              bslib_0.6.1            
FALSE  [22] htmlwidgets_1.6.4       plyr_1.8.9              cachem_1.0.8           
FALSE  [25] whisker_0.4.1           lifecycle_1.0.4         pkgconfig_2.0.3        
FALSE  [28] Matrix_1.6-5            R6_2.5.1                fastmap_1.1.1          
FALSE  [31] GenomeInfoDbData_1.2.10 digest_0.6.34           numDeriv_2016.8-1.1    
FALSE  [34] GGally_2.2.0            colorspace_2.1-0        AnnotationDbi_1.62.2   
FALSE  [37] ps_1.7.5                rprojroot_2.0.4         irlba_2.3.5.1          
FALSE  [40] RSQLite_2.3.4           invgamma_1.1            fansi_1.0.6            
FALSE  [43] timechange_0.2.0        httr_1.4.7              abind_1.4-5            
FALSE  [46] compiler_4.3.1          bit64_4.0.5             withr_2.5.2            
FALSE  [49] backports_1.4.1         BiocParallel_1.34.2     DBI_1.2.1              
FALSE  [52] carData_3.0-5           ggstats_0.5.1           ggsignif_0.6.4         
FALSE  [55] MASS_7.3-60.0.1         DelayedArray_0.26.7     tools_4.3.1            
FALSE  [58] httpuv_1.6.13           glue_1.7.0              callr_3.7.3            
FALSE  [61] promises_1.2.1          grid_4.3.1              getPass_0.2-4          
FALSE  [64] generics_0.1.3          gtable_0.3.4            tzdb_0.4.0             
FALSE  [67] hms_1.1.3               xml2_1.3.6              car_3.1-2              
FALSE  [70] utf8_1.2.4              XVector_0.40.0          pillar_1.9.0           
FALSE  [73] emdbook_1.3.13          genefilter_1.82.1       later_1.3.2            
FALSE  [76] splines_4.3.1           lattice_0.22-5          survival_3.5-7         
FALSE  [79] bit_4.0.5               annotate_1.78.0         tidyselect_1.2.0       
FALSE  [82] locfit_1.5-9.8          Biostrings_2.68.1       git2r_0.33.0           
FALSE  [85] gridExtra_2.3           bookdown_0.37           svglite_2.1.3          
FALSE  [88] xfun_0.41               stringi_1.8.3           lazyeval_0.2.2         
FALSE  [91] yaml_2.3.8              evaluate_0.23           codetools_0.2-19       
FALSE  [94] bbmle_1.0.25.1          cli_3.6.2               xtable_1.8-4           
FALSE  [97] systemfonts_1.0.5       munsell_0.5.0           processx_3.8.3         
FALSE [100] jquerylib_0.1.4         Rcpp_1.0.12             png_0.1-8              
FALSE [103] coda_0.19-4             XML_3.99-0.16           bdsmatrix_1.3-6        
FALSE [106] parallel_4.3.1          blob_1.2.4              bitops_1.0-7           
FALSE [109] viridisLite_0.4.2       mvtnorm_1.2-4           scales_1.3.0           
FALSE [112] crayon_1.5.2            rlang_1.1.3             formatR_1.14           
FALSE [115] KEGGREST_1.40.1         rvest_1.0.3

Why study gene expression changes during the locust phase transition?

Maeva Techer

2024-01-29

EdgeR

Load required R libraries

EdgeR analysis using STAR input

Preparing input files

Raw count matrices

Transcript-to-gene annotation file

Sample sheet

Differential expression analysis

Import input count data

Filtering and normalizing

Run the model fit with EdgeR

DGE Visualization

Volcano plot

MA plot

Create a html report with EdgeR

Load required R libraries

DESeq2 analysis using STAR input

Preparing input files

Raw count matrices

Transcript-to-gene annotation file

Sample sheet

Differential expression analysis

Import input count data

Run the model Fit DESeq2

Fit the statistical model

Here we plot the adjusted p-value means corrected for multiple testing (FDR padj)

We only keep genes with an adjusted p-value cutoff > 0.05 by changing the default significance cut-off

Details of what is each column meaning in our final result

Try with the vst transformation

Create the pca on the defined groups

Store the information for each axis variance in %

Make sure that the discriminant variable are in factor for using shape and color later on

Create the pca on the defined groups

Store the information for each axis variance in %

Make sure that the discriminant variable are in factor for using shape and color later on

MA-plot

Volcano Plot

Normalized Matrix distance

Create a hmtl report with DEseq2