Last updated: 2024-05-15
Checks: 6 1
Knit directory:
locust-comparative-genomics/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20221025) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.
| absolute | relative |
|---|---|
| /Users/alphamanae/Documents/GitHub/locust-comparative-genomics | . |
| /Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data | data |
| /Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/finalcounts/ | data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/finalcounts |
| /Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/list/HeadSGREG.txt | data/list/HeadSGREG.txt |
| /Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/list/gregaria_lncrna_list.txt | data/list/gregaria_lncrna_list.txt |
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 0ac8380. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .RData
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: analysis/.Rhistory
Ignored: data/.DS_Store
Ignored: data/.Rhistory
Ignored: data/DEseq2_SAMER_SAMER_HEAD_STARnew_features/.DS_Store
Ignored: data/DEseq2_SAMER_SAMER_THORAX_STARnew_features/.DS_Store
Ignored: data/DEseq2_SAMER_SGREG_HEAD_STARnew_features/.DS_Store
Ignored: data/DEseq2_SAMER_SGREG_THORAX_STARnew_features/.DS_Store
Ignored: data/DEseq2_SCANC_SCANC_HEAD_STARnew_features/.DS_Store
Ignored: data/DEseq2_SCANC_SCANC_THORAX_STARnew_features/.DS_Store
Ignored: data/DEseq2_SCANC_SGREG_HEAD_STARnew_features/.DS_Store
Ignored: data/DEseq2_SCANC_SGREG_THORAX_STARnew_features/.DS_Store
Ignored: data/DEseq2_SCUBE_SCUBE_HEAD_STARnew_features/.DS_Store
Ignored: data/DEseq2_SCUBE_SCUBE_THORAX_STARnew_features/.DS_Store
Ignored: data/DEseq2_SCUBE_SGREG_HEAD_STARnew_features/.DS_Store
Ignored: data/DEseq2_SCUBE_SGREG_THORAX_STARnew_features/.DS_Store
Ignored: data/DEseq2_SGREG_SGREG_HEAD_STARnew_features/.DS_Store
Ignored: data/DEseq2_SGREG_SGREG_THORAX_STARnew_features/.DS_Store
Ignored: data/DEseq2_SNITE_SGREG_HEAD_STARnew_features/.DS_Store
Ignored: data/DEseq2_SNITE_SGREG_THORAX_STARnew_features/.DS_Store
Ignored: data/DEseq2_SNITE_SNITE_HEAD_STARnew_features/.DS_Store
Ignored: data/DEseq2_SNITE_SNITE_THORAX_STARnew_features/.DS_Store
Ignored: data/DEseq2_SPICE_SGREG_HEAD_STARnew_features/.DS_Store
Ignored: data/DEseq2_SPICE_SGREG_THORAX_STARnew_features/.DS_Store
Ignored: data/DEseq2_SPICE_SPICE_HEAD_STARnew_features/.DS_Store
Ignored: data/DEseq2_SPICE_SPICE_THORAX_STARnew_features/.DS_Store
Ignored: data/americana/.DS_Store
Ignored: data/americana/deg_counts/.DS_Store
Ignored: data/americana/deg_counts/STAR/.DS_Store
Ignored: data/americana/deg_counts/STAR/americana/.DS_Store
Ignored: data/americana/deg_counts/STAR/americana/featurecounts/.DS_Store
Ignored: data/americana/deg_counts/STAR/gregaria/.DS_Store
Ignored: data/americana/deg_counts/STAR/gregaria/featurecounts/.DS_Store
Ignored: data/americana/deg_counts/STAR_newparams/.DS_Store
Ignored: data/americana/deg_counts/STAR_newparams/americana/.DS_Store
Ignored: data/americana/deg_counts/STAR_newparams/americana/featurecounts/.DS_Store
Ignored: data/americana/deg_counts/STAR_newparams/gregaria/.DS_Store
Ignored: data/americana/deg_counts/STAR_newparams/gregaria/featurecounts/.DS_Store
Ignored: data/cancellata/.DS_Store
Ignored: data/cancellata/deg_counts/.DS_Store
Ignored: data/cancellata/deg_counts/STAR/.DS_Store
Ignored: data/cancellata/deg_counts/STAR_newparams/.DS_Store
Ignored: data/cubense/.DS_Store
Ignored: data/cubense/deg_counts/.DS_Store
Ignored: data/cubense/deg_counts/STAR/.DS_Store
Ignored: data/cubense/deg_counts/STAR/cubense/.DS_Store
Ignored: data/cubense/deg_counts/STAR/cubense/featurecounts/
Ignored: data/cubense/deg_counts/STAR/gregaria/
Ignored: data/cubense/deg_counts/STAR_newparams/.DS_Store
Ignored: data/cubense/deg_counts/STAR_newparams/cubense/.DS_Store
Ignored: data/cubense/deg_counts/STAR_newparams/gregaria/.DS_Store
Ignored: data/gregaria/.DS_Store
Ignored: data/gregaria/deg_counts/.DS_Store
Ignored: data/gregaria/deg_counts/STAR/.DS_Store
Ignored: data/gregaria/deg_counts/STAR_newparams/.DS_Store
Ignored: data/gregaria/deg_counts/STAR_newparams/gregaria/.DS_Store
Ignored: data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/.DS_Store
Ignored: data/list/.DS_Store
Ignored: data/metadata/.DS_Store
Ignored: data/nitens/.DS_Store
Ignored: data/nitens/deg_counts/.DS_Store
Ignored: data/nitens/deg_counts/STAR_newparams/.DS_Store
Ignored: data/nitens/deg_counts/STAR_newparams/gregaria/.DS_Store
Ignored: data/nitens/deg_counts/STAR_newparams/gregaria/featurecounts/.DS_Store
Ignored: data/nitens/deg_counts/STAR_newparams/nitens/.DS_Store
Ignored: data/nitens/deg_counts/STAR_newparams/nitens/featurecounts/.DS_Store
Ignored: data/piceifrons/.DS_Store
Ignored: data/piceifrons/deg_counts/.DS_Store
Ignored: data/piceifrons/deg_counts/STAR/.DS_Store
Ignored: data/piceifrons/deg_counts/STAR/gregaria/.DS_Store
Ignored: data/piceifrons/deg_counts/STAR/gregaria/featurecounts/.DS_Store
Ignored: data/piceifrons/deg_counts/STAR/piceifrons/.DS_Store
Ignored: data/piceifrons/deg_counts/STAR/piceifrons/featurecounts/.DS_Store
Ignored: data/piceifrons/deg_counts/STAR_newparams/.DS_Store
Ignored: data/piceifrons/deg_counts/STAR_newparams/gregaria/.DS_Store
Ignored: data/piceifrons/deg_counts/STAR_newparams/piceifrons/.DS_Store
Ignored: figures/
Ignored: tables/
Untracked files:
Untracked: data/DEseq2_SGRE_HEAD/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/3_deg-analysis.Rmd) and
HTML (docs/3_deg-analysis.html) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote), click on
the hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | 0ac8380 | Maeva A. TECHER | 2024-05-15 | wflow_publish("analysis/3_deg-analysis.Rmd") |
| html | da47a64 | Maeva A. TECHER | 2024-05-14 | Build site. |
| html | a39899b | Maeva A. TECHER | 2024-05-14 | Build site. |
| Rmd | b02e6e9 | Maeva A. TECHER | 2024-05-14 | wflow_publish("analysis/3_deg-analysis.Rmd") |
| html | e39d280 | Maeva A. TECHER | 2024-01-29 | Build site. |
| html | f701a01 | Maeva A. TECHER | 2024-01-29 | reupdate |
| html | a3d138b | Maeva A. TECHER | 2024-01-29 | Build site. |
| Rmd | 915285e | Maeva A. TECHER | 2024-01-29 | wflow_publish("analysis/3_deg-analysis.Rmd") |
| html | 959a4e9 | Maeva A. TECHER | 2024-01-24 | Build site. |
| html | 1b09cbe | Maeva A. TECHER | 2024-01-24 | remove |
| html | 478e72d | Maeva A. TECHER | 2023-12-18 | Build site. |
| Rmd | 53877fa | Maeva A. TECHER | 2023-12-18 | add pages |
Here we present the workflow example with the head data from S. piceifrons mapped to its own genome. We applied the same process for each species ( S. americana, S. serialis cubense, S. nitens, S. gregaria, S. cancellata and S. piceifrons), tissue (head or thorax) and each mapping strategy (mapped to sharedgregaria RefSeq or own RefSeq).
For the analysis of differentially expressed genes, we will follow
some guidelines from an online RNA course tutorial
that uses either DESeq2 or edgeR on STAR
output. We also adapted some script lines from Foquet et
al. 2021 code. Our aim will be
to combine both results and take only the overlapping DEGs detected by
these two software.
Both DESeq2 and edgeR identify
differentially expressed genes by starting with the hypothesis that no
genes are differentially expressed. They work similarly but differs
slightly in the normalization process. Where DEseq2 use a
“geometric” normalization, edgeR uses a weighted mean of
log ratios-based method. If you want to understand the process and
algorithms for each of these R package, I recommend to check their
bioconductor page.
Warning: For sample size over 8 per condition, please be aware that both methods have been found to overestimate DEGs with false positive in human population samples. You can read more about it with Li et al. 2022
We generated this in the precedent section.
We need to format the .gtf annotation file for input in
DESeq2.
Below is the example with S. piceifrons
# Download annotation and place it into the folder refgenomes
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/021/461/385/GCF_021461385.2_iqSchPice1.1/GCF_021461385.2_iqSchPice1.1_genomic.gtf.gz
# first column is the transcript ID, second column is the gene ID, third column is the gene symbol
zcat GCF_021461385.2_iqSchPice1.1_genomic.gtf.gz | awk -F "\t" 'BEGIN{OFS="\t"}{if($3=="transcript"){split($9, a, "\""); print a[4],a[2],a[8]}}' > tx2gene.piceifrons.csv
DESeq2 needs a sample sheet that describes the samples characteristics: SampleName, FileName (…counts.txt), and subsequently anything that can be used for statistical design such as RearingCondition, replicates, tissue, time points, etc. in the form.
The design indicates how to model the samples: in the model we need to specify what we want to measure and what we want to control.
Here we will be using the html report function of DESeq2 to export all the results, however, the option to do it step by step will be also provided below.
We start by loading all the required R packages.
#(install first from CRAN or Bioconductor)
library("knitr")
library("rmdformats")
library("tidyverse")
library("data.table")
library("DT") # for making interactive search table
library("plotly") # for interactive plots
library("ggthemes") # for theme_calc
library("reshape2")
library("DESeq2")
library("apeglm")
library("ggpubr")
library("ggplot2")
library("ggrepel")
library("EnhancedVolcano")
library("SARTools")
library("pheatmap")
We need to specify the path of our working directories, where are located our gene counts files and where the .html report will be created:
## PARAMETERS for running the script
homeDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics"
workDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data" # Working directory
# path to the directory containing raw counts files
rawDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/finalcounts/"
## Details to create the html report
projectName <- "SGRE_HEAD" # name of the project
author <- "Maeva TECHER" # author of the statistical analysis/report
## Create all the needed directories for the html report
setwd(workDir)
Dirname <- paste("DEseq2_", projectName, sep = "")
dir.create(Dirname)
setwd(Dirname)
workDir_DEseq2 <- getwd()
We then define the parameters that we will be running with
DESeq2:
## PARAMETERS for running DEseq2
tresh_logfold <- 1 # Treshold for log2(foldchange) in final DE-files
tresh_padj <- 0.05 # Treshold for adjusted p-valued in final DE-files
alpha_DEseq2 <- 0.05 # threshold of statistical significance
pAdjustMethod_DEseq2 <- "BH" # p-value adjustment method: "BH" (default) or "BY"
featuresToRemove <- c(NULL) # names of the features to be removed, NULL if none or if using Idxstats
varInt <- "RearingCondition" # factor of interest
condRef <- "Crowded" # reference biological condition
batch <- NULL # blocking factor: NULL (default) or "batch" for example
fitType <- "parametric" # mean-variance relationship: "parametric" (default) or "local"
cooksCutoff <- TRUE # TRUE/FALSE to perform the outliers detection (default is TRUE)
independentFiltering <- TRUE # TRUE/FALSE to perform independent filtering (default is TRUE)
typeTrans <- "rlog" # transformation for PCA/clustering: "VST" or "rlog"
locfunc <- "median"
# Path and name of targetfile containing conditions and file names
targetFile <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/list/HeadSGREG.txt"
colors <- c("#B31B21", "#1465AC", "green3", "orange")
Here our targetFile is our metadata which should have a
column for each: sample ID, the file name (exactly the same as in your
folder, be careful with ” ” or “_“) and the conditions. Below we show an
example with the files for piceifrons head:

If everything has been loaded properly, the next chunk of code should not show any warning and we can simpy proceed to the actual DEG analysis.
# checking parameters
setwd(workDir_DEseq2)
checkParameters.DESeq2(projectName=projectName,author=author,targetFile=targetFile,
rawDir=rawDir,featuresToRemove=featuresToRemove,varInt=varInt,
condRef=condRef,batch=batch,fitType=fitType,cooksCutoff=cooksCutoff,
independentFiltering=independentFiltering,alpha=alpha_DEseq2,pAdjustMethod=pAdjustMethod_DEseq2,
typeTrans=typeTrans,locfunc=locfunc,colors=colors)
Here if you just want a quick report without being able to tailor step by step, you can use the code below and obtain several graphs and details for your DEG analysis.
## import the sample sheet that indicates Rearing Conditions and Tissue origins
setwd(workDir_DEseq2)
# loading target file
target <- loadTargetFile(targetFile=targetFile, varInt=varInt, condRef=condRef, batch=batch)
# loading counts
counts <- loadCountData(target=target, rawDir=rawDir, featuresToRemove=featuresToRemove)
# description plots
majSequences <- descriptionPlots(counts=counts, group=target[,varInt], col=colors)
# analysis with DESeq2
out.DESeq2 <- run.DESeq2(counts=counts, target=target, varInt=varInt, batch=batch,
locfunc=locfunc, fitType=fitType, pAdjustMethod=pAdjustMethod_DEseq2,
cooksCutoff=cooksCutoff, independentFiltering=independentFiltering, alpha=alpha_DEseq2)
# PCA + Clustering
exploreCounts(object=out.DESeq2$dds, group=target[,varInt], typeTrans=typeTrans, col=colors)
# summary of the analysis (boxplots, dispersions, diag size factors, export table, nDiffTotal, histograms, MA plot)
summaryResults <- summarizeResults.DESeq2(out.DESeq2, group=target[,varInt], col=colors,
independentFiltering=independentFiltering,
cooksCutoff=cooksCutoff, alpha=alpha_DEseq2)
# generating HTML report
writeReport.DESeq2(target=target, counts=counts, out.DESeq2=out.DESeq2, summaryResults=summaryResults,
majSequences=majSequences, workDir=workDir_DEseq2, projectName=projectName, author=author,
targetFile=targetFile, rawDir=rawDir, featuresToRemove=featuresToRemove, varInt=varInt,
condRef=condRef, batch=batch, fitType=fitType, cooksCutoff=cooksCutoff,
independentFiltering=independentFiltering, alpha=alpha_DEseq2, pAdjustMethod=pAdjustMethod_DEseq2,
typeTrans=typeTrans, locfunc=locfunc, colors=colors)
Follow the same process section 1.2
Here we only need to specify the path of our raw counts and metadata files.
## PARAMETERS for running the script
homeDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics"
workDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data" # Working directory
# path to the directory containing raw counts files
rawDir <- "/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/finalcounts/"
We then define the parameters that we will be running with
DESeq2:
## PARAMETERS for running DEseq2
tresh_logfold <- 1 # Treshold for log2(foldchange) in final DE-files
tresh_padj <- 0.05 # Treshold for adjusted p-valued in final DE-files
alpha_DEseq2 <- 0.05 # threshold of statistical significance
pAdjustMethod_DEseq2 <- "BH" # p-value adjustment method: "BH" (default) or "BY"
featuresToRemove <- c(NULL) # names of the features to be removed, NULL if none or if using Idxstats
varInt <- "RearingCondition" # factor of interest
condRef <- "Crowded" # reference biological condition
batch <- NULL # blocking factor: NULL (default) or "batch" for example
fitType <- "parametric" # mean-variance relationship: "parametric" (default) or "local"
cooksCutoff <- TRUE # TRUE/FALSE to perform the outliers detection (default is TRUE)
independentFiltering <- TRUE # TRUE/FALSE to perform independent filtering (default is TRUE)
typeTrans <- "rlog" # transformation for PCA/clustering: "VST" or "rlog"
locfunc <- "median"
We then import the metadata table and format it as such as there is no error downstream.
# Get in the home directory and load the file of interest
setwd(homeDir)
sampletable <- fread("data/list/HeadSGREG.txt")
## add the sample names as row names (it is needed for some of the DESeq functions)
rownames(sampletable) <- sampletable$SampleName
## Make sure discriminant variables are factor
sampletable$RearingCondition <- as.factor(sampletable$RearingCondition)
sampletable$Tissue <- as.factor(sampletable$Tissue)
#sampletable$Species <- as.factor(sampletable$Species)
dim(sampletable)
Then we obtain the output from STAR GeneCount and import
here individually using the sampletable as a reference to fetch them. We
also filter the lowly expressed genes to avoid noisy data.
## Import count files
satoshi <- DESeqDataSetFromHTSeqCount(sampleTable = sampletable,
directory = "data/gregaria/deg_counts/STAR_newparams/gregaria/featurecounts/finalcounts",
design = ~ RearingCondition )
satoshi
# keep genes for which sums of raw counts across experimental samples is > 5
satoshi <- satoshi[rowSums(counts(satoshi)) > 1, ]
nrow(satoshi)
Run DESeq2 analysis using DESeq, which performs (1)
estimation of size factors, (2) estimation of dispersion, then (3)
Negative Binomial GLM fitting and Wald statistics. The results tables
(log2 fold changes and p-values) can be generated using the results
function (copied
from online chapter).
One of the thing we want to do with the step-by-step method is to remove some of special genomic features such as long -non-coding RNA genes or ribosomal RNA. For that we created a list beforehand by parsing the .gff file and selecting field of interest for each species.
IF WE WANT TO EXCLUDE SOME LOCI USE THIS CODE:
# Read the list of loci from a text file to later remove it
loci_to_exclude <- readLines("/Users/alphamanae/Documents/GitHub/locust-comparative-genomics/data/list/gregaria_lncrna_list.txt")
# Subset the rows of the DESeqDataSet to exclude the loci in the 'loci_to_exclude' list
satoshi_filtered <- satoshi[!(rownames(satoshi) %in% loci_to_exclude), ]
shigeru <- DESeq(satoshi_filtered)
cbind(resultsNames(shigeru))
IF WE DO NOT WANT TO EXCLUDE ANY LOCI USE THIS CODE:
# Fit the statistical model
shigeru <- DESeq(satoshi)
cbind(resultsNames(shigeru))
Then we will observe the final result by keeping DEGs with an adjusted p-value cutoff > 0.05.
# Here we plot the adjusted p-value means corrected for multiple testing (FDR padj)
res_shigeru <- results(shigeru)
# We only keep genes with an adjusted p-value cutoff > 0.05 by changing the default significance cut-off
sum(res_shigeru$padj < tresh_padj, na.rm = TRUE)
brock <- results(shigeru, name = "RearingCondition_Isolated_vs_Crowded", alpha = alpha_DEseq2)
summary(brock)
# Details of what is each column meaning in our final result
mcols(brock)$description
head(brock)
This cam be done only with the step-by-step method. The html report can output similar figures but with less possibility for tailoring.
We transformed the data for visualization by comparing both recommended rlog (Regularized log) or vst (Variance Stabilizing Transformation) transformations. Both options produce log2 scale data which has been normalized by the DESeq2 method with respect to library size.
# Try with the vst transformation
shigeru_vst <- vst(shigeru)
shigeru_rlog <- rlog(shigeru)
Here we will plot the data using the rlog approach:
# Create the pca on the defined groups
pcaData <- plotPCA(object = shigeru_rlog, intgroup = c("RearingCondition2"),returnData=TRUE)
# Store the information for each axis variance in %
percentVar <- round(100 * attr(pcaData, "percentVar"))
# Make sure that the discriminant variable are in factor for using shape and color later on
#pcaData$RearingCondition<-factor(pcaData$RearingCondition,levels=c("Crowded","Isolated"), labels=c("Crowded for 72h mother","Isolated mother"))
pcaData$RearingCondition2<-factor(pcaData$RearingCondition2,levels=c("GregariousCrowded","GregariousIsolated", "SolitariousCrowded", "SolitariousIsolated"), labels=c("Gregarious control","Gregarious isolated for 72h", "Solitarious crowded for 72h", "Solitarious control"))
levels(pcaData$RearingCondition2)
#pcaData$Tissue<-factor(pcaData1$Tissue,levels=c("Whole_body","Optical_lobes"), labels=c("Hatchling", "OLB"))
#levels(pcaData$Tissue)
library(ggdist)
p <- ggplot(pcaData, aes(PC1, PC2, color= RearingCondition2)) +
geom_point(size=12) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
scale_color_manual(values = c("orange", "red2", "blue", "forestgreen")) +
#coord_fixed() +
theme_bw() +
theme(legend.title = element_blank()) +
theme(legend.text = element_text(face="bold", size=24)) +
theme(axis.text = element_text(size=20)) +
theme(axis.title = element_text(size=20))
p + geom_convexhull(aes(fill = RearingCondition2, color = RearingCondition2), alpha = 0.5) +
scale_fill_manual(values = c("orange", "red2", "blue", "forestgreen"))
#stat_ellipse(aes(group = RearingCondition, fill = RearingCondition), level = 0.95, geom = "polygon", alpha = 0.2)
#ggtitle("PCA on STAR read counts from accessory glands (using 13472 genes)",
# subtitle = "rlog transformation in DEseq2") +
#theme(plot.title = element_text(size = 18, face = "bold", margin = margin(t = 10)))+
#theme(plot.subtitle = element_text(size = 16, face = "italic"))
#ggplot(pcaData, aes(PC1, PC2, color= Tissue, shape = Species)) +
# geom_point(size=4)+xlab(paste0("PC1: ",percentVar[1],"% variance")) +
# ylab(paste0("PC2: ",percentVar[2],"% variance")) +
# scale_color_manual(values = c("blue3","red2","magenta2", "black", "red4", "green4", "deepskyblue", "darkorchid1", "darkorange3", "darkorange1" , "gray40", "gray", "darkorange2"))+
# geom_text_repel(aes(label = name), nudge_x = -1, nudge_y = 0.2, size = 3) +
# coord_fixed()+
# theme_bw()+
# theme(legend.title = element_blank())+
# theme(legend.text = element_text(face="bold", size=15))+
# theme(axis.text = element_text(size=15))+
# theme(axis.title = element_text(size=16))+
# ggtitle("PCA on S. piceifrons thorax tissues", subtitle = "rlog transformation") +
# xlab(paste0("PC1: ",percentVar[1],"% variance")) +
# ylab(paste0("PC2: ",percentVar[2],"% variance"))
Plot the PCA vsd
# Create the pca on the defined groups
pcaData <- plotPCA(object = shigeru_vst, intgroup = c("RearingCondition"),returnData=TRUE)
# Store the information for each axis variance in %
percentVar <- round(100 * attr(pcaData, "percentVar"))
# Make sure that the discriminant variable are in factor for using shape and color later on
pcaData$RearingCondition<-factor(pcaData$RearingCondition,levels=c("Crowded","Isolated"), labels=c("Crowded","Isolated"))
levels(pcaData$RearingCondition)
#pcaData$Tissue<-factor(pcaData$Tissue,levels=c("Antenna_lobes","Antennae", "Fat_Body", "Head", "Maxillary", "Metathoracic_ganglia", "Mushroom_body", "Optical_lobes", "Pleuropodium", "T3_leg" , "Thorax", "Whole_body", "Whole_egg"), labels=c("Antenna lobes","Antennae","Fat Body", "Whole Head", "Maxillary", "Metathoracic ganglia", "Mushroom body", "Optical lobes", "Pleuropodium", "T3 leg" , "Thorax", "Whole body", "Egg"))
#levels(pcaData$Tissue)
ggplot(pcaData, aes(PC1, PC2, color= RearingCondition)) +
geom_point(size=4)+xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
scale_color_manual(values = c("blue", "red")) +
geom_text_repel(aes(label = name), nudge_x = -1, nudge_y = 0.2, size = 4) +
coord_fixed()+
theme_bw()+
theme(legend.title = element_blank())+
theme(legend.text = element_text(face="bold", size=15))+
theme(axis.text = element_text(size=15))+
theme(axis.title = element_text(size=16))+
ggtitle("PCA on Schisto tissues", subtitle = "vst transformation") +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance"))
#ggplot(pcaData, aes(PC1, PC2, shape= Species, fill= Tissue )) +
# geom_point(size=5)+xlab(paste0("PC1: ",percentVar[1],"% variance")) +
# ylab(paste0("PC2: ",percentVar[2],"% variance")) +
# scale_fill_manual(values = c("blue3","red2","magenta2", "black", "red4", "green4", "deepskyblue", "cyan", "darkorange3", "darkorange1" , "gray40", "gray", "darkorange2"))+
# scale_shape_manual(values=c(21, 23, 24, 25, 21))+
# geom_text_repel(aes(label = Tissue), nudge_x = -1, nudge_y = 0.2, size = 3) +
# coord_fixed()+
# theme_bw()+
# theme(legend.title = element_blank())+
# theme(panel.border = element_blank())+
# theme(legend.text = element_text(face="bold", size=15))+
# theme(axis.text = element_text(size=15))+
# theme(axis.title = element_text(size=16))+
# ggtitle("PCA on mixed tissues", subtitle = "vst transformation") +
# xlab(paste0("PC1: ",percentVar[1],"% variance")) +
# ylab(paste0("PC2: ",percentVar[2],"% variance"))
Using also the transformed data, we check the distance between samples and see how they correlate to each others.
Heatmap using rlog
# calculate between-sample distance matrix
sampleDistMatrix.rlog <- as.matrix(dist(t(assay(shigeru_rlog))))
metadata <- sampletable[,c("RearingCondition2", "Tissue")]
rownames(metadata) <- sampletable$SampleName
pheatmap(sampleDistMatrix.rlog, annotation_col=metadata, main = "Ovaries tissue heatmap, rlog transformation")
Heatmap using vst
# calculate between-sample distance matrix
sampleDistMatrix.vst <- as.matrix(dist(t(assay(shigeru_vst))))
pheatmap(sampleDistMatrix.vst, annotation_col=metadata, main = "Ovaries tissue heatmap, vst transformation")
This plot allows us to show the log2 fold changes over the mean of normalized counts for all the samples. Points will be colored in red if the adjusted p-value is less than 0.05 and the log2 fold change is bigger than 1. In blue, will be the reverse for the log2 fold change.
To generate more accurate log2 foldchange estimates, DESeq2 allows (and recommends) the shrinkage of the LFC estimates toward zero when the information for a gene is low, which could include:
-Low counts
-High dispersion values
# include the log2FoldChange shrinkage use to visualize gene ranking
de_shrink <- lfcShrink(dds = shigeru, coef="RearingCondition_Isolated_vs_Crowded", type="apeglm")
head(de_shrink)
# Ma plot
maplot <-ggmaplot(de_shrink, fdr = 0.05, fc = 1, size = 2, palette = c("#B31B21", "#1465AC", "darkgray"), genenames = as.vector(rownames(de_shrink$name)), top = 0,legend="top",label.select = NULL) +
coord_cartesian(xlim = c(0, 20)) +
scale_y_continuous(limits=c(-12, 12)) +
theme(axis.text.x = element_text(size=16),axis.text.y = element_text(size=15),axis.title.x = element_text(size=17),axis.title.y = element_text(size=17),axis.line = element_line(size = 1, colour="gray20"),axis.ticks = element_line(size = 1, colour="gray20"))+
guides(color = guide_legend(override.aes = list(size = c(3,3,3))))+
theme(legend.position = c(0.70, 0.12),legend.text=element_text(size=14,face="bold"),legend.background = element_rect(fill="transparent"))+
theme(plot.title = element_text(size=18, colour="gray30", face="bold",hjust=0.06, vjust=-5))+
labs(title="MA-plot for the shrunken log2 fold changes in the AGY")
maplot
The EnhancedVolcano helps visualise the resulst of
differential expression analysis.
keyvals <-ifelse(
res_shigeru$log2FoldChange >= 1 & res_shigeru$padj <= 0.05, '#B31B21',
ifelse(res_shigeru$log2FoldChange <= -1 & res_shigeru$padj <= 0.05, '#1465AC', 'darkgray'))
keyvals[is.na(keyvals)] <-'darkgray'
names(keyvals)[keyvals == "#B31B21"] <-'Upregulated'
names(keyvals)[keyvals == "#1465AC"] <-'Downregulated'
names(keyvals)[keyvals == 'darkgray'] <-'NS'
EnhancedVolcano(res_shigeru,
lab = rownames(res_shigeru),
x = 'log2FoldChange',
y = 'padj',
pCutoff = 0.05,
FCcutoff = 1,
pointSize = 3,
labSize = 5,
colAlpha = 4/5,
colCustom = keyvals,
drawConnectors = TRUE)
We then normalize the result by extracting only significant genes with a fold change of 1.
resorted_deresults <- res_shigeru[order(res_shigeru$padj),]
## Select only the genes that have a padj > 0.05 and with minimum log2FoldChange of 1
sig <- resorted_deresults[!is.na(resorted_deresults$padj) &
resorted_deresults$padj < tresh_padj &
abs(resorted_deresults$log2FoldChange) >=3,]
selected <- rownames(sig);selected
## Norm transform the data from DEseq2 run
ntd <- normTransform(satoshi)
## Plot the relation among samples considering only the significant genes
pheatmap(assay(ntd)[selected,],
annotation_col=metadata,
cluster_rows = TRUE,
show_rownames = TRUE,
cluster_cols = TRUE,
cutree_rows = 4,
cutree_cols = 3,
labels_col= colData(satoshi)$SampleName)
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] pheatmap_1.0.12 SARTools_1.8.1
[3] kableExtra_1.4.0 edgeR_3.42.4
[5] limma_3.56.2 ashr_2.2-63
[7] EnhancedVolcano_1.18.0 ggrepel_0.9.5
[9] ggpubr_0.6.0 apeglm_1.22.1
[11] DESeq2_1.40.2 SummarizedExperiment_1.30.2
[13] Biobase_2.60.0 MatrixGenerics_1.12.3
[15] matrixStats_1.3.0 GenomicRanges_1.52.1
[17] GenomeInfoDb_1.36.4 IRanges_2.34.1
[19] S4Vectors_0.38.2 BiocGenerics_0.46.0
[21] reshape2_1.4.4 ggthemes_5.1.0
[23] plotly_4.10.4 DT_0.33
[25] data.table_1.15.4 lubridate_1.9.3
[27] forcats_1.0.0 stringr_1.5.1
[29] dplyr_1.1.4 purrr_1.0.2
[31] readr_2.1.5 tidyr_1.3.1
[33] tibble_3.2.1 ggplot2_3.5.1
[35] tidyverse_2.0.0 rmdformats_1.0.4
[37] knitr_1.45 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 ggdendro_0.2.0 rstudioapi_0.16.0
[4] jsonlite_1.8.8 magrittr_2.0.3 rmarkdown_2.26
[7] fs_1.6.4 zlibbioc_1.46.0 vctrs_0.6.5
[10] memoise_2.0.1 RCurl_1.98-1.14 SQUAREM_2021.1
[13] rstatix_0.7.2 mixsqp_0.3-54 htmltools_0.5.8.1
[16] S4Arrays_1.0.6 truncnorm_1.0-9 broom_1.0.5
[19] sass_0.4.9 bslib_0.7.0 htmlwidgets_1.6.4
[22] plyr_1.8.9 cachem_1.0.8 whisker_0.4.1
[25] lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.6-5
[28] R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.10
[31] digest_0.6.35 numDeriv_2016.8-1.1 GGally_2.2.1
[34] colorspace_2.1-0 AnnotationDbi_1.62.2 ps_1.7.6
[37] rprojroot_2.0.4 irlba_2.3.5.1 RSQLite_2.3.6
[40] invgamma_1.1 fansi_1.0.6 timechange_0.3.0
[43] httr_1.4.7 abind_1.4-5 compiler_4.3.1
[46] bit64_4.0.5 withr_3.0.0 backports_1.4.1
[49] BiocParallel_1.34.2 DBI_1.2.2 carData_3.0-5
[52] ggstats_0.6.0 highr_0.10 ggsignif_0.6.4
[55] MASS_7.3-60.0.1 DelayedArray_0.26.7 tools_4.3.1
[58] httpuv_1.6.15 glue_1.7.0 callr_3.7.6
[61] promises_1.3.0 grid_4.3.1 getPass_0.2-4
[64] generics_0.1.3 gtable_0.3.5 tzdb_0.4.0
[67] hms_1.1.3 xml2_1.3.6 car_3.1-2
[70] utf8_1.2.4 XVector_0.40.0 pillar_1.9.0
[73] emdbook_1.3.13 genefilter_1.82.1 later_1.3.2
[76] splines_4.3.1 lattice_0.22-6 survival_3.6-4
[79] bit_4.0.5 annotate_1.78.0 tidyselect_1.2.1
[82] locfit_1.5-9.9 Biostrings_2.68.1 git2r_0.33.0
[85] gridExtra_2.3 bookdown_0.39 svglite_2.1.3
[88] xfun_0.43 stringi_1.8.4 lazyeval_0.2.2
[91] yaml_2.3.8 evaluate_0.23 codetools_0.2-20
[94] bbmle_1.0.25.1 cli_3.6.2 xtable_1.8-4
[97] systemfonts_1.0.6 munsell_0.5.1 processx_3.8.4
[100] jquerylib_0.1.4 Rcpp_1.0.12 png_0.1-8
[103] coda_0.19-4.1 XML_3.99-0.16.1 bdsmatrix_1.3-7
[106] parallel_4.3.1 blob_1.2.4 bitops_1.0-7
[109] viridisLite_0.4.2 mvtnorm_1.2-4 scales_1.3.0
[112] crayon_1.5.2 rlang_1.1.3 KEGGREST_1.40.1