This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20210202) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 9caed99. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/rna_seq_collier_2017.Rmd) and HTML (docs/rna_seq_collier_2017.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.
Data: Used public data from (Collier et al. 2017): H9 embryonic stem cells in Naïve and Primed state, three biological replicates.
Sample info
dataset_id
gsm_id
description
filename
run_id
input
layout
experiment
26
Collier_2017
GSM2449055
hESC_H9_naive_1
hESC_H9_naive_1
SRR5151102
-
single
RNA-seq
27
Collier_2017
GSM2449056
hESC_H9_naive_2
hESC_H9_naive_2
SRR5151103
-
single
RNA-seq
28
Collier_2017
GSM2449057
hESC_H9_naive_3
hESC_H9_naive_3
SRR5151104
-
single
RNA-seq
29
Collier_2017
GSM2449058
hESC_H9_primed_1
hESC_H9_primed_1
SRR5151105
-
single
RNA-seq
30
Collier_2017
GSM2449059
hESC_H9_primed_2
hESC_H9_primed_2
SRR5151106
-
single
RNA-seq
31
Collier_2017
GSM2449060
hESC_H9_primed_3
hESC_H9_primed_3
SRR5151107
-
single
RNA-seq
Primary analysis
RNA-seq data was analysed by a standard RNA-seq pipeline v2.0, available on nf-core (Ewels et al. 2019) with mostly default settings. For more information about the specific steps involved in this analysis, refer to the repository documentation: https://nf-co.re/rnaseq/2.0.
Reference genome
Due to some difficulties in using hg38 with iGenomes, and since the rest of the analysis has been done on such reference (more specifically, GRCh38.p12). An annotation was downloaded from Ensembl and used as matching custom --fasta and --gtf parameters:
RNA-seq analysis was run on Uppmax HPC environment using standard parameters: nextflow run nf-core/rnaseq -profile uppmax -revision 2.0 --public_data_ids ids_file.txt. ids_file.txt contained only the GSM ids as plain text.
This run would produce the corresponding FASTQ files and then a further run with the generated samplesheet_grouped.tsv file, the pipeline was run.
Note on pipeline version
There is a newer RNA-seq version: https://github.com/nf-core/rnaseq/releases/tag/3.0. This affects the quantification, so it may be the case that results are better using Salmon instead of featureCounts. However at this point there is a problem with Singularity on Uppmax and this version cannot be easily run. If necessary, I will re-do this analysis in the future, once these problems are solved.
Downstream analysis
Helper functions
color_list <- c(Naive=global_palette$teal,
Primed=global_palette$coral)
#' Select by quantile
#'
#' @param counts Counts table.
#' @param length_cutoff Gene length cutoff.
#' @param value_cutoff Value cutoff
#' @param min_q Minimum quantile to consider
#' @param max_q Maximum quantile to consider
#' @param field Which field to use (default = TPM).
#'
#' @return filtered data.frame
quantile_group <- function(counts, min_q, max_q, value_cutoff = 0, length_cutoff = 10000, field = "TPM") {
counts$length <- (counts$End - counts$Start)
counts <- dplyr::filter(counts, .data[[field]] > value_cutoff & length > length_cutoff)
q <- quantile(counts[, field], probs = c(min_q, max_q))
dplyr::filter(counts, .data[[field]] >= q[1] & .data[[field]] <= q[2])
}
#' Make a venn diagram out ouf a list of selected gene counts
#'
#' @param gene_counts of count tables
#' @param title Plot title
#' @param color Base color
#' @return ggplot object
genes_venn <- function(gene_counts, title, color) {
field_list <- list(r1 = gene_counts[[1]]$Gene.Name,
r2 = gene_counts[[2]]$Gene.Name,
r3 = gene_counts[[3]]$Gene.Name)
clist <- rep(color, 3)
ggvenn(field_list, fill_color = clist, fill_alpha = 0.5, text_size = 5, stroke_color = "#ffffff") +
ggtitle(title)
}
intersect_genes_3 <- function(glist) {
intersect(intersect(glist[[1]]$Gene.Name,
glist[[2]]$Gene.Name),
glist[[3]]$Gene.Name)
}
#' Subset a named granges by name
#'
#' @param names Names list
#' @param granges GRanges to subset.
#'
#' @return GRanges
select_genes <- function(names, granges) {
sort(granges[granges$name %in% names, ])
}
#' Get GRanges with gene symbols for hg38
#'
#' @return A GRanges where name field is Gene symbol
genes_hg38 <- function() {
# This matches gene IDs to symbols
x <- org.Hs.egSYMBOL
# Get the gene names that are mapped to an entrez gene identifier
mapped_genes <- mappedkeys(x)
# Convert to a list
xx <- as.list(x[mapped_genes])
# Apparently there are some pathway crap that puts NAs in the list
xx <- xx[!is.na(xx)]
gene_id_to_name <- unlist(xx)
genes <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
genes$name <- gene_id_to_name[genes$gene_id]
genes
}
#' Get GRanges with gene symbols for hg38
#'
#' @return A GRanges where name field is Gene symbol
genes_hg38_ensembl <- function() {
# This matches gene IDs to symbols
edb <- EnsDb.Hsapiens.v86
genes(edb)
}
export_gene_set <- function(replicates, name, path, annot_func, name_field) {
gene_names <- intersect_genes_3(replicates)
genes_all <- annot_func()
genes_all <- genes_all[, name_field]
names(mcols(genes_all)) <- "name"
result <- select_genes(gene_names, genes_all)
out_file <- file.path(path, name)
export(result, out_file)
}
Gene counts
Gene counts output file from featureCounts are used.
As expected, there are many zero values in all the distributions, and long tails since there is no upper limit in RNA-seq data.
Stringtie output
stringtie(Kovaka et al. 2019)https://ccb.jhu.edu/software/stringtie/) is a tool for quantitation of full-length transcripts (taking variants into account) per gene.
This output is also provided as standard output for RNA-seq pipeline v2.0.
Collier, Amanda J, Sarita P Panula, John Paul Schell, Peter Chovanec, Alvaro Plaza Reyes, Sophie Petropoulos, Anne E Corcoran, et al. 2017. “Comprehensive Cell Surface Protein Profiling Identifies Specific Markers of Human Naive and Primed Pluripotent States.”Cell Stem Cell 20 (6): 874–90.
Ewels, Philip A, Alexander Peltzer, Sven Fillinger, Johannes Alneberg, Harshil Patel, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso, and Sven Nahnsen. 2019. “Nf-Core: Community Curated Bioinformatics Pipelines.”BioRxiv, 610741.
Kovaka, Sam, Aleksey V Zimin, Geo M Pertea, Roham Razaghi, Steven L Salzberg, and Mihaela Pertea. 2019. “Transcriptome Assembly from Long-Read RNA-Seq Alignments with StringTie2.”Genome Biology 20 (1): 1–13.