GSVA: gene set variation analysis

Last updated: 2023-10-23

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 115752b

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 115752b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.3.0/
    Ignored:    r_packages_4.3.1/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd
    Untracked:  analysis/complex_heatmap.Rmd
    Untracked:  analysis/sleuth.Rmd
    Untracked:  analysis/tss_xgboost.Rmd
    Untracked:  code/multiz100way/
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
    Untracked:  data/ncrna_NONCODE[v3.0].fasta.tar.gz
    Untracked:  data/ncrna_noncode_v3.fa
    Untracked:  data/netmhciipan.out.gz
    Untracked:  data/test
    Untracked:  export/davetang039sblog.WordPress.2023-06-30.xml
    Untracked:  export/output/
    Untracked:  women.json

Unstaged changes:
    Modified:   analysis/graph.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/gsva.Rmd) and HTML (docs/gsva.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	115752b	Dave Tang	2023-10-23	Technical notes
html	14e3e6b	Dave Tang	2023-10-23	Build site.
Rmd	9b9327a	Dave Tang	2023-10-23	No progress bar
html	b0af093	Dave Tang	2023-10-23	Build site.
Rmd	aeb90e1	Dave Tang	2023-10-23	GSVA

Following the vignette.

Gene set variation analysis (GSVA) is a particular type of gene set enrichment method that works on single samples and enables pathway-centric analyses of molecular data by performing a conceptually simple but powerful change in the functional unit of analysis, from genes to gene sets. The GSVA package provides the implementation of four single-sample gene set enrichment methods, concretely zscore, plage, ssGSEA and its own called GSVA. While this methodology was initially developed for gene expression data, it can be applied to other types of molecular profiling data. In this vignette we illustrate how to use the GSVA package with bulk microarray and RNA-seq expression data.

Gene set variation analysis (GSVA) provides an estimate of pathway activity by transforming an input gene-by-sample expression data matrix into a corresponding gene-set-by-sample expression data matrix. This resulting expression data matrix can be then used with classical analytical methods such as differential expression, classification, survival analysis, clustering or correlation analysis in a pathway-centric manner. One can also perform sample-wise comparisons between pathways and other molecular data types such as microRNA expression or binding data, copy-number variation (CNV) data or single nucleotide polymorphisms (SNPs).

Package

Install GSVA. (Dependencies are listed in the Imports section in the DESCRIPTION file.)

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

if (!require("GSVA", quietly = TRUE))
  BiocManager::install("GSVA")

Load package.

library(GSVA)

Quick start

Generate example expression matrix.

p <- 10000
n <- 30
set.seed(1984)
X <- matrix(
  rnorm(p*n),
  nrow=p,
  dimnames=list(paste0("g", 1:p), paste0("s", 1:n))
)
X[1:5, 1:5]

           s1         s2         s3          s4           s5
g1  0.4092032  1.4676435  0.3515056  1.53512312 -1.279009469
g2 -0.3230250 -1.8501416 -0.9198650  1.40036448  0.086613315
g3  0.6358523  1.6084120  1.6380322  0.23799146  0.216628121
g4 -1.8461288 -0.2928844  0.4651573 -0.09766558 -0.009887299
g5  0.9536474 -0.4816006  0.1807824  1.03141311  0.206414282

Generate 100 gene sets that are contain from 10 to up to 100 genes sampled from 1:p.

set.seed(1984)
gs <- as.list(sample(10:100, size=100, replace=TRUE))

gs <- lapply(gs, function(n, p){
  paste0("g", sample(1:p, size=n, replace=FALSE))
}, p)
names(gs) <- paste0("gs", 1:length(gs))

sapply(gs, length)

  gs1   gs2   gs3   gs4   gs5   gs6   gs7   gs8   gs9  gs10  gs11  gs12  gs13 
   49    29    67    90    94    87    41    26    86    77    97    90    45 
 gs14  gs15  gs16  gs17  gs18  gs19  gs20  gs21  gs22  gs23  gs24  gs25  gs26 
   47    54    83    11    75    95    99    94    89    93    50    49    87 
 gs27  gs28  gs29  gs30  gs31  gs32  gs33  gs34  gs35  gs36  gs37  gs38  gs39 
   36    61    84    99    58    30    63    29    35    29    69    41    46 
 gs40  gs41  gs42  gs43  gs44  gs45  gs46  gs47  gs48  gs49  gs50  gs51  gs52 
   38    17    48    72    15    81   100    93    37    99    89    43    36 
 gs53  gs54  gs55  gs56  gs57  gs58  gs59  gs60  gs61  gs62  gs63  gs64  gs65 
   84    83    40    72    90    86    37    23    69    96    20    93    36 
 gs66  gs67  gs68  gs69  gs70  gs71  gs72  gs73  gs74  gs75  gs76  gs77  gs78 
   21    46    76    71    57    48    25    73    26    46    29    53    69 
 gs79  gs80  gs81  gs82  gs83  gs84  gs85  gs86  gs87  gs88  gs89  gs90  gs91 
   69    42    76    30    16    49    35    12    83    99    88    66    10 
 gs92  gs93  gs94  gs95  gs96  gs97  gs98  gs99 gs100 
   51    82    73    97    59    59    42    10    64

Calculate GSVA enrichment scores using the gsva() function, which does all the work and requires the following two input arguments:

A normalised gene expression dataset, which can be provided in one of the following containers:
- A matrix of expression values with genes corresponding to rows and samples corresponding to columns.
- An ExpressionSet object; see package Biobase.
- A SummarizedExperiment object, see package SummarizedExperiment.
A collection of gene sets; which can be provided in one of the following containers:
- A list object where each element corresponds to a gene set defined by a vector of gene identifiers, and the element names correspond to the names of the gene sets.
- A GeneSetCollection object; see package GSEABase.

The first argument to the gsva() function is the gene expression data matrix and the second the collection of gene sets. The gsva() function can take the input expression data and gene sets using different specialized containers that facilitate the access and manipulation of molecular and phenotype data, as well as their associated metadata. Another advanced features include the use of on-disk and parallel backends to enable, respectively, using GSVA on large molecular data sets and speed up computing time.

The gsva() function will apply the following filters before the actual calculations take place:

Discard genes in the input expression data matrix with constant expression.
Discard genes in the input gene sets that do not map to a gene in the input gene expression data matrix.
Discard gene sets that, after applying the previous filters, do not meet a minimum and maximum size, which by default is 1 for the minimum size and Inf for the maximum size.

When method="gsva" is used (the default), the following parameters can be tuned:

kcdf: The first step of the GSVA algorithm brings gene expression profiles to a common scale by calculating an expression statistic through a non-parametric estimation of the CDF across samples. Such a non-parametric estimation employs a kernel function and the kcdf parameter allows the user to specify three possible values for that function:

“Gaussian”, the default value, which is suitable for continuous expression data, such as microarray fluorescent units in logarithmic scale and RNA-seq log-CPMs, log-RPKMs or log-TPMs units of expression;
“Poisson”, which is suitable for integer counts, such as those derived from RNA-seq alignments;
“none”, which will enforce a direct estimation of the CDF without a kernel function.

mx.diff: The last step of the GSVA algorithm calculates the gene set enrichment score from two Kolmogorov-Smirnov random walk statistics. This parameter is a logical flag that allows the user to specify two possible ways to do such calculation:

TRUE, the default value, where the enrichment score is calculated as the magnitude difference between the largest positive and negative random walk deviations;
FALSE, where the enrichment score is calculated as the maximum distance of the random walk from zero.

abs.ranking: Logical flag used only when mx.diff=TRUE. By default, abs.ranking=FALSE and it implies that a modified Kuiper statistic is used to calculate enrichment scores, taking the magnitude difference between the largest positive and negative random walk deviations. When abs.ranking=TRUE the original Kuiper statistic is used, by which the largest positive and negative random walk deviations are added together. In this case, gene sets with genes enriched on either extreme (high or low) will be regarded as highly activated.
tau: Exponent defining the weight of the tail in the random walk. By default tau=1. When method="ssgsea", this parameter is also used and its default value becomes then tau=0.25 to match the methodology described in (Barbie et al. 2009).

In general, the default values for the previous parameters are suitable for most analysis settings, which usually consist of some kind of normalized continuous expression values.

es_gsva <- gsva(X, gs, verbose=FALSE)
dim(es_gsva)

[1] 100  30

Median enrichment scores.

apply(es_gsva, 2, median)

          s1           s2           s3           s4           s5           s6 
 0.009061514 -0.008458424 -0.005713628 -0.021937531  0.006417182  0.019422486 
          s7           s8           s9          s10          s11          s12 
 0.010140618  0.005812097  0.006495584  0.008887644  0.024577619 -0.031697634 
         s13          s14          s15          s16          s17          s18 
 0.001642502  0.008919786 -0.022622470  0.027695420 -0.015799537 -0.011108686 
         s19          s20          s21          s22          s23          s24 
-0.013956442 -0.015493300 -0.004809844 -0.014081494  0.026845336  0.023895676 
         s25          s26          s27          s28          s29          s30 
 0.006358240  0.010642450 -0.012690144 -0.005999451 -0.005058572 -0.012403422

ssgsea (Barbie et al. 2009). Single sample GSEA (ssGSEA) is a non-parametric method that calculates a gene set enrichment score per sample as the normalized difference in empirical cumulative distribution functions (CDFs) of gene expression ranks inside and outside the gene set. By default, the implementation in the GSVA package follows the last step described in (Barbie et al. 2009, online methods, pg. 2) by which pathway scores are normalized, dividing them by the range of calculated values. This normalization step may be switched off using the argument ssgsea.norm in the call to the gsva() function; see below.

es_ssgsea <- gsva(X, gs, method = "ssgsea", verbose=FALSE)

[1] "Calculating ranks..."
[1] "Calculating absolute values from ranks..."
[1] "Normalizing..."

apply(es_ssgsea, 2, median)

       s1        s2        s3        s4        s5        s6        s7        s8 
0.1056332 0.1105148 0.1149790 0.1045303 0.1192256 0.1274702 0.1192217 0.1287768 
       s9       s10       s11       s12       s13       s14       s15       s16 
0.1316429 0.1257247 0.1293241 0.1083643 0.1222038 0.1012195 0.1018693 0.1174057 
      s17       s18       s19       s20       s21       s22       s23       s24 
0.1103184 0.1112410 0.1164373 0.1126160 0.1212393 0.1067666 0.1167117 0.1418775 
      s25       s26       s27       s28       s29       s30 
0.1221909 0.1235449 0.1112199 0.1012381 0.1254514 0.1210392

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GSVA_1.48.3         BiocManager_1.30.22 workflowr_1.7.1    

loaded via a namespace (and not attached):
 [1] blob_1.2.4                  Biostrings_2.68.1          
 [3] bitops_1.0-7                fastmap_1.1.1              
 [5] SingleCellExperiment_1.22.0 RCurl_1.98-1.12            
 [7] promises_1.2.1              rsvd_1.0.5                 
 [9] XML_3.99-0.14               digest_0.6.33              
[11] lifecycle_1.0.3             processx_3.8.2             
[13] KEGGREST_1.40.1             RSQLite_2.3.1              
[15] magrittr_2.0.3              compiler_4.3.1             
[17] rlang_1.1.1                 sass_0.4.7                 
[19] tools_4.3.1                 utf8_1.2.3                 
[21] yaml_2.3.7                  knitr_1.44                 
[23] S4Arrays_1.0.6              bit_4.0.5                  
[25] DelayedArray_0.26.7         abind_1.4-5                
[27] BiocParallel_1.34.2         HDF5Array_1.28.1           
[29] BiocGenerics_0.46.0         grid_4.3.1                 
[31] stats4_4.3.1                fansi_1.0.5                
[33] git2r_0.32.0                beachmat_2.16.0            
[35] xtable_1.8-4                Rhdf5lib_1.22.1            
[37] SummarizedExperiment_1.30.2 cli_3.6.1                  
[39] rmarkdown_2.25              crayon_1.5.2               
[41] rstudioapi_0.15.0           httr_1.4.7                 
[43] DelayedMatrixStats_1.22.6   DBI_1.1.3                  
[45] cachem_1.0.8                rhdf5_2.44.0               
[47] stringr_1.5.0               zlibbioc_1.46.0            
[49] parallel_4.3.1              AnnotationDbi_1.62.2       
[51] XVector_0.40.0              matrixStats_1.0.0          
[53] vctrs_0.6.4                 Matrix_1.5-4.1             
[55] jsonlite_1.8.7              BiocSingular_1.16.0        
[57] callr_3.7.3                 IRanges_2.34.1             
[59] S4Vectors_0.38.2            bit64_4.0.5                
[61] irlba_2.3.5.1               GSEABase_1.62.0            
[63] jquerylib_0.1.4             annotate_1.78.0            
[65] glue_1.6.2                  codetools_0.2-19           
[67] ps_1.7.5                    stringi_1.7.12             
[69] later_1.3.1                 GenomeInfoDb_1.36.4        
[71] GenomicRanges_1.52.1        ScaledMatrix_1.8.1         
[73] tibble_3.2.1                pillar_1.9.0               
[75] htmltools_0.5.6.1           rhdf5filters_1.12.1        
[77] graph_1.78.0                GenomeInfoDbData_1.2.10    
[79] R6_2.5.1                    sparseMatrixStats_1.12.2   
[81] rprojroot_2.0.3             evaluate_0.22              
[83] Biobase_2.60.0              lattice_0.21-8             
[85] png_0.1-8                   memoise_2.0.1              
[87] httpuv_1.6.11               bslib_0.5.1                
[89] Rcpp_1.0.11                 whisker_0.4.1              
[91] xfun_0.40                   fs_1.6.3                   
[93] MatrixGenerics_1.12.3       getPass_0.2-2              
[95] pkgconfig_2.0.3