Creating a correlation matrix with R

Last updated: 2024-04-10

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 7fac102

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 7fac102. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.3.2/
    Ignored:    r_packages_4.3.3/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/cor_mat.Rmd) and HTML (docs/cor_mat.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	7fac102	Dave Tang	2024-04-10	Creating a correlation matrix with R

Incentive

Let \(A\) be a \(m \times n\) matrix, where \(a_{ij}\) are elements of \(A\), where \(i\) is the \(i_{th}\) row and \(j\) is the \(j_{th}\) column.

\[ A = \begin{bmatrix} a_{11} & \cdots & a_{1j} & \cdots & a_{1n} \\ \vdots & \ddots & \vdots && \vdots \\ a_{i1} & \cdots & a_{ij} & \cdots & a_{in} \\ \vdots && \vdots & \ddots & \vdots \\ a_{m1} & \cdots & a_{mj} & \cdots & a_{mn} \end{bmatrix} \]

If the matrix \(A\) contained transcript expression data, then \(a_{ij}\) is the expression level of the \(i^{th}\) transcript in the \(j^{th}\) assay. The elements of the \(i^{th}\) row of \(A\) form the transcriptional response of the \(i^{th}\) transcript. The elements of the \(j^{th}\) column of \(A\) form the expression profile of the \(j^{th}\) assay.

Transcripts that have a similar transcriptional response may indicate that they are co-expressed together and could have related biological functions. A simple way of looking at co-expression is through correlation, i.e., correlating all pairs of transcriptional responses, which results in a correlation matrix.

Getting started

Let’s get started with a small \(10 \times 10\) matrix. The code below will create random matrix with numbers ranging from 1 to 100.

set.seed(12345)
my_rows <- 10
my_cols <- 10
x <- runif(n = my_rows * my_cols, min = 1, max = 100)
A <- matrix(
  data = runif(100,1,100),
  nrow = my_rows,
  ncol = my_cols,
  byrow = TRUE
)
image(A)

We will calculate the Spearman’s rank correlation coefficient, which is a more robust measure of correlation.

Below are the first and second rows of matrix A.

A[1, ]

 [1] 30.15207 62.10811 97.45314 62.20299 52.61555 90.34788 64.10799 86.56581
 [9] 25.86066 22.29184

A[2, ]

 [1] 61.338128 38.961019 75.771833 38.593888 79.702234 90.663422 98.418591
 [8] 59.206849  1.936941 32.758415

Use cor() to calculate the correlation between row 1 and 2.

cor(A[1,], A[2,], method = "spearman")

[1] 0.6121212

If we provide cor() with a matrix, it will calculate all correlations between columns. If we are interested in row correlations, we need to transpose the matrix.

cor_mat <- cor(t(A), method = "spearman")
cor_mat

             [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
 [1,]  1.00000000  0.61212121 -0.15151515  0.33333333  0.11515152 -0.06666667
 [2,]  0.61212121  1.00000000  0.41818182  0.52727273  0.12727273 -0.05454545
 [3,] -0.15151515  0.41818182  1.00000000  0.40606061  0.39393939 -0.18787879
 [4,]  0.33333333  0.52727273  0.40606061  1.00000000 -0.27272727 -0.34545455
 [5,]  0.11515152  0.12727273  0.39393939 -0.27272727  1.00000000  0.35757576
 [6,] -0.06666667 -0.05454545 -0.18787879 -0.34545455  0.35757576  1.00000000
 [7,]  0.16363636  0.10303030 -0.01818182 -0.18787879  0.45454545  0.55151515
 [8,] -0.21212121 -0.13939394 -0.16363636 -0.41818182 -0.12727273  0.50303030
 [9,]  0.51515152  0.57575758  0.10303030  0.40606061 -0.04242424 -0.62424242
[10,] -0.27272727 -0.10303030 -0.22424242 -0.01818182 -0.62424242 -0.63636364
             [,7]       [,8]        [,9]       [,10]
 [1,]  0.16363636 -0.2121212  0.51515152 -0.27272727
 [2,]  0.10303030 -0.1393939  0.57575758 -0.10303030
 [3,] -0.01818182 -0.1636364  0.10303030 -0.22424242
 [4,] -0.18787879 -0.4181818  0.40606061 -0.01818182
 [5,]  0.45454545 -0.1272727 -0.04242424 -0.62424242
 [6,]  0.55151515  0.5030303 -0.62424242 -0.63636364
 [7,]  1.00000000 -0.2363636 -0.46666667 -0.58787879
 [8,] -0.23636364  1.0000000 -0.36969697 -0.13939394
 [9,] -0.46666667 -0.3696970  1.00000000  0.36969697
[10,] -0.58787879 -0.1393939  0.36969697  1.00000000

The correlation of row 1 with row 1, row 2 with row 2, and so on are all 1 because we are correlating identical rows. The correlation of row 1 with row 2 and row 2 with row 1 are the same because the order does not matter with correlation; this value is also the same as the one we manually calculated before, which is a good sanity check.

Larger matrices

For 10 rows, we needed to calculate 45 correlations. We can use the choose() function to return the number of comparisons.

choose(10, 2)

[1] 45

Let’s plot the number of comparisons as the number of rows doubles.

double_num <- function(n, l){
  if(n == 1){
    return(c(l, list(l[[length(l)]]*2)))
  }
  return(double_num(n-1, c(l, l[[length(l)]]*2)))
}

x <- unlist(double_num(10, list(10)))
y <- sapply(x, function(x) choose(x, 2))

ggplot(data.frame(x = x, y = y), aes(x, y)) +
  geom_point() +
  theme_minimal()

Since the number of comparisons increases exponentially, we will need to use more cores to calculate all pairwise comparisons if we want results in a reasonable time.

Cell network

Dataset with 2,700 cells.

pbmc3k <- read_csv(file = "https://davetang.org/file/pbmc3k/pbmc3k.csv.gz", show_col_types = FALSE)

New names:
• `` -> `...1`

rn <- pull(pbmc3k[, 1])
pbmc3k <- pbmc3k[, -1]
pbmc3k <- as.matrix(pbmc3k)
row.names(pbmc3k) <- rn
pbmc3k <- pbmc3k[rowSums(pbmc3k)>9, ]
pbmc3k <- pbmc3k[, colSums(pbmc3k)>200 ]
dim(pbmc3k)

[1] 11259  2700

Number of comparisons.

choose(ncol(pbmc3k), 2)

[1] 3643650

system.time(
  pbmc3k_cor <- cor(pbmc3k, method = "spearman")
)

   user  system elapsed 
 41.650   2.282  43.938

Dimensions of the correlation matrix.

dim(pbmc3k_cor)

[1] 2700 2700

Gene network

We will use pnas_expression.txt from the study ” Determination of tag density required for digital transcriptome analysis: application to an androgen-sensitive prostate cancer model”.

pnas_exp <- read.table(
  file = "https://davetang.org/file/pnas_expression.txt",
  header = TRUE,
  row.names = 1
)
pnas_exp_rn <- row.names(pnas_exp)

dim(pnas_exp)

[1] 37435     8

We won’t be using the gene lengths, so we will remove the len column.

pnas_exp <- pnas_exp[,-pnas_exp$len]
head(pnas_exp)

                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000215696     0     0     0     0     0     0     0
ENSG00000215700     0     0     0     0     0     0     0
ENSG00000215699     0     0     0     0     0     0     0
ENSG00000215784     0     0     0     0     0     0     0
ENSG00000212914     0     0     0     0     0     0     0
ENSG00000212042     0     0     0     0     0     0     0

Top 10 most highly expressed genes.

rs <- rowSums(pnas_exp)
top10 <- head(sort(rs, decreasing = TRUE), 10)

pnas_exp[names(top10), ]

                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000157214  5270  6208  7359  7521  9806  9786  4418
ENSG00000131051  4141  4926  5615  5548  8762  8479  3496
ENSG00000212875  4652  5409  6445  6668  5623  5727  2875
ENSG00000122566  3775  4131  5427  5831  6959  7069  2505
ENSG00000142875  3166  3735  3920  4375  8123  7758  3880
ENSG00000132570  4070  4704  5471  5548  6027  5665  2613
ENSG00000115053  4352  4612  5880  5892  4744  4948  1892
ENSG00000008128  3467  4300  4514  5018  5800  5814  2079
ENSG00000162669  4212  5000  5516  5812  4328  4172  1632
ENSG00000115648   940  1084  1317  1345  9730  9942  3272

Normalise each column by its “depth”, then center and scale.

pnas_exp_norm <- apply(pnas_exp, 2, function(x) x / sum(x) * 1000000)
pnas_exp_norm <- apply(pnas_exp_norm, 2, scale)
row.names(pnas_exp_norm) <- pnas_exp_rn
pnas_exp_norm[names(top10), ]

                    lane1     lane2     lane3     lane4    lane5    lane6
ENSG00000157214 41.702924 42.345887 40.954637 40.769545 41.78237 41.50440
ENSG00000131051 32.724307 33.557386 31.197773 30.017650 37.31179 35.93325
ENSG00000212875 36.788145 36.868498 35.841235 36.121108 23.87006 24.20273
ENSG00000122566 29.813613 28.107418 30.146001 31.559863 29.59104 29.92307
ENSG00000142875 24.970409 25.392717 21.715040 23.625368 34.57548 32.85996
ENSG00000132570 32.159665 32.035508 30.392160 30.017650 25.60005 23.93845
ENSG00000115053 34.402330 31.404820 32.680324 31.892284 20.10603 20.88221
ENSG00000008128 27.364177 29.265965 25.038192 27.129407 24.62800 24.57357
ENSG00000162669 33.288950 34.064678 30.643914 31.456322 18.32465 17.57447
ENSG00000115648  7.267664  7.219303  7.152472  7.113335 41.45693 42.16936
                   lane8
ENSG00000157214 48.15434
ENSG00000131051 38.06333
ENSG00000212875 31.26667
ENSG00000122566 27.21714
ENSG00000142875 42.26609
ENSG00000132570 28.39916
ENSG00000115053 20.50804
ENSG00000008128 22.55470
ENSG00000162669 17.66242
ENSG00000115648 35.61172

Subset to the 500 most highly variable genes.

my_vars <- apply(pnas_exp_norm, 1, var)
hvgs <- head(sort(my_vars, decreasing = TRUE), 500)
pnas_exp_norm <- pnas_exp_norm[names(hvgs), ]
head(pnas_exp_norm)

                     lane1      lane2      lane3      lane4     lane5     lane6
ENSG00000115648  7.2676637  7.2193031  7.1524721  7.1133353 41.456929 42.169359
ENSG00000106070 31.9449412 30.7261446 34.3251164 34.5788950  6.193259  6.214789
ENSG00000064042 10.6873315 10.9143126 12.0476880 11.7508733 33.239445 33.115712
ENSG00000142192 13.5423565 12.6487048 13.5694009 14.2903476 34.232908 32.979311
ENSG00000155368  3.8479958  3.2089496  2.5425773  2.7809702 19.035487 19.087676
ENSG00000096060  0.3090372  0.3297214  0.3718987  0.3995318 16.813041 15.677639
                    lane8
ENSG00000115648 35.611717
ENSG00000106070  4.397453
ENSG00000064042 31.135338
ENSG00000142192 31.485568
ENSG00000155368 22.127855
ENSG00000096060 15.681431

Create correlation matrix.

pnas_exp_cor <- cor(t(pnas_exp_norm), method="spearman")
dim(pnas_exp_cor)

[1] 500 500

Install the {network} and {sna} packages.

install.packages(c("network", "sna"))

Load libraries.

library(network)
library(sna)

The network::network() function takes a matrix giving the network structure in adjacency, incidence, or edgelist form or otherwise, an object of class network.

We will convert the correlation matrix into a adjacency matrix.

pnas_exp_cor[upper.tri(pnas_exp_cor)] <- 42
my_index <- which(pnas_exp_cor < 42, arr.ind = TRUE)
pnas_exp_cor_adj <- cbind(my_index, pnas_exp_cor[my_index])
colnames(pnas_exp_cor_adj) <- c('row', 'col', 'spearman')
to_keep <- which(pnas_exp_cor_adj[, 'row'] != pnas_exp_cor_adj[, 'col'], arr.ind = TRUE)
pnas_exp_cor_adj <- pnas_exp_cor_adj[to_keep, ]
dim(pnas_exp_cor_adj)

[1] 124750      3

Distribution of the correlations.

summary(pnas_exp_cor_adj[, 'spearman'])

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-1.000000 -0.714286  0.000000  0.003178  0.714286  1.000000

Keep only the higher correlations.

pnas_exp_cor_adj_sub <- pnas_exp_cor_adj[abs(pnas_exp_cor_adj[, 'spearman']) >= 0.95, ]
dim(pnas_exp_cor_adj_sub)

[1] 5264    3

Create network.

net <- network::network(pnas_exp_cor_adj_sub, directed = FALSE)
net

 Network attributes:
  vertices = 500 
  directed = FALSE 
  hyper = FALSE 
  loops = FALSE 
  multiple = FALSE 
  bipartite = FALSE 
  total edges= 5264 
    missing edges= 0 
    non-missing edges= 5264 

 Vertex attribute names: 
    vertex.names 

 Edge attribute names not shown

Component analysis

comp_dist <- component.dist(net)
class(comp_dist)

[1] "list"

Delete genes not connected with others

delete.vertices(net, which(comp_dist$csize[comp_dist$membership] == 1))
net

 Network attributes:
  vertices = 475 
  directed = FALSE 
  hyper = FALSE 
  loops = FALSE 
  multiple = FALSE 
  bipartite = FALSE 
  total edges= 5264 
    missing edges= 0 
    non-missing edges= 5264 

 Vertex attribute names: 
    vertex.names 

 Edge attribute names not shown

Plot network.

plot(net)

sessionInfo()

R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sna_2.7-2            statnet.common_4.9.0 network_1.18.2      
 [4] lubridate_1.9.3      forcats_1.0.0        stringr_1.5.1       
 [7] dplyr_1.1.4          purrr_1.0.2          readr_2.1.5         
[10] tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.0       
[13] tidyverse_2.0.0      workflowr_1.7.1     

loaded via a namespace (and not attached):
 [1] gtable_0.3.4      xfun_0.43         bslib_0.7.0       processx_3.8.4   
 [5] lattice_0.22-5    callr_3.7.6       tzdb_0.4.0        vctrs_0.6.5      
 [9] tools_4.3.3       ps_1.7.6          generics_0.1.3    curl_5.2.1       
[13] parallel_4.3.3    fansi_1.0.6       highr_0.10        pkgconfig_2.0.3  
[17] lifecycle_1.0.4   compiler_4.3.3    farver_2.1.1      git2r_0.33.0     
[21] munsell_0.5.1     getPass_0.2-4     httpuv_1.6.15     htmltools_0.5.8.1
[25] sass_0.4.9        yaml_2.3.8        later_1.3.2       pillar_1.9.0     
[29] crayon_1.5.2      jquerylib_0.1.4   whisker_0.4.1     cachem_1.0.8     
[33] tidyselect_1.2.1  digest_0.6.35     stringi_1.8.3     labeling_0.4.3   
[37] rprojroot_2.0.4   fastmap_1.1.1     grid_4.3.3        colorspace_2.1-0 
[41] cli_3.6.2         magrittr_2.0.3    utf8_1.2.4        withr_3.0.0      
[45] scales_1.3.0      promises_1.3.0    bit64_4.0.5       timechange_0.3.0 
[49] rmarkdown_2.26    httr_1.4.7        bit_4.0.5         hms_1.1.3        
[53] coda_0.19-4.1     evaluate_0.23     knitr_1.46        rlang_1.1.3      
[57] Rcpp_1.0.12       glue_1.7.0        rstudioapi_0.16.0 vroom_1.6.5      
[61] jsonlite_1.8.8    R6_2.5.1          fs_1.6.3