Matrix to adjacency list in R

Last updated: 2022-11-04

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 6def24c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 6def24c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.1.2/
    Ignored:    r_packages_4.2.0/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd
    Untracked:  data/ncrna_NONCODE[v3.0].fasta.tar.gz
    Untracked:  data/ncrna_noncode_v3.fa

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/adj_list.Rmd) and HTML (docs/adj_list.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	6def24c	Dave Tang	2022-11-04	Adjacency list

An adjacency list is simply an unordered list that describes connections between vertices (or nodes) and is a commonly used input format for graphs. In this post, I use the pivot_longer function from the tidyr package to create an adjacency list from a correlation matrix. I will use the geneData dataset, which contains real but anonymised microarray expression data, from the Biobase package as an example. Finally, I will show some features of the igraph package.

In the first section, I will load the dataset, calculate the correlations, and finally create the adjacency list.

Data

Install Biobase (if necessary), a package that contains base functions for Bioconductor.

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

Bioconductor version '3.15' is out-of-date; the current release version '3.16'
  is available with R version '4.2'; see https://bioconductor.org/install

if (!require("Biobase", quietly = TRUE))
  BiocManager::install("Biobase")


Attaching package: 'BiocGenerics'

The following objects are masked from 'package:dplyr':

    combine, intersect, setdiff, union

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

library(Biobase)

A description of the data is provided with ?geneData:

The geneData data.frame has 500 rows and 26 columns. It consists of a subset of real expression data from an Affymetrix U95v2 chip. The data are anonymous. The covariate data geneCov and geneCovariate are made up. The standard error data seD is also made up.

data("geneData")
geneData[1:6, 1:6]

                       A         B        C        D        E       F
AFFX-MurIL2_at  192.7420  85.75330 176.7570 135.5750 64.49390 76.3569
AFFX-MurIL10_at  97.1370 126.19600  77.9216  93.3713 24.39860 85.5088
AFFX-MurIL4_at   45.8192   8.83135  33.0632  28.7072  5.94492 28.2925
AFFX-MurFAS_at   22.5445   3.60093  14.6883  12.3397 36.86630 11.2568
AFFX-BioB-5_at   96.7875  30.43800  46.1271  70.9319 56.17440 42.6756
AFFX-BioB-M_at   89.0730  25.84610  57.2033  69.9766 49.58220 26.1262

The cor function can be used to calculate the correlations of each sample (columns) to every other sample (all-vs-all).

my_cor_mat <- cor(geneData)
dim(my_cor_mat)

[1] 26 26

Since the correlation between A and B is the same as the correlation between B and A, we will remove the values in the lower triangle of the matrix as well as the self correlations.

my_cor_mat[1:6, 1:6]

          A         B         C         D         E         F
A 1.0000000 0.9471591 0.9191233 0.9403129 0.9689690 0.9621362
B 0.9471591 1.0000000 0.8888108 0.9108174 0.9703906 0.9425248
C 0.9191233 0.8888108 1.0000000 0.8788107 0.9077245 0.9078971
D 0.9403129 0.9108174 0.8788107 1.0000000 0.9477058 0.9017977
E 0.9689690 0.9703906 0.9077245 0.9477058 1.0000000 0.9531459
F 0.9621362 0.9425248 0.9078971 0.9017977 0.9531459 1.0000000

my_cor_mat[lower.tri(my_cor_mat, diag = TRUE)] <- NA
my_cor_mat[1:6, 1:6]

   A         B         C         D         E         F
A NA 0.9471591 0.9191233 0.9403129 0.9689690 0.9621362
B NA        NA 0.8888108 0.9108174 0.9703906 0.9425248
C NA        NA        NA 0.8788107 0.9077245 0.9078971
D NA        NA        NA        NA 0.9477058 0.9017977
E NA        NA        NA        NA        NA 0.9531459
F NA        NA        NA        NA        NA        NA

Adjacency list

We will use pivot_longer to generate the adjacency list.

my_cor_df <- as_tibble(my_cor_mat, rownames = "sample1")

pivot_longer(
  data = my_cor_df,
  cols = -sample1,
  names_to = "sample2",
  values_to = "cor"
) %>%
  filter(!is.na(cor)) -> my_adj_list

head(my_adj_list)

# A tibble: 6 × 3
  sample1 sample2   cor
  <chr>   <chr>   <dbl>
1 A       B       0.947
2 A       C       0.919
3 A       D       0.940
4 A       E       0.969
5 A       F       0.962
6 A       G       0.953

All samples are positively correlated to each other.

summary(my_adj_list$cor)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.6645  0.8945  0.9284  0.9140  0.9522  0.9923

Getting started with igraph

The igraph package is very useful for working with and visualising graph data.

library(igraph)


Attaching package: 'igraph'

The following objects are masked from 'package:BiocGenerics':

    normalize, path, union

The following objects are masked from 'package:dplyr':

    as_data_frame, groups, union

The following objects are masked from 'package:purrr':

    compose, simplify

The following object is masked from 'package:tidyr':

    crossing

The following object is masked from 'package:tibble':

    as_data_frame

The following objects are masked from 'package:stats':

    decompose, spectrum

The following object is masked from 'package:base':

    union

net <- graph_from_data_frame(my_adj_list, directed = FALSE)
E(net)$weight <- my_adj_list$cor
plot(net, layout = layout_components(net), edge.width = E(net)$weight, vertex.shape="none")

Plot only samples that are highly correlated to each other.

net_high_cor <- graph_from_data_frame(my_adj_list %>% filter(cor > 0.95), directed = FALSE)
E(net_high_cor)$weight <- my_adj_list %>% filter(cor > 0.95) %>% pull(cor)
plot(net_high_cor, layout = layout_components(net_high_cor), edge.width = E(net_high_cor)$weight, vertex.shape="none")

Louvain clustering.

cluster_louvain(net_high_cor, weights = E(net_high_cor)$weight)

IGRAPH clustering multi level, groups: 3, mod: 0.23
+ groups:
  $`1`
  [1] "A" "G" "O" "U" "J" "V"
  
  $`2`
   [1] "B" "E" "F" "H" "I" "K" "L" "P" "W" "Y" "N"
  
  $`3`
  [1] "D" "M" "Q" "S" "T" "X"

Detect communities or subgraphs using the Newman-Girvan algorithm, which is based on edge betweenness.

ceb <- cluster_edge_betweenness(net_high_cor)

Warning in cluster_edge_betweenness(net_high_cor): At core/community/
edge_betweenness.c:492 : Membership vector will be selected based on the highest
modularity score.

Warning in cluster_edge_betweenness(net_high_cor): At core/community/
edge_betweenness.c:497 : Modularity calculation with weighted edge betweenness
community detection might not make sense -- modularity treats edge weights as
similarities while edge betwenness treats them as distances.

plot(ceb, net_high_cor)

Use membership to get the clusters.

membership(ceb)

A B D E F G H I K L M O P Q S T U W J V Y X N 
1 1 2 1 1 3 1 1 1 1 1 4 1 2 2 1 3 1 5 3 1 2 1

Plot a dendrogram.

dendPlot(ceb, mode = "hclust")

igraph functions

Get the edges of a graph.

E(net_high_cor)

+ 89/89 edges from cd1acd1 (vertex names):
 [1] A--E A--F A--G A--I A--J A--K A--L A--M A--O A--P A--U A--V A--W A--Y B--E
[16] B--I B--K B--P B--W B--Y D--M D--Q D--S D--X E--F E--I E--K E--L E--M E--P
[31] E--T E--W E--X E--Y F--K F--M F--P F--W F--Y G--U G--V H--I H--K H--L H--P
[46] H--T H--W I--K I--L I--M I--P I--Q I--T I--W I--Y K--L K--M K--N K--P K--Q
[61] K--T K--W K--Y L--M L--N L--P L--Q L--T L--W L--Y M--P M--S M--T M--W M--X
[76] M--Y O--U O--Y P--T P--W P--Y Q--S Q--T Q--X S--X T--W T--X U--V W--Y

Get the vertices.

V(net_high_cor)

+ 23/23 vertices, named, from cd1acd1:
 [1] A B D E F G H I K L M O P Q S T U W J V Y X N

The proportion of present edges from all possible edges in the network (1).

edge_density(net, loops = FALSE)

[1] 1

edge_density(net_high_cor, loops = FALSE)

[1] 0.3517787

Make a full connected graph.

full_graph <- make_full_graph(26)

Degree distribution of the vertices.

degree(net_high_cor, mode = "all")

 A  B  D  E  F  G  H  I  K  L  M  O  P  Q  S  T  U  W  J  V  Y  X  N 
14  6  4 12  7  3  6 12 14 12 13  3 12  7  4 10  4 12  1  3 11  6  2

Number of cliques.

length(cliques(net_high_cor))

[1] 965

Largest clique.

largest_cliques(net_high_cor)

[[1]]
+ 9/23 vertices, named, from cd1acd1:
[1] L A Y W P M K I E

Matrix to adjacency list in R

2022-11-04

Data

Adjacency list

Getting started with igraph

igraph functions

Further reading