The Rand Index

Last updated: 2024-06-19

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 8755da5

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 8755da5. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.3.3/
    Ignored:    r_packages_4.4.0/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/rand_index.Rmd) and HTML (docs/rand_index.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	8755da5	Dave Tang	2024-06-19	Set seed before k-means clustering
html	01c75b7	Dave Tang	2024-06-19	Build site.
Rmd	77a8ff4	Dave Tang	2024-06-19	The Rand index

I’ve been looking for ways to compare clustering results and through my searching I came across something called the Rand index. In this short post, I explain how this index is calculated.

From the Wikipedia page you can see that the Rand index, R, is calculated by:

\[ R = \frac{a + b}{{n}\choose{2}} \]

Ignoring the numerator for now, notice that the denominator is a binomial coefficient. \({n}\choose{2}\) is the number of unordered pairs (order does not matter) in a set of \(n\) elements. For example, if you have set of 4 elements \(\{a, b, c, d\}\), there are 6 unordered pairs: \(\{a, b\}\), \(\{a, c\}\), \(\{a, d\}\), \(\{b, c\}\), \(\{b, d\}\), and \(\{c, d\}\).

The \(a\) in the formula refers to the number of times a pair of elements belongs to a same cluster across two different clustering results and the \(b\) refers to the number of times a pair of elements are in different clusters across two different clustering results. It will be easier to understand the Rand index with a simple example.

Say we have a set of six elements: \(\{a, b, c, d, e, f\}\). Clustering method 1 (CM1) forms three clusters; the first two items are in group 1, the third and fourth are in group 2, and the fifth and sixth are in group 3: \(\{1, 1, 2, 2, 3, 3\}\). Clustering method 2 (CM2) forms two clusters; the first three items are in group 1 and the last three items are in group 2: \(\{1, 1, 1, 2, 2, 2\}\). What’s the Rand index of these two clustering results?

To manually calculate the Rand index, we need to go through every unordered pair to work out \(a\) and \(b\); let’s work out \(a\) first. There are 15 unordered pairs in a set of six elements:

\(\{a, b\}\)
\(\{a, c\}\)
\(\{a, d\}\)
\(\{a, e\}\)
\(\{a, f\}\)
\(\{b, c\}\)
\(\{b, d\}\)
\(\{b, e\}\)
\(\{b, f\}\)
\(\{c, d\}\)
\(\{c, e\}\)
\(\{c, f\}\)
\(\{d, e\}\)
\(\{d, f\}\)
\(\{e, f\}\)

\(a\) is every time a pair of elements is grouped together by the two clustering methods. \(\{a, b\}\) and \(\{e, f\}\) are clustered together by CM1 and CM2, so \(a\) = 2.

\(b\) is every time a pair of elements is not grouped together by the two clustering methods. \(\{a, d\}\), \(\{a, e\}\), \(\{a, f\}\), \(\{b, d\}\), \(\{b, e\}\), \(\{b, f\}\), \(\{c, e\}\), and \(\{c, f\}\) are not clustered together by CM1 and CM2, so \(b\) = 8.

\(a\) and \(b\) are the times that both clustering methods agree. Therefore, the Rand index is:

\[ R = \frac{2 + 8}{{6}\choose{2}} = \frac{10}{15} = 0.667 \]

Using the {fossil} package and the rand.index() function, we can work out the Rand index in R.

if(!require("fossil")){
  install.packages("fossil")
}

# CM1
a <- rep(1:3, each = 2)
a

[1] 1 1 2 2 3 3

# CM2
b <- rep(1:2, each = 3)
b

[1] 1 1 1 2 2 2

fossil::rand.index(a, b)

[1] 0.6666667

Let’s take a look at the rand.index() function.

fossil::rand.index

function (group1, group2) 
{
    x <- abs(sapply(group1, function(x) x - group1))
    x[x > 1] <- 1
    y <- abs(sapply(group2, function(x) x - group2))
    y[y > 1] <- 1
    sg <- sum(abs(x - y))/2
    bc <- choose(dim(x)[1], 2)
    ri <- 1 - sg/bc
    return(ri)
}
<bytecode: 0x556b1f3e5088>
<environment: namespace:fossil>

Let’s go through the code step by step to see what it is doing. First we’ll create two sets of clustering results.

group1 <- rep(1:3, each = 2)
group2 <- rep(1:2, each = 3)

group1

[1] 1 1 2 2 3 3

group2

[1] 1 1 1 2 2 2

The line of code below is a nice way to create a matrix that indicates whether two elements belong to the same cluster (value = 0).

x <- abs(sapply(group1, function(x) x - group1))
x

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    1    1    2    2
[2,]    0    0    1    1    2    2
[3,]    1    1    0    0    1    1
[4,]    1    1    0    0    1    1
[5,]    2    2    1    1    0    0
[6,]    2    2    1    1    0    0

Since we only care whether an element belongs to the same cluster or not, we simply transform values greater than 1 (belonging to a different cluster) into 1.

x[x > 1] <- 1
x

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    1    1    1    1
[2,]    0    0    1    1    1    1
[3,]    1    1    0    0    1    1
[4,]    1    1    0    0    1    1
[5,]    1    1    1    1    0    0
[6,]    1    1    1    1    0    0

This is repeated for the second group.

y <- abs(sapply(group2, function(x) x - group2))
y[y > 1] <- 1

Each row and column number in x and y correspond to the element number in group1 and group2, respectively. A zero indicates a match, i.e. that a particular element is in the same cluster as another element, and a one indicates that they belong to different clusters. The x and y matrices therefore list all the pairs (twice) and whether or not they belong to the same cluster.

Now the function can compare the two different groups of clusters to get the number of disagreements.

# divide by two because the pairs are counted twice
# sum to get all the disagreements
sg <- sum(abs(x - y))/2
sg

[1] 5

Get the total number of pairs.

bc <- choose(dim(x)[1], 2)
bc

[1] 15

Finally to get the Rand Index, which reflects the number of agreements, we subtract the fraction of disagreements from 1.

ri <- 1 - sg/bc
ri

[1] 0.6666667

Assessing clustering results

We can use the Rand index to assess a clustering approach. I’ll illustrate using k-means clustering using the infamous iris dataset.

# check out the data
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

# true labels
true_label <- as.numeric(iris$Species)
true_label

  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
[112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[149] 3 3

# perform k-means clustering
set.seed(1984)
my_kmeans <- kmeans(x = iris[,-5], centers = 3)
# clustering results
my_kmeans$cluster

  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [75] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1
[112] 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1
[149] 1 3

fossil::rand.index(true_label, my_kmeans$cluster)

[1] 0.8797315

From the Wikipedia article:

The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.

The Rand index suggests that the k-means clustering of the iris data using sepal and petal measurements is similar to the real “clustering” of the data.

sessionInfo()

R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fossil_0.4.0     shapefiles_0.7.2 foreign_0.8-86   maps_3.4.2      
 [5] sp_2.1-4         lubridate_1.9.3  forcats_1.0.0    stringr_1.5.1   
 [9] dplyr_1.1.4      purrr_1.0.2      readr_2.1.5      tidyr_1.3.1     
[13] tibble_3.2.1     ggplot2_3.5.1    tidyverse_2.0.0  workflowr_1.7.1 

loaded via a namespace (and not attached):
 [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    lattice_0.22-6   
 [5] stringi_1.8.4     hms_1.1.3         digest_0.6.35     magrittr_2.0.3   
 [9] timechange_0.3.0  evaluate_0.23     grid_4.4.0        fastmap_1.2.0    
[13] rprojroot_2.0.4   jsonlite_1.8.8    processx_3.8.4    whisker_0.4.1    
[17] ps_1.7.6          promises_1.3.0    httr_1.4.7        fansi_1.0.6      
[21] scales_1.3.0      jquerylib_0.1.4   cli_3.6.2         rlang_1.1.3      
[25] munsell_0.5.1     withr_3.0.0       cachem_1.1.0      yaml_2.3.8       
[29] tools_4.4.0       tzdb_0.4.0        colorspace_2.1-0  httpuv_1.6.15    
[33] vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   git2r_0.33.0     
[37] fs_1.6.4          pkgconfig_2.0.3   callr_3.7.6       pillar_1.9.0     
[41] bslib_0.7.0       later_1.3.2       gtable_0.3.5      glue_1.7.0       
[45] Rcpp_1.0.12       xfun_0.44         tidyselect_1.2.1  rstudioapi_0.16.0
[49] knitr_1.46        htmltools_0.5.8.1 rmarkdown_2.27    compiler_4.4.0   
[53] getPass_0.2-4