Last updated: 2023-10-11

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 7c1bdcd. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.3.0/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd
    Untracked:  analysis/complex_heatmap.Rmd
    Untracked:  analysis/sleuth.Rmd
    Untracked:  analysis/tss_xgboost.Rmd
    Untracked:  code/multiz100way/
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
    Untracked:  data/ncrna_NONCODE[v3.0].fasta.tar.gz
    Untracked:  data/ncrna_noncode_v3.fa
    Untracked:  data/netmhciipan.out.gz
    Untracked:  data/test
    Untracked:  export/davetang039sblog.WordPress.2023-06-30.xml
    Untracked:  export/output/
    Untracked:  women.json

Unstaged changes:
    Modified:   analysis/graph.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/tissue.Rmd) and HTML (docs/tissue.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 7c1bdcd Dave Tang 2023-10-11 Tissue specificity

A key measure in information theory is entropy, which is:

“The amount of uncertainty involved in a random process; the lower the uncertainty, the lower the entropy.”

For example, there is lower entropy in a fair coin flip versus a fair die roll since there are more possible outcomes with a die roll \({1, 2, 3, 4, 5, 6}\) compared to a coin flip \({H, T}\).

Entropy is measured in bits, which has a single binary value of either 1 or 0. Since a coin toss has only two outcomes, each toss has one bit of information. However, if the coin is not fair, meaning that it is biased towards either heads or tails, there is less uncertainty, i.e. lower entropy; if a die lands on heads 60% of the time, we are more certain of heads than in a fair die (50% heads).

There is a nice example on the Information Content Wikipedia page explaining the relationship between entropy (uncertainty) and information content.

For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin, “Weather forecast for tonight: dark. Continued dark overnight, with widely scattered light by morning.” Assuming one not residing near the Earth’s poles or polar circles, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.

There is no uncertainty in the above statement hence that piece of information has 0 bits.

Mathematically, the Shannon entropy is defined as:

\[ -\sum_{i=1}^n p(x_i) log_{b}p(x_i) \]

Let’s test this out in R using the coin flip example above.

Firstly let’s define the entropy function according to the formula above.

entropy <- function(x){
  -sum(log2(x)*x)
}

Generate 100 fair coin tosses.

set.seed(1984)
fair_res <- rbinom(n = 100, size = 1, prob = 0.5)
prop.table(table(fair_res))
fair_res
   0    1 
0.48 0.52 

Calculate Shannon entropy of fair coin tosses.

entropy(as.vector(prop.table(table(fair_res))))
[1] 0.9988455

Generate 100 biased coin tosses.

set.seed(1984)
unfair_res <- rbinom(n = 100, size = 1, prob = 0.2)
prop.table(table(unfair_res))
unfair_res
   0    1 
0.76 0.24 

Calculate Shannon entropy of biased coin tosses.

entropy(as.vector(prop.table(table(unfair_res))))
[1] 0.7950403

A biased coin will is more predicable, i.e. has less uncertainty, and therefore has less entropy than a fair coin.

In fact a fair coin has the highest entropy. This makes sense because when it’s 50/50, it is the most uncertain!

x <- seq(0.05, 0.95, 0.05)
y <- 1 - x
e <- sapply(seq_along(x), function(i) entropy(c(x[i], y[i])))

plot(x, e, xlab = "Probability of heads", ylab = "Entropy", pch = 16, xaxt = 'n')
axis(side=1, at=x)
abline(v = 0.5, lty = 3)
abline(h = 1, lty = 3)

Tissue specificity

What does all this have to do with measuring tissue specificity? I came across this paper: Promoter features related to tissue specificity as measured by Shannon entropy and it spurred me to learn about entropy. Basically, if a gene is expressed in a tissue specific manner, we are more certain of its expression and hence there is lower entropy.

Let’s begin by defining a Shannon entropy function for use with tissue expression. The code is from the R help mail. This function includes a simple normalisation method of normalising each value by the sum. In addition, if any value is less than 0 or if the sum of all values is less than or equal to zero, the function will return an NA.

shannon.entropy <- function(p){
   if (min(p) < 0 || sum(p) <= 0) return(NA)
   p.norm <- p[p>0]/sum(p)
   -sum(log2(p.norm)*p.norm)
}

Let’s imagine we have 30 samples and we have a gene that is ubiquitously expressed.

set.seed(1984)
all_gene <- rnorm(n = 30, mean = 50, sd = 15)
shannon.entropy(all_gene)
[1] 4.854579

A ubiquitously expressed gene that is highly expressed in one sample.

all_gene_one_higher <- all_gene
all_gene_one_higher[30] <- 100
shannon.entropy(all_gene_one_higher)
[1] 4.842564

Higher expression in half of the samples.

set.seed(1984)
half_gene <- c(
  rnorm(n = 15, mean = 50, sd = 10),
  rnorm(n = 15, mean = 5, sd = 1)
)
shannon.entropy(half_gene)
[1] 4.319041

Expression only in one sample.

one_gene <- rep(0, 29)
one_gene[30] <- 50
shannon.entropy(one_gene)
[1] 0

Expression only in three samples.

three_gene <- c(rep(1,27), 25, 65, 100)
shannon.entropy(three_gene)
[1] 2.360925

Equal expression in all samples; note that the Shannon entropy will be the same regardless of the expression strength.

all_gene_equal <- rep(50, 30)
shannon.entropy(all_gene_equal)
[1] 4.906891

Plot the expression patterns for the 6 scenarios.

plot_entropy <- function(x){
  barplot(x, main = round(shannon.entropy(x), 3), xlab = 'Samples', ylab = 'Expression')
}

par(mfrow=c(2,3))
sapply(list(all_gene, all_gene_one_higher, half_gene, one_gene, three_gene, all_gene_equal), plot_entropy)

      [,1] [,2] [,3] [,4] [,5] [,6]
 [1,]  0.7  0.7  0.7  0.7  0.7  0.7
 [2,]  1.9  1.9  1.9  1.9  1.9  1.9
 [3,]  3.1  3.1  3.1  3.1  3.1  3.1
 [4,]  4.3  4.3  4.3  4.3  4.3  4.3
 [5,]  5.5  5.5  5.5  5.5  5.5  5.5
 [6,]  6.7  6.7  6.7  6.7  6.7  6.7
 [7,]  7.9  7.9  7.9  7.9  7.9  7.9
 [8,]  9.1  9.1  9.1  9.1  9.1  9.1
 [9,] 10.3 10.3 10.3 10.3 10.3 10.3
[10,] 11.5 11.5 11.5 11.5 11.5 11.5
[11,] 12.7 12.7 12.7 12.7 12.7 12.7
[12,] 13.9 13.9 13.9 13.9 13.9 13.9
[13,] 15.1 15.1 15.1 15.1 15.1 15.1
[14,] 16.3 16.3 16.3 16.3 16.3 16.3
[15,] 17.5 17.5 17.5 17.5 17.5 17.5
[16,] 18.7 18.7 18.7 18.7 18.7 18.7
[17,] 19.9 19.9 19.9 19.9 19.9 19.9
[18,] 21.1 21.1 21.1 21.1 21.1 21.1
[19,] 22.3 22.3 22.3 22.3 22.3 22.3
[20,] 23.5 23.5 23.5 23.5 23.5 23.5
[21,] 24.7 24.7 24.7 24.7 24.7 24.7
[22,] 25.9 25.9 25.9 25.9 25.9 25.9
[23,] 27.1 27.1 27.1 27.1 27.1 27.1
[24,] 28.3 28.3 28.3 28.3 28.3 28.3
[25,] 29.5 29.5 29.5 29.5 29.5 29.5
[26,] 30.7 30.7 30.7 30.7 30.7 30.7
[27,] 31.9 31.9 31.9 31.9 31.9 31.9
[28,] 33.1 33.1 33.1 33.1 33.1 33.1
[29,] 34.3 34.3 34.3 34.3 34.3 34.3
[30,] 35.5 35.5 35.5 35.5 35.5 35.5

Summary

Equal expression amongst the 30 libraries resulted in a Shannon entropy of ~4.91 bits; this is similar to an even coin toss. This is close to 5 bits because we need 5 bits to transfer information on 30 samples. The more specific a gene is expressed, the less uncertainty, and therefore the lower the entropy.


sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.0

loaded via a namespace (and not attached):
 [1] jsonlite_1.8.5   compiler_4.3.0   highr_0.10       promises_1.2.0.1
 [5] Rcpp_1.0.10      stringr_1.5.0    git2r_0.32.0     callr_3.7.3     
 [9] later_1.3.1      jquerylib_0.1.4  yaml_2.3.7       fastmap_1.1.1   
[13] R6_2.5.1         knitr_1.43       tibble_3.2.1     rprojroot_2.0.3 
[17] bslib_0.5.0      pillar_1.9.0     rlang_1.1.1      utf8_1.2.3      
[21] cachem_1.0.8     stringi_1.7.12   httpuv_1.6.11    xfun_0.39       
[25] getPass_0.2-2    fs_1.6.2         sass_0.4.6       cli_3.6.1       
[29] magrittr_2.0.3   ps_1.7.5         digest_0.6.31    processx_3.8.1  
[33] rstudioapi_0.14  lifecycle_1.0.3  vctrs_0.6.2      evaluate_0.21   
[37] glue_1.6.2       whisker_0.4.1    fansi_1.0.4      rmarkdown_2.22  
[41] httr_1.4.6       tools_4.3.0      pkgconfig_2.0.3  htmltools_0.5.5