tree_pca

Last updated: 2020-08-03

Checks: 7 0

Knit directory: misc/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(1)

The command set.seed(1) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: d41fc88

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version d41fc88. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/ALStruct_cache/
    Ignored:    data/.Rhistory
    Ignored:    data/pbmc/

Untracked files:
    Untracked:  .dropbox
    Untracked:  Icon
    Untracked:  analysis/GHstan.Rmd
    Untracked:  analysis/GTEX-cogaps.Rmd
    Untracked:  analysis/PACS.Rmd
    Untracked:  analysis/Rplot.png
    Untracked:  analysis/SPCAvRP.rmd
    Untracked:  analysis/admm_02.Rmd
    Untracked:  analysis/admm_03.Rmd
    Untracked:  analysis/compare-transformed-models.Rmd
    Untracked:  analysis/cormotif.Rmd
    Untracked:  analysis/cp_ash.Rmd
    Untracked:  analysis/eQTL.perm.rand.pdf
    Untracked:  analysis/eb_prepilot.Rmd
    Untracked:  analysis/eb_var.Rmd
    Untracked:  analysis/ebpmf1.Rmd
    Untracked:  analysis/flash_test_tree.Rmd
    Untracked:  analysis/flash_tree.Rmd
    Untracked:  analysis/ieQTL.perm.rand.pdf
    Untracked:  analysis/lasso_em_03.Rmd
    Untracked:  analysis/m6amash.Rmd
    Untracked:  analysis/mash_bhat_z.Rmd
    Untracked:  analysis/mash_ieqtl_permutations.Rmd
    Untracked:  analysis/mixsqp.Rmd
    Untracked:  analysis/mr.ash_lasso_init.Rmd
    Untracked:  analysis/mr.mash.test.Rmd
    Untracked:  analysis/mr_ash_modular.Rmd
    Untracked:  analysis/mr_ash_parameterization.Rmd
    Untracked:  analysis/mr_ash_pen.Rmd
    Untracked:  analysis/mr_ash_ridge.Rmd
    Untracked:  analysis/nejm.Rmd
    Untracked:  analysis/normalize.Rmd
    Untracked:  analysis/pbmc.Rmd
    Untracked:  analysis/poisson_transform.Rmd
    Untracked:  analysis/pseudodata.Rmd
    Untracked:  analysis/qrnotes.txt
    Untracked:  analysis/ridge_iterative_02.Rmd
    Untracked:  analysis/ridge_iterative_splitting.Rmd
    Untracked:  analysis/samps/
    Untracked:  analysis/sc_bimodal.Rmd
    Untracked:  analysis/shrinkage_comparisons_changepoints.Rmd
    Untracked:  analysis/susie_en.Rmd
    Untracked:  analysis/susie_z_investigate.Rmd
    Untracked:  analysis/svd-timing.Rmd
    Untracked:  analysis/temp.RDS
    Untracked:  analysis/temp.Rmd
    Untracked:  analysis/test-figure/
    Untracked:  analysis/test.Rmd
    Untracked:  analysis/test.Rpres
    Untracked:  analysis/test.md
    Untracked:  analysis/test_qr.R
    Untracked:  analysis/test_sparse.Rmd
    Untracked:  analysis/z.txt
    Untracked:  code/multivariate_testfuncs.R
    Untracked:  code/rqb.hacked.R
    Untracked:  data/4matthew/
    Untracked:  data/4matthew2/
    Untracked:  data/E-MTAB-2805.processed.1/
    Untracked:  data/ENSG00000156738.Sim_Y2.RDS
    Untracked:  data/GDS5363_full.soft.gz
    Untracked:  data/GSE41265_allGenesTPM.txt
    Untracked:  data/Muscle_Skeletal.ACTN3.pm1Mb.RDS
    Untracked:  data/Thyroid.FMO2.pm1Mb.RDS
    Untracked:  data/bmass.HaemgenRBC2016.MAF01.Vs2.MergedDataSources.200kRanSubset.ChrBPMAFMarkerZScores.vs1.txt.gz
    Untracked:  data/bmass.HaemgenRBC2016.Vs2.NewSNPs.ZScores.hclust.vs1.txt
    Untracked:  data/bmass.HaemgenRBC2016.Vs2.PreviousSNPs.ZScores.hclust.vs1.txt
    Untracked:  data/eb_prepilot/
    Untracked:  data/finemap_data/fmo2.sim/b.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out2.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out2_snp.txt
    Untracked:  data/finemap_data/fmo2.sim/dap_out_snp.txt
    Untracked:  data/finemap_data/fmo2.sim/data
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.config
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.k
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.k4.config
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.k4.snp
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.ld
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.snp
    Untracked:  data/finemap_data/fmo2.sim/fmo2.sim.z
    Untracked:  data/finemap_data/fmo2.sim/pos.txt
    Untracked:  data/logm.csv
    Untracked:  data/m.cd.RDS
    Untracked:  data/m.cdu.old.RDS
    Untracked:  data/m.new.cd.RDS
    Untracked:  data/m.old.cd.RDS
    Untracked:  data/mainbib.bib.old
    Untracked:  data/mat.csv
    Untracked:  data/mat.txt
    Untracked:  data/mat_new.csv
    Untracked:  data/matrix_lik.rds
    Untracked:  data/paintor_data/
    Untracked:  data/temp.txt
    Untracked:  data/y.txt
    Untracked:  data/y_f.txt
    Untracked:  data/zscore_jointLCLs_m6AQTLs_susie_eQTLpruned.rds
    Untracked:  data/zscore_jointLCLs_random.rds
    Untracked:  explore_udi.R
    Untracked:  output/fit.k10.rds
    Untracked:  output/fit.varbvs.RDS
    Untracked:  output/glmnet.fit.RDS
    Untracked:  output/test.bv.txt
    Untracked:  output/test.gamma.txt
    Untracked:  output/test.hyp.txt
    Untracked:  output/test.log.txt
    Untracked:  output/test.param.txt
    Untracked:  output/test2.bv.txt
    Untracked:  output/test2.gamma.txt
    Untracked:  output/test2.hyp.txt
    Untracked:  output/test2.log.txt
    Untracked:  output/test2.param.txt
    Untracked:  output/test3.bv.txt
    Untracked:  output/test3.gamma.txt
    Untracked:  output/test3.hyp.txt
    Untracked:  output/test3.log.txt
    Untracked:  output/test3.param.txt
    Untracked:  output/test4.bv.txt
    Untracked:  output/test4.gamma.txt
    Untracked:  output/test4.hyp.txt
    Untracked:  output/test4.log.txt
    Untracked:  output/test4.param.txt
    Untracked:  output/test5.bv.txt
    Untracked:  output/test5.gamma.txt
    Untracked:  output/test5.hyp.txt
    Untracked:  output/test5.log.txt
    Untracked:  output/test5.param.txt

Unstaged changes:
    Modified:   analysis/ash_delta_operator.Rmd
    Modified:   analysis/ash_pois_bcell.Rmd
    Modified:   analysis/lasso_em.Rmd
    Modified:   analysis/minque.Rmd
    Modified:   analysis/mr_missing_data.Rmd
    Modified:   analysis/ridge_admm.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/tree_pca_02.Rmd) and HTML (docs/tree_pca_02.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	d41fc88	Matthew Stephens	2020-08-03	workflowr::wflow_publish(“tree_pca_02.Rmd”)

Introduction

Following up on a previous analysis to look at how PCA behaves on some tree-structure covariance matrices.

Tree-structure covariance - 4 tips

I look at the simple 4-tip tree where the root first splits into 2 branches, and then each branch splits into two tips.

This function produces the covariance matrix of this tree, with branch lengths given by b. The first two elements of b give the left and right branches of the first split, etc (There are 6 branches in all.)

Sigma4_fun = function(b){
  Sigma4 = matrix(0,nrow=4, ncol=4) # shared branch
  Sigma4[1:2,1:2] = Sigma4[1:2,1:2] + b[1] # left
  Sigma4[3:4,3:4] = Sigma4[3:4,3:4] + b[2] # right split
  for(i in 1:4)
    Sigma4[i,i] = Sigma4[i,i] + b[2+i]
  Sigma4
}

Even trees

Here we assume all the branch lengths are the same. We see that the PCA (eigenvectors) reproduce the tree.

Sigma4_even = Sigma4_fun(rep(1,6))
Sigma4_even

     [,1] [,2] [,3] [,4]
[1,]    2    1    0    0
[2,]    1    2    0    0
[3,]    0    0    2    1
[4,]    0    0    1    2

eigen(Sigma4_even)

eigen() decomposition
$values
[1] 3 3 1 1

$vectors
          [,1]      [,2]       [,3]       [,4]
[1,] 0.0000000 0.7071068  0.0000000  0.7071068
[2,] 0.0000000 0.7071068  0.0000000 -0.7071068
[3,] 0.7071068 0.0000000  0.7071068  0.0000000
[4,] 0.7071068 0.0000000 -0.7071068  0.0000000

However, note that the equal eigenvalues (3,3) and (1,1) mean that there is ambiguity about these eigenvectors. Any linear cobination of these last two eigenvectors will also be an eigenvector. For example:

v=c(-1,1,-1,1)
Sigma4_even %*% v

     [,1]
[1,]   -1
[2,]    1
[3,]   -1
[4,]    1

This means that if the truth is a tree like this, the representation learned could be quite prior dependent. It makes it more attractive to make assumptions like sparsity, or even a hierarchical structure on the loadings. (One long-term question is whether a sparsity assumption is enough, or whether we really need to put in a hierarchical assumption to recover trees.)

The effects of centering

We can also center the covariance matrix before eigen-decomposition, which is analogous to including a constant factor. This way each eigenvector represents a “split” of the samples in the tree (rather than the top two branches being put in two different eigen vectors).

Sigma4_even_centered = Sigma4_even-mean(Sigma4_even)
eigen(Sigma4_even_centered)

eigen() decomposition
$values
[1] 3.000000e+00 1.000000e+00 1.000000e+00 2.664535e-15

$vectors
     [,1]          [,2]          [,3] [,4]
[1,] -0.5  7.071068e-01  0.000000e+00  0.5
[2,] -0.5 -7.071068e-01  6.824628e-17  0.5
[3,]  0.5 -9.992007e-16 -7.071068e-01  0.5
[4,]  0.5 -1.054712e-15  7.071068e-01  0.5

Balanced trees

By a (locally-)balanced tree I mean one where every split is balanced (although not all splits are the same). We see similar results in this case to the even tree (but this is not true for the locally-balanced 8 tip tree!)

Sigma4_balanced = Sigma4_fun(c(1,1,2,2,3,3))
Sigma4_balanced

     [,1] [,2] [,3] [,4]
[1,]    3    1    0    0
[2,]    1    3    0    0
[3,]    0    0    4    1
[4,]    0    0    1    4

eigen(Sigma4_balanced)

eigen() decomposition
$values
[1] 5 4 3 2

$vectors
          [,1]      [,2]       [,3]       [,4]
[1,] 0.0000000 0.7071068  0.0000000  0.7071068
[2,] 0.0000000 0.7071068  0.0000000 -0.7071068
[3,] 0.7071068 0.0000000  0.7071068  0.0000000
[4,] 0.7071068 0.0000000 -0.7071068  0.0000000

eigen(Sigma4_balanced-mean(Sigma4_balanced))

eigen() decomposition
$values
[1]  4.55488611  3.00000000  2.00000000 -0.05488611

$vectors
           [,1]          [,2]          [,3]       [,4]
[1,]  0.4424561  0.000000e+00  7.071068e-01 -0.5515729
[2,]  0.4424561 -5.985912e-17 -7.071068e-01 -0.5515729
[3,] -0.5515729 -7.071068e-01 -3.885781e-16 -0.4424561
[4,] -0.5515729  7.071068e-01 -3.330669e-16 -0.4424561

Un-balanced trees

With arbitrary branch lengths we don’t see the nice exact results. For example, the first eigen-vector, the loadings are not equal for all the samples that split the same way. However the general patterns are somewhat similar.

Sigma4_unbalanced = Sigma4_fun(c(1,2,3,4,5,6))
Sigma4_unbalanced

     [,1] [,2] [,3] [,4]
[1,]    4    1    0    0
[2,]    1    5    0    0
[3,]    0    0    7    2
[4,]    0    0    2    8

eigen(Sigma4_unbalanced)

eigen() decomposition
$values
[1] 9.561553 5.618034 5.438447 3.381966

$vectors
          [,1]      [,2]       [,3]       [,4]
[1,] 0.0000000 0.5257311  0.0000000  0.8506508
[2,] 0.0000000 0.8506508  0.0000000 -0.5257311
[3,] 0.6154122 0.0000000 -0.7882054  0.0000000
[4,] 0.7882054 0.0000000  0.6154122  0.0000000

eigen(Sigma4_unbalanced-mean(Sigma4_unbalanced))

eigen() decomposition
$values
[1]  8.085568  5.441261  3.529426 -0.556255

$vectors
           [,1]        [,2]        [,3]       [,4]
[1,]  0.3181096  0.06396412  0.66046264 -0.6771292
[2,]  0.3959714  0.10834478 -0.74306199 -0.5285139
[3,] -0.4823772 -0.77854288 -0.09273996 -0.3906181
[4,] -0.7136702  0.61485038 -0.05520215 -0.3310386

Un-balanced trees: correlation matrix

Part of the issue is that with arbitrary branch lengths the different nodes do not have the same variance. Taking the correlation matrix instead of the covariance matrix seemms to make the results for irregular trees closer to the “regular” trees.

eigen(cov2cor(Sigma4_unbalanced)-mean(cov2cor(Sigma4_unbalanced)))

eigen() decomposition
$values
[1]  1.245816442  0.776393202  0.732738758 -0.000382422

$vectors
           [,1]          [,2]          [,3]       [,4]
[1,]  0.4911644 -7.071068e-01  0.000000e+00 -0.5086821
[2,]  0.4911644  7.071068e-01 -6.636048e-17 -0.5086821
[3,] -0.5086821  1.665335e-16 -7.071068e-01 -0.4911644
[4,] -0.5086821  3.330669e-16  7.071068e-01 -0.4911644

Tree-structure covariance - 8 tips

Here I create an 8-tip tree by joining two 4-tip trees with a top splitting branch whose branch lengths are given by b_top.

Sigma8_fun = function(b_top,b_left,b_right){
  Sigma8 = matrix(0,nrow=8,ncol=8)
  Sigma8[1:4,1:4] = b_top[1] + Sigma4_fun(b_left) # left
  Sigma8[5:8,5:8] = b_top[2] + Sigma4_fun(b_right) # right split
  Sigma8
}

Balanced trees

Note that for the 8-tip case, the (locally-)balanced trees don’t give such clean results as the 4-tip case. In particular, the first eigenvector varies among samples that split the same way. And the second eigenvector is not really reflecting the tree!

Sigma8_balanced = Sigma8_fun(c(1,1),c(2,2,3,3,4,4),c(5,5,6,6,7,7))
Sigma8_balanced

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    6    3    1    1    0    0    0    0
[2,]    3    6    1    1    0    0    0    0
[3,]    1    1    7    3    0    0    0    0
[4,]    1    1    3    7    0    0    0    0
[5,]    0    0    0    0   12    6    1    1
[6,]    0    0    0    0    6   12    1    1
[7,]    0    0    0    0    1    1   13    6
[8,]    0    0    0    0    1    1    6   13

eigen(Sigma8_balanced)

eigen() decomposition
$values
[1] 20.561553 16.438447 11.561553  7.438447  7.000000  6.000000  4.000000
[8]  3.000000

$vectors
          [,1]       [,2]       [,3]       [,4]          [,5]          [,6]
[1,] 0.0000000  0.0000000 -0.4351621  0.5573454  0.000000e+00  0.000000e+00
[2,] 0.0000000  0.0000000 -0.4351621  0.5573454  0.000000e+00  0.000000e+00
[3,] 0.0000000  0.0000000 -0.5573454 -0.4351621  0.000000e+00  0.000000e+00
[4,] 0.0000000  0.0000000 -0.5573454 -0.4351621  0.000000e+00  0.000000e+00
[5,] 0.4351621  0.5573454  0.0000000  0.0000000  0.000000e+00  7.071068e-01
[6,] 0.4351621  0.5573454  0.0000000  0.0000000  1.369272e-17 -7.071068e-01
[7,] 0.5573454 -0.4351621  0.0000000  0.0000000 -7.071068e-01 -3.191891e-16
[8,] 0.5573454 -0.4351621  0.0000000  0.0000000  7.071068e-01 -3.191891e-16
              [,7]          [,8]
[1,]  0.000000e+00  7.071068e-01
[2,] -2.643072e-17 -7.071068e-01
[3,] -7.071068e-01 -3.885781e-16
[4,]  7.071068e-01 -3.885781e-16
[5,]  0.000000e+00  0.000000e+00
[6,]  0.000000e+00  0.000000e+00
[7,]  0.000000e+00  0.000000e+00
[8,]  0.000000e+00  0.000000e+00

eigen(Sigma8_balanced-mean(Sigma8_balanced))

eigen() decomposition
$values
[1] 17.394206 16.294806  7.514771  7.000000  6.000000  4.000000  3.000000
[8] -1.203784

$vectors
           [,1]       [,2]        [,3]          [,4]          [,5]
[1,] -0.2134273  0.1091374  0.48446566  0.000000e+00 -4.100270e-16
[2,] -0.2134273  0.1091374  0.48446566  6.824628e-17 -4.909405e-18
[3,] -0.2361463  0.1222948 -0.51396174 -5.556812e-17  1.251287e-15
[4,] -0.2361463  0.1222948 -0.51396174  2.681578e-17 -5.824634e-16
[5,]  0.1717947 -0.6346232 -0.02512398 -6.742347e-18  7.071068e-01
[6,]  0.1717947 -0.6346232 -0.02512398 -2.757250e-17 -7.071068e-01
[7,]  0.6075939  0.2653043 -0.02247523 -7.071068e-01  2.335306e-16
[8,]  0.6075939  0.2653043 -0.02247523  7.071068e-01  2.335306e-16
              [,6]          [,7]       [,8]
[1,] -9.969376e-17  7.071068e-01 -0.4558847
[2,] -1.193919e-16 -7.071068e-01 -0.4558847
[3,]  7.071068e-01 -8.881784e-16 -0.4063524
[4,] -7.071068e-01 -5.273559e-16 -0.4063524
[5,] -6.840255e-16 -2.220446e-16 -0.2590535
[6,]  9.257979e-16 -5.551115e-16 -0.2590535
[7,] -1.618737e-16 -2.775558e-16 -0.2448228
[8,] -1.618737e-16 -2.775558e-16 -0.2448228

Using the correlation matrix instead helps make the results cleaner and more “tree-like”:

eigen(cov2cor(Sigma8_balanced)-mean(cov2cor(Sigma8_balanced)))

eigen() decomposition
$values
[1]  1.711116942  1.320360834  1.154943654  0.571428571  0.538461538
[6]  0.500000000  0.500000000 -0.003206439

$vectors
           [,1]        [,2]        [,3]          [,4]          [,5]
[1,] -0.3900132 -0.01923418  0.48904193 -3.038644e-16  0.000000e+00
[2,] -0.3900132 -0.01923418  0.48904193  2.368025e-16 -4.303177e-17
[3,] -0.3428881 -0.01237841 -0.50960415  7.071068e-01  2.720817e-15
[4,] -0.3428881 -0.01237841 -0.50960415 -7.071068e-01 -2.691877e-15
[5,]  0.3556323 -0.49240356  0.02097782 -5.244414e-15 -7.324422e-16
[6,]  0.3556323 -0.49240356  0.02097782  5.691283e-15  7.938681e-16
[7,]  0.3222470  0.50696701  0.02648651  2.471636e-15 -7.071068e-01
[8,]  0.3222470  0.50696701  0.02648651 -2.968457e-15  7.071068e-01
              [,6]          [,7]       [,8]
[1,]  2.174194e-02  7.067724e-01 -0.3291774
[2,] -2.174194e-02 -7.067724e-01 -0.3291774
[3,]  7.133183e-15  1.040834e-15 -0.3501116
[4,] -7.219919e-15  1.290634e-15 -0.3501116
[5,]  7.067724e-01 -2.174194e-02 -0.3614199
[6,] -7.067724e-01  2.174194e-02 -0.3614199
[7,] -4.839878e-16  3.330669e-16 -0.3720749
[8,]  9.037909e-16  3.330669e-16 -0.3720749

Add an outgroup

Here I add a 9th “outgroup” to an even 8-tip tree to see what happens. I try eigen-decomposition of correlation matrix both with and without centering. In both cases the first eigen-vector has no weight on the 9th population. This might not get in the way of partition-based approaches because in both cases the first eigenvector partitions the samples consistent with the unrooted tree. However, it may complicate the issue of taking account of admixed populations; how to distinguish admixed vs an outgroup?

Sigma8_even = Sigma8_fun(c(1,1),rep(1,6),rep(1,6))
Sigma9 = rbind(cbind(1+Sigma8_even, rep(0,8)),rep(0,9))
Sigma9[9,9] = 1
Sigma9

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
 [1,]    4    3    2    2    1    1    1    1    0
 [2,]    3    4    2    2    1    1    1    1    0
 [3,]    2    2    4    3    1    1    1    1    0
 [4,]    2    2    3    4    1    1    1    1    0
 [5,]    1    1    1    1    4    3    2    2    0
 [6,]    1    1    1    1    3    4    2    2    0
 [7,]    1    1    1    1    2    2    4    3    0
 [8,]    1    1    1    1    2    2    3    4    0
 [9,]    0    0    0    0    0    0    0    0    1

Sigma9_cor = cov2cor(Sigma9)
Sigma9_cor

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
 [1,] 1.00 0.75 0.50 0.50 0.25 0.25 0.25 0.25    0
 [2,] 0.75 1.00 0.50 0.50 0.25 0.25 0.25 0.25    0
 [3,] 0.50 0.50 1.00 0.75 0.25 0.25 0.25 0.25    0
 [4,] 0.50 0.50 0.75 1.00 0.25 0.25 0.25 0.25    0
 [5,] 0.25 0.25 0.25 0.25 1.00 0.75 0.50 0.50    0
 [6,] 0.25 0.25 0.25 0.25 0.75 1.00 0.50 0.50    0
 [7,] 0.25 0.25 0.25 0.25 0.50 0.50 1.00 0.75    0
 [8,] 0.25 0.25 0.25 0.25 0.50 0.50 0.75 1.00    0
 [9,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00    1

eigen(Sigma9_cor)

eigen() decomposition
$values
[1] 3.75 1.75 1.00 0.75 0.75 0.25 0.25 0.25 0.25

$vectors
            [,1]       [,2] [,3]          [,4]          [,5]          [,6]
 [1,] -0.3535534  0.3535534    0  0.000000e+00  5.000000e-01  7.071068e-01
 [2,] -0.3535534  0.3535534    0 -2.296293e-16  5.000000e-01 -7.071068e-01
 [3,] -0.3535534  0.3535534    0  1.442622e-16 -5.000000e-01  1.665335e-16
 [4,] -0.3535534  0.3535534    0  3.633327e-16 -5.000000e-01  3.885781e-16
 [5,] -0.3535534 -0.3535534    0 -5.000000e-01 -7.632783e-16  2.775558e-16
 [6,] -0.3535534 -0.3535534    0 -5.000000e-01 -5.967449e-16  2.220446e-16
 [7,] -0.3535534 -0.3535534    0  5.000000e-01  1.804112e-16 -1.665335e-16
 [8,] -0.3535534 -0.3535534    0  5.000000e-01 -7.632783e-16  2.775558e-16
 [9,]  0.0000000  0.0000000    1  0.000000e+00  0.000000e+00  0.000000e+00
               [,7]          [,8]          [,9]
 [1,]  0.000000e+00  0.000000e+00  0.000000e+00
 [2,]  4.018513e-17  4.592587e-16 -5.740733e-18
 [3,]  8.357616e-03  1.342745e-01  6.941905e-01
 [4,] -8.357616e-03 -1.342745e-01 -6.941905e-01
 [5,] -6.977392e-01  1.139040e-01 -1.363166e-02
 [6,]  6.977392e-01 -1.139040e-01  1.363166e-02
 [7,] -1.144119e-01 -6.848330e-01  1.338420e-01
 [8,]  1.144119e-01  6.848330e-01 -1.338420e-01
 [9,]  0.000000e+00  0.000000e+00  0.000000e+00

eigen(Sigma9_cor-mean(Sigma9_cor))

eigen() decomposition
$values
[1]  1.7500000  1.7358440  0.7500000  0.7500000  0.2500000  0.2500000  0.2500000
[8]  0.2500000 -0.4302884

$vectors
               [,1]       [,2]          [,3]          [,4]          [,5]
 [1,] -3.535534e-01 -0.2540634  4.983457e-01 -4.063962e-02  1.342371e-02
 [2,] -3.535534e-01 -0.2540634  4.983457e-01 -4.063962e-02 -1.342371e-02
 [3,] -3.535534e-01 -0.2540634 -4.983457e-01  4.063962e-02 -6.078290e-01
 [4,] -3.535534e-01 -0.2540634 -4.983457e-01  4.063962e-02  6.078290e-01
 [5,]  3.535534e-01 -0.2540634  4.063962e-02  4.983457e-01  3.522214e-01
 [6,]  3.535534e-01 -0.2540634  4.063962e-02  4.983457e-01 -3.522214e-01
 [7,]  3.535534e-01 -0.2540634 -4.063962e-02 -4.983457e-01 -7.939662e-02
 [8,]  3.535534e-01 -0.2540634 -4.063962e-02 -4.983457e-01  7.939662e-02
 [9,] -3.019807e-14  0.6954238 -1.693090e-15  2.983724e-16 -8.153200e-17
               [,6]          [,7]          [,8]       [,9]
 [1,] -2.763869e-03  0.000000e+00  7.069739e-01 -0.2458694
 [2,]  2.763869e-03 -1.791617e-16 -7.069739e-01 -0.2458694
 [3,] -3.160485e-01 -1.747888e-01  1.030561e-02 -0.2458694
 [4,]  3.160485e-01  1.747888e-01 -1.030561e-02 -0.2458694
 [5,] -4.436719e-01 -4.231127e-01 -8.422330e-03 -0.2458694
 [6,]  4.436719e-01  4.231127e-01  8.422330e-03 -0.2458694
 [7,]  4.508447e-01 -5.389105e-01  3.270096e-03 -0.2458694
 [8,] -4.508447e-01  5.389105e-01 -3.270096e-03 -0.2458694
 [9,] -6.982262e-17 -7.597605e-17  1.110223e-15 -0.7185999

Relationship between distance and covariance

This section is a bit of a diversion and may be ignored… I just wanted to remind myself about the relationship between the pairwise distances and covariances.

Basically if \(D\) is a distance matrix, and \(L\) is the centering matrix (so premultiplying by \(L\) subtracts the) then \(W=-LDL'\) is symmetric positive semi-definite, and closely related to the covariance matrix. See Petkova et al for more discussion….

D4_fun = function(b){
  D4 = matrix(0,nrow=4, ncol=4) 
  D4[1:2,3:4] = b[2] + b[3] # top split
  D4[1,-1] = D4[1,-1] + b[4]
  D4[2,-2] = D4[2,-2] + b[5]
  D4[3,-3] = D4[3,-3] + b[6]
  D4[4,-4] = D4[4,-4] + b[7]
  D4 + t(D4)
}

D4_even = D4_fun(rep(1,7))
D4_even

     [,1] [,2] [,3] [,4]
[1,]    0    2    4    4
[2,]    2    0    4    4
[3,]    4    4    0    2
[4,]    4    4    2    0

L = diag(4) - matrix(1/4,nrow=4,ncol=4)
W = -L %*% D4_even %*% t(L)
W

     [,1] [,2] [,3] [,4]
[1,]  2.5  0.5 -1.5 -1.5
[2,]  0.5  2.5 -1.5 -1.5
[3,] -1.5 -1.5  2.5  0.5
[4,] -1.5 -1.5  0.5  2.5

0.5*W - Sigma4_even

      [,1]  [,2]  [,3]  [,4]
[1,] -0.75 -0.75 -0.75 -0.75
[2,] -0.75 -0.75 -0.75 -0.75
[3,] -0.75 -0.75 -0.75 -0.75
[4,] -0.75 -0.75 -0.75 -0.75

eigen(W)

eigen() decomposition
$values
[1] 6.000000e+00 2.000000e+00 2.000000e+00 5.329071e-15

$vectors
     [,1]          [,2]          [,3] [,4]
[1,] -0.5  7.071068e-01  0.000000e+00  0.5
[2,] -0.5 -7.071068e-01  6.824628e-17  0.5
[3,]  0.5 -9.992007e-16 -7.071068e-01  0.5
[4,]  0.5 -1.054712e-15  7.071068e-01  0.5

D4_unbalanced_noshare = D4_fun(c(0,2,3,4,5,6,7))
D4_unbalanced_noshare

     [,1] [,2] [,3] [,4]
[1,]    0    9   15   16
[2,]    9    0   16   17
[3,]   15   16    0   13
[4,]   16   17   13    0

W = -L %*% D4_unbalanced_noshare %*% t(L)
W

      [,1]  [,2]  [,3]  [,4]
[1,]  9.25  0.75 -4.75 -5.25
[2,]  0.75 10.25 -5.25 -5.75
[3,] -4.75 -5.25 11.25 -1.25
[4,] -5.25 -5.75 -1.25 12.25

eigen(W)

eigen() decomposition
$values
[1] 2.110302e+01 1.293894e+01 8.958048e+00 3.552714e-14

$vectors
           [,1]        [,2]        [,3] [,4]
[1,] -0.4561794  0.03209393  0.73543887  0.5
[2,] -0.5383516  0.05393445 -0.67621645  0.5
[3,]  0.4358975 -0.74741486 -0.03693718  0.5
[4,]  0.5586334  0.66138648 -0.02228524  0.5

sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5      rstudioapi_0.11 whisker_0.4     knitr_1.29     
 [5] magrittr_1.5    workflowr_1.6.2 R6_2.4.1        rlang_0.4.7    
 [9] stringr_1.4.0   tools_3.6.0     xfun_0.16       git2r_0.27.1   
[13] htmltools_0.5.0 ellipsis_0.3.1  yaml_2.2.1      digest_0.6.25  
[17] rprojroot_1.3-2 tibble_3.0.3    lifecycle_0.2.0 crayon_1.3.4   
[21] later_1.1.0.1   vctrs_0.3.2     fs_1.4.2        promises_1.1.1 
[25] glue_1.4.1      evaluate_0.14   rmarkdown_2.3   stringi_1.4.6  
[29] compiler_3.6.0  pillar_1.4.6    backports_1.1.8 httpuv_1.5.4   
[33] pkgconfig_2.0.3

tree_pca_02

Matthew Stephens

2020-07-29

Introduction

Tree-structure covariance - 4 tips

Even trees

The effects of centering

Balanced trees

Un-balanced trees

Un-balanced trees: correlation matrix

Tree-structure covariance - 8 tips

Balanced trees

Add an outgroup

Relationship between distance and covariance