Last updated: 2018-10-22
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(1)
The command set.seed(1)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 9abd1d0
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: analysis/.Rhistory
Ignored: analysis/include/.DS_Store
Ignored: code/.DS_Store
Ignored: data/.DS_Store
Ignored: docs/.DS_Store
Ignored: output/.DS_Store
Untracked files:
Untracked: analysis/Classify.Rmd
Untracked: analysis/EstimateCorMaxEMGD.Rmd
Untracked: analysis/EstimateCorMaxGD.Rmd
Untracked: analysis/HierarchicalFlashSim.Rmd
Untracked: analysis/MashLowSignalGTEx4.Rmd
Untracked: analysis/Mash_GTEx.Rmd
Untracked: analysis/MeanAsh.Rmd
Untracked: analysis/OutlierDetection.Rmd
Untracked: analysis/OutlierDetection2.Rmd
Untracked: analysis/OutlierDetection3.Rmd
Untracked: analysis/OutlierDetection4.Rmd
Untracked: analysis/mash_missing_row.Rmd
Untracked: code/GTExNullModel.R
Untracked: code/MashClassify.R
Untracked: code/MashCorResult.R
Untracked: code/MashCormVResult.R
Untracked: code/MashNULLCorResult.R
Untracked: code/MashSource.R
Untracked: code/Weight_plot.R
Untracked: code/addemV.R
Untracked: code/estimate_cor.R
Untracked: code/generateDataV.R
Untracked: code/johnprocess.R
Untracked: code/mV.R
Untracked: code/sim_mean_sig.R
Untracked: code/summary.R
Untracked: data/Blischak_et_al_2015/
Untracked: data/scale_data.rds
Untracked: docs/figure/Classify.Rmd/
Untracked: docs/figure/OutlierDetection.Rmd/
Untracked: docs/figure/OutlierDetection2.Rmd/
Untracked: docs/figure/OutlierDetection3.Rmd/
Untracked: docs/figure/Test.Rmd/
Untracked: docs/figure/mash_missing_whole_row_5.Rmd/
Untracked: docs/include/
Untracked: output/AddEMV/
Untracked: output/CovED_UKBio_strong.rds
Untracked: output/CovED_UKBio_strong_Z.rds
Untracked: output/Flash_UKBio_strong.rds
Untracked: output/GTExNULLres/
Untracked: output/GTEx_2.5_nullData.rds
Untracked: output/GTEx_2.5_nullModel.rds
Untracked: output/GTEx_2.5_nullPermData.rds
Untracked: output/GTEx_2.5_nullPermModel.rds
Untracked: output/GTEx_3.5_nullData.rds
Untracked: output/GTEx_3.5_nullModel.rds
Untracked: output/GTEx_3.5_nullPermData.rds
Untracked: output/GTEx_3.5_nullPermModel.rds
Untracked: output/GTEx_3_nullData.rds
Untracked: output/GTEx_3_nullModel.rds
Untracked: output/GTEx_3_nullPermData.rds
Untracked: output/GTEx_3_nullPermModel.rds
Untracked: output/GTEx_4.5_nullData.rds
Untracked: output/GTEx_4.5_nullModel.rds
Untracked: output/GTEx_4.5_nullPermData.rds
Untracked: output/GTEx_4.5_nullPermModel.rds
Untracked: output/GTEx_4_nullData.rds
Untracked: output/GTEx_4_nullModel.rds
Untracked: output/GTEx_4_nullPermData.rds
Untracked: output/GTEx_4_nullPermModel.rds
Untracked: output/MASH.10.em2.result.rds
Untracked: output/MASH.10.mle.result.rds
Untracked: output/MashCorSim--midway/
Untracked: output/Mash_EE_Cov_0_plusR1.rds
Untracked: output/UKBio_mash_model.rds
Untracked: output/mVIterations/
Untracked: output/mVUlist/
Untracked: output/result.em.rds
Unstaged changes:
Modified: analysis/EstimateCorMaxEM2.Rmd
Modified: analysis/Mash_UKBio.Rmd
Modified: analysis/mash_missing_samplesize.Rmd
Modified: output/Flash_T2_0.rds
Modified: output/Flash_T2_0_mclust.rds
Modified: output/Mash_model_0_plusR1.rds
Modified: output/PresiAddVarCol.rds
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 9abd1d0 | zouyuxin | 2018-10-22 | wflow_publish(“analysis/EstimateCorMaxMV.Rmd”) |
html | 3889d07 | zouyuxin | 2018-10-22 | Build site. |
Rmd | 6a7014a | zouyuxin | 2018-10-22 | wflow_publish(“analysis/EstimateCorMaxMV.Rmd”) |
html | 8f39f1c | zouyuxin | 2018-10-09 | Build site. |
Rmd | 52d66f3 | zouyuxin | 2018-10-09 | wflow_publish(“analysis/EstimateCorMaxMV.Rmd”) |
html | 4e61be5 | zouyuxin | 2018-10-09 | Build site. |
Rmd | cd82d7e | zouyuxin | 2018-10-09 | wflow_publish(“analysis/EstimateCorMaxMV.Rmd”) |
html | e3b067e | zouyuxin | 2018-10-09 | Build site. |
Rmd | 032212c | zouyuxin | 2018-10-09 | wflow_publish(“analysis/EstimateCorMaxMV.Rmd”) |
library(mashr)
Loading required package: ashr
source('../code/generateDataV.R')
source('../code/summary.R')
We use EM algorithm to update \(\rho\).
B is the \(n\times R\) true value matrix. \(\mathbf{z}\) is a length n vector.
\[ P(\hat{B},B|\rho, \pi) = \prod_{i=1}^{n} \left[N(\hat{b}_{i}; b_{i}, S_{i}VS_{i})\sum_{p=0}^{P} \pi_{p} N(b_{i}; 0, \Sigma_{p})\right] \]
\[ \begin{align*} \mathbb{E}_{B|\hat{B}} \log P(\hat{B},B|\rho, \pi) &= \sum_{i=1}^{n} \mathbb{E}_{b_{i}|\hat{b}_{i}}\left[ \log N(\hat{b}_{i}; b_{i}, S_{i}VS_{i}) + \log \sum_{p=0}^{P} \pi_{p} N(b_{i}; 0, \Sigma_{p}) \right] \\ &= \sum_{i=1}^{n} \mathbb{E}_{b_{i}|\hat{b}_{i}}\log N(\hat{b}_{i}; b_{i}, S_{i}VS_{i}) + \sum_{i=1}^{n}\mathbb{E}_{b_{i}|\hat{b}_{i}}\log \sum_{p=0}^{P} \pi_{p} N(b_{i}; 0, \Sigma_{p}) \end{align*} \]
\(V\) depends on the first term only. Let \(\mu_{i} = \mathbb{E}_{b_{i}|\hat{b}_{i}}(b_{i})\) \[ \begin{align*} \log N(\hat{b}_{i}; b_{i}, V) &= -\frac{R}{2}\log 2\pi -\frac{1}{2}\log |S_{i}VS_{i}| - \frac{1}{2}(\hat{b}_{i}-b_{i})^{T}S_{i}^{-1}V^{-1}S_{i}^{-1}(\hat{b}_{i}-b_{i}) \\ &= -\frac{R}{2}\log 2\pi -\frac{1}{2}\log |S_{i}VS_{i}| - \frac{1}{2}\hat{b}_{i}^{T}S_{i}^{-1}V^{-1}S_{i}^{-1}\hat{b}_{i} + \frac{1}{2}b_{i}^{T}S_{i}^{-1}V^{-1}S_{i}^{-1}\hat{b}_{i} + \frac{1}{2}\hat{b}_{i}^{T}S_{i}^{-1}V^{-1}S_{i}^{-1}b_{i} -\frac{1}{2}b_{i}^{T}S_{i}^{-1}V^{-1}S_{i}^{-1}b_{i} \\ \mathbb{E}_{b_{i}|\hat{b}_{i}} \log N(\hat{b}_{i}; b_{i}, V) &= -\frac{R}{2}\log 2\pi -\frac{1}{2}\log |S_{i}VS_{i}| - \frac{1}{2}\hat{b}_{i}^{T}S_{i}^{-1}V^{-1}S_{i}^{-1}\hat{b}_{i} + \frac{1}{2}\mu_{i}^{T}S_{i}^{-1}V^{-1}S_{i}^{-1}\hat{b}_{i} + \frac{1}{2}\hat{b}_{i}^{T}S_{i}^{-1}V^{-1}S_{i}^{-1}\mu_{i} -\frac{1}{2}tr(S_{i}^{-1}V^{-1}S_{i}^{-1}\mathbb{E}_{b_{i}|\hat{b}_{i}}(b_{i}b_{i}^{T})) \end{align*} \]
Maximize with respect to V:
We have constraint on V, the diagonal of V must be 1. Let \(V = DCD\), C is the covariance matrix, D = \(diag(1/sqrt(C_{jj}))\).
\[ f(C) = \sum_{i=1}^{n} -\frac{R}{2}\log 2\pi -\frac{1}{2}\log |C|- \log |DS_{i}| - \frac{1}{2}\hat{b}_{i}^{T}S_{i}^{-1}D^{-1}C^{-1}D^{-1}S_{i}^{-1}\hat{b}_{i} + \frac{1}{2}\mu_{i}^{T}S_{i}^{-1}D^{-1}C^{-1}D^{-1}S_{i}^{-1}\hat{b}_{i} + \frac{1}{2}\hat{b}_{i}^{T}S_{i}^{-1}D^{-1}C^{-1}D^{-1}S_{i}^{-1}\mu_{i} -\frac{1}{2}tr(S_{i}^{-1}D^{-1}C^{-1}D^{-1}S_{i}^{-1}\mathbb{E}_{b_{i}|\hat{b}_{i}}(b_{i}b_{i}^{T})) \]
\[ \begin{align*} f(C)' &= \sum_{i=1}^{n} -\frac{1}{2}C^{-1} + \frac{1}{2}C^{-1}D^{-1}S_{i}^{-1}\hat{b}_{i}\hat{b}_{i}^{T}S_{i}^{-1}D^{-1}C^{-1} - \frac{1}{2} C^{-1}D^{-1}S_{i}^{-1}\mu_{i}\hat{b}_{i}^{T}S_{i}^{-1}D^{-1}C^{-1} - \frac{1}{2}C^{-1}D^{-1}S_{i}^{-1}\hat{b}_{i}\mu_{i}^{T}S_{i}^{-1}D^{-1}C^{-1} + \frac{1}{2} C^{-1}D^{-1}S_{i}^{-1}\mathbb{E}(b_{i}b_{i}^{T}|\hat{b}_{i})S_{i}^{-1}D^{-1}C^{-1} = 0 \\ 0 &= \sum_{i=1}^{n} -\frac{1}{2}C + \frac{1}{2}D^{-1}S_{i}^{-1}\hat{b}_{i}\hat{b}_{i}^{T}S_{i}^{-1}D^{-1} - \frac{1}{2}D^{-1}S_{i}^{-1}\mu_{i}\hat{b}_{i}^{T}S_{i}^{-1}D^{-1} - \frac{1}{2}D^{-1}S_{i}^{-1}\hat{b}_{i}\mu_{i}^{T}S_{i}^{-1}D^{-1} + \frac{1}{2} D^{-1}S_{i}^{-1}\mathbb{E}(b_{i}b_{i}^{T}|\hat{b}_{i})S_{i}^{-1}D^{-1} \\ \hat{C} &= \frac{1}{n} \sum_{i=1}^{n} \left[D^{-1}S_{i}^{-1}\hat{b}_{i}\hat{b}_{i}^{T}S_{i}^{-1}D^{-1} - D^{-1}S_{i}^{-1}\mu_{i}\hat{b}_{i}^{T}S_{i}^{-1}D^{-1} - D^{-1}S_{i}^{-1}\hat{b}_{i}\mu_{i}^{T}S_{i}^{-1}D^{-1} + D^{-1}S_{i}^{-1}\mathbb{E}(b_{i}b_{i}^{T}|\hat{b}_{i})S_{i}^{-1}D^{-1} \right] \\ &= \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}\left[ (D^{-1}S_{i}^{-1}(\hat{b}_{i} - b_{i}))(D^{-1}S_{i}^{-1}(\hat{b}_{i} - b_{i}))^{T} | \hat{b}_{i}\right] \\ &= D^{-1}\frac{1}{n} \sum_{i=1}^{n} S_{i}^{-1}\mathbb{E}\left[ (\hat{b}_{i} - b_{i})(\hat{b}_{i} - b_{i})^{T} | \hat{b}_{i}\right] S_{i}^{-1} D^{-1} \end{align*} \]
We can update C and V as \[ \hat{C}_{(t+1)} = \hat{D}^{-1}_{(t)}\frac{1}{n} \left[\sum_{i=1}^{n} S_{i}^{-1}\mathbb{E}\left[ (\hat{b}_{i} - b_{i})(\hat{b}_{i} - b_{i})^{T} | \hat{b}_{i}\right]S_{i}^{-1} \right] \hat{D}^{-1}_{(t)} \\ \hat{D}_{(t+1)} = diag(1/\sqrt{\hat{C}_{(t+1)jj}}) \\ \hat{V}_{(t+1)} = \hat{D}_{(t+1)}\hat{C}_{(t+1)}\hat{D}_{(t+1)} \]
The resulting \(\hat{V}_{(t+1)}\) is equivalent as \[ \hat{C}_{(t+1)} = \frac{1}{n}\sum_{i=1}^{n} S_{i}^{-1}\mathbb{E}\left[ (\hat{b}_{i} - b_{i})(\hat{b}_{i} - b_{i})^{T} | \hat{b}_{i}\right]S_{i}^{-1} \\ \hat{D}_{(t+1)} = diag(1/\sqrt{\hat{C}_{(t+1)jj}}) \\ \hat{V}_{(t+1)} = \hat{D}_{(t+1)}\hat{C}_{(t+1)}\hat{D}_{(t+1)} \]
It is hard to estimate \(\boldsymbol{\pi}\) from the second term, \(\sum_{i=1}^{n}\mathbb{E}_{b_{i}|\hat{b}_{i}}\log \sum_{p=0}^{P} \pi_{p} N(b_{i}; 0, \Sigma_{p})\).
Given V, we estimate \(\boldsymbol{\pi}\) by max loglikelihood, which is a convex problem
Algorithm:
Input: X, Ulist, init_V
Given V, estimate pi by max loglikelihood (convex problem)
Compute loglikelihood
delta = 1
while delta > tol
M step: update C
Convert to V
Given V, estimate pi by max loglikelihood (convex problem)
Compute loglikelihood
Update delta
penalty <- function(prior, pi_s){
subset <- (prior != 1.0)
sum((prior-1)[subset]*log(pi_s[subset]))
}
mixture.MV <- function(mash.data, Ulist, init_V=diag(ncol(mash.data$Bhat)), max_iter = 500, tol=1e-5, prior = c('nullbiased', 'uniform'), cor = TRUE, track_fit = FALSE){
prior <- match.arg(prior)
tracking = list()
m.model = fit_mash_V(mash.data, Ulist, V = init_V, prior=prior)
pi_s = get_estimated_pi(m.model, dimension = 'all')
prior.v <- mashr:::set_prior(length(pi_s), prior)
# compute loglikelihood
log_liks <- numeric(max_iter+1)
log_liks[1] <- get_loglik(m.model)+penalty(prior.v, pi_s)
V = init_V
result = list(V = V, logliks = log_liks[1], mash.model = m.model)
for(i in 1:max_iter){
if(track_fit){
tracking[[i]] = result
}
# max_V
V = E_V(mash.data, m.model)
if(cor){
V = cov2cor(V)
}
m.model = fit_mash_V(mash.data, Ulist, V, prior=prior)
pi_s = get_estimated_pi(m.model, dimension = 'all')
log_liks[i+1] <- get_loglik(m.model)+penalty(prior.v, pi_s)
result = list(V = V, logliks = log_liks[1:(i+1)], mash.model = m.model)
# Update delta
delta.ll <- log_liks[i+1] - log_liks[i]
if(delta.ll<=tol) break;
}
if(track_fit){
result$trace = tracking
}
return(result)
}
E_V = function(mash.data, m.model){
n = mashr:::n_effects(mash.data)
Z = mash.data$Bhat/mash.data$Shat
post.m.shat = m.model$result$PosteriorMean / mash.data$Shat
post.sec.shat = plyr::laply(1:n, function(i) (t(m.model$result$PosteriorCov[,,i]/mash.data$Shat[i,])/mash.data$Shat[i,]) + tcrossprod(post.m.shat[i,])) # nx2x2 array
temp1 = crossprod(Z)
temp2 = crossprod(post.m.shat, Z) + crossprod(Z, post.m.shat)
temp3 = unname(plyr::aaply(post.sec.shat, c(2,3), sum))
(temp1 - temp2 + temp3)/n
}
fit_mash_V <- function(mash.data, Ulist, V, prior=c('nullbiased', 'uniform')){
m.data = mashr::mash_set_data(Bhat=mash.data$Bhat, Shat=mash.data$Shat, V = V, alpha = mash.data$alpha)
m.model = mashr::mash(m.data, Ulist, prior=prior, verbose = FALSE, outputlevel = 3)
return(m.model)
}
set.seed(1)
n = 4000; p = 2
Sigma = matrix(c(1,0.5,0.5,1),p,p)
U0 = matrix(0,2,2)
U1 = U0; U1[1,1] = 1
U2 = U0; U2[2,2] = 1
U3 = matrix(1,2,2)
Utrue = list(U0=U0, U1=U1, U2=U2, U3=U3)
data = generate_data(n, p, Sigma, Utrue)
m.data = mash_set_data(data$Bhat, data$Shat)
U.c = cov_canonical(m.data)
result.mV <- mixture.MV(m.data, U.c)
The estimated \(V\) is
result.mV$V
[,1] [,2]
[1,] 1.0000000 0.5087773
[2,] 0.5087773 1.0000000
m.mV = result.mV$mash.model
null.ind = which(apply(data$B,1,sum) == 0)
The log likelihood is -12302.52. There are 26 significant samples, 0 false positives. The RRMSE is 0.5822283.
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mashr_0.2.18.0454 ashr_2.2-7
loaded via a namespace (and not attached):
[1] Rcpp_0.12.19 knitr_1.20 whisker_0.3-2
[4] magrittr_1.5 workflowr_1.1.1 REBayes_1.3
[7] MASS_7.3-50 pscl_1.5.2 doParallel_1.0.14
[10] SQUAREM_2017.10-1 lattice_0.20-35 foreach_1.4.4
[13] plyr_1.8.4 stringr_1.3.1 tools_3.5.1
[16] parallel_3.5.1 grid_3.5.1 R.oo_1.22.0
[19] rmeta_3.0 git2r_0.23.0 htmltools_0.3.6
[22] iterators_1.0.10 assertthat_0.2.0 abind_1.4-5
[25] yaml_2.2.0 rprojroot_1.3-2 digest_0.6.18
[28] Matrix_1.2-14 codetools_0.2-15 R.utils_2.7.0
[31] evaluate_0.12 rmarkdown_1.10 stringi_1.2.4
[34] compiler_3.5.1 Rmosek_8.0.69 backports_1.1.2
[37] R.methodsS3_1.7.1 mvtnorm_1.0-8 truncnorm_1.0-8
This reproducible R Markdown analysis was created with workflowr 1.1.1