Last updated: 2018-08-23
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180714)
The command set.seed(20180714)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 9778769
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: docs/.DS_Store
Ignored: docs/figure/.DS_Store
Untracked files:
Untracked: data/greedy19.rds
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 9778769 | Jason Willwerscheid | 2018-08-23 | wflow_publish(“analysis/arbitraryV.Rmd”) |
Here I examine whether it is possible to fit a FLASH model with an arbitrary error covariance matrix using an idea suggested here by Matthew Stephens.
That is, I want to fit the model \[ Y = LF' + E,\] where the columns of \(E\) are distributed i.i.d. \[ E_{\bullet j} \sim N(0, V). \] Equivalently, letting \(\lambda_{min}\) be the smallest eigenvalue of \(V\) and letting \(W = V - \lambda_{min} I_n\) (so that, in particular, \(W\) is positive semi-definite), write \[ Y = LF' + E^{(1)} + E^{(2)}, \] with the columns of \(E^{(1)}\) distributed i.i.d. \[ E^{(1)}_{\bullet j} \sim N(0, W) \] and the elements of \(E^{(2)}\) distributed i.i.d. \[ E^{(2)}_{i j} \sim N(0, \lambda_{min}) \] Notice that by taking the eigendecomposition of \(W\) \[ W = \sum_{k = 1}^n \lambda_k w_k w_k' \] and letting \[ f_i \sim N(0, \lambda_i), \] one can write \[ E^{(1)}_{\bullet j} = w_1 f_1' + \ldots + w_n f_n'. \]
Thus, one should be able to fit the desired model by adding fixed loadings \(w_1, \ldots, w_n\), by fixing the priors on the corresponding factors at \(N(0, \lambda_1), \ldots, N(0, \lambda_n)\), and by taking \(\tau = 1 / \lambda_{min}\) (with var_type = "zero"
).
First I need a function that will generate random covariance matrices. I normalize the matrices so that the largest eigenvalue is equal to one. Further, I ensure that the smallest eigenvalue is bounded below by some constant. (If the covariance matrix is poorly conditioned, then the final backfit can be very slow, and in practice, we would not expect these eigenvalues to be terribly small.)
rand.V <- function(n, lambda.min=0.25) {
A <- matrix(rnorm(n^2), nrow=n, ncol=n)
V <- A %*% t(A)
max.eigen <- max(eigen(V, symmetric=TRUE, only.values=TRUE)$values)
d <- max.eigen * lambda.min / (1 - lambda.min)
# Add diagonal matrix to improve conditioning and then normalize:
V <- (V + diag(rep(d, n))) / (max.eigen + d)
return(V)
}
The next function simulates data from the rank-zero FLASH model \(Y = E\), with \(E_{\bullet j} \sim^{i.i.d.} N(0, V)\).
sim.E <- function(V, p) {
n <- nrow(V)
return(t(MASS::mvrnorm(p, rep(0, n), V)))
}
The following function fits a FLASH model using the approach outlined above.
fit.fixed.V <- function(Y, V, verbose=TRUE, backfit=FALSE, tol=1e-2) {
n <- nrow(V)
lambda.min <- min(eigen(V, symmetric=TRUE, only.values=TRUE)$values)
data <- flash_set_data(Y, S = sqrt(lambda.min))
W.eigen <- eigen(V - diag(rep(lambda.min, n)), symmetric=TRUE)
# The rank of W is at most n - 1, so we can drop the last eigenval/vec:
W.eigen$values <- W.eigen$values[-n]
W.eigen$vectors <- W.eigen$vectors[, -n, drop=FALSE]
fl <- flash_add_fixed_loadings(data, LL=W.eigen$vectors, init_fn="udv_svd")
ebnm_param_f <- lapply(as.list(W.eigen$values),
function(eigenval) {
list(g = list(a=1/eigenval, pi0=0), fixg = TRUE)
})
ebnm_param_l <- lapply(vector("list", n - 1),
function(k) {list()})
fl <- flash_backfit(data, fl, var_type="zero", ebnm_fn="ebnm_pn",
ebnm_param=(list(f = ebnm_param_f, l = ebnm_param_l)),
nullcheck=FALSE, verbose=verbose, tol=tol)
fl <- flash_add_greedy(data, Kmax=50, f_init=fl, var_type="zero",
init_fn="udv_svd", ebnm_fn="ebnm_pn",
verbose=verbose, tol=tol)
if (backfit) {
n.added <- flash_get_k(fl) - (n - 1)
ebnm_param_f <- c(ebnm_param_f,
lapply(vector("list", n.added),
function(k) {list(warmstart=TRUE)}))
ebnm_param_l <- c(ebnm_param_l,
lapply(vector("list", n.added),
function(k) {list(warmstart=TRUE)}))
fl <- flash_backfit(data, fl, var_type="zero", ebnm_fn="ebnm_pn",
ebnm_param=(list(f = ebnm_param_f, l = ebnm_param_l)),
nullcheck=FALSE, verbose=verbose, tol=tol)
}
return(fl)
}
devtools::load_all("/Users/willwerscheid/GitHub/flashr/")
Loading flashr
devtools::load_all("/Users/willwerscheid/GitHub/ebnm/")
Loading ebnm
n <- 20
p <- 500
set.seed(666)
V = rand.V(n=n)
Y <- sim.E(V, p=p)
fl <- fit.fixed.V(Y, V)
Backfitting 19 factor/loading(s) (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -9917.69 Inf
2 -9917.69 0.00e+00
Fitting factor/loading 20 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -9948.16 Inf
2 -9932.50 1.57e+01
3 -9917.69 1.48e+01
4 -9917.69 0.00e+00
Performing nullcheck...
Deleting factor 20 increases objective by 4.66e-03. Factor zeroed out.
Nullcheck complete. Objective: -9917.69
Here, after backfitting the fixed loadings corresponding to the eigenvectors of \(W\), FLASH (correctly) fails to find any additional structure in the data. In contrast, fitting FLASH without paying attention to the fact that \(V \ne I\) gives misleading results:
bad.fl <- flash_add_greedy(Y, Kmax=50)
Fitting factor/loading 1 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -10046.63 Inf
2 -10034.53 1.21e+01
3 -10033.07 1.46e+00
4 -10032.20 8.68e-01
5 -10031.47 7.23e-01
6 -10030.91 5.61e-01
7 -10030.58 3.38e-01
8 -10030.30 2.79e-01
9 -10029.90 4.00e-01
10 -10029.34 5.58e-01
11 -10028.99 3.52e-01
12 -10028.89 9.62e-02
13 -10028.86 2.89e-02
14 -10028.85 1.47e-02
15 -10028.84 9.14e-03
Performing nullcheck...
Deleting factor 1 decreases objective by 3.89e+01. Factor retained.
Nullcheck complete. Objective: -10028.84
Fitting factor/loading 2 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -10000.11 Inf
2 -9988.22 1.19e+01
3 -9986.88 1.33e+00
4 -9986.23 6.52e-01
5 -9985.81 4.21e-01
6 -9985.49 3.16e-01
7 -9985.23 2.58e-01
8 -9985.01 2.21e-01
9 -9984.82 1.93e-01
10 -9984.65 1.68e-01
11 -9984.51 1.47e-01
12 -9984.37 1.32e-01
13 -9984.25 1.23e-01
14 -9984.13 1.21e-01
15 -9984.01 1.20e-01
16 -9983.89 1.16e-01
17 -9983.79 1.00e-01
18 -9983.72 7.57e-02
19 -9983.67 5.04e-02
20 -9983.64 3.13e-02
21 -9983.62 1.89e-02
22 -9983.61 1.14e-02
23 -9983.60 6.88e-03
Performing nullcheck...
Deleting factor 2 decreases objective by 4.52e+01. Factor retained.
Nullcheck complete. Objective: -9983.6
Fitting factor/loading 3 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -9976.87 Inf
2 -9963.19 1.37e+01
3 -9961.15 2.03e+00
4 -9960.17 9.81e-01
5 -9959.64 5.28e-01
6 -9959.14 5.03e-01
7 -9958.78 3.65e-01
8 -9958.62 1.57e-01
9 -9958.55 6.56e-02
10 -9958.52 3.34e-02
11 -9958.50 1.97e-02
12 -9958.49 1.26e-02
13 -9958.48 8.11e-03
Performing nullcheck...
Deleting factor 3 decreases objective by 2.51e+01. Factor retained.
Nullcheck complete. Objective: -9958.48
Fitting factor/loading 4 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -9981.09 Inf
2 -9965.90 1.52e+01
3 -9964.03 1.87e+00
4 -9963.31 7.24e-01
5 -9962.84 4.67e-01
6 -9962.46 3.87e-01
7 -9962.12 3.38e-01
8 -9961.82 2.97e-01
9 -9961.56 2.59e-01
10 -9961.34 2.24e-01
11 -9961.15 1.91e-01
12 -9960.99 1.60e-01
13 -9960.86 1.32e-01
14 -9960.75 1.08e-01
15 -9960.66 8.79e-02
16 -9960.59 7.14e-02
17 -9960.53 5.80e-02
18 -9960.48 4.72e-02
19 -9960.44 3.86e-02
20 -9960.41 3.16e-02
21 -9960.39 2.59e-02
22 -9960.37 2.13e-02
23 -9960.35 1.76e-02
24 -9960.33 1.46e-02
25 -9960.32 1.21e-02
26 -9960.31 1.00e-02
27 -9960.30 8.33e-03
Performing nullcheck...
Deleting factor 4 increases objective by 1.82e+00. Factor zeroed out.
Nullcheck complete. Objective: -9958.48
The following function simulates data from the rank-one FLASH model \(Y = \ell d f' + E\). pi0.l
and pi0.f
give the expected proportion of null entries in \(\ell\) and \(f\). Since \(\ell\) and \(f\) are normalized to have length one, \(d\) measures how large the factor/loading pair is, and thus, how easy it is to find (recall that \(V\) is normalized so that its largest eigenvalue is equal to one).
sim.rank1 <- function(V, p, pi0.l=0.5, pi0.f=0.8, d=5^2) {
E <- sim.E(V, p)
n <- nrow(V)
# Nonnull entries of l and f are normally distributed:
l <- rnorm(n) * rbinom(n, 1, 1 - pi0.l)
# Nonnull entries of f are all equal:
f <- rnorm(p) * rbinom(p, 1, 1 - pi0.f)
# Normalize l and f:
l <- l / sqrt(sum(l^2))
f <- f / sqrt(sum(f^2))
LF <- outer(l, f) * d
return(list(Y = LF + E, l = l, f = f))
}
Here, the procedure outlined above correctly finds the additional rank-one structure. Running FLASH as is, however, yields structure of higher rank:
set.seed(999)
V = rand.V(n=n)
data <- sim.rank1(V, p=p)
fl <- fit.fixed.V(data$Y, V)
Backfitting 19 factor/loading(s) (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -10626.81 Inf
2 -10626.81 0.00e+00
Fitting factor/loading 20 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -10296.26 Inf
2 -10294.19 2.07e+00
3 -10294.15 4.51e-02
4 -10294.14 1.80e-03
Performing nullcheck...
Deleting factor 20 decreases objective by 3.33e+02. Factor retained.
Nullcheck complete. Objective: -10294.14
Fitting factor/loading 21 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -10326.73 Inf
2 -10299.97 2.68e+01
3 -10294.15 5.82e+00
4 -10294.15 0.00e+00
Performing nullcheck...
Deleting factor 21 increases objective by 2.53e-03. Factor zeroed out.
Nullcheck complete. Objective: -10294.14
bad.fl <- flash_add_greedy(data$Y, Kmax=50)
Fitting factor/loading 1 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -10290.91 Inf
2 -10276.06 1.48e+01
3 -10274.56 1.49e+00
4 -10274.09 4.75e-01
5 -10273.93 1.55e-01
6 -10273.88 4.93e-02
7 -10273.87 1.58e-02
8 -10273.86 5.21e-03
Performing nullcheck...
Deleting factor 1 decreases objective by 1.71e+02. Factor retained.
Nullcheck complete. Objective: -10273.86
Fitting factor/loading 2 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -10229.69 Inf
2 -10218.30 1.14e+01
3 -10217.19 1.11e+00
4 -10216.49 6.96e-01
5 -10215.86 6.26e-01
6 -10215.26 6.00e-01
7 -10214.68 5.81e-01
8 -10214.12 5.61e-01
9 -10213.59 5.35e-01
10 -10213.08 5.03e-01
11 -10212.62 4.66e-01
12 -10212.19 4.26e-01
13 -10211.81 3.84e-01
14 -10211.47 3.41e-01
15 -10211.17 3.01e-01
16 -10210.90 2.63e-01
17 -10210.68 2.28e-01
18 -10210.48 1.97e-01
19 -10210.31 1.69e-01
20 -10210.16 1.46e-01
21 -10210.04 1.25e-01
22 -10209.93 1.08e-01
23 -10209.84 9.37e-02
24 -10209.75 8.13e-02
25 -10209.68 7.09e-02
26 -10209.62 6.21e-02
27 -10209.57 5.47e-02
28 -10209.52 4.84e-02
29 -10209.48 4.30e-02
30 -10209.44 3.83e-02
31 -10209.40 3.43e-02
32 -10209.37 3.09e-02
33 -10209.34 2.78e-02
34 -10209.32 2.51e-02
35 -10209.30 2.28e-02
36 -10209.28 2.06e-02
37 -10209.26 1.87e-02
38 -10209.24 1.70e-02
39 -10209.22 1.55e-02
40 -10209.21 1.41e-02
41 -10209.20 1.28e-02
42 -10209.19 1.16e-02
43 -10209.18 1.06e-02
44 -10209.17 9.59e-03
Performing nullcheck...
Deleting factor 2 decreases objective by 6.47e+01. Factor retained.
Nullcheck complete. Objective: -10209.17
Fitting factor/loading 3 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -10173.09 Inf
2 -10161.29 1.18e+01
3 -10160.15 1.15e+00
4 -10159.47 6.78e-01
5 -10158.92 5.47e-01
6 -10158.46 4.61e-01
7 -10158.07 3.89e-01
8 -10157.75 3.25e-01
9 -10157.48 2.68e-01
10 -10157.26 2.17e-01
11 -10157.09 1.75e-01
12 -10156.95 1.39e-01
13 -10156.84 1.10e-01
14 -10156.75 8.74e-02
15 -10156.68 6.92e-02
16 -10156.63 5.47e-02
17 -10156.58 4.32e-02
18 -10156.55 3.50e-02
19 -10156.52 2.99e-02
20 -10156.49 2.66e-02
21 -10156.47 2.43e-02
22 -10156.44 2.25e-02
23 -10156.42 2.08e-02
24 -10156.40 1.89e-02
25 -10156.39 1.68e-02
26 -10156.37 1.43e-02
27 -10156.36 1.17e-02
28 -10156.35 9.28e-03
Performing nullcheck...
Deleting factor 3 decreases objective by 5.28e+01. Factor retained.
Nullcheck complete. Objective: -10156.35
Fitting factor/loading 4 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -10146.50 Inf
2 -10134.17 1.23e+01
3 -10133.44 7.32e-01
4 -10133.29 1.47e-01
5 -10133.24 5.09e-02
6 -10133.22 2.32e-02
7 -10133.21 1.20e-02
8 -10133.20 6.59e-03
Performing nullcheck...
Deleting factor 4 decreases objective by 2.31e+01. Factor retained.
Nullcheck complete. Objective: -10133.2
Fitting factor/loading 5 (stop when difference in obj. is < 1.00e-02):
Iteration Objective Obj Diff
1 -10156.31 Inf
2 -10142.29 1.40e+01
3 -10141.41 8.88e-01
4 -10141.26 1.48e-01
5 -10141.22 4.25e-02
6 -10141.20 1.89e-02
7 -10141.19 1.09e-02
8 -10141.18 7.15e-03
Performing nullcheck...
Deleting factor 5 increases objective by 7.98e+00. Factor zeroed out.
Nullcheck complete. Objective: -10133.2
To check that the new approach gives reasonable results, one can calculate the angle between the estimated \(l\) and the true \(l\) (and likewise for \(f\)):
ldf <- flash_get_ldf(fl, drop_zero_factors=FALSE)
l.angle <- acos(abs(sum(ldf$l[, n] * data$l)))
f.angle <- acos(abs(sum(ldf$f[, n] * data$f)))
round(c(l.angle, f.angle), digits=2)
[1] 0.43 0.37
These results are not terrible, but an additional backfit can improve upon them:
fl.b <- fit.fixed.V(data$Y, V, verbose=FALSE, backfit=TRUE)
ldf <- flash_get_ldf(fl.b, drop_zero_factors=FALSE)
l.angle <- acos(abs(sum(ldf$l[, n] * data$l)))
f.angle <- acos(abs(sum(ldf$f[, n] * data$f)))
round(c(l.angle, f.angle), digits=2)
[1] 0.16 0.33
I include code below that can be used to verify that the above results are typical. Since they can take a long time, I do not run them here.
rank0.experiment <-function(ntests, n, p, lambda.min=0.25, seeds=1:ntests) {
est.rank <- bad.rank <- rep(NA, ntests)
for (i in 1:length(seeds)) {
set.seed(i)
V <- rand.V(n, lambda.min)
Y <- sim.E(V, p)
fl <- fit.fixed.V(Y, V, verbose=FALSE)
k <- flash_get_k(fl)
est.rank[i] <- k - (n - 1)
bad.fl <- flash_add_greedy(Y, Kmax=50, verbose=FALSE)
bad.rank[i] <- flash_get_nfactors(bad.fl)
}
return(list(est.rank = est.rank, bad.rank = bad.rank))
}
rank1.experiment <-function(ntests, n, p, lambda.min=0.25, d=5^2,
seeds=1:ntests) {
est.rank <- bad.rank <- rep(NA, ntests)
l.angle <- f.angle <- rep(NA, ntests)
for (i in 1:length(seeds)) {
set.seed(i)
V = rand.V(n, lambda.min)
data <- sim.rank1(V, p, d=d)
fl <- fit.fixed.V(data$Y, V, verbose=FALSE, backfit=TRUE)
k <- flash_get_k(fl)
est.rank[i] <- k - (n - 1)
ldf <- flash_get_ldf(fl, drop_zero_factors=FALSE)
if (est.rank[i] >= 1) {
l.angle[i] <- acos(abs(sum(ldf$l[, n] * data$l)))
f.angle[i] <- acos(abs(sum(ldf$f[, n] * data$f)))
}
bad.fl <- flash_add_greedy(data$Y, Kmax=50, verbose=FALSE)
bad.rank[i] <- flash_get_nfactors(bad.fl)
}
return(list(est.rank = est.rank, bad.rank = bad.rank,
l.angle = l.angle, f.angle = f.angle))
}
I have set parameters lambda.min
and d
favorably for this investigation. If lambda.min
is closer to 1, then errors will be more nearly independent, and the usual FLASH model will not fare so poorly. It would be worthwhile to investigate whether the approach detailed here beats the usual FLASH fit in such cases.
Further, I have set d
to be quite large. In the above simulations, the true loading and factor are each five times larger (in terms of Euclidean length) than the largest eigenvalue of the error covariance matrix. It would be interesting to see what the detection threshold is as a function of n
, p
, and lambda.min
.
Finally, notice that when \(\lambda_{min} = 1\), the approach detailed above is just the usual FLASH fit, so both of these proposed investigations would help to establish some continuity between the two.
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ebnm_0.1-13 flashr_0.5-14
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 pillar_1.2.1 plyr_1.8.4
[4] compiler_3.4.3 git2r_0.21.0 workflowr_1.0.1
[7] R.methodsS3_1.7.1 R.utils_2.6.0 iterators_1.0.9
[10] tools_3.4.3 testthat_2.0.0 digest_0.6.15
[13] tibble_1.4.2 evaluate_0.10.1 memoise_1.1.0
[16] gtable_0.2.0 lattice_0.20-35 rlang_0.2.0
[19] Matrix_1.2-12 foreach_1.4.4 commonmark_1.4
[22] yaml_2.1.17 parallel_3.4.3 withr_2.1.1.9000
[25] stringr_1.3.0 roxygen2_6.0.1.9000 xml2_1.2.0
[28] knitr_1.20 devtools_1.13.4 rprojroot_1.3-2
[31] grid_3.4.3 R6_2.2.2 rmarkdown_1.8
[34] ggplot2_2.2.1 ashr_2.2-10 magrittr_1.5
[37] whisker_0.3-2 backports_1.1.2 scales_0.5.0
[40] codetools_0.2-15 htmltools_0.3.6 MASS_7.3-48
[43] assertthat_0.2.0 softImpute_1.4 colorspace_1.3-2
[46] stringi_1.1.6 lazyeval_0.2.1 munsell_0.4.3
[49] doParallel_1.0.11 pscl_1.5.2 truncnorm_1.0-8
[52] SQUAREM_2017.10-1 R.oo_1.21.0
This reproducible R Markdown analysis was created with workflowr 1.0.1