More efficient FLASH algorithms

Last updated: 2019-01-12

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20180714)

The command set.seed(20180714) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 4c75d75
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    docs/.DS_Store
    Ignored:    docs/figure/.DS_Store

Untracked files:
    Untracked:  analysis/gd_notes.Rmd

Unstaged changes:
    Modified:   analysis/brain.Rmd
    Modified:   analysis/index.Rmd
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	4c75d75	Jason Willwerscheid	2019-01-12	workflowr::wflow_publish(“analysis/matrix_ops.Rmd”)
html	2cca898	Jason Willwerscheid	2018-12-04	Build site.
html	7ca650c	Jason Willwerscheid	2018-11-06	Build site.
Rmd	83c6dd5	Jason Willwerscheid	2018-11-06	wflow_publish(c(“analysis/matrix_ops.Rmd”, “analysis/index.Rmd”))
html	1f185b0	Jason Willwerscheid	2018-11-05	Build site.
Rmd	7e20469	Jason Willwerscheid	2018-11-05	wflow_publish(“analysis/matrix_ops.Rmd”)
html	d183b0a	Jason Willwerscheid	2018-11-05	Build site.
Rmd	1eda9af	Jason Willwerscheid	2018-11-05	wflow_publish(“analysis/matrix_ops.Rmd”)

Update 1/12/19: There are multiple errors in the formulae derived below which I haven’t bothered to correct. For the correct formulae, see flashier (in particular, see factor_init.R, factor_update.R, tau.R, and objective.R). Once one becomes accustomed to n-mode products (which are used throughout flashier), the code reads much more easily than this document.

Introduction

Currently, FLASH does not perform well on large datasets. The reason is that it maintains and manipulates at least three matrices (Rk, R2, and R2k) which are each of the same size as the data matrix Y. When var_type is by_row or by_column, a fourth large matrix tau is maintained, and when there is missing data, still another large matrix is stored (a copy of Y, with missing data imputed to be zero).

Let $Y \in \mathbb{R}^{n \times p}$, and let $k$ be the rank of the flash fit. I show here that when $k^2 < \min(n, p)$, FLASH can be implemented without maintaining any matrices of the same dimensions as Y. Even when $k$ is large, such a matrix only needs to be created once during factor updates and once during loading updates, and it is only temporarily needed.

I outline two improved implementations. In the first, the only $n \times p$ matrices that are required are the data matrix $Y$ with NAs set to zero and, if there is missing data, a matrix $Z$ with zeroes where data is missing and ones elsewhere. The original data matrix is not needed. I never modify or transpose $Y$ or $Z$, so the total memory requirements should not be much larger than the memory needed to store $Y$ and $Z$.

The second assumes that one is willing to maintain a matrix of residuals $R$, initialized at $Y$ ($Z$ is kept to ensure that $R$ has zeroes where data is missing). The primary disadvantage is that one cannot take advantage of sparsity in $Y$, and it requires some care to update $R$ correctly and at the right time, but it is potentially much faster than the first implementation. (Update 1/12/19: this approach rarely provides a noticeable speedup over the first approach.)

Initializing factors

To initialize a new factor, we currently form a matrix of residuals, run an SVD-type algorithm on it, and then initialize using the leading singular vector. However, we don’t need to recur to an external package here. Since we only require one singular vector, the minimization problem is simple. When there is no missing data, the problem is to solve: \[ (l_k, f_k) = {\arg \min}_{l_k, f_k} \| R_{-k} - l_k f_k' \|_F^2 = {\arg \min}_{l_k, f_k} \| Y - L_{-k}F_{-k}' - l_k f_k' \|_F^2 \] It is easy to derive an alternating algorithm by differentiating with respect to $l_k$ and setting the result equal to zero: \[ l_k = \frac{R_{-k}f_k}{f_k^Tf_k} = \frac{Yf_k - L_{-k}(F_{-k}'f_k)}{f_k^Tf_k} \] And similarly: \[ f_k^T = \frac{l_k^TR_{-k}}{l_k^Tl_k} = \frac{l_k^TY - (l_k^TL_{-k})F_{-k}'}{l_k^Tl_k} \]

Each of these updates can be performed very quickly; further, since we don’t need to solve the problem exactly, we can stop early, and the same procedure is likely sufficient to (approximately) initialize factors when there is not a lot of missing data.

But we can also derive efficient updates that are exact for missing data. In this case, the problem is to solve: \[ (l_k, f_k) = {\arg \min}_{l_k, f_k} \| Z \odot (R_{-k} - l_k f_k') \|_F^2 = {\arg \min}_{l_k, f_k} \| Y - Z \odot L_{-k}F_{-k}' - Z \odot l_k f_k' \|_F^2 \]

The terms on the right-hand side that depend on $l_k^{(i)}$ are \[\begin{aligned} -2 &(Z \odot R_{-k})_{i \bullet}(Z_{i \bullet}^T \odot l_k^{(i)} f_k) + (Z_{i \bullet}^T \odot l_k^{(i)} f_k)^T (Z_{i \bullet}^T \odot l_k^{(i)} f_k) \\ &= -2l_k^{(i)} (Z_{i \bullet} \odot (R_{-k})_{i \bullet})f_k + l_k^{(i)2} (Z_{i \bullet}^T \odot f_k)^T f_k \end{aligned}\] Differentiating and setting the result equal to zero gives (in vector form) \[ l_k = (Z \odot R_{-k})f_k / Zf_k^2, \] where division and squaring are elementwise. Write \[ (Z \odot R_{-k})f_k = Yf_k - (Z \odot L_{-k}F_{-k}')f_k\] and note that the $i$th entry of $(Z \odot L_{-k}F_{-k}')f_k$ is \[\sum_j z_{ij} \sum_{\ell: \ell \ne k} l_{i \ell} f_{j \ell} f_k = \sum_{\ell: \ell \ne k} l_{i \ell} \sum_j z_{ij} f_{j \ell} f_k = \sum_{\ell: \ell \ne k} l_{i \ell} \sum_j z_{ij} \tilde{f}_{j \ell},\] where $\tilde{F} = f_k \odot_b F_{-k}$, with $\odot_b$ denoting elementwise multiplication using broadcasting (that is, the $\ell$th column of $\tilde{F}$ is formed by taking the elementwise product of $f_k$ and the $\ell$th column of $F_{-k}$). Thus, \[ (Z \odot R_{-k})f_k = Yf_k - \text{rowSums}(L_{-k} \odot Z(f_k \odot_b F_{-k}))\]

In sum, exact updates can be obtained for missing data by alternately setting \[ l_k = (Yf_k - \text{rowSums}(L_{-k} \odot Z(f_k \odot_b F_{-k})) / Zf_k^2\] and \[ f_k^T = (l_k^T Y - \text{colSums}(F_{-k}^T \odot (l_k \odot_b L_{-k})^TZ)) / l_k^{2T}Z \]

This algorithm requires three matrix multiplies involving a $n \times p$ matrix, with a flop count of $2(k + 1)np$ per update. The algorithm for no missing data only requires one, with a flop count of $2np$ per update, so we can expect this initialization to be slower by a factor of $k + 1$. Still, note that the matrix of residuals never needs to be formed. Further, if there is a lot of missing data, then $Y$ and $Z$ can be stored as sparse matrices. In that case, the largest dense matrix that needs to be manipulated will be of dimension $\max(n, p) \times k - 1$.

Alternative implementation

If one maintains $R$, then the updates are extremely simple: alternate between setting \[ l_k = Rf_k / f_k^T f_k\ (\text{or } l_k = Rf_k / Z f_k^2) \] and \[ f_k^T = l_k^T R / l_k^T l_k\ (\text{or } f_k^T = l_k^T R / l_k^{2T} Z \]

Updating loadings

Calculating the EBNM `s2`s

The old algorithm calculates s2 = 1/(tau %*% f$EF2[, k]), where the entries of tau corresponding to missing data have been set to zero.

For var_type = constant or var_type = by_row, s2 may be calculated as \[ 1 / (\tau \odot Z(EF^2)_{\bullet k}) \] If there is no missing data, then s2 is simply \[1 / (\tau \odot \text{sum}((EF^2)_{\bullet k})) \]

For var_type = by_column, s2 is \[1 / Z(\tau \odot (EF^2)_{\bullet k}) \] or the scalar \[ 1 / \text{sum}(\tau \odot (EF^2)_{\bullet k}) \] if there is no missing data.

This is essentially the same as before, but does not require storing tau as a matrix.

Calculating the EBNM `x`s

The old algorithm calculates x = ((Rk * tau) %*% f$EF[, k]) * s2, which requires storing and updating Rk. For var_type = constant and var_type = by_row, x is

\[ \tau \odot (Y - \sum_{i: i \ne k} Z \odot (EL)_{\bullet i}(EF)_{\bullet i}') (EF)_{\bullet k} \odot \text{s2}. \]

Note that $(Z \odot (EL)_{\bullet i} (EF)_{\bullet i}') (EF)_{\bullet k}$ can be written as \[ (EL)_{\bullet i} \odot Z ((EF)_{\bullet i} \odot (EF)_{\bullet k}) \] so \[\left(\sum_{i: i \ne k} Z \odot (EL)_{\bullet i}(EF)_{\bullet i}'\right) (EF)_{\bullet k} = \text{rowSums} \left((EL)_{\bullet -k} \odot Z((EF)_{\bullet k} \odot_b (EF)_{\bullet -k}) \right).\]

The performance savings are not likely to be substantial unless there is no missing data, but this method does not require the formation of any new $n \times p$ matrices.

For var_type = by_column, x is \[ (Y - \sum_{i: i \ne k} Z \odot (EL)_{\bullet i}(EF)_{\bullet i}') (\tau \odot (EF)_{\bullet k}) \odot \text{s2}\] or \[ (Y(\tau \odot (EF)_{\bullet k}) - \text{rowSums} \left((EL)_{\bullet -k} \odot Z(\tau \odot (EF)_{\bullet k} \odot_b (EF)_{\bullet -k}) \right) \odot \text{s2}\]

These expressions can again be simplified when there is no missing data.

An idea for parallelizing the EBNM problems

The above could also be implemented as follows:

Calculate the $n \times k$ matrix $W = (Y - Z \odot (EL) (EF)')(EF)$. When there is missing data, this does require the temporary formation of a new $n \times p$ matrix, but it does not need to be stored.
$(Z \odot (EL)_{\bullet k} (EF)_{\bullet k}') (EF)_{\bullet k}$ can be written as \[ (EL)_{\bullet k} \odot Z ((EF)_{\bullet k}^2). \] Form the $n \times k$ matrix $U = (EL) \odot Z(EF)^2$ in one go.
For $k = 1$, calculate \[ X_{\bullet 1} = \tau \odot (W_{\bullet 1} + U_{\bullet 1}) \odot S_{\bullet 1}\] and solve the EBNM problem. This changes $(EL)_{\bullet k}$, so $W$ needs to be updated:

\[\begin{aligned} W^{\text{new}} &= W^{\text{old}} + \left( Z \odot ((EL)_{\bullet k}^{\text{old}} - (EL)_{\bullet k}^{\text{new}}) (EF)_{\bullet k}' \right) (EF) \\ &= W^{\text{old}} + ((EL)_{\bullet k}^{\text{old}} - (EL)_{\bullet k}^{\text{new}}) \odot Z((EF)_{\bullet k} \odot_b (EF)), \end{aligned}\]

Then repeat for $k = 2, 3, \ldots$

But if one simply omits the update of of $W$, parallelization is simple. If the updates of $(EL)$ are small, then this omission might be justified.

Alternative implementation

If $R$ is stored rather than $Y$, then the values of x for constant and by_row precision types are \[ \tau \odot (R + Z \odot (EL)_{\bullet k}(EF)_{\bullet k}') (EF)_{\bullet k} \odot \text{s2} = \tau \odot (R(EF)_{\bullet k} + (EL)_{\bullet k} \odot Z(EF)_{\bullet k}^2) \odot \text{s2}\]

For by_column, x can be obtained as \[ (R(\tau \odot (EF)_{\bullet k}) + (EL)_{\bullet k} \odot Z(\tau \odot (EF)_{\bullet k}^2) \odot \text{s2}\]

The above implementation requires multiplying a $n \times p$ matrix by a $p \times 1$ matrix and, when there is missing data, multiplying a $n \times p$ matrix by a $p \times k - 1$ matrix. In this implementation, both of the matrices on the right-hand side are $p \times 1$, so the calculation should be faster by a factor of $k / 2$ when there is missing data (with no improvement when there is none).

Of course, the residuals also need to be updated after updating the loadings. This requires up to $3np$ flops: \[ R^{\text{new}} = R^{\text{old}} - Z \odot ((EL)_{\bullet k}^{\text{new}} - (EL)_{\bullet k}^{\text{old}}) (EF)_{\bullet k}' \]

Updating factors

Factor updates can easily be obtained by reversing the roles of, on the one hand, $EL$ and $EF$ and, on the other, var_type = by_row and var_type = by_column.

Updating $\tau$

$\tau$ can be updated easily and efficiently if it is done every time a single factor or loading is updated. Recall that most updates to $\tau$ involve some operation on the matrix of squared residuals: var_type = constant takes the overall mean, by_row takes row means, and by_column takes column means. Write \[\text{R2} = R^2 + Z \odot (EL^2) (EF^2)' - Z \odot (EL)^2 ((EF)^2)' \]

For var_type = constant, \[\begin{aligned} \frac{n_{\text{nonnmissing}}}{\tau} = \text{sum}(\text{R2}) &= \text{sum}(R^2) + \text{sum} ((EL^2) \odot Z(EF^2)) - \text{sum} ((EL)^2 \odot Z((EF)^2)) \\ &= \text{sum}(R^2) + \text{sum} ((EF^2)^T \odot (EL^2)^T Z) - \text{sum} (((EF)^2)^T \odot ((EL)^2)^T Z) \end{aligned}\]

Similar expressions can be derived for by_row (viz. by_column) by replacing the sums with rowSums in the first equality (viz. with colSums in the second equality) and by replacing the scalar number of non-missing values $n_{\text{nonnmissing}}$ with a vector of the number of non-missing values per row (viz. per column).

Since at most one factor or loading has changed between updates of $\tau$, one can further simplify. For example, take the case var_type = constant. After a single loading update, \[\begin{aligned} n_{\text{nonmissing}} \left( \frac{1}{\tau} - \frac{1}{\tilde{\tau}} \right) = \Delta (\text{sum}(R^2)) + (\Delta EL_k^2)^T Z(EF_k^2) - (\Delta (EL_k)^2)^T Z((EF_k)^2), \end{aligned}\] where $\Delta$ denotes the new value minus the old value.

If $R$ is stored, then $\Delta (\text{sum}(R^2))$ is probably most easily obtained by calculating and storing sum$(R^2)$ every time $R$ is updated (at a cost of $2np$ flops). Alternatively, writing \[ \begin{aligned} \Delta R^2 &= (R_{-k} - Z \odot L_k^{\text{new}}F_k')^2 - (R_{-k} - Z \odot L_k^{\text{old}}F_k')^2 \\ &= -2R_{-k} \odot (L_k^{\text{new}} - L_k^{\text{old}}) F_k' + (Z \odot L_k^{\text{new}}F_k')^2 - (Z \odot L_k^{\text{old}}F_k')^2 \end{aligned}\] and noting that \[ \text{sum}((Z \odot L_k^{\text{new}}F_k')^2) = ((L_k^{\text{new}})^2)^T ZF_k^2, \] one has that \[ \Delta (\text{sum}(R^2)) = ((L_k^{\text{new}})^2 - (L_k^{\text{old}})^2)^T ZF_k^2 -2(L_k^{\text{new}} - L_k^{\text{old}})^T R_{-k} F_k \] Finally, replacing $R_{-k}$ with $Y - \sum_{i: i \ne k} Z \odot L_i F_i'$: \[\begin{aligned} \Delta(\text{sum}(R^2)) &= ((L_k^{\text{new}})^2 - (L_k^{\text{old}})^2)^T ZF_k^2 - 2(L_k^{\text{new}} - L_k^{\text{old}})^T (Y F_k - \text{rowSums} (L_{-k} \odot Z(F_k \odot_b F_{-k}))) \end{aligned} \] (See “Calculating the EBNM xs” above for a more detailed derivation of the last step.) This calculation requires approximately $npk$ flops, so I would expect it to be slower than simply calculating sum$(R^2)$ after each update, especially as $k$ gets large.

I omit the details for by_row and by_column and the simplifications obtained when there is no missing data; these are easy to derive from the above.

Updating objective

The part of the objective that is not stored in the KL fields is calculated as e_loglik = -0.5 * sum(log(2 * pi / tau) + tau * R2) (here, R2 has NAs where data is missing so that the entries do not contribute to the sum).

One should observe, however, that if the objective is calculated immediately after updating $\tau$, then this expression has a very simple form. With by_constant, for example, \[\frac{1}{\tau} = \frac{\text{sum}(\text{R2})}{n_{\text{nonmissing}}},\] so \[\text{sum}(\tau \cdot \text{R2}) = \tau \cdot \text{sum}(\text{R2}) = n_{\text{nonmissing}}\]

In effect, one can verify that for constant, by_row, and by_column, e_loglik is simply \[ -\frac{1}{2} \text{sum} \left(n_{\text{nonmissing}} \odot \left(\log \left(\frac{2 \pi}{\tau} \right) + 1 \right) \right), \] where $n_{\text{nonmissing}}$ is a scalar, $n$-vector of nonmissing values per row (for by_row), or $p$-vector of nonmissing values per column (for by_column).

If var_type is zero and is a constant (say, S = 1), then one can maintain a scalar value of tau as if the precision were constant (using the above updates). Since it will still be true that, for this pseudo-$\tau$, $\text{sum(R2)} = n_{\text{nonmissing}} / \tau$, e_loglik can then be calculated as \[ -\frac{1}{2} \text{sum} \left(n_{\text{nonmissing}} \odot \left(\log \left(2 \pi S^2 \right) + \frac{1}{\tau S^2} \right) \right) \] This also works if S is a row or column vector.

If S is a full matrix and $R$ is being stored, then still another approach is possible; one can store $\text{sum}(\log(2 \pi S^2))$ and then calculate $\text{sum(R2} / \text{S}^2)$ as \[ \text{sum} \left( (1 / S^2) \odot R^2 \right) + \text{sum}(EL^2 \cdot (1 / S^2) EF^2) - \text{sum}((EL)^2 \cdot (1 / S^2) (EF)^2),\] where the matrix $(1 / S^2)$ has zeroes where data is missing. This is of course similar to the formula for $1 / \tau$ when var_type = constant, and similar simple updates that depend only on updated factors and loadings are possible. For example, one can update the objective after a single loading update by calculating \[ \Delta(\text{sum}((1 / S^2) \odot R^2)) + (\Delta EL_k^2)^T (1 / S^2)(EF_k^2) - (\Delta (EL_k)^2)^T (1 / S^2)((EF_k)^2)\] (Due to the form of these updates, I recommend storing 1 / S^2 rather than S, with zeroes where data is missing.)

Session information

sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.0.1   Rcpp_1.0.0        digest_0.6.18    
 [4] rprojroot_1.3-2   R.methodsS3_1.7.1 backports_1.1.2  
 [7] magrittr_1.5      git2r_0.21.0      evaluate_0.12    
[10] stringi_1.2.4     whisker_0.3-2     R.oo_1.21.0      
[13] R.utils_2.6.0     rmarkdown_1.10    tools_3.4.3      
[16] stringr_1.3.1     xfun_0.4          yaml_2.2.0       
[19] compiler_3.4.3    htmltools_0.3.6   knitr_1.20.22

This reproducible R Markdown analysis was created with workflowr 1.0.1

More efficient FLASH algorithms

Jason Willwerscheid

11/4/2018

Introduction

Initializing factors

Alternative implementation

Updating loadings

Calculating the EBNM `s2`s

Calculating the EBNM `x`s

An idea for parallelizing the EBNM problems

Alternative implementation

Updating factors

Updating \(\tau\)

Updating objective

Session information

More efficient FLASH algorithms

Jason Willwerscheid

11/4/2018

Introduction

Initializing factors

Alternative implementation

Updating loadings

Calculating the EBNM s2s

Calculating the EBNM xs

An idea for parallelizing the EBNM problems

Alternative implementation

Updating factors

Updating \(\tau\)

Updating objective

Session information

Calculating the EBNM `s2`s

Calculating the EBNM `x`s