A more memory-efficient backfitting algorithm

Last updated: 2018-11-05

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20180714)

The command set.seed(20180714) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 1eda9af
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    docs/.DS_Store
    Ignored:    docs/figure/.DS_Store

Untracked files:
    Untracked:  analysis/gd_notes.Rmd
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	1eda9af	Jason Willwerscheid	2018-11-05	wflow_publish(“analysis/matrix_ops.Rmd”)

Updating $\sigma^2$

Let $Y$ have zeroes where data is missing and let $Z$ be an indicator matrix with ones where data is non-missing and zeroes where data is missing. Write

\[ \begin{aligned} R^2 = Y^2 &- 2 Y \odot (EL) (EF)' + Z \odot ((EL) (EF)')^2 \\ &+ Z \odot (EL^2) (EF^2)' - Z \odot (EL)^2 ((EF)^2)' \end{aligned} \]

var_type = constant

Let $m$ be the number of non-missing entries in $Y$. We can write

\[\begin{aligned} \sigma^2 = \frac{1}{m} ( \text{sum}(Y^2) &- 2 \cdot \text{sum}((EL) \odot Y (EF)) + \text{sum}((EL) \odot Z(EF))^2 \\ &+ \text{sum}((EL^2) \odot Z (EF^2)) - \text{sum}((EL)^2 \odot Z (EF)^2) ) \end{aligned}\]

The most expensive operations are the four multiplications of an $n \times p$ matrix by a $p \times k$ matrix. The three involving $Z$ are very cheap if there is no missing data. The old implementation required three multiplications of a $n \times k$ by a $k \times p$ matrix, so there is not likely to be any speedup unless there is missing data. Much more importantly, though, the new implementation does not require the formation of any new $n \times p$ matrices.

var_type = by_row

Now let $m$ be an $n$-vector, where $m_i$ denotes the number of non-missing entries in row $Y_{i \bullet}$. $\sigma^2$ is now an $n$-vector:

\[\begin{aligned} \sigma^2 = (1 / m) \odot (\text{rowSums}(Y^2) &- 2 \cdot \text{rowSums}((EL) \odot Y (EF)) + \text{rowSums}((EL) \odot Z(EF))^2 \\ &+ \text{rowSums}((EL^2) \odot Z (EF^2)) - \text{rowSums}((EL)^2 \odot Z (EF)^2)) \end{aligned}\]

var_type = by_column

With $m$ a $p$-vector, where $m_j$ denotes the number of non-missing entries in column $Y_{\bullet j}$:

\[\begin{aligned} \sigma^2 = (1 / m) \odot (\text{colSums}(Y^2) &- 2 \cdot \text{rowSums}(t(EF) \odot (t(EL))Y) + \text{rowSums}(t(EF) \odot (t(EL))Z)^2 \\ &+ \text{rowSums}(t(EF^2) \odot (t(EL^2))Z)) - \text{rowSums}(t(EF)^2 \odot t(EL)^2 Z)) \end{aligned}\]

(I write it in this form to avoid transposing $Y$.)

Greedy updates

In the course of optimizing a single factor/loading pair, one can exploit the fact that only one column in each of $EL$, $EF$, $EL^2$, and $EF^2$ change.

Updating loadings

Instead of updating loading 1, then factor 1, then loading 2, etc., one can much more efficiently update all loadings (potentially in parallel), then all factors.

Calculating the EBNM `s2`s

The old algorithm calculates s2 = 1/(tau %*% f$EF2[, k]), where the entries of tau corresponding to missing data have been set to zero.

Let $S$ be the matrix of s2s, with the $i$th column giving the s2s for the $i$th loading. Then for var_type = constant or var_type = by_row, $S$ may be calculated as \[ S = \sigma^2 / Z(EF^2), \] where “$/$” is elementwise for var_type = by_row. Again the calculation is simplified if there is no missing data: here, $Z(EF^2)$ is simply a matrix where each row is the $k$-vector $\text{colSums}(EF^2)$.

For var_type = by_column, \[ S = 1 / Z(\tau \odot_b EF^2), \] where $\odot_b$ denotes that broadcasting is to be performed (in the sense that R uses it).

This is essentially the same as before, but tau does not need to be stored as a matrix.

Calculating the EBNM `x`s

The old algorithm calculates x = ((Rk * tau) %*% f$EF[, k]) * s2, which requires storing and updating Rk. For var_type = constant and var_type = by_row, one may write:

\[ X_{\bullet k} = \tau \odot (Y - \sum_{i: i \ne k} Z \odot (EL)_{\bullet i}(EF)_{\bullet i}') (EF)_{\bullet k} \odot S_{\bullet k}. \]

Note that $(Z \odot (EL)_{\bullet i} (EF)_{\bullet i}') (EF)_{\bullet k}$ can be written as \[ (EL)_{\bullet i} \odot Z ((EF)_{\bullet i} \odot (EF)_{\bullet k}) \] so \[\left(\sum_{i: i \ne k} Z \odot (EL)_{\bullet i}(EF)_{\bullet i}'\right) (EF)_{\bullet k} = \text{rowSums} \left((EL)_{\bullet -k} \odot Z((EF)_{\bullet k} \odot_b (EF)_{\bullet -k}) \right).\]

Again, the performance savings are not likely to be substantial unless there is no missing data, but this method does not require the formation of any new $n \times p$ matrices.

For var_type = by_column, write \[ X_{\bullet k} = (Y - \sum_{i: i \ne k} Z \odot (EL)_{\bullet i}(EF)_{\bullet i}') (\tau \odot (EF)_{\bullet k}) \odot S_{\bullet k}. \]

An idea for parallelizing the EBNM problems

The above could also be implemented as follows:

Calculate the $n \times k$ matrix $W = (Y - Z \odot (EL) (EF)')(EF)$. When there is missing data, this does require the temporary formation of a new $n \times p$ matrix, but it does not need to be stored.
$(Z \odot (EL)_{\bullet k} (EF)_{\bullet k}') (EF)_{\bullet k}$ can be written as \[ (EL)_{\bullet k} \odot Z ((EF)_{\bullet k}^2). \] Form the $n \times k$ matrix $U = (EL) \odot Z(EF)^2$ in one go.
For $k = 1$, calculate \[ X_{\bullet 1} = \tau \odot (W_{\bullet 1} + U_{\bullet 1}) \odot S_{\bullet 1}\] and solve the EBNM problem. This changes $(EL)_{\bullet k}$, so $W$ needs to be updated:

\[\begin{aligned} W^{\text{new}} &= W^{\text{old}} + \left( Z \odot ((EL)_{\bullet k}^{\text{old}} - (EL)_{\bullet k}^{\text{new}}) (EF)_{\bullet k}' \right) (EF) \\ &= W^{\text{old}} + ((EL)_{\bullet k}^{\text{old}} - (EL)_{\bullet k}^{\text{new}}) \odot Z((EF)_{\bullet k} \odot_b (EF)), \end{aligned}\]

Then repeat for $k = 2, 3, \ldots$

But if one simply omits the update of of $W$, parallelization is simple. If the updates of $(EL)$ are small, then this omission might be justified.

Updating factors

Factor updates can easily be obtained by reversing the roles of, on the one hand, $EL$ and $EF$ and, on the other, var_type = by_row and var_type = by_column.

Summary

With this implementation, the only $n \times p$ matrices that ever need to be handled are $Y$ and, if data is missing, $Z$ (and $Z$ can be stored as a matrix of integers). So it should be possible to fit a flash object using only slightly more memory than that required to store the data itself.

Session information

sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.0.1   Rcpp_0.12.19      digest_0.6.15    
 [4] rprojroot_1.3-2   R.methodsS3_1.7.1 backports_1.1.2  
 [7] git2r_0.21.0      magrittr_1.5      evaluate_0.10.1  
[10] stringi_1.2.4     whisker_0.3-2     R.oo_1.21.0      
[13] R.utils_2.6.0     rmarkdown_1.8     tools_3.4.3      
[16] stringr_1.3.0     yaml_2.1.17       compiler_3.4.3   
[19] htmltools_0.3.6   knitr_1.20

This reproducible R Markdown analysis was created with workflowr 1.0.1

A more memory-efficient backfitting algorithm

Jason Willwerscheid

11/4/2018

Updating \(\sigma^2\)

var_type = constant

var_type = by_row

var_type = by_column

Greedy updates

Updating loadings

Calculating the EBNM `s2`s

Calculating the EBNM `x`s

An idea for parallelizing the EBNM problems

Updating factors

Summary

Session information

A more memory-efficient backfitting algorithm

Jason Willwerscheid

11/4/2018

Updating \(\sigma^2\)

var_type = constant

var_type = by_row

var_type = by_column

Greedy updates

Updating loadings

Calculating the EBNM s2s

Calculating the EBNM xs

An idea for parallelizing the EBNM problems

Updating factors

Summary

Session information

Calculating the EBNM `s2`s

Calculating the EBNM `x`s