Last updated: 2018-05-03
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180411)
The command set.seed(20180411)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 196d0e3
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .sos/
Ignored: exams/
Untracked files:
Untracked: analysis/pca_cell_cycle.Rmd
Untracked: analysis/ridge_mle.Rmd
Untracked: docs/figure/pca_cell_cycle.Rmd/
Unstaged changes:
Modified: analysis/cell_cycle.Rmd
Modified: analysis/density_est_cell_cycle.Rmd
Modified: analysis/eb_vs_soft.Rmd
Modified: analysis/eight_schools.Rmd
Modified: analysis/glmnet_intro.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 196d0e3 | stephens999 | 2018-05-03 | wflow_publish(“analysis/bayes_normal_means.Rmd”) |
In a previous homework you implemented Empirical Bayes (EB) shrinkage for the normal means problem with a normal prior. That is we have data \(X=(X_1,\dots,X_n)\): \[X_j | \theta_j, s_j \sim N(\theta_j, s_j^2)\] and assume \[\theta_j | \mu,\sigma \sim N(\mu,\sigma^2) \quad j=1,\dots,n.\]
The EB approach involved two steps:
The EB approach can be criticized for ignoring uncertainty in the estimates of \(\mu\) and \(\sigma\). Here we will use MCMC to do a fully Bayesian analysis that takes account of this uncertainty.
To make this easier we will first re-parameterize to use \(\eta = log(\sigma)\), so \(\eta\) can take any value on the real line.
We will use a uniform prior on \((\mu,\eta)\), \(p(\mu,\eta) \propto 1\) in the range \(\mu = [-a,a]\) and \(\eta \in [-b,b]\). You can use \(a=10^6\) and \(b=10\). (Because \(\eta\) is on the log scale, \(b=10\) covers a wide range of possible standard deviations). Thus the posterior distribution on \(\mu,\eta\) is given by \[p(\mu,\eta | X) \propto p(X | \mu, \eta) I(|\mu|<a) I(|\eta|<b)\]
where \(I\) denotes an indicator function.
Modify your log-likelihood computation code from your previous homework to compute the log-likelihood for \((\mu,\eta)\) given data \(X\) (and standard deviations \(s\)).
Use this to implement a MH algorithm to sample from \(\pi(\mu,\eta) \propto p(X | mu,\eta)\). Note: In computing the MH acceptance probability you need to compute a ratio \(L_1/L_2\). For numerical stability reasons you should always compute this ratio by \(\exp(l_1 - l_2)\) where \(l_i = \log(L_i)\) rather than directly computing \(L_1\) and \(L_2\) and then computing their ratio. (If both \(L_1\) and \(L_2\) are very small, they may be 0 to machine precision, which causes problems if you try to compute \(L_1/L_2\) directly.)
Apply your MH algorithm to simulated data where you know the answer. Run you MH algorithm multiple (at least 3) times from multiple different initializations. For each run plot how the value of \(log \pi(\mu^t,\eta^t)\) changes with iteration \(t\). You should see that it starts from a low value (assuming you initialized to something that is not consistent with the data) and then gradually increases until it settles down to a “steady state” behavior. Use these plots to help decide how many iterations to run your algorithm to get reliable results (ie so results from different runs look similar) and how many iterations to discard as “burn-in”. Compare your posterior distributions of \(\mu\) and \(\eta\) with the true values you simulated (the distributions should cover the true values unless you did something wrong or are unlucky!)
Repeat part 3 for the “8 schools data” here (omitting the comparisons with the true values, which of course you do not know here).
Note that the posterior distribution on \(\theta_j\) is given by: \[p(\theta_j | X) = \int p(\theta_j | X, \mu, \eta)p(\mu,\eta | X)\] which is the expectation of \(p(\theta_j | X, \mu, \eta)\) over the posteriore \(p(\mu,\eta | X)\). Computing posterior distributions like this is sometimes referred to as “integrating out uncertainty in” \(\mu,\eta\). (It is useful to compare this with the EB approach of just plugging in the maximum likelihood estimates and computing \(p(\theta_j | X, \hat{\mu},\hat{\eta})\). Notice that the two will produce similar results if the posterior distribution \(p(\mu,\eta | X)\) is very concentrated around the mle.)
Given \(T\) samples \(\mu^1,\eta^1,\dots,\mu^T, \eta^T\) from the posterior distribution \(p(\mu,\eta | X)\) you can approximate this expectation by \[p(\theta_j | X) \approx (1/T)\sum_t p(\theta_j | X, \mu^t, \eta^t).\] So you can approximate the posterior mean by \[E(\theta_j | X) \approx (1/T)\sum_t E(\theta_j | X, \mu^t, \eta^t).\]
Using the same idea, given an expression to approximate the posterior second moment \(E(\theta^2_j | X)\), and so approximate the posterior variance (and hence posterior standard deviation).
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.0.1 Rcpp_0.12.16 digest_0.6.15
[4] rprojroot_1.3-2 R.methodsS3_1.7.1 backports_1.1.2
[7] git2r_0.21.0 magrittr_1.5 evaluate_0.10.1
[10] stringi_1.1.7 whisker_0.3-2 R.oo_1.22.0
[13] R.utils_2.6.0 rmarkdown_1.9 tools_3.3.2
[16] stringr_1.3.0 yaml_2.1.18 htmltools_0.3.6
[19] knitr_1.20
This reproducible R Markdown analysis was created with workflowr 1.0.1