Preprocessed Adult Height GWAS Summary Data

Last updated: 2020-06-23

Checks: 2 0

Knit directory: rss/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Repository version: c4df22a

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version c4df22a. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    .spelling
    Ignored:    examples/example5/.Rhistory
    Ignored:    examples/example5/Aseg_chr16.mat
    Ignored:    examples/example5/example5_simulated_data.mat
    Ignored:    examples/example5/example5_simulated_results.mat
    Ignored:    examples/example5/ibd2015_path2641_genes_results.mat

Untracked files:
    Untracked:  docs_old/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (rmd/height2014_data.Rmd) and HTML (docs/height2014_data.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	c4df22a	Xiang Zhu	2020-06-23	wflow_publish(“rmd/height2014_data.Rmd”)

This page provides information on the preprocessed adult height genome-wide association study (GWAS) summary statistics and estimated linkage disequilibrium (LD) matrices, which were created and analyzed in the following publication.

Zhu, Xiang; Stephens, Matthew. Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Ann. Appl. Stat. 11 (2017), no. 3, 1561–1592. DOI:10.1214/17-AOAS1046.

This dataset is publicly available at https://doi.org/10.5281/zenodo.1443565, and can be referenced in a journal’s “Data availability” section as

If you find this dataset useful in your research, please kindly cite the publication listed above, Zhu and Stephens (2017).

GWAS summary statistics

The folder mat_files contains the GWAS summary statistics for each of the 22 autosomes.

>> load mat_files/height2014.chr22.mat;
>> whos
  Name             Size                Bytes  Class     Attributes

  H            15599x758            94592336  double
  Nsnp         15599x1                 62396  int32
  betahat      15599x1                124792  double
  chr          15599x1                 62396  int32
  cummap       15599x1                124792  double
  pos18        15599x1                 62396  int32
  pos19        15599x1                 62396  int32
  se           15599x1                124792  double

Most variables above are self-explanatory. The single-SNP GWAS summary statistics {betahat, se} are published in Wood et al. (2014). The matrix H contains the phased haplotypes of 379 European ancestry individuals in the 1000 Genomes Project Phase 1. The vector cummap contains the genetic map of HapMap Release 24 European-ancestry population. Note that {betahat, se, H} must use the SAME way of coding alleles; otherwise RSS results will be severely distorted.

Estimated LD matrices

The folder estimated_ld_sparse contains the estimated LD matrices for each of the 22 autosomes. The estimation method is detailed in Wen and Stephens (2010), and is implemented by get_corr.m.

Files R.chr*.mat were generated by setting cutoff=1e-8 in get_corr.m. Files R.chr*.3.mat were generated by setting cutoff=1e-3. These LD matrices are stored as sparse matrices to save space.

>> load estimated_ld_sparse/R.chr22.mat;
>> whos R
  Name          Size                   Bytes  Class     Attributes

  R         15599x15599            592534032  double    sparse

Of note, if you run RSS MCMC programs (rss/src) with these LD data, please first convert them to full matrices: R=full(R). I recommend using full LD matrices in MCMC because each iteration of MCMC involves multiple matrix indexing operations (i.e. sampling SNPs to include in or exclude from the current regression model), and based on my experiments, full matrix indexing is much faster than sparse matrix indexing (at least in MATLAB).

However, if you run RSS VB programs (rss/src_vb) with these LD data, please do NOT convert them to full matrices. Indeed, RSS VB programs require that input LD matrices must be sparse.

Preprocessed Adult Height GWAS Summary Data

Xiang Zhu

GWAS summary statistics

Estimated LD matrices