Input Data Formats for RSS Methods

Last updated: 2021-03-05

Checks: 7 0

Knit directory: rss/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200623)

The command set.seed(20200623) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 06cfb1d

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 06cfb1d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Unstaged changes:
    Modified:   rmd/faq.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (rmd/input_data.Rmd) and HTML (docs/input_data.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	06cfb1d	Xiang Zhu	2021-03-05	wflow_publish(“rmd/input_data.Rmd”)
html	bab3f58	Xiang Zhu	2020-06-24	Build site.
html	1daecfb	Xiang Zhu	2020-06-23	Build site.
Rmd	5782e93	Xiang Zhu	2020-06-23	wflow_publish(“rmd/input_data.Rmd”)

All RSS methods to date require the input of GWAS summary statistics and ancestry-matching LD estimates. Some RSS methods further require the input of genomic annotations.

GWAS summary statistics

This section is modified from Box 1 of Winkler (2014).

The following columns are required for any RSS analysis:

snp: identifier of genetic variant, character string such as rs12498742;
chr: autosome number of genetic variant such as chr1,…,chr22;
pos: physical position, in base pair, of genetic variant;
a1: effect allele, a single upper case character A, C, G or T;
a2: the other (non-effect) allele, a single upper case character A, C, G or T;
betahat: estimated effect size of genetic variant under the single-marker model;
se: estimated standard error of betahat.

The following columns are optional, but they can be helpful for sanity checks:

strand: strand on which the alleles are reported, a single character - or +;
n: number of individuals analyzed (i.e., sample size) for the genetic variant;
maf: minor allele frequency, numeric between 0 and 1;
p: p-value of genetic variant association, numeric between 0 and 1;
info: other information (e.g., imputation quality) about genetic variants.

It is crucial to make sure that [a1, betahat, se] are consistently defined. Below is a toy example. Consider two SNPs (rs1, rs2) and four individuals (i1, i2, i3, i4):

IND, i1, i2, i3, i4
rs1, AT, TT, AT, AA
rs2, CG, CC, GG, GC

If the effect alleles (a1) of these two SNPs are A and G respectively, then the genotype data of rs1 are X[, 1]=[1, 0, 1, 2] and the genotype data of rs2 are X[, 2]=[1, 0, 2, 1]. Further, the single-SNP summary statistics of rs1 and rs2 are generated as follows.

(betahat[1], se[1]) <- single.SNP.model(y, X[, 1])
(betahat[2], se[2]) <- single.SNP.model(y, X[, 2])

Finally, when providing chr and pos columns, please confirm the assembly releases and versions of human genome. For example, if 1000 Genomes Project Phase 3 data are used to generate ancestry-matching LD estimates, then chr and pos columns should be based on UCSC hg19/GRCh37.

LD estimates

The ancestry-matching LD estimates are often derived from the phased haplotype data from 1000 Genomes Project Phase 3 data. Because the 1000 Genomes data are publicly available, the LD estimates only require the list of genetic variants, their physical positions and effect alleles (i.e. [snp, chr, pos, a1] from the summary statistics file).

The script import_1000g_vcf.sh illustrates how to extract phased haplotypes of select genetic variants from 1000 Genomes Phase 3 VCF format data and save them in IMPUTE reference-panel format *.impute.hap.

The scripts get_corr.m and get_corr.R illustrate how to compute LD estimates in MATLAB and R respectively.

If there are some internal genotype data that can be used to estimate LD matrix, you can first organize the genotype data in the same VCF format as 1000 Genomes Phase 3 data, and then reuse my scripts above. Again, please make sure that the physical positions and effect alleles of the internal genotype data are consistent with [chr, pos, a1] provided in the GWAS summary statistics file.

Genomic annotations

The most statistician-friendly format of genomic annotation data might look like this:

 snp   chr   pos ann1 ann2 ann3
 rs1  chr2 52877    0    0    0
 rs2  chr1 50670    0    1    0
 rs3 chr14   854    0    1    1
 rs4  chr4 99620    1    1    1
 rs5 chr16 71537    0    0    0
 rs6 chr22 39741    0    0    0
 rs7  chr6 89331    1    0    0

where ann1, ann2 and ann3 are three types of annotations, 1 indicates that SNP is annotated and 0 otherwise.

Alternatively, a list of annotated SNPs can be saved as a separate file. For example:

> cat ann3.txt
 snp   chr   pos
 rs3 chr14   854
 rs4  chr4 99620

> cat ann2.txt
 snp   chr   pos
 rs2  chr1 50670
 rs3 chr14   854
 rs4  chr4 99620

> cat ann1.txt
 snp  chr   pos
 rs4 chr4 99620
 rs7 chr6 89331

Sometimes the annotations are based on genes (e.g., biological pathways) or genomic regions (e.g., regulatory elements). For these region-based annotations, it is easier to provide a list of annotated regions as follows:

 ensembl_gene_id chromosome_name start_position end_position
 ENSG00000000938               1       27938575     27961788
 ENSG00000008438              19       46522411     46526323
 ENSG00000008516              16        3096682      3110727
 ENSG00000066336              11       47376411     47400127
 ENSG00000077984              20       24929866     24940564
 ENSG00000085265               9      137801431    137809809

Similar to LD estimates, please make sure that the physical positions ([snp, chr, pos] or [chromosome_name, start_position, end_position]) in the annotation file are consistent with [snp, chr, pos] in the summary statistics file.

devtools::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.4 (2021-02-15)
 os       macOS Big Sur 10.16         
 system   x86_64, darwin17.0          
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/New_York            
 date     2021-03-05                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
 cachem        1.0.4   2021-02-13 [1] CRAN (R 4.0.2)
 callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.2)
 cli           2.3.1   2021-02-23 [1] CRAN (R 4.0.2)
 crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.2)
 desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.2)
 devtools      2.3.2   2020-09-18 [1] CRAN (R 4.0.2)
 digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
 evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.1)
 fansi         0.4.2   2021-01-15 [1] CRAN (R 4.0.2)
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.0.2)
 fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
 git2r         0.28.0  2021-01-10 [1] CRAN (R 4.0.2)
 glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
 htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
 httpuv        1.5.5   2021-01-13 [1] CRAN (R 4.0.2)
 knitr         1.31    2021-01-27 [1] CRAN (R 4.0.2)
 later         1.1.0.1 2020-06-05 [1] CRAN (R 4.0.2)
 lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.0.2)
 magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
 memoise       2.0.0   2021-01-26 [1] CRAN (R 4.0.2)
 pillar        1.5.0   2021-02-22 [1] CRAN (R 4.0.2)
 pkgbuild      1.2.0   2020-12-15 [1] CRAN (R 4.0.2)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.2)
 pkgload       1.2.0   2021-02-23 [1] CRAN (R 4.0.2)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.2)
 processx      3.4.5   2020-11-30 [1] CRAN (R 4.0.2)
 promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.0.2)
 ps            1.6.0   2021-02-28 [1] CRAN (R 4.0.4)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.2)
 R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
 Rcpp          1.0.6   2021-01-15 [1] CRAN (R 4.0.2)
 remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
 rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.2)
 rmarkdown     2.7     2021-02-19 [1] CRAN (R 4.0.2)
 rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.2)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
 stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
 stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
 testthat      3.0.2   2021-02-14 [1] CRAN (R 4.0.2)
 tibble        3.1.0   2021-02-25 [1] CRAN (R 4.0.2)
 usethis       2.0.1   2021-02-10 [1] CRAN (R 4.0.2)
 utf8          1.1.4   2018-05-24 [1] CRAN (R 4.0.2)
 vctrs         0.3.6   2020-12-17 [1] CRAN (R 4.0.2)
 whisker       0.4     2019-08-28 [1] CRAN (R 4.0.2)
 withr         2.4.1   2021-01-26 [1] CRAN (R 4.0.2)
 workflowr   * 1.6.2   2020-04-30 [1] CRAN (R 4.0.2)
 xfun          0.21    2021-02-10 [1] CRAN (R 4.0.2)
 yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)

[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Input Data Formats for RSS Methods

Xiang Zhu

GWAS summary statistics

LD estimates

Genomic annotations