Input Data Formats for RSS Methods

Last updated: 2020-06-23

Checks: 7 0

Knit directory: rss/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200623)

The command set.seed(20200623) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 5782e93

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 5782e93. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    .spelling
    Ignored:    examples/example5/.Rhistory
    Ignored:    examples/example5/Aseg_chr16.mat
    Ignored:    examples/example5/example5_simulated_data.mat
    Ignored:    examples/example5/example5_simulated_results.mat
    Ignored:    examples/example5/ibd2015_path2641_genes_results.mat

Untracked files:
    Untracked:  docs_old/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (rmd/input_data.Rmd) and HTML (docs/input_data.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	5782e93	Xiang Zhu	2020-06-23	wflow_publish(“rmd/input_data.Rmd”)

GWAS summary statistics

This section is modified from Box 1 of Winkler (2014).

The following columns are required for any RSS analysis:

snp: identifier of genetic variant, character string such as rs12498742;
chr: chromosome number of genetic variant such as chr1,…,chr22, chrX, chrY;
pos: physical position, in base pair, of genetic variant;
a1: allele associated with the trait, a single upper case character A, C, G or T;
a2: the other (non-effect) allele, a single upper case character A, C, G or T;
betahat: estimated effect size of genetic variant under the single-marker model;
se: estimated standard error of betahat.

The following columns are optional, but they can be very helpful for sanity checks:

strand: strand on which the alleles are reported, a single character - or +;
n: number of individuals analyzed (a.k.a. sample size) for the genetic variant;
maf: minor allele frequency, numeric between 0 and 1;
p: p-value of genetic variant association, numeric between 0 and 1;
info: other information about genetic variants.

It is very important to make sure that [a1, betahat, se] are perfectly matched. Below is a toy example. Consider two SNPs (rs1, rs2) and four individuals (i1, i2, i3, i4):

IND, i1, i2, i3, i4
rs1, AT, TT, AT, AA
rs2, CG, CC, GG, GC

If the effect alleles (a1) of these two SNPs are A and G respectively, then the genotype data of rs1 are X[, 1]=[1, 0, 1, 2], and the genotype data of rs2 are X[, 2]=[1, 0, 2, 1]. Further, the single-SNP summary statistics of rs1 and rs2 are given by:

(betahat[1], se[1]) <- single.SNP.model(y, X[, 1])
(betahat[2], se[2]) <- single.SNP.model(y, X[, 2])

Finally, when providing chr and pos columns, please explicitly specify the assembly releases and versions of human genome. For example, if 1000 Genomes Project Phase 3 data are used to estimate LD, please ensure that chr and pos columns are based on UCSC hg19/GRCh37.

LD matrix estimates

All RSS methods to date also require the input of an estimated LD matrix.

The LD estimates are often derived from the phased haplotype data from 1000 Genomes Project Phase 3 data. Because the 1000 Genomes data are publicly available, the LD estimates only require the list of genetic variants, their physical positions and their effect alleles (i.e. [snp, chr, pos, a1] from the summary statistics file).

If there are some internal genotype data that can be used to estimate LD matrix, please organize the genotype data in the same VCF format as 1000 Genomes Phase 3 data. Again, make sure that the physical positions and effect alleles of the internal genotype data are consistent with [chr, pos, a1] provided in the GWAS summary statistics file.

Genomic annotations

Annotation data are only required if you want to use RSS for enrichment analyses. The most statistician-friendly format of genomic annotation data might look like this:

 snp   chr   pos ann1 ann2 ann3
 rs1  chr2 52877    0    0    0
 rs2  chr1 50670    0    1    0
 rs3 chr14   854    0    1    1
 rs4  chr4 99620    1    1    1
 rs5 chr16 71537    0    0    0
 rs6 chr22 39741    0    0    0
 rs7  chr6 89331    1    0    0

where ann1, ann2 and ann3 are three types of annotations, 1 indicates that SNP is annotated and 0 otherwise.

Alternatively, a list of annotated SNPs can be saved as a separate file. For example:

> cat ann3.txt
 snp   chr   pos
 rs3 chr14   854
 rs4  chr4 99620

> cat ann2.txt
 snp   chr   pos
 rs2  chr1 50670
 rs3 chr14   854
 rs4  chr4 99620

> cat ann1.txt
 snp  chr   pos
 rs4 chr4 99620
 rs7 chr6 89331

Sometimes the annotations are based on genes or genomic regions (e.g. biological pathways). For these annotations, it is easier to provide a list of annotated regions:

 ensembl_gene_id chromosome_name start_position end_position
 ENSG00000000938               1       27938575     27961788
 ENSG00000008438              19       46522411     46526323
 ENSG00000008516              16        3096682      3110727
 ENSG00000066336              11       47376411     47400127
 ENSG00000077984              20       24929866     24940564
 ENSG00000085265               9      137801431    137809809

For all these annotation files, please make sure that the physical positions ([snp, chr, pos] or [chromosome_name, start_position, end_position]) are consistent with [snp, chr, pos] in the summary statistics file.

devtools::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.1 (2020-06-06)
 os       macOS Catalina 10.15.5      
 system   x86_64, darwin17.0          
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2020-06-23                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
 backports     1.1.8   2020-06-17 [1] CRAN (R 4.0.0)
 callr         3.4.3   2020-03-28 [1] CRAN (R 4.0.0)
 cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
 desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
 devtools      2.3.0   2020-04-10 [1] CRAN (R 4.0.0)
 digest        0.6.25  2020-02-23 [1] CRAN (R 4.0.0)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
 evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
 fs            1.4.1   2020-04-04 [1] CRAN (R 4.0.0)
 git2r         0.27.1  2020-05-03 [1] CRAN (R 4.0.0)
 glue          1.4.1   2020-05-13 [1] CRAN (R 4.0.0)
 htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.0)
 httpuv        1.5.4   2020-06-06 [1] CRAN (R 4.0.0)
 knitr         1.29    2020-06-23 [1] CRAN (R 4.0.0)
 later         1.1.0.1 2020-06-05 [1] CRAN (R 4.0.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.0)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
 pkgbuild      1.0.8   2020-05-07 [1] CRAN (R 4.0.0)
 pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.0)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
 processx      3.4.2   2020-02-09 [1] CRAN (R 4.0.0)
 promises      1.1.1   2020-06-09 [1] CRAN (R 4.0.0)
 ps            1.3.3   2020-05-08 [1] CRAN (R 4.0.0)
 R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.0)
 Rcpp          1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0)
 remotes       2.1.1   2020-02-15 [1] CRAN (R 4.0.0)
 rlang         0.4.6   2020-05-02 [1] CRAN (R 4.0.0)
 rmarkdown     2.3     2020-06-18 [1] CRAN (R 4.0.0)
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
 stringi       1.4.6   2020-02-17 [1] CRAN (R 4.0.0)
 stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
 testthat      2.3.2   2020-03-02 [1] CRAN (R 4.0.0)
 usethis       1.6.1   2020-04-29 [1] CRAN (R 4.0.0)
 whisker       0.4     2019-08-28 [1] CRAN (R 4.0.0)
 withr         2.2.0   2020-04-20 [1] CRAN (R 4.0.0)
 workflowr   * 1.6.2   2020-04-30 [1] CRAN (R 4.0.0)
 xfun          0.15    2020-06-21 [1] CRAN (R 4.0.0)
 yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)

[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Input Data Formats for RSS Methods

Xiang Zhu

GWAS summary statistics

LD matrix estimates

Genomic annotations