Last updated: 2021-03-05
Checks: 7 0
Knit directory: rss/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200623)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 06cfb1d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rproj.user/
Unstaged changes:
Modified: rmd/faq.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (rmd/input_data.Rmd
) and HTML (docs/input_data.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 06cfb1d | Xiang Zhu | 2021-03-05 | wflow_publish(“rmd/input_data.Rmd”) |
html | bab3f58 | Xiang Zhu | 2020-06-24 | Build site. |
html | 1daecfb | Xiang Zhu | 2020-06-23 | Build site. |
Rmd | 5782e93 | Xiang Zhu | 2020-06-23 | wflow_publish(“rmd/input_data.Rmd”) |
All RSS methods to date require the input of GWAS summary statistics and ancestry-matching LD estimates. Some RSS methods further require the input of genomic annotations.
This section is modified from Box 1 of Winkler (2014).
The following columns are required for any RSS analysis:
snp
: identifier of genetic variant, character string such as rs12498742
;chr
: autosome number of genetic variant such as chr1
,…,chr22
;pos
: physical position, in base pair, of genetic variant;a1
: effect allele, a single upper case character A
, C
, G
or T
;a2
: the other (non-effect) allele, a single upper case character A
, C
, G
or T
;betahat
: estimated effect size of genetic variant under the single-marker model;se
: estimated standard error of betahat
.The following columns are optional, but they can be helpful for sanity checks:
strand
: strand on which the alleles are reported, a single character -
or +
;n
: number of individuals analyzed (i.e., sample size) for the genetic variant;maf
: minor allele frequency, numeric between 0 and 1;p
: p-value of genetic variant association, numeric between 0 and 1;info
: other information (e.g., imputation quality) about genetic variants.It is crucial to make sure that [a1, betahat, se]
are consistently defined. Below is a toy example. Consider two SNPs (rs1
, rs2
) and four individuals (i1
, i2
, i3
, i4
):
IND, i1, i2, i3, i4
rs1, AT, TT, AT, AA rs2, CG, CC, GG, GC
If the effect alleles (a1
) of these two SNPs are A
and G
respectively, then the genotype data of rs1
are X[, 1]=[1, 0, 1, 2]
and the genotype data of rs2
are X[, 2]=[1, 0, 2, 1]
. Further, the single-SNP summary statistics of rs1
and rs2
are generated as follows.
1], se[1]) <- single.SNP.model(y, X[, 1])
(betahat[2], se[2]) <- single.SNP.model(y, X[, 2]) (betahat[
Finally, when providing chr
and pos
columns, please confirm the assembly releases and versions of human genome. For example, if 1000 Genomes Project Phase 3 data are used to generate ancestry-matching LD estimates, then chr
and pos
columns should be based on UCSC hg19/GRCh37.
The ancestry-matching LD estimates are often derived from the phased haplotype data from 1000 Genomes Project Phase 3 data. Because the 1000 Genomes data are publicly available, the LD estimates only require the list of genetic variants, their physical positions and effect alleles (i.e. [snp, chr, pos, a1]
from the summary statistics file).
The script import_1000g_vcf.sh
illustrates how to extract phased haplotypes of select genetic variants from 1000 Genomes Phase 3 VCF format data and save them in IMPUTE reference-panel format *.impute.hap
.
The scripts get_corr.m
and get_corr.R
illustrate how to compute LD estimates in MATLAB and R respectively.
If there are some internal genotype data that can be used to estimate LD matrix, you can first organize the genotype data in the same VCF format as 1000 Genomes Phase 3 data, and then reuse my scripts above. Again, please make sure that the physical positions and effect alleles of the internal genotype data are consistent with [chr, pos, a1]
provided in the GWAS summary statistics file.
The most statistician-friendly format of genomic annotation data might look like this:
snp chr pos ann1 ann2 ann352877 0 0 0
rs1 chr2 50670 0 1 0
rs2 chr1 854 0 1 1
rs3 chr14 99620 1 1 1
rs4 chr4 71537 0 0 0
rs5 chr16 39741 0 0 0
rs6 chr22 89331 1 0 0 rs7 chr6
where ann1
, ann2
and ann3
are three types of annotations, 1
indicates that SNP is annotated and 0
otherwise.
Alternatively, a list of annotated SNPs can be saved as a separate file. For example:
> cat ann3.txt
snp chr pos
rs3 chr14 854
rs4 chr4 99620
> cat ann2.txt
snp chr pos
rs2 chr1 50670
rs3 chr14 854
rs4 chr4 99620
> cat ann1.txt
snp chr pos
rs4 chr4 99620 rs7 chr6 89331
Sometimes the annotations are based on genes (e.g., biological pathways) or genomic regions (e.g., regulatory elements). For these region-based annotations, it is easier to provide a list of annotated regions as follows:
ensembl_gene_id chromosome_name start_position end_position1 27938575 27961788
ENSG00000000938 19 46522411 46526323
ENSG00000008438 16 3096682 3110727
ENSG00000008516 11 47376411 47400127
ENSG00000066336 20 24929866 24940564
ENSG00000077984 9 137801431 137809809 ENSG00000085265
Similar to LD estimates, please make sure that the physical positions ([snp, chr, pos]
or [chromosome_name, start_position, end_position]
) in the annotation file are consistent with [snp, chr, pos]
in the summary statistics file.
::session_info() devtools
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.0.4 (2021-02-15)
os macOS Big Sur 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2021-03-05
─ Packages ───────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
cachem 1.0.4 2021-02-13 [1] CRAN (R 4.0.2)
callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.2)
cli 2.3.1 2021-02-23 [1] CRAN (R 4.0.2)
crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2)
desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.2)
digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1)
fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.2)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.2)
fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
git2r 0.28.0 2021-01-10 [1] CRAN (R 4.0.2)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
httpuv 1.5.5 2021-01-13 [1] CRAN (R 4.0.2)
knitr 1.31 2021-01-27 [1] CRAN (R 4.0.2)
later 1.1.0.1 2020-06-05 [1] CRAN (R 4.0.2)
lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
memoise 2.0.0 2021-01-26 [1] CRAN (R 4.0.2)
pillar 1.5.0 2021-02-22 [1] CRAN (R 4.0.2)
pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.0.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
pkgload 1.2.0 2021-02-23 [1] CRAN (R 4.0.2)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
processx 3.4.5 2020-11-30 [1] CRAN (R 4.0.2)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.0.2)
ps 1.6.0 2021-02-28 [1] CRAN (R 4.0.4)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.2)
remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.2)
rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.2)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
testthat 3.0.2 2021-02-14 [1] CRAN (R 4.0.2)
tibble 3.1.0 2021-02-25 [1] CRAN (R 4.0.2)
usethis 2.0.1 2021-02-10 [1] CRAN (R 4.0.2)
utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.2)
vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.2)
whisker 0.4 2019-08-28 [1] CRAN (R 4.0.2)
withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.2)
workflowr * 1.6.2 2020-04-30 [1] CRAN (R 4.0.2)
xfun 0.21 2021-02-10 [1] CRAN (R 4.0.2)
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library