Last updated: 2022-08-18
Checks: 6 1
Knit directory: RatXcan_Training/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20220711)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version f65b5a1. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.Rhistory
Untracked files:
Untracked: .DS_Store
Untracked: .gitignore
Untracked: README.html
Untracked: analysis/.DS_Store
Untracked: rsconnect/
Unstaged changes:
Modified: analysis/GEMMA_BSLMM.Rmd
Deleted: output/README.md
Staged changes:
New: code/remove_trailing_tab.sh
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/GEMMA_BSLMM.Rmd
) and HTML
(docs/GEMMA_BSLMM.html
) files. If you’ve configured a
remote Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | f65b5a1 | sabrina-mi | 2022-08-10 | wflow_rename("analysis/Rat_Heritability.Rmd", "analysis/GEMMA_BSLMM.Rmd") |
GEMMA generates heritability and predictability values for a given phenotype. In our analysis, we use gene expression values in place of a phenotype to estimate the heritability. It takes genotype, gene expression, and an estimated relatedness matrix as input.
The “bimbam” file we are writing does not exactly follow the BIMBAM mean genotype file format, but it contains all the necessary information to create subseted BIMBAM files for estimating the genetic relatedness matrix (GRM).
library(tidyverse)
"%&%" = function(a,b) paste(a,b,sep="")
load("~/Box/imlab-data/Projects/PTRS-PGRS-Rosetta/Data-From-Abe-Palmer-Lab/Rdata/genoGex.RData")
bim.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/data/bimbam/"
bimfile <- bim.dir %&% "Ac_bimbam"
geno_Ac = geno[,match(colnames(gexAc)[-1], colnames(geno))]
snps <- paste(phyMap$Chr, phyMap$Pos, phyMap$Ref, phyMap$Alt, sep = "_")
Ac_bimbam <- cbind(phyMap$Chr, phyMap$Pos, snps, phyMap$Ref, phyMap$Alt, geno_Ac)
write.table(Ac_bimbam, file = bimfile,quote=F,col.names=F,row.names=F)
For each gene, we subset the initial “bimbam” file to cis eQTL variants for the genotype input and use gene expression values for the phenotype input. The output is a GRM estimating relatedness between the rats. BSLMM takes the GRM as input to control for individaul relatedness in association tests between genetic variance and gene expression.
pheno.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/phenotype_files/"
ge.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/genotype_files/"
grm.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/output/"
bim.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/data/bimbam/"
#Read in bimbam file
bimfile <- bim.dir %&% "Ac_bimbam" ###get SNP position information###
bim <- read.table(bimfile)
ensidlist <- gexAc$EnsemblGeneID
setwd("/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/")
for(i in 1:length(ensidlist)){
cat(i,"/",length(ensidlist),"\n")
gene <- ensidlist[i]
geneinfo <- gtf[match(gene, gtf$Gene),]
chr <- geneinfo[1]
c <- chr$Chr
start <- geneinfo$Start - 1e6 ### 1Mb lower bound for cis-eQTLS
end <- geneinfo$End + 1e6 ### 1Mb upper bound for cis-eQTLs
chrsnps <- subset(bim, bim[,1]==c) ### pull snps on same chr
cissnps <- subset(chrsnps,chrsnps[,2]>=start & chrsnps[,2]<=end) ### pull cis-SNP info
snplist <- cissnps[,3:ncol(cissnps)]
write.table(snplist, file= ge.dir %&% "tmp.Ac.geno" %&% gene, quote=F,col.names=F,row.names=F)
geneexp <- cbind(as.numeric(gexAc[i,-1]))
write.table(geneexp, file= pheno.dir %&% "tmp.pheno." %&% gene, col.names=F, row.names = F, quote=F) #output pheno for gemma
runGEMMAgrm <- "gemma -g " %&% ge.dir %&% "tmp.Ac.geno" %&% gene %&% " -p " %&% pheno.dir %&% "tmp.pheno." %&% gene %&% " -gk -o /grm_Ac_" %&% gene
system(runGEMMAgrm)
}
GEMMA creates a new folder, output
, in the working
directory the GRMs. They should have file extension
cXX.txt
. When we run BSLMM, GEMMA writes to the same
output
folder, so we rename to separate the two.
mv /Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/output /Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/GRMs
Running BSLMM is more computationally expensive than estimating GRMs, but parallelizable if we run it in CRI. We use Alvaro’s Badger script to create a PBS batch job for each gene.
cd /gpfs/data/im-lab/nas40t2/Github/badger
/gpfs/data/im-lab/nas40t2/lab_software/miniconda/envs/ukb_env/bin/python src/badger.py \
-yaml_configuration_file /gpfs/data/im-lab/nas40t2/Github/RatXcan_Training/code/GEMMA_submission.yaml \
-parsimony 9
There should be a folder
/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac/jobs
that contains all the job scripts. CRI has a job limit of 10,000, but we
have about 15,000 genes. We can work around it by repeating the chunk
below.
cd /gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac
for job in jobs/gemma*; do
if (($(qstat | wc -l) > 10000)); then
break
else
qsub $job
mv $job completed_jobs/
fi
done
GEMMA creates
The BSLMM output is a table with columns: h
,
pve
, rho
, pge
, pi
,
n_gamma
. PVE is the proportion variance explained by causal
variants and PGE is the proportion of genetic variance explained by
sparse effects. We interpret them as estimates for heritability and
sparsity, respectively.
For some bizarre reason, the output table contains an extra tab at
the end of each row, so we need to trim it in order for R to be able to
parse it. We use a simple shell script, /gpfs/data/im-lab/nas40t2/Github/RatXcan_Training/code/remove_trailing_tab.sh
.
cd /gpfs/data/im-lab/nas40t2/Github/RatXcan_Training/code
chmod +x remove_trailing_tab.sh
DIR=/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac
mkdir $DIR/hyp
for file in $DIR/output/*.hyp.txt; do
./remove_trailing_tab.sh $file
done
Now we should have a new directory, hyp
, that contains
all of the .hyp.txt
files in the correct format.
ge.dir <- "/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac/hyp"
files <- list.files(path = ge.dir, pattern = ".hyp.txt", full.names = TRUE)
Each gene results in whole columns of PVE values, so we summarize the
distribution in a new dataframe. (BSLMM employs the Metropolis-Hastings
algorithm, a way to estimate the probability distributions of PVE and
PGE given the observed data. Under a very, very loose interpretation,
BSLMM simulates different values of the parameters, computes the
probability they result in our observed data, and repeats until it
settles on the most likely sampled value of the parameters. Each row in
the .hyp.txt
file represents an iteration of the M-H
algorithm.)
library(stringr)
library(LearnBayes)
PVE_df <- as.data.frame(matrix(NA, 0, 4))
for(i in 1:length(files)){
gene <- str_extract(files[i], "ENSRNOG([\\d]+)")
df <- read.delim(files[i])
q1 <- quantile(df$pve, 0.1)
q2 <- quantile(df$pve, 0.9)
quantile1=list(p=.1 ,x=q1)
quantile2=list(p=.9, x=q2)
if(quantile1$x != quantile2$x) {
prior <- beta.select(quantile1, quantile2)
credible_set <- list(qbeta(0.025,prior[1], prior[2]), qbeta(0.975,prior[1], prior[2]))
PVE_df[i, 1] <- gene
PVE_df[i, 2] <- qbeta(0.5, prior[1], prior[2])
PVE_df[i, 3] <- credible_set[1]
PVE_df[i, 4] <- credible_set[2]
}
else
i = i+1
}
write_tsv(PVE_df, "/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac/PVE_estimates.txt", col_names = FALSE )
And repeat for PGE.
PGE_df <- as.data.frame(matrix(NA, 0, 4))
for(i in 1:length(files)){
gene <- str_extract(files[i], "ENSRNOG([\\d]+)")
df <- read.delim(files[i])
q1 <- quantile(df$pge, 0.5)
q2 <- quantile(df$pge, 0.9)
quantile1=list(p=.5 ,x=q1)
quantile2=list(p=.9, x=q2)
if(quantile1$x != quantile2$x) {
prior <- beta.select(quantile1, quantile2)
credible_set <- list(qbeta(0.025,prior[1], prior[2]), qbeta(0.975,prior[1], prior[2]))
PGE_df[i, 1] <- gene
PGE_df[i, 2] <- qbeta(0.5, prior[1], prior[2])
PGE_df[i, 3] <- credible_set[1]
PGE_df[i, 4] <- credible_set[2]
}
else
i = i+1
}
write_tsv(PGE_df, "/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac/PGE_estimates.txt", col_names = FALSE )
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 rstudioapi_0.11 whisker_0.4 knitr_1.39
[5] magrittr_1.5 workflowr_1.6.2 R6_2.4.1 rlang_1.0.2
[9] fastmap_1.1.0 stringr_1.4.0 tools_4.0.3 xfun_0.31
[13] cli_3.3.0 git2r_0.27.1 jquerylib_0.1.4 htmltools_0.5.2
[17] ellipsis_0.3.2 rprojroot_1.3-2 yaml_2.2.1 digest_0.6.27
[21] tibble_3.0.4 lifecycle_0.2.0 crayon_1.3.4 later_1.1.0.1
[25] sass_0.4.1 vctrs_0.4.1 fs_1.5.0 promises_1.1.1
[29] glue_1.6.2 evaluate_0.15 rmarkdown_2.14 stringi_1.5.3
[33] compiler_4.0.3 bslib_0.3.1 pillar_1.4.6 backports_1.1.10
[37] jsonlite_1.7.1 httpuv_1.5.4 pkgconfig_2.0.3