Last updated: 2022-08-18

Checks: 6 1

Knit directory: RatXcan_Training/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220711) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version f65b5a1. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.Rhistory

Untracked files:
    Untracked:  .DS_Store
    Untracked:  .gitignore
    Untracked:  README.html
    Untracked:  analysis/.DS_Store
    Untracked:  rsconnect/

Unstaged changes:
    Modified:   analysis/GEMMA_BSLMM.Rmd
    Deleted:    output/README.md

Staged changes:
    New:        code/remove_trailing_tab.sh

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/GEMMA_BSLMM.Rmd) and HTML (docs/GEMMA_BSLMM.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd f65b5a1 sabrina-mi 2022-08-10 wflow_rename("analysis/Rat_Heritability.Rmd", "analysis/GEMMA_BSLMM.Rmd")

GEMMA generates heritability and predictability values for a given phenotype. In our analysis, we use gene expression values in place of a phenotype to estimate the heritability. It takes genotype, gene expression, and an estimated relatedness matrix as input.

Format Genotype Data

The “bimbam” file we are writing does not exactly follow the BIMBAM mean genotype file format, but it contains all the necessary information to create subseted BIMBAM files for estimating the genetic relatedness matrix (GRM).

library(tidyverse)
"%&%" = function(a,b) paste(a,b,sep="")

load("~/Box/imlab-data/Projects/PTRS-PGRS-Rosetta/Data-From-Abe-Palmer-Lab/Rdata/genoGex.RData")

bim.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/data/bimbam/"
bimfile <- bim.dir %&% "Ac_bimbam"

geno_Ac = geno[,match(colnames(gexAc)[-1], colnames(geno))]
snps <- paste(phyMap$Chr, phyMap$Pos, phyMap$Ref, phyMap$Alt, sep = "_")
Ac_bimbam <- cbind(phyMap$Chr, phyMap$Pos, snps, phyMap$Ref, phyMap$Alt, geno_Ac)
write.table(Ac_bimbam, file = bimfile,quote=F,col.names=F,row.names=F)

Generate GRMs

For each gene, we subset the initial “bimbam” file to cis eQTL variants for the genotype input and use gene expression values for the phenotype input. The output is a GRM estimating relatedness between the rats. BSLMM takes the GRM as input to control for individaul relatedness in association tests between genetic variance and gene expression.

pheno.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/phenotype_files/"
ge.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/genotype_files/"
grm.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/output/"
bim.dir <- "/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/data/bimbam/"
#Read in bimbam file
bimfile <- bim.dir %&% "Ac_bimbam" ###get SNP position information###

bim <- read.table(bimfile)
ensidlist <- gexAc$EnsemblGeneID
setwd("/Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/")
for(i in 1:length(ensidlist)){
  cat(i,"/",length(ensidlist),"\n")
  gene <- ensidlist[i]
  geneinfo <- gtf[match(gene, gtf$Gene),]
  chr <- geneinfo[1]
  c <- chr$Chr
  start <- geneinfo$Start - 1e6 ### 1Mb lower bound for cis-eQTLS
  end <- geneinfo$End + 1e6 ### 1Mb upper bound for cis-eQTLs
  chrsnps <- subset(bim, bim[,1]==c) ### pull snps on same chr
  cissnps <- subset(chrsnps,chrsnps[,2]>=start & chrsnps[,2]<=end) ### pull cis-SNP info
  snplist <- cissnps[,3:ncol(cissnps)]
  write.table(snplist, file= ge.dir %&% "tmp.Ac.geno" %&% gene, quote=F,col.names=F,row.names=F)
  geneexp <- cbind(as.numeric(gexAc[i,-1]))
  write.table(geneexp, file= pheno.dir %&% "tmp.pheno." %&% gene, col.names=F, row.names = F, quote=F) #output pheno for gemma
  runGEMMAgrm <- "gemma -g " %&%  ge.dir %&% "tmp.Ac.geno" %&% gene %&% " -p " %&% pheno.dir %&% "tmp.pheno." %&%  gene  %&%  " -gk -o /grm_Ac_" %&% gene
  system(runGEMMAgrm)

}

GEMMA creates a new folder, output, in the working directory the GRMs. They should have file extension cXX.txt. When we run BSLMM, GEMMA writes to the same output folder, so we rename to separate the two.

mv /Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/output /Users/sabrinami/Github/Rat_Genomics_Paper_Pipeline/GEMMA/Ac/GRMs

Run BSLMM

Running BSLMM is more computationally expensive than estimating GRMs, but parallelizable if we run it in CRI. We use Alvaro’s Badger script to create a PBS batch job for each gene.

cd /gpfs/data/im-lab/nas40t2/Github/badger
/gpfs/data/im-lab/nas40t2/lab_software/miniconda/envs/ukb_env/bin/python src/badger.py \
-yaml_configuration_file /gpfs/data/im-lab/nas40t2/Github/RatXcan_Training/code/GEMMA_submission.yaml \
-parsimony 9

There should be a folder /gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac/jobs that contains all the job scripts. CRI has a job limit of 10,000, but we have about 15,000 genes. We can work around it by repeating the chunk below.

cd /gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac
for job in jobs/gemma*; do
if  (($(qstat | wc -l) > 10000)); then
break
else
qsub $job
mv $job completed_jobs/
fi
done

Read Heritability and Sparsity Estimates

GEMMA creates

The BSLMM output is a table with columns: h, pve, rho, pge, pi, n_gamma. PVE is the proportion variance explained by causal variants and PGE is the proportion of genetic variance explained by sparse effects. We interpret them as estimates for heritability and sparsity, respectively.

For some bizarre reason, the output table contains an extra tab at the end of each row, so we need to trim it in order for R to be able to parse it. We use a simple shell script, /gpfs/data/im-lab/nas40t2/Github/RatXcan_Training/code/remove_trailing_tab.sh.

cd /gpfs/data/im-lab/nas40t2/Github/RatXcan_Training/code
chmod +x remove_trailing_tab.sh
DIR=/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac
mkdir $DIR/hyp
for file in $DIR/output/*.hyp.txt; do
  ./remove_trailing_tab.sh $file
done

Now we should have a new directory, hyp, that contains all of the .hyp.txt files in the correct format.

ge.dir <- "/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac/hyp"
files <- list.files(path = ge.dir, pattern = ".hyp.txt", full.names = TRUE)

Each gene results in whole columns of PVE values, so we summarize the distribution in a new dataframe. (BSLMM employs the Metropolis-Hastings algorithm, a way to estimate the probability distributions of PVE and PGE given the observed data. Under a very, very loose interpretation, BSLMM simulates different values of the parameters, computes the probability they result in our observed data, and repeats until it settles on the most likely sampled value of the parameters. Each row in the .hyp.txt file represents an iteration of the M-H algorithm.)

library(stringr)
library(LearnBayes)

PVE_df <- as.data.frame(matrix(NA, 0, 4)) 

for(i in 1:length(files)){
  gene <- str_extract(files[i], "ENSRNOG([\\d]+)")
  df <- read.delim(files[i])
  
  q1 <- quantile(df$pve, 0.1)
  q2 <- quantile(df$pve, 0.9)
  quantile1=list(p=.1 ,x=q1)
  quantile2=list(p=.9, x=q2)
  if(quantile1$x != quantile2$x) {
  prior <- beta.select(quantile1, quantile2)
  credible_set <- list(qbeta(0.025,prior[1],  prior[2]), qbeta(0.975,prior[1],  prior[2]))
  
  PVE_df[i, 1] <- gene
  PVE_df[i, 2] <- qbeta(0.5, prior[1],  prior[2])
  PVE_df[i, 3] <- credible_set[1]
  PVE_df[i, 4] <- credible_set[2]
  }
  else 
    i = i+1
}

write_tsv(PVE_df, "/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac/PVE_estimates.txt", col_names = FALSE )

And repeat for PGE.

PGE_df <- as.data.frame(matrix(NA, 0, 4))
for(i in 1:length(files)){
  gene <- str_extract(files[i], "ENSRNOG([\\d]+)")
  df <- read.delim(files[i])
  
  q1 <- quantile(df$pge, 0.5)
  q2 <- quantile(df$pge, 0.9)
  quantile1=list(p=.5 ,x=q1)
  quantile2=list(p=.9, x=q2)
  if(quantile1$x != quantile2$x) {
  prior <- beta.select(quantile1, quantile2)
  credible_set <- list(qbeta(0.025,prior[1],  prior[2]), qbeta(0.975,prior[1],  prior[2]))
  
  PGE_df[i, 1] <- gene
  PGE_df[i, 2] <- qbeta(0.5, prior[1],  prior[2])
  PGE_df[i, 3] <- credible_set[1]
  PGE_df[i, 4] <- credible_set[2]
  }
  else 
    i = i+1
}

write_tsv(PGE_df, "/gpfs/data/im-lab/nas40t2/natasha/rat_genomics/GEMMA/Ac/PGE_estimates.txt", col_names = FALSE )

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3     rstudioapi_0.11  whisker_0.4      knitr_1.39      
 [5] magrittr_1.5     workflowr_1.6.2  R6_2.4.1         rlang_1.0.2     
 [9] fastmap_1.1.0    stringr_1.4.0    tools_4.0.3      xfun_0.31       
[13] cli_3.3.0        git2r_0.27.1     jquerylib_0.1.4  htmltools_0.5.2 
[17] ellipsis_0.3.2   rprojroot_1.3-2  yaml_2.2.1       digest_0.6.27   
[21] tibble_3.0.4     lifecycle_0.2.0  crayon_1.3.4     later_1.1.0.1   
[25] sass_0.4.1       vctrs_0.4.1      fs_1.5.0         promises_1.1.1  
[29] glue_1.6.2       evaluate_0.15    rmarkdown_2.14   stringi_1.5.3   
[33] compiler_4.0.3   bslib_0.3.1      pillar_1.4.6     backports_1.1.10
[37] jsonlite_1.7.1   httpuv_1.5.4     pkgconfig_2.0.3