Last updated: 2020-10-14
Checks: 7 0
Knit directory: GradLog/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20201014) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 0e7a8fb. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: analysis/image.png
Ignored: code/.DS_Store
Ignored: data/.DS_Store
Ignored: output/.DS_Store
Untracked files:
Untracked: code/coexpressed.gene.network.R
Untracked: code/fdr.R
Untracked: code/image.png
Untracked: code/make_geno.R
Untracked: code/make_geno.sh
Untracked: code/p.null.R
Untracked: code/regress_out_covariates.R
Untracked: code/remove.genes.R
Untracked: code/z.null.R
Untracked: data/image.png
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Trans.Rmd) and HTML (docs/Trans.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| html | 4382474 | Lili Wang | 2020-10-14 | Build site. |
| html | f79b80d | Lili Wang | 2020-10-14 | Build site. |
| Rmd | d7821ad | Lili Wang | 2020-10-14 | wflow_publish("./analysis/*.Rmd") |
module load plink
for CHR in `seq 22`
do
# convert bfiles to .raw files
plink \
--bfile /project2/xuanyao/data/DGN/genotype/chr${CHR}_QCed \
--recodeA \
--out /project2/xuanyao/data/DGN/txt_file_of_DGN/chr${CHR}
# extract genotype meta
awk \
'BEGIN{print "id\tchr\tpos"}{printf("%s\tchr%s\t%s\n",$2,$1,$4)}' \
/project2/xuanyao/data/DGN/genotype/chr${CHR}_QCed.bim > \
/project2/xuanyao/data/DGN/txt_file_of_DGN/chr${CHR}_genotype_meta.txt
# convert genotype to txt file
Rscript --no-restore --no-save /project2/xuanyao/data/DGN/make_geno.R $CHR
done
make_geno.R
rm(list = ls())
args = commandArgs(trailing = T)
chr = args[1]
dat = read.table(paste("/project2/xuanyao/data/DGN/txt_file_of_DGN/chr",chr,".raw",sep=""),header=T,check.names=F)
geno = as.matrix(dat[,7:ncol(dat)])
rownames(geno) = as.character(dat$IID)
colnames(geno) = unlist(lapply(strsplit(colnames(dat)[7:ncol(dat)],"_"),"[",1))
write.table(t(geno),
paste("./project2/xuanyao/data/DGN/txt_file_of_DGN/chr",chr,".genotype.matrix.eqtl.txt",sep = ""),
sep = "\t", quote = FALSE)
rm(list = ls())
# import expression and covariates data
load("./data/dgn.rdata")
covariates_tec = read.table("./data/Technical_factors.txt",
sep = '\t',
header = T, row.names = 1,
stringsAsFactors = F)
covariates_bio = read.table("./data/Biological_and_hidden_factors.txt",
sep = '\t',
header = T, row.names = 1,
stringsAsFactors = F)
sample_names = read.table("data/samples.txt",
header = T,
stringsAsFactors = F)
extract_residual <- function(y, x){
return(lm(y ~ x)$residuals)
}
rownames(ex) = sample_names$sample; colnames(ex) = gnames
covariates_tec_used = as.matrix(covariates_tec[rownames(ex), ])
covariates_bio_used = as.matrix(covariates_bio[rownames(ex), ])
# regress out technical covariates
ex_tec_regressed = apply(ex, 2, function(y) extract_residual(y, covariates_tec_used))
ex_tec_bio_regressed = apply(ex_tec_regressed, 2, function(e) extract_residual(e, covariates_bio_used))
saveRDS(ex_tec_bio_regressed, "./data/ex_var_regressed.rds")
print("Done!")
Remove seudogenes, genes with mappability\(<0.9\), and gene pairs with cross-mappability higher than 1.
rm(list = ls())
library(WGCNA)
source("functions_for_additional_analysis.R")
options(stringsAsFactors = FALSE)
# expression input
datExpr = readRDS("./data/ex_var_regressed.rds")
# remove pseudogenes
pseudogenes = read.table("./result/pseudogenes.txt", sep ="\t", stringsAsFactors = F)
ind_pseudo = colnames(datExpr) %in% unique(pseudogenes$V1)
# remove low mappability genes
mappability = read.table("./result/mappability.txt", sep ="\t", stringsAsFactors = F, header = T)
ind_map = colnames(datExpr) %in% unique(mappability$gene_name)
# remove cross-mappable genes
cross_mappability = readRDS("./result/cross_mappable_genes.rds")
ind_cross_map = unique(names(cross_mappability[!is.na(cross_mappability)]))
ind_cross_map = colnames(datExpr) %in% ind_cross_map
# all removed genes
ind_remove = ind_pseudo | ind_map | ind_cross_map
Cluster the co-expressed genes by WGCNA.
# Genes left after removing low-quality genes
datExpr = datExpr[, !ind_remove]
# Parameter specification
minModuleSize = 30
MEDissThres = 0.15
if_plot_adjacency_mat_parameter_selection = F
if_plot_only_tree = F
if_plot_color_and_tree = F
if_plot_eigengene_heatmap_tree = F
if_heatmap_of_network = T
### Step1: network construction ###
# determine the paramter in adjacency function: pickSoftThreshold() OR pickHardThreshold()
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
softPower = sft$powerEstimate
# network construction
adjacency = adjacency(datExpr, power = softPower)
TOM = TOMsimilarity(adjacency)
dissTOM = 1-TOM
### Step2: module detection ###
# tree construction using hierarchical clustering based on TOM
geneTree = hclust(as.dist(dissTOM), method = "average")
# branch cutting using dynamic tree cut
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE,
minClusterSize = minModuleSize)
dynamicColors = labels2colors(dynamicMods)
# eigene genes
MEList = moduleEigengenes(datExpr, colors = dynamicColors)
MEs = MEList$eigengenes
# Call an automatic merging function
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
mergedColors = merge$colors
mergedMEs = merge$newMEs
moduleLabels = match(mergedColors, c("grey", standardColors(50)))-1
names(moduleLabels) = colnames(datExpr)
# Save results
result = list(moduleColors = mergedColors,
moduleLabels = moduleLabels,
MEs = mergedMEs,
old_moduleColors = dynamicColors,
old_moduleLabels = dynamicMods,
old_MEs = MEs,
geneTree = geneTree,
ind_remove = ind_remove)
saveRDS(result, file = "./result/DGN_coexpressed_gene_network.rds")
Here is the code of how I compute zscores for module \(i\) and chromosome \(j\), using randomly permuted samples. Calculating observed zscores (zscores based on the original real data) is similar, only need to remove line genotype = genotype[ind.perm, ].
rm(list = ls())
parameter = commandArgs(trailingOnly = T)
module_id = parameter[1]
chr = parameter[2]
library(foreach)
library(doParallel)
# data input
data_dir = "/project2/xuanyao/data/DGN/"
datExpr = readRDS(paste0(data_dir, "txt_file_of_DGN/ex_var_regressed.rds"))
gene_pos = read.table(paste0(data_dir, "txt_file_of_DGN/expression_meta_hg19.txt"),
sep = '\t',
header = T,
stringsAsFactors = F)
gene_result = readRDS(file = paste0(data_dir, "txt_file_of_DGN/DGN_coexpressed_gene_network.rds"))
header = ifelse(chr==1, TRUE, FALSE)
genotype = read.table(paste0(data_dir, "txt_file_of_DGN/chr", chr, ".genotype.matrix.eqtl.txt"),
header = header, row.names=NULL,
stringsAsFactors = F)
genotype = genotype[!duplicated(genotype[, 1]), ]
rownames(genotype) = genotype[, 1]
genotype = t(genotype[, -1])
# Use only the genes in the module and have postion info in the meta file
gene_in_cluster = data.frame("gene_name"=names(gene_result$moduleLabels)[gene_result$moduleLabels == module_id], stringsAsFactors = F)
gene_w_pos = merge(gene_in_cluster, gene_pos, by.x = 1, by.y = 1)
exp.data = datExpr[, gene_w_pos$gene_name]
print("Input part is done!")
# run parallel on snps
cores=(Sys.getenv("SLURM_NTASKS_PER_NODE"))
cl = makeCluster(as.numeric(cores), outfile="")
registerDoParallel(cl)
M = ncol(genotype); K = ncol(exp.data); N = nrow(exp.data); ind.perm = sample(1:N)
genotype = genotype[ind.perm, ]
chunk.size = M %/% as.numeric(cores)
system.time(
z.null <- foreach(i=1:cores, .combine='rbind') %dopar%
{
res = matrix(nrow = chunk.size, ncol = K,
dimnames = list(colnames(genotype)[((i-1)*chunk.size+1):(i*chunk.size)], colnames(exp.data)))
for(j in 1:chunk.size){
snp = genotype[, j + (i-1)*chunk.size]
res[j, ] = apply(exp.data, 2, function(x) summary(lm(x~snp))$coefficients[2, 3] )
if(j %% 100 == 0){print(j+(i-1)*chunk.size)}
}
res
}
)
stopCluster(cl)
print("Parallel part is done!")
# Run PCO on snps left in the parallel part
Nsnps_left = M %% as.numeric(cores)
Nsnps_done = (chunk.size)*as.numeric(cores)
if(Nsnps_left != 0){
res = matrix(nrow = Nsnps_left, ncol = K,
dimnames = list(colnames(genotype)[Nsnps_done+1:(Nsnps_left)], colnames(exp.data)) )
for(j in 1:Nsnps_left){
snp = genotype[, j + Nsnps_done]
res[j, ] = apply(exp.data, 2, function(x) summary(lm(x~snp))$coefficients[2, 3] )
print(Nsnps_done+j)
}
z.null = rbind(z.null, res)
}
print("All part is done!")
saveRDS(z.null, file = paste0("./NULL/z.null/z.null.module", module_id, ".chr", chr, ".rds"))
Here is the code using PCO to calculate pvalues. This uses only PC’s with eigenvalue \(\lambda > 1\).
rm(list = ls())
parameter = commandArgs(trailingOnly = T)
module_id = parameter[1]
chr = parameter[2]
p.method = parameter[3] #"truncPCO" #"PCO"
script_dir = "./NULL/script/"
zscore_dir = "./NULL/z.null/"
# There files are modified PCO functions. Only use PC's with eigenvalue \lambda>1
source(paste0(script_dir, "ModifiedPCOMerged.R"))
source(paste0(script_dir, "liu.r"))
source(paste0(script_dir, "liumod.R"))
source(paste0(script_dir, "davies.R"))
source(paste0(script_dir, "SigmaMetaEstimate.R"))
library(foreach)
library(doParallel)
library(mvtnorm)
library(MPAT)
# zscore input
z.null = readRDS(file = paste0(zscore_dir, "z.null.module", module_id, ".chr", chr, ".rds"))
# input
data_dir = "/project2/xuanyao/data/DGN/txt_file_of_DGN/"
gene_all_pos = read.table(paste0(data_dir, "expression_meta_hg19.txt"),
sep = '\t',
header = T,
stringsAsFactors = F)
datExpr = readRDS(paste0(data_dir, "ex_var_regressed.rds"))
gene_module = readRDS(file = paste0(data_dir, "DGN_coexpressed_gene_network.rds"))
# Extract genes within the module that have gene meta info and are not on the chromosome for which we are calculating pvalues.
# This step makes sure the signals are due to trans genetic effects.
gene_cluster = data.frame("gene_name" = names(gene_module$moduleLabels[gene_module$moduleLabels == module_id]), stringsAsFactors = F)
gene_pos = merge(gene_cluster, gene_all_pos, by.x = 1, by.y = 1)
gene_trans = gene_pos[gene_pos$chrom != paste0("chr", chr), "gene_name"]
# inputs for PCO
Sigma = cor(datExpr[, gene_trans])
SigmaO = ModifiedSigmaOEstimate(Sigma,simNum=1000)
z.null_trans = z.null[, gene_trans]; rm(z.null)
# Do parallel across snps
cores=(Sys.getenv("SLURM_NTASKS_PER_NODE"))
cl = makeCluster(as.numeric(cores), outfile="")
registerDoParallel(cl)
Nsnps = nrow(z.null_trans); snps = rownames(z.null_trans)
chunk.size = Nsnps %/% as.numeric(cores)
Nsnps_left = Nsnps %% as.numeric(cores)
Nsnps_done = (chunk.size)*as.numeric(cores)
system.time(
p.null <- foreach(i=1:cores, .combine='c') %dopar%
{
library(MPAT)
library(mvtnorm)
res <- rep(NA, chunk.size); names(res) <- snps[((i-1)*chunk.size+1):(i*chunk.size)]
for(x in 1:chunk.size) {
res[x] <- ModifiedPCOMerged(Z.vec=z.null_trans[x + (i-1)*chunk.size, ],
Sigma=Sigma, SigmaO=SigmaO,method = "davies")
if(x %% 100 == 0){print(x + (i-1)*chunk.size)}
}
res
}
)
stopCluster(cl)
print("Parallel part is done!")
# Do PCO on snps that are left in the parallel part.
if(Nsnps_left != 0){
res_left = rep(NA, Nsnps_left); names(res_left) = snps[1:(Nsnps_left) + Nsnps_done]
for(k in 1:(Nsnps_left)){
res_left[k] = ModifiedPCOMerged(Z.vec=z.null_trans[k + Nsnps_done, ],
Sigma=Sigma, SigmaO=SigmaO,method = "davies")
print(k + Nsnps_done)
}
p.null = c(p.null, res_left)
}
print("All part is done!")
saveRDS(pT.null, file = paste0("./NULL/p.null/p.null.module", module_id, ".chr", chr, ".", p.method, ".rds"))
rm(list = ls())
library(qvalue)
# estimate the p-value cut-off to control the false discovery rate at a certain level
parameter = commandArgs(trailingOnly = T)
fdr.level = parameter[1]
adjust.method = parameter[2]
# input pvalues for all 18 gene modules and 22 chromosomes
pvalue_pc = NULL
for(gene_cluster_id in 1:18){
for(chr in 1:22){
tmp_pc = readRDS(file = paste0("./TruncPCO/GeneCluster", gene_cluster_id, "/pvalue_pco_", chr, ".rds"))
names(tmp_pc) = paste0("C", gene_cluster_id, ":", names(tmp_pc))
pvalue_pc = c(pvalue_pc, tmp_pc)
cat("Gene cluster:", gene_cluster_id, ". Chr:", chr, "\n")
print(object.size(pvalue_pc))
}
}
# histogram of all pvalues
png("./FDR/Hist of all pvalues")
hist(pvalue_pc)
dev.off()
print("Histogram of pvalues saved!")
# adjust pvalues
if(adjust.method == "BH"){
p.adjusted = p.adjust(pvalue_pc, method = "BH")
print("BH correction of pvalue done!")
signal_q = p.adjusted[p.adjusted<fdr.level]
signal_name = names(signal_q)
signal_summary = cbind(pvalue_pc[signal_name], signal_q)
colnames(signal_summary) = c("pvalue.PCO", "qvalue.PCO")
write.table(signal_summary, file = "./FDR/transeQTL.Summary.BH.txt",
quote = FALSE)
print("Signals written in the file!")
saveRDS(p.adjusted, file = "./FDR/qvalueofPC.BH.rds")
print("qvalue results saved!")
}else if(adjust.method == "qvalue"){
qobj_pc = qvalue(p = pvalue_pc, fdr.level=fdr.level)
print("FDR analysis of pc has been done!")
signal_q = qobj_pc$qvalues[qobj_pc$significant]
signal_name = names(signal_q)
signal_summary = cbind(qobj_pc$pvalues[signal_name], signal_q)
colnames(signal_summary) = c("pvalue.PCO", "qvalue.PCO")
write.table(signal_summary, file = "./FDR/transeQTL.Summary.q.txt",
quote = FALSE)
print("Signals written in the file!")
saveRDS(qobj_pc, file = "./FDR/qvalueofPC.q.rds")
print("qvalue results saved!")
}
Use all PC’s. Adjust pvalues using qvalue.
Use PC’s with eigenvalue \(\lambda>1\). Adjust pvalues using empirical null.
Use the first PC. (elife method)
Single-variate method. (MatrixeQTL)
Prepare files
Run MatrixeQTL
sessionInfo()