Last updated: 2020-11-03
Checks: 6 1
Knit directory: GradLog/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20201014) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version a167ba4. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: code/.DS_Store
Ignored: data/.DS_Store
Ignored: output/.DS_Store
Untracked files:
Untracked: analysis/Log.Rmd
Untracked: data/GO.organizedResult.csv
Unstaged changes:
Modified: analysis/Trans.Rmd
Modified: analysis/_site.yml
Modified: analysis/index.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Trans.Rmd) and HTML (docs/Trans.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| html | 2a04808 | Lili Wang | 2020-10-19 | Build site. |
| Rmd | c830222 | Lili Wang | 2020-10-19 | Update all code used for breast cancer TCGA data. |
| html | c830222 | Lili Wang | 2020-10-19 | Update all code used for breast cancer TCGA data. |
| html | 5ccca4c | Lili Wang | 2020-10-17 | Build site. |
| Rmd | 2c4d881 | Lili Wang | 2020-10-17 | rewrite the code chunk using the ‘code’ parameter |
| html | a62f2e4 | Lili Wang | 2020-10-17 | Build site. |
| Rmd | eb3f712 | Lili Wang | 2020-10-17 | test using cat to display code |
| html | f05b42c | Lili Wang | 2020-10-17 | Build site. |
| Rmd | f9b0a15 | Lili Wang | 2020-10-17 | Rewrite make.geno to convert the genotype binary files to txt |
| html | 5a86aaf | Lili Wang | 2020-10-14 | Build site. |
| html | 4382474 | Lili Wang | 2020-10-14 | Build site. |
| html | f79b80d | Lili Wang | 2020-10-14 | Build site. |
| Rmd | d7821ad | Lili Wang | 2020-10-14 | wflow_publish("./analysis/*.Rmd") |
module load plink
module load R/3.6.1
cd /scratch/midway2/liliw1/
# dataset DGN: dir_plink=/project2/xuanyao/data/DGN
# dataset DGN: dir_txt=/project2/xuanyao/data/DGN/txt_file_of_DGN
# dataset DGN: dir_script=/project2/xuanyao/data/DGN
# dataset DGN: awk 'BEGIN{print "id\tchr\tpos"}{printf("%s\tchr%s\t%s\n",$2,$1,$4)}'
dir_plink=/project2/xuanyao/xuanyao/cancer_gbat/breast_cancer
dir_txt=/project2/xuanyao/llw/breastcancerTCGA/txt_file
dir_script=/scratch/midway2/liliw1/TCGA
for CHR in `seq 22`
do
# convert bfiles to .raw files
plink \
--bfile $dir_plink/genotype/chr${CHR}_QCed \
--recodeA \
--out $dir_txt/chr${CHR}
# extract genotype meta
awk \
'BEGIN{print "id\tchr\tpos"}{key=sprintf("%s:%s", $1, $4);printf("%s\tchr%s\t%s\n",key,$1,$4)}' \
$dir_plink/genotype/chr${CHR}_QCed.bim > \
$dir_txt/chr${CHR}_genotype_meta.txt
# convert genotype to txt file
Rscript --no-restore --no-save $dir_script/make.geno.R $CHR
done
make.geno.R
rm(list = ls())
args = commandArgs(trailing = T)
chr = args[1]
# dataset DGN: dir_txt = "/project2/xuanyao/data/DGN/txt_file_of_DGN/"
# dataset DGN: colnames(geno) = unlist(lapply(strsplit(colnames(genotype)[7:ncol(genotype)],"_"),"[",1))
dir_txt = "/project2/xuanyao/llw/breastcancerTCGA/txt_file"
genotype = read.table(paste0(dir_txt, "chr",chr,".raw"),
header = T, check.names = F)
snp.meta = read.table(paste0(dir_txt, "chr",chr,"_genotype_meta.txt"),
header = TRUE, stringsAsFactors = FALSE)
geno = as.matrix(genotype[,7:ncol(genotype)])
rownames(geno) = as.character(genotype$IID)
colnames(geno) = snp.meta$id
write.table(t(geno),
paste0(dir_txt, "chr", chr,".genotype.matrix.eqtl.txt"),
sep = "\t", quote = FALSE)
cat(paste0("chr", chr, " is done!"))
rm(list = ls())
dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file"
setwd(dir_data)
extract_residual <- function(y, x){
return(lm(y ~ x)$residuals)
}
# import expression and covariates data
load("./tumor.rdata")
cov_exp = read.table("./tumor.exp_pc.txt",
header = TRUE, row.names = NULL, stringsAsFactors = FALSE)
cov_snp = read.table("./tumor.snp_pc.txt",
header = FALSE, row.names = NULL, stringsAsFactors = FALSE)
colnames(ex) = gnames
# regress out covariates
ex_cov_regressed = apply(ex, 2, function(y) extract_residual(y, as.matrix(cbind(cov_exp, cov_snp))))
saveRDS(ex_cov_regressed, "./ex_var_regressed.rds")
print("Done!")
Remove seudogenes, genes with mappability\(<0.9\), and gene pairs with cross-mappability higher than 1.
The code here is for the breast cancer TCGA data. So the genes on chromosomes X and Y are also removed.
rm(list = ls())
options(stringsAsFactors = FALSE)
# dataDGN: dir_data =
# dataDGN: gene.name = colnames(datExpr)
dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file"
setwd(dir_data)
### Data preparation ###
datExpr = readRDS("./ex_var_regressed.rds")
gene.name = sapply(colnames(datExpr), function(x) strsplit(x, "\\|")[[1]][1])
# remove genes on chromosome X & Y
gene.meta = read.table("./tumor.gene_pos.txt",
header = TRUE,
stringsAsFactors = F)
rownames(gene.meta) = gene.meta$gene
ind_chrX = gene.meta[colnames(datExpr), "chr"] == "chrX" | gene.meta[colnames(datExpr), "chr"] == "chrY"
# remove pseudogenes
pseudogenes = read.table("./pseudogenes.txt", sep ="\t", stringsAsFactors = F)
ind_pseudo = gene.name %in% unique(pseudogenes$V1)
# remove low mappability genes
mappability = read.table("./mappability.txt", sep ="\t", stringsAsFactors = F, header = T)
ind_map = gene.name %in% unique(mappability$gene_name)
# remove cross-mappable genes
cross_mappability = readRDS("./cross_mappable_genes.rds")
cross_map_genes = unique(names(cross_mappability[!is.na(cross_mappability)]))
ind_cross_map = gene.name %in% cross_map_genes
ind_remove = ind_chrX | ind_pseudo | ind_map | ind_cross_map
sum(ind_chrX)
sum(ind_pseudo)
sum(ind_map)
sum(ind_cross_map)
sum(ind_remove)
write.table(as.matrix(data.frame("gene" = colnames(datExpr),
"gene.name" = gene.name,
"ind_remove" = ind_remove,
"ind_chrX" = ind_chrX,
"pseudo" = ind_pseudo,
"lowmap" = ind_map,
"crossmap" = ind_cross_map)),
"genes_rm_info.txt",
row.names = FALSE, col.names = TRUE, quote = FALSE)
Cluster the co-expressed genes by WGCNA.
rm(list = ls())
library(WGCNA)
dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file"
setwd(dir_data)
# Data preparation
datExpr = readRDS("./ex_var_regressed.rds")
rm_info = read.table("./genes_rm_info.txt",
header = TRUE, row.names = NULL,
stringsAsFactors = FALSE)
datExpr = datExpr[, !rm_info$ind_remove]
# Run WGCNA
### Parameter specification ###
minModuleSize = 30
MEDissThres = 0.15
if_plot_adjacency_mat_parameter_selection = F
if_plot_only_tree = F
if_plot_color_and_tree = F
if_plot_eigengene_heatmap_tree = F
if_heatmap_of_network = T
### Step1: network construction ###
# determine the paramter in adjacency function: pickSoftThreshold() OR pickHardThreshold()
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
softPower = sft$powerEstimate
# network construction
adjacency = adjacency(datExpr, power = softPower)
TOM = TOMsimilarity(adjacency)
dissTOM = 1-TOM
### Step2: module detection ###
# tree construction using hierarchical clustering based on TOM
geneTree = hclust(as.dist(dissTOM), method = "average")
# branch cutting using dynamic tree cut
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE,
minClusterSize = minModuleSize)
dynamicColors = labels2colors(dynamicMods)
# eigene genes
MEList = moduleEigengenes(datExpr, colors = dynamicColors)
MEs = MEList$eigengenes
# Call an automatic merging function
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
mergedColors = merge$colors
mergedMEs = merge$newMEs
moduleLabels = match(mergedColors, c("grey", standardColors(100)))-1
names(moduleLabels) = colnames(datExpr)
# Save results
result = list(moduleColors = mergedColors,
moduleLabels = moduleLabels,
MEs = mergedMEs,
old_moduleColors = dynamicColors,
old_moduleLabels = dynamicMods,
old_MEs = MEs,
geneTree = geneTree,
ind_remove = rm_info$ind_remove)
saveRDS(result, file = "./coexp.module.rds")
Here is the code of how I compute null zscores for module \(i\) and chromosome \(j\), using randomly permuted samples. Calculating observed zscores (zscores based on the original real data) is similar, only need to remove line genotype = genotype[ind.perm, ].
Update the code to calculate the observed zscores from TCGA data.
rm(list = ls())
parameter = commandArgs(trailingOnly = T)
module_id = parameter[1]
chr = parameter[2]
setwd("/scratch/midway2/liliw1/TCGA/")
# dataset DGN: dir_data = "/project2/xuanyao/data/DGN/"
# dataset DGN: gene_meta = read.table(paste0(dir_data, "expression_meta_hg19.txt"), ...)
# dataset DGN: gene_module = readRDS(file = paste0(dir_data, "DGN_coexpressed_gene_network.rds"))
# dataset DGN: header = ifelse(chr==1, TRUE, FALSE) when reading genotype
library(foreach)
library(doParallel)
# data input
dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file/"
datExpr = readRDS(paste0(dir_data, "ex_var_regressed.rds"))
gene_meta = read.table(paste0(dir_data, "tumor.gene_pos.txt"),
sep = '\t',
header = T,
stringsAsFactors = F)
gene_module = readRDS(file = paste0(dir_data, "coexp.module.rds"))
genotype = read.table(paste0(dir_data, "chr", chr, ".genotype.matrix.eqtl.txt"),
header = TRUE, row.names=NULL,
stringsAsFactors = F)
genotype = genotype[!duplicated(genotype[, 1]), ]
rownames(genotype) = genotype[, 1]
genotype = t(genotype[, -1])
# Use only the genes in the module and have postion info in the meta file
gene_in_cluster = data.frame("gene_name"=names(gene_module$moduleLabels)[gene_module$moduleLabels == module_id], stringsAsFactors = F)
gene_w_pos = merge(gene_in_cluster, gene_meta, by.x = 1, by.y = 1)
exp.data = datExpr[, gene_w_pos$gene_name]
print("Input part is done!")
# run parallel on snps
cores=(Sys.getenv("SLURM_NTASKS_PER_NODE"))
cl = makeCluster(as.numeric(cores), outfile="")
registerDoParallel(cl)
M = ncol(genotype); K = ncol(exp.data); N = nrow(exp.data)
#ind.perm = sample(1:N); genotype = genotype[ind.perm, ]
chunk.size = M %/% as.numeric(cores)
system.time(
res.all <- foreach(i=1:cores, .combine='rbind') %dopar%
{
res = matrix(nrow = chunk.size, ncol = K,
dimnames = list(colnames(genotype)[((i-1)*chunk.size+1):(i*chunk.size)], colnames(exp.data)))
for(j in 1:chunk.size){
snp = genotype[, j + (i-1)*chunk.size]
res[j, ] = apply(exp.data, 2, function(x) summary(lm(x~snp))$coefficients[2, 3] )
if(j %% 100 == 0){print(j+(i-1)*chunk.size)}
}
res
}
)
stopCluster(cl)
print("Parallel part is done!")
# Run PCO on snps left in the parallel part
Nsnps_left = M %% as.numeric(cores)
Nsnps_done = (chunk.size)*as.numeric(cores)
if(Nsnps_left != 0){
res = matrix(nrow = Nsnps_left, ncol = K,
dimnames = list(colnames(genotype)[Nsnps_done+1:(Nsnps_left)], colnames(exp.data)) )
for(j in 1:Nsnps_left){
snp = genotype[, j + Nsnps_done]
res[j, ] = apply(exp.data, 2, function(x) summary(lm(x~snp))$coefficients[2, 3] )
print(Nsnps_done+j)
}
res.all = rbind(res.all, res)
}
print("All part is done!")
saveRDS(res.all, file = paste0("./z/z.module", module_id, ".chr", chr, ".rds"))
Here is the code using PCO to calculate pvalues. This uses only PC’s with eigenvalue \(\lambda > 1\). Modified PCO code in ./script/*.
rm(list = ls())
parameter = commandArgs(trailingOnly = T)
module_id = parameter[1]
chr = parameter[2]
p.method = parameter[3] #"truncPCO" #"PCO"
setwd("/scratch/midway2/liliw1/TCGA/")
# dataset DGN: dir_data = "/project2/xuanyao/data/DGN/txt_file_of_DGN/"
# dataset DGN: gene_module = readRDS(file = paste0(dir_data, "DGN_coexpressed_gene_network.rds"))
# dataset DGN: gene_meta = read.table(paste0(dir_data, "expression_meta_hg19.txt"), ...)
source(paste0("./script/ModifiedPCOMerged.R"))
source(paste0("./script/liu.r"))
source(paste0("./script/liumod.R"))
source(paste0("./script/davies.R"))
source(paste0("./script/SigmaMetaEstimate.R"))
library(foreach)
library(doParallel)
library(mvtnorm)
library(MPAT)
# zscore input
z_mat = readRDS(file = paste0("./z/z.module", module_id, ".chr", chr, ".rds"))
# data input
dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file/"
datExpr = readRDS(paste0(dir_data, "ex_var_regressed.rds"))
gene_meta = read.table(paste0(dir_data, "tumor.gene_pos.txt"),
sep = '\t',
header = T,
stringsAsFactors = F)
gene_module = readRDS(file = paste0(dir_data, "coexp.module.rds"))
# Use only the genes in the module & have postion info in the meta file & not on chromosome chr
gene_in_cluster = data.frame("gene_name" = names(gene_module$moduleLabels[gene_module$moduleLabels == module_id]), stringsAsFactors = F)
gene_w_pos = merge(gene_in_cluster, gene_meta, by.x = 1, by.y = 1)
gene_trans = gene_w_pos[gene_w_pos[,2] != paste0("chr", chr), "gene_name"]
Sigma = cor(datExpr[, gene_trans])
SigmaO = ModifiedSigmaOEstimate(Sigma,simNum=1000)
z_mat_trans = z_mat[, gene_trans]; rm(z_mat)
# Do parallel
cores=(Sys.getenv("SLURM_NTASKS_PER_NODE"))
cl = makeCluster(as.numeric(cores), outfile="")
registerDoParallel(cl)
Nsnps = nrow(z_mat_trans); snps = rownames(z_mat_trans)
chunk.size = Nsnps %/% as.numeric(cores)
Nsnps_left = Nsnps %% as.numeric(cores)
Nsnps_done = (chunk.size)*as.numeric(cores)
system.time(
res.all <- foreach(i=1:cores, .combine='c') %dopar%
{
library(MPAT)
library(mvtnorm)
res <- rep(NA, chunk.size); names(res) <- snps[((i-1)*chunk.size+1):(i*chunk.size)]
for(x in 1:chunk.size) {
res[x] <- ModifiedPCOMerged(Z.vec=z_mat_trans[x + (i-1)*chunk.size, ],
Sigma=Sigma, SigmaO=SigmaO,method = "davies")
if(x %% 100 == 0){print(x + (i-1)*chunk.size)}
}
res
}
)
stopCluster(cl)
print("Parallel part is done!")
if(Nsnps_left != 0){
res_left = rep(NA, Nsnps_left); names(res_left) = snps[1:(Nsnps_left) + Nsnps_done]
for(k in 1:(Nsnps_left)){
res_left[k] = ModifiedPCOMerged(Z.vec=z_mat_trans[k + Nsnps_done, ],
Sigma=Sigma, SigmaO=SigmaO,method = "davies")
print(k + Nsnps_done)
}
res.all = c(res.all, res_left)
}
print("All part is done!")
saveRDS(res.all, file = paste0("./p/p.module", module_id, ".chr", chr, ".", p.method, ".rds"))
rm(list = ls())
library(qvalue)
# estimate the p-value cut-off to control the false discovery rate at a certain level
parameter = commandArgs(trailingOnly = T)
fdr.level = parameter[1]
adjust.method = parameter[2]
# input pvalues for all 18 gene modules and 22 chromosomes
pvalue_pc = NULL
for(gene_cluster_id in 1:18){
for(chr in 1:22){
tmp_pc = readRDS(file = paste0("./TruncPCO/GeneCluster", gene_cluster_id, "/pvalue_pco_", chr, ".rds"))
names(tmp_pc) = paste0("C", gene_cluster_id, ":", names(tmp_pc))
pvalue_pc = c(pvalue_pc, tmp_pc)
cat("Gene cluster:", gene_cluster_id, ". Chr:", chr, "\n")
print(object.size(pvalue_pc))
}
}
# histogram of all pvalues
png("./FDR/Hist of all pvalues")
hist(pvalue_pc)
dev.off()
print("Histogram of pvalues saved!")
# adjust pvalues
if(adjust.method == "BH"){
p.adjusted = p.adjust(pvalue_pc, method = "BH")
print("BH correction of pvalue done!")
signal_q = p.adjusted[p.adjusted<fdr.level]
signal_name = names(signal_q)
signal_summary = cbind(pvalue_pc[signal_name], signal_q)
colnames(signal_summary) = c("pvalue.PCO", "qvalue.PCO")
write.table(signal_summary, file = "./FDR/transeQTL.Summary.BH.txt",
quote = FALSE)
print("Signals written in the file!")
saveRDS(p.adjusted, file = "./FDR/qvalueofPC.BH.rds")
print("qvalue results saved!")
}else if(adjust.method == "qvalue"){
qobj_pc = qvalue(p = pvalue_pc, fdr.level=fdr.level)
print("FDR analysis of pc has been done!")
signal_q = qobj_pc$qvalues[qobj_pc$significant]
signal_name = names(signal_q)
signal_summary = cbind(qobj_pc$pvalues[signal_name], signal_q)
colnames(signal_summary) = c("pvalue.PCO", "qvalue.PCO")
write.table(signal_summary, file = "./FDR/transeQTL.Summary.q.txt",
quote = FALSE)
print("Signals written in the file!")
saveRDS(qobj_pc, file = "./FDR/qvalueofPC.q.rds")
print("qvalue results saved!")
}
Use all PC’s. Adjust pvalues using qvalue.
Use PC’s with eigenvalue \(\lambda>1\). Adjust pvalues using empirical null.
Use the first PC. (elife method)
Single-variate method. (MatrixeQTL)
Prepare files
Run MatrixeQTL
sessionInfo()