Last updated: 2022-01-03

Checks: 7 0

Knit directory: GradLog/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20201014) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 68ded6f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/test_log.Rmd
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    output/.DS_Store

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Trans.Rmd) and HTML (docs/Trans.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html 51b1a4c Lili Wang 2021-01-04 Build site.
html c4d150c Lili Wang 2020-11-16 Build site.
html d13875e Lili Wang 2020-11-16 Build site.
html 5bc34ec Lili Wang 2020-11-03 Build site.
Rmd 8fe4184 Lili Wang 2020-11-03 Build site.
html 8fe4184 Lili Wang 2020-11-03 Build site.
html 2a04808 Lili Wang 2020-10-19 Build site.
Rmd c830222 Lili Wang 2020-10-19 Update all code used for breast cancer TCGA data.
html c830222 Lili Wang 2020-10-19 Update all code used for breast cancer TCGA data.
html 5ccca4c Lili Wang 2020-10-17 Build site.
Rmd 2c4d881 Lili Wang 2020-10-17 rewrite the code chunk using the ‘code’ parameter
html a62f2e4 Lili Wang 2020-10-17 Build site.
Rmd eb3f712 Lili Wang 2020-10-17 test using cat to display code
html f05b42c Lili Wang 2020-10-17 Build site.
Rmd f9b0a15 Lili Wang 2020-10-17 Rewrite make.geno to convert the genotype binary files to txt
html 5a86aaf Lili Wang 2020-10-14 Build site.
html 4382474 Lili Wang 2020-10-14 Build site.
html f79b80d Lili Wang 2020-10-14 Build site.
Rmd d7821ad Lili Wang 2020-10-14 wflow_publish("./analysis/*.Rmd")

Outine

Preprocess data

Regress out covariates from expression data

rm(list = ls())

dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file"
setwd(dir_data)

extract_residual <- function(y, x){
  return(lm(y ~ x)$residuals)
}


# import expression and covariates data
load("./tumor.rdata")
cov_exp = read.table("./tumor.exp_pc.txt",
                     header = TRUE, row.names = NULL, stringsAsFactors = FALSE)
cov_snp = read.table("./tumor.snp_pc.txt",
                     header = FALSE, row.names = NULL, stringsAsFactors = FALSE)

colnames(ex) = gnames


# regress out covariates
ex_cov_regressed = apply(ex, 2, function(y) extract_residual(y, as.matrix(cbind(cov_exp, cov_snp))))
saveRDS(ex_cov_regressed, "./ex_var_regressed.rds")

print("Done!")

Remove low-quality genes

Remove seudogenes, genes with mappability\(<0.9\), and gene pairs with cross-mappability higher than 1.

The code here is for the breast cancer TCGA data. So the genes on chromosomes X and Y are also removed.

rm(list = ls())

options(stringsAsFactors = FALSE)

# dataDGN: dir_data =
# dataDGN: gene.name = colnames(datExpr)

dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file"
setwd(dir_data)

### Data preparation ###
datExpr = readRDS("./ex_var_regressed.rds")
gene.name = sapply(colnames(datExpr), function(x) strsplit(x, "\\|")[[1]][1])

# remove genes on chromosome X & Y
gene.meta = read.table("./tumor.gene_pos.txt",
                       header = TRUE,
                       stringsAsFactors = F)
rownames(gene.meta) = gene.meta$gene
ind_chrX = gene.meta[colnames(datExpr), "chr"] == "chrX" | gene.meta[colnames(datExpr), "chr"] == "chrY"

# remove pseudogenes
pseudogenes = read.table("./pseudogenes.txt", sep ="\t", stringsAsFactors = F)
ind_pseudo = gene.name %in% unique(pseudogenes$V1)

# remove low mappability genes
mappability = read.table("./mappability.txt", sep ="\t", stringsAsFactors = F, header = T)
ind_map = gene.name %in% unique(mappability$gene_name)

# remove cross-mappable genes
cross_mappability = readRDS("./cross_mappable_genes.rds")
cross_map_genes = unique(names(cross_mappability[!is.na(cross_mappability)]))
ind_cross_map = gene.name %in% cross_map_genes

ind_remove = ind_chrX | ind_pseudo | ind_map | ind_cross_map

sum(ind_chrX)
sum(ind_pseudo)
sum(ind_map)
sum(ind_cross_map)
sum(ind_remove)


write.table(as.matrix(data.frame("gene" = colnames(datExpr),
                                 "gene.name" = gene.name,
                                 "ind_remove" = ind_remove,
                                 "ind_chrX" = ind_chrX,
                                 "pseudo" = ind_pseudo,
                                 "lowmap" = ind_map,
                                 "crossmap" = ind_cross_map)),
            "genes_rm_info.txt",
            row.names = FALSE, col.names = TRUE, quote = FALSE)

Coexpressed gene modules

Cluster the co-expressed genes by WGCNA.

rm(list = ls())

library(WGCNA)

dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file"
setwd(dir_data)

# Data preparation
datExpr = readRDS("./ex_var_regressed.rds")
rm_info = read.table("./genes_rm_info.txt",
                     header = TRUE, row.names = NULL,
                     stringsAsFactors = FALSE)
datExpr = datExpr[, !rm_info$ind_remove]


# Run WGCNA
### Parameter specification ###
minModuleSize = 30
MEDissThres = 0.15
if_plot_adjacency_mat_parameter_selection = F
if_plot_only_tree = F
if_plot_color_and_tree = F
if_plot_eigengene_heatmap_tree = F
if_heatmap_of_network = T


### Step1: network construction ###
# determine the paramter in adjacency function: pickSoftThreshold() OR pickHardThreshold()
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
softPower = sft$powerEstimate

# network construction
adjacency = adjacency(datExpr, power = softPower)
TOM = TOMsimilarity(adjacency)
dissTOM = 1-TOM


### Step2: module detection ###
# tree construction using hierarchical clustering based on TOM
geneTree = hclust(as.dist(dissTOM), method = "average")

# branch cutting using dynamic tree cut
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
                            deepSplit = 2, pamRespectsDendro = FALSE,
                            minClusterSize = minModuleSize)
dynamicColors = labels2colors(dynamicMods)

# eigene genes
MEList = moduleEigengenes(datExpr, colors = dynamicColors)
MEs = MEList$eigengenes

# Call an automatic merging function
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
mergedColors = merge$colors
mergedMEs = merge$newMEs

moduleLabels = match(mergedColors, c("grey", standardColors(100)))-1
names(moduleLabels) = colnames(datExpr)

# Save results
result = list(moduleColors = mergedColors,
              moduleLabels = moduleLabels,
              MEs = mergedMEs,
              old_moduleColors = dynamicColors,
              old_moduleLabels = dynamicMods,
              old_MEs = MEs,
              geneTree = geneTree,
              ind_remove = rm_info$ind_remove)
saveRDS(result, file = "./coexp.module.rds")

Calculate zscores

Here is the code of how I compute null zscores for module \(i\) and chromosome \(j\), using randomly permuted samples. Calculating observed zscores (zscores based on the original real data) is similar, only need to remove line genotype = genotype[ind.perm, ].

Update the code to calculate the observed zscores from TCGA data.

rm(list = ls())

parameter = commandArgs(trailingOnly = T)
module_id = parameter[1]
chr = parameter[2]

setwd("/scratch/midway2/liliw1/TCGA/")

# dataset DGN: dir_data = "/project2/xuanyao/data/DGN/"
# dataset DGN: gene_meta = read.table(paste0(dir_data, "expression_meta_hg19.txt"), ...)
# dataset DGN: gene_module = readRDS(file = paste0(dir_data, "DGN_coexpressed_gene_network.rds"))
# dataset DGN: header = ifelse(chr==1, TRUE, FALSE) when reading genotype


library(foreach)
library(doParallel)

# data input
dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file/"
datExpr = readRDS(paste0(dir_data, "ex_var_regressed.rds"))
gene_meta = read.table(paste0(dir_data, "tumor.gene_pos.txt"),
                      sep = '\t',
                      header = T,
                      stringsAsFactors = F)
gene_module = readRDS(file = paste0(dir_data, "coexp.module.rds"))
genotype = read.table(paste0(dir_data, "chr", chr, ".genotype.matrix.eqtl.txt"),
                      header = TRUE, row.names=NULL,
                      stringsAsFactors = F)
genotype = genotype[!duplicated(genotype[, 1]), ]
rownames(genotype) = genotype[, 1]
genotype = t(genotype[, -1])


# Use only the genes in the module and have postion info in the meta file
gene_in_cluster = data.frame("gene_name"=names(gene_module$moduleLabels)[gene_module$moduleLabels == module_id], stringsAsFactors = F)
gene_w_pos = merge(gene_in_cluster, gene_meta, by.x = 1, by.y = 1)
exp.data = datExpr[, gene_w_pos$gene_name]
print("Input part is done!")

# run parallel on snps
cores=(Sys.getenv("SLURM_NTASKS_PER_NODE"))
cl = makeCluster(as.numeric(cores), outfile="")
registerDoParallel(cl)

M = ncol(genotype); K = ncol(exp.data); N = nrow(exp.data)
#ind.perm = sample(1:N); genotype = genotype[ind.perm, ]
chunk.size = M %/% as.numeric(cores)
system.time(
  res.all <- foreach(i=1:cores, .combine='rbind') %dopar%
    {
      res = matrix(nrow = chunk.size, ncol = K,
                   dimnames = list(colnames(genotype)[((i-1)*chunk.size+1):(i*chunk.size)], colnames(exp.data)))

      for(j in 1:chunk.size){
        snp = genotype[, j + (i-1)*chunk.size]
        res[j, ] = apply(exp.data, 2, function(x) summary(lm(x~snp))$coefficients[2, 3] )

        if(j %% 100 == 0){print(j+(i-1)*chunk.size)}
      }
      res
    }
)
stopCluster(cl)
print("Parallel part is done!")

# Run PCO on snps left in the parallel part
Nsnps_left = M %% as.numeric(cores)
Nsnps_done = (chunk.size)*as.numeric(cores)
if(Nsnps_left != 0){
  res = matrix(nrow = Nsnps_left, ncol = K,
               dimnames = list(colnames(genotype)[Nsnps_done+1:(Nsnps_left)], colnames(exp.data)) )

  for(j in 1:Nsnps_left){
    snp = genotype[, j + Nsnps_done]
    res[j, ] = apply(exp.data, 2, function(x) summary(lm(x~snp))$coefficients[2, 3] )

    print(Nsnps_done+j)
  }

  res.all = rbind(res.all, res)
}
print("All part is done!")


saveRDS(res.all, file = paste0("./z/z.module", module_id, ".chr", chr, ".rds"))

Calculate pvalues

Here is the code using PCO to calculate pvalues. This uses only PC’s with eigenvalue \(\lambda > 1\). Modified PCO code in ./script/*.

rm(list = ls())

parameter = commandArgs(trailingOnly = T)
module_id = parameter[1]
chr = parameter[2]
p.method = parameter[3] #"truncPCO" #"PCO"


setwd("/scratch/midway2/liliw1/TCGA/")

# dataset DGN: dir_data = "/project2/xuanyao/data/DGN/txt_file_of_DGN/"
# dataset DGN: gene_module = readRDS(file = paste0(dir_data, "DGN_coexpressed_gene_network.rds"))
# dataset DGN: gene_meta = read.table(paste0(dir_data, "expression_meta_hg19.txt"), ...)


source(paste0("./script/ModifiedPCOMerged.R"))
source(paste0("./script/liu.r"))
source(paste0("./script/liumod.R"))
source(paste0("./script/davies.R"))
source(paste0("./script/SigmaMetaEstimate.R"))
library(foreach)
library(doParallel)
library(mvtnorm)
library(MPAT)


# zscore input
z_mat = readRDS(file = paste0("./z/z.module", module_id, ".chr", chr, ".rds"))


# data input
dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file/"
datExpr = readRDS(paste0(dir_data, "ex_var_regressed.rds"))
gene_meta = read.table(paste0(dir_data, "tumor.gene_pos.txt"),
                      sep = '\t',
                      header = T,
                      stringsAsFactors = F)
gene_module = readRDS(file = paste0(dir_data, "coexp.module.rds"))


# Use only the genes in the module & have postion info in the meta file & not on chromosome chr
gene_in_cluster = data.frame("gene_name" = names(gene_module$moduleLabels[gene_module$moduleLabels == module_id]), stringsAsFactors = F)
gene_w_pos = merge(gene_in_cluster, gene_meta, by.x = 1, by.y = 1)
gene_trans = gene_w_pos[gene_w_pos[,2] != paste0("chr", chr), "gene_name"]


Sigma = cor(datExpr[, gene_trans])
SigmaO = ModifiedSigmaOEstimate(Sigma,simNum=1000)
z_mat_trans = z_mat[, gene_trans]; rm(z_mat)


# Do parallel
cores=(Sys.getenv("SLURM_NTASKS_PER_NODE"))
cl = makeCluster(as.numeric(cores), outfile="")
registerDoParallel(cl)

Nsnps = nrow(z_mat_trans); snps = rownames(z_mat_trans)
chunk.size = Nsnps %/% as.numeric(cores)
Nsnps_left = Nsnps %% as.numeric(cores)
Nsnps_done = (chunk.size)*as.numeric(cores)

system.time(
  res.all <- foreach(i=1:cores, .combine='c') %dopar%
    {
      library(MPAT)
      library(mvtnorm)

      res <- rep(NA, chunk.size); names(res) <- snps[((i-1)*chunk.size+1):(i*chunk.size)]
      for(x in 1:chunk.size) {
        res[x] <- ModifiedPCOMerged(Z.vec=z_mat_trans[x + (i-1)*chunk.size, ],
                                                       Sigma=Sigma, SigmaO=SigmaO,method = "davies")

        if(x %% 100 == 0){print(x + (i-1)*chunk.size)}
      }

      res
    }
)
stopCluster(cl)

print("Parallel part is done!")


if(Nsnps_left != 0){
  res_left = rep(NA, Nsnps_left); names(res_left) = snps[1:(Nsnps_left) + Nsnps_done]
  for(k in 1:(Nsnps_left)){
    res_left[k] = ModifiedPCOMerged(Z.vec=z_mat_trans[k + Nsnps_done, ],
                               Sigma=Sigma, SigmaO=SigmaO,method = "davies")

    print(k + Nsnps_done)
  }

  res.all = c(res.all, res_left)
}

print("All part is done!")


saveRDS(res.all, file = paste0("./p/p.module", module_id, ".chr", chr, ".", p.method, ".rds"))

Multi-testing correction

Pvalue adjustment based on uniform null pvalue

rm(list = ls())

library(qvalue)

# estimate the p-value cut-off to control the false discovery rate at a certain level

parameter = commandArgs(trailingOnly = T)
fdr.level = parameter[1]
adjust.method = parameter[2]

# input pvalues for all 18 gene modules and 22 chromosomes
pvalue_pc = NULL
for(gene_cluster_id in 1:18){
  for(chr in 1:22){
    tmp_pc = readRDS(file = paste0("./TruncPCO/GeneCluster", gene_cluster_id, "/pvalue_pco_", chr, ".rds"))
    names(tmp_pc) = paste0("C", gene_cluster_id, ":", names(tmp_pc))
    pvalue_pc = c(pvalue_pc, tmp_pc)

    cat("Gene cluster:", gene_cluster_id, ". Chr:", chr, "\n")
    print(object.size(pvalue_pc))
  }
}

# histogram of all pvalues
png("./FDR/Hist of all pvalues")
hist(pvalue_pc)
dev.off()
print("Histogram of pvalues saved!")


# adjust pvalues
if(adjust.method == "BH"){
  p.adjusted = p.adjust(pvalue_pc, method = "BH")
  print("BH correction of pvalue done!")


  signal_q = p.adjusted[p.adjusted<fdr.level]
  signal_name = names(signal_q)
  signal_summary = cbind(pvalue_pc[signal_name], signal_q)
  colnames(signal_summary) = c("pvalue.PCO", "qvalue.PCO")
  write.table(signal_summary, file = "./FDR/transeQTL.Summary.BH.txt",
              quote = FALSE)
  print("Signals written in the file!")


  saveRDS(p.adjusted, file = "./FDR/qvalueofPC.BH.rds")
  print("qvalue results saved!")

}else if(adjust.method == "qvalue"){
  qobj_pc = qvalue(p = pvalue_pc, fdr.level=fdr.level)
  print("FDR analysis of pc has been done!")


  signal_q = qobj_pc$qvalues[qobj_pc$significant]
  signal_name = names(signal_q)
  signal_summary = cbind(qobj_pc$pvalues[signal_name], signal_q)
  colnames(signal_summary) = c("pvalue.PCO", "qvalue.PCO")
  write.table(signal_summary, file = "./FDR/transeQTL.Summary.q.txt",
              quote = FALSE)
  print("Signals written in the file!")


  saveRDS(qobj_pc, file = "./FDR/qvalueofPC.q.rds")
  print("qvalue results saved!")

}

Pvalue adjustment using empirical null pvalue

Post analysis of signals

  • Use all PC’s. Adjust pvalues using qvalue.

  • Use PC’s with eigenvalue \(\lambda>1\). Adjust pvalues using empirical null.

  • Use the first PC. (elife method)

  • Single-variate method. (MatrixeQTL)

Others

Use only the first PC. Use all PC’s.

Find phseudogenes, poor-mapped genes, and cross-mappable genes

Run MatrixeQTL

  • Prepare files

  • Run MatrixeQTL


sessionInfo()