Trans

Last updated: 2020-11-03

Checks: 6 1

Knit directory: GradLog/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201014)

The command set.seed(20201014) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: a167ba4

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version a167ba4. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  analysis/Log.Rmd
    Untracked:  data/GO.organizedResult.csv

Unstaged changes:
    Modified:   analysis/Trans.Rmd
    Modified:   analysis/_site.yml
    Modified:   analysis/index.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Trans.Rmd) and HTML (docs/Trans.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	2a04808	Lili Wang	2020-10-19	Build site.
Rmd	c830222	Lili Wang	2020-10-19	Update all code used for breast cancer TCGA data.
html	c830222	Lili Wang	2020-10-19	Update all code used for breast cancer TCGA data.
html	5ccca4c	Lili Wang	2020-10-17	Build site.
Rmd	2c4d881	Lili Wang	2020-10-17	rewrite the code chunk using the ‘code’ parameter
html	a62f2e4	Lili Wang	2020-10-17	Build site.
Rmd	eb3f712	Lili Wang	2020-10-17	test using cat to display code
html	f05b42c	Lili Wang	2020-10-17	Build site.
Rmd	f9b0a15	Lili Wang	2020-10-17	Rewrite make.geno to convert the genotype binary files to txt
html	5a86aaf	Lili Wang	2020-10-14	Build site.
html	4382474	Lili Wang	2020-10-14	Build site.
html	f79b80d	Lili Wang	2020-10-14	Build site.
Rmd	d7821ad	Lili Wang	2020-10-14	wflow_publish("./analysis/*.Rmd")

Outine

Preprocess data

Convert plink files of genotype to .txt

module load plink
module load R/3.6.1

cd /scratch/midway2/liliw1/

# dataset DGN: dir_plink=/project2/xuanyao/data/DGN
# dataset DGN: dir_txt=/project2/xuanyao/data/DGN/txt_file_of_DGN
# dataset DGN: dir_script=/project2/xuanyao/data/DGN
# dataset DGN: awk 'BEGIN{print "id\tchr\tpos"}{printf("%s\tchr%s\t%s\n",$2,$1,$4)}'

dir_plink=/project2/xuanyao/xuanyao/cancer_gbat/breast_cancer
dir_txt=/project2/xuanyao/llw/breastcancerTCGA/txt_file
dir_script=/scratch/midway2/liliw1/TCGA


for CHR in `seq 22`
do
# convert bfiles to .raw files
plink \
--bfile $dir_plink/genotype/chr${CHR}_QCed \
--recodeA \
--out $dir_txt/chr${CHR}

# extract genotype meta
awk \
'BEGIN{print "id\tchr\tpos"}{key=sprintf("%s:%s", $1, $4);printf("%s\tchr%s\t%s\n",key,$1,$4)}' \
$dir_plink/genotype/chr${CHR}_QCed.bim > \
$dir_txt/chr${CHR}_genotype_meta.txt

# convert genotype to txt file
Rscript --no-restore --no-save $dir_script/make.geno.R $CHR

done

make.geno.R

rm(list = ls())

args = commandArgs(trailing = T)
chr = args[1]

# dataset DGN: dir_txt = "/project2/xuanyao/data/DGN/txt_file_of_DGN/"
# dataset DGN: colnames(geno) = unlist(lapply(strsplit(colnames(genotype)[7:ncol(genotype)],"_"),"[",1))

dir_txt = "/project2/xuanyao/llw/breastcancerTCGA/txt_file"
genotype = read.table(paste0(dir_txt, "chr",chr,".raw"),
                 header = T, check.names = F)
snp.meta = read.table(paste0(dir_txt, "chr",chr,"_genotype_meta.txt"),
                      header = TRUE, stringsAsFactors = FALSE)

geno = as.matrix(genotype[,7:ncol(genotype)])
rownames(geno) = as.character(genotype$IID)
colnames(geno) = snp.meta$id


write.table(t(geno),
            paste0(dir_txt, "chr", chr,".genotype.matrix.eqtl.txt"),
            sep = "\t", quote = FALSE)

cat(paste0("chr", chr, " is done!"))

Regress out covariates from expression data

rm(list = ls())

dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file"
setwd(dir_data)

extract_residual <- function(y, x){
  return(lm(y ~ x)$residuals)
}


# import expression and covariates data
load("./tumor.rdata")
cov_exp = read.table("./tumor.exp_pc.txt",
                     header = TRUE, row.names = NULL, stringsAsFactors = FALSE)
cov_snp = read.table("./tumor.snp_pc.txt",
                     header = FALSE, row.names = NULL, stringsAsFactors = FALSE)

colnames(ex) = gnames


# regress out covariates
ex_cov_regressed = apply(ex, 2, function(y) extract_residual(y, as.matrix(cbind(cov_exp, cov_snp))))
saveRDS(ex_cov_regressed, "./ex_var_regressed.rds")

print("Done!")

Remove low-quality genes

Remove seudogenes, genes with mappability\(<0.9\), and gene pairs with cross-mappability higher than 1.

The code here is for the breast cancer TCGA data. So the genes on chromosomes X and Y are also removed.

rm(list = ls())

options(stringsAsFactors = FALSE)

# dataDGN: dir_data =
# dataDGN: gene.name = colnames(datExpr)

dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file"
setwd(dir_data)

### Data preparation ###
datExpr = readRDS("./ex_var_regressed.rds")
gene.name = sapply(colnames(datExpr), function(x) strsplit(x, "\\|")[[1]][1])

# remove genes on chromosome X & Y
gene.meta = read.table("./tumor.gene_pos.txt",
                       header = TRUE,
                       stringsAsFactors = F)
rownames(gene.meta) = gene.meta$gene
ind_chrX = gene.meta[colnames(datExpr), "chr"] == "chrX" | gene.meta[colnames(datExpr), "chr"] == "chrY"

# remove pseudogenes
pseudogenes = read.table("./pseudogenes.txt", sep ="\t", stringsAsFactors = F)
ind_pseudo = gene.name %in% unique(pseudogenes$V1)

# remove low mappability genes
mappability = read.table("./mappability.txt", sep ="\t", stringsAsFactors = F, header = T)
ind_map = gene.name %in% unique(mappability$gene_name)

# remove cross-mappable genes
cross_mappability = readRDS("./cross_mappable_genes.rds")
cross_map_genes = unique(names(cross_mappability[!is.na(cross_mappability)]))
ind_cross_map = gene.name %in% cross_map_genes

ind_remove = ind_chrX | ind_pseudo | ind_map | ind_cross_map

sum(ind_chrX)
sum(ind_pseudo)
sum(ind_map)
sum(ind_cross_map)
sum(ind_remove)


write.table(as.matrix(data.frame("gene" = colnames(datExpr),
                                 "gene.name" = gene.name,
                                 "ind_remove" = ind_remove,
                                 "ind_chrX" = ind_chrX,
                                 "pseudo" = ind_pseudo,
                                 "lowmap" = ind_map,
                                 "crossmap" = ind_cross_map)),
            "genes_rm_info.txt",
            row.names = FALSE, col.names = TRUE, quote = FALSE)

Coexpressed gene modules

Cluster the co-expressed genes by WGCNA.

rm(list = ls())

library(WGCNA)

dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file"
setwd(dir_data)

# Data preparation
datExpr = readRDS("./ex_var_regressed.rds")
rm_info = read.table("./genes_rm_info.txt",
                     header = TRUE, row.names = NULL,
                     stringsAsFactors = FALSE)
datExpr = datExpr[, !rm_info$ind_remove]


# Run WGCNA
### Parameter specification ###
minModuleSize = 30
MEDissThres = 0.15
if_plot_adjacency_mat_parameter_selection = F
if_plot_only_tree = F
if_plot_color_and_tree = F
if_plot_eigengene_heatmap_tree = F
if_heatmap_of_network = T


### Step1: network construction ###
# determine the paramter in adjacency function: pickSoftThreshold() OR pickHardThreshold()
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
softPower = sft$powerEstimate

# network construction
adjacency = adjacency(datExpr, power = softPower)
TOM = TOMsimilarity(adjacency)
dissTOM = 1-TOM


### Step2: module detection ###
# tree construction using hierarchical clustering based on TOM
geneTree = hclust(as.dist(dissTOM), method = "average")

# branch cutting using dynamic tree cut
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
                            deepSplit = 2, pamRespectsDendro = FALSE,
                            minClusterSize = minModuleSize)
dynamicColors = labels2colors(dynamicMods)

# eigene genes
MEList = moduleEigengenes(datExpr, colors = dynamicColors)
MEs = MEList$eigengenes

# Call an automatic merging function
merge = mergeCloseModules(datExpr, dynamicColors, cutHeight = MEDissThres, verbose = 3)
mergedColors = merge$colors
mergedMEs = merge$newMEs

moduleLabels = match(mergedColors, c("grey", standardColors(100)))-1
names(moduleLabels) = colnames(datExpr)

# Save results
result = list(moduleColors = mergedColors,
              moduleLabels = moduleLabels,
              MEs = mergedMEs,
              old_moduleColors = dynamicColors,
              old_moduleLabels = dynamicMods,
              old_MEs = MEs,
              geneTree = geneTree,
              ind_remove = rm_info$ind_remove)
saveRDS(result, file = "./coexp.module.rds")

Calculate zscores

Here is the code of how I compute null zscores for module \(i\) and chromosome \(j\), using randomly permuted samples. Calculating observed zscores (zscores based on the original real data) is similar, only need to remove line genotype = genotype[ind.perm, ].

Update the code to calculate the observed zscores from TCGA data.

rm(list = ls())

parameter = commandArgs(trailingOnly = T)
module_id = parameter[1]
chr = parameter[2]

setwd("/scratch/midway2/liliw1/TCGA/")

# dataset DGN: dir_data = "/project2/xuanyao/data/DGN/"
# dataset DGN: gene_meta = read.table(paste0(dir_data, "expression_meta_hg19.txt"), ...)
# dataset DGN: gene_module = readRDS(file = paste0(dir_data, "DGN_coexpressed_gene_network.rds"))
# dataset DGN: header = ifelse(chr==1, TRUE, FALSE) when reading genotype


library(foreach)
library(doParallel)

# data input
dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file/"
datExpr = readRDS(paste0(dir_data, "ex_var_regressed.rds"))
gene_meta = read.table(paste0(dir_data, "tumor.gene_pos.txt"),
                      sep = '\t',
                      header = T,
                      stringsAsFactors = F)
gene_module = readRDS(file = paste0(dir_data, "coexp.module.rds"))
genotype = read.table(paste0(dir_data, "chr", chr, ".genotype.matrix.eqtl.txt"),
                      header = TRUE, row.names=NULL,
                      stringsAsFactors = F)
genotype = genotype[!duplicated(genotype[, 1]), ]
rownames(genotype) = genotype[, 1]
genotype = t(genotype[, -1])


# Use only the genes in the module and have postion info in the meta file
gene_in_cluster = data.frame("gene_name"=names(gene_module$moduleLabels)[gene_module$moduleLabels == module_id], stringsAsFactors = F)
gene_w_pos = merge(gene_in_cluster, gene_meta, by.x = 1, by.y = 1)
exp.data = datExpr[, gene_w_pos$gene_name]
print("Input part is done!")

# run parallel on snps
cores=(Sys.getenv("SLURM_NTASKS_PER_NODE"))
cl = makeCluster(as.numeric(cores), outfile="")
registerDoParallel(cl)

M = ncol(genotype); K = ncol(exp.data); N = nrow(exp.data)
#ind.perm = sample(1:N); genotype = genotype[ind.perm, ]
chunk.size = M %/% as.numeric(cores)
system.time(
  res.all <- foreach(i=1:cores, .combine='rbind') %dopar%
    {
      res = matrix(nrow = chunk.size, ncol = K,
                   dimnames = list(colnames(genotype)[((i-1)*chunk.size+1):(i*chunk.size)], colnames(exp.data)))

      for(j in 1:chunk.size){
        snp = genotype[, j + (i-1)*chunk.size]
        res[j, ] = apply(exp.data, 2, function(x) summary(lm(x~snp))$coefficients[2, 3] )

        if(j %% 100 == 0){print(j+(i-1)*chunk.size)}
      }
      res
    }
)
stopCluster(cl)
print("Parallel part is done!")

# Run PCO on snps left in the parallel part
Nsnps_left = M %% as.numeric(cores)
Nsnps_done = (chunk.size)*as.numeric(cores)
if(Nsnps_left != 0){
  res = matrix(nrow = Nsnps_left, ncol = K,
               dimnames = list(colnames(genotype)[Nsnps_done+1:(Nsnps_left)], colnames(exp.data)) )

  for(j in 1:Nsnps_left){
    snp = genotype[, j + Nsnps_done]
    res[j, ] = apply(exp.data, 2, function(x) summary(lm(x~snp))$coefficients[2, 3] )

    print(Nsnps_done+j)
  }

  res.all = rbind(res.all, res)
}
print("All part is done!")


saveRDS(res.all, file = paste0("./z/z.module", module_id, ".chr", chr, ".rds"))

Calculate pvalues

Here is the code using PCO to calculate pvalues. This uses only PC’s with eigenvalue \(\lambda > 1\). Modified PCO code in ./script/*.

rm(list = ls())

parameter = commandArgs(trailingOnly = T)
module_id = parameter[1]
chr = parameter[2]
p.method = parameter[3] #"truncPCO" #"PCO"


setwd("/scratch/midway2/liliw1/TCGA/")

# dataset DGN: dir_data = "/project2/xuanyao/data/DGN/txt_file_of_DGN/"
# dataset DGN: gene_module = readRDS(file = paste0(dir_data, "DGN_coexpressed_gene_network.rds"))
# dataset DGN: gene_meta = read.table(paste0(dir_data, "expression_meta_hg19.txt"), ...)


source(paste0("./script/ModifiedPCOMerged.R"))
source(paste0("./script/liu.r"))
source(paste0("./script/liumod.R"))
source(paste0("./script/davies.R"))
source(paste0("./script/SigmaMetaEstimate.R"))
library(foreach)
library(doParallel)
library(mvtnorm)
library(MPAT)


# zscore input
z_mat = readRDS(file = paste0("./z/z.module", module_id, ".chr", chr, ".rds"))


# data input
dir_data = "/project2/xuanyao/llw/breastcancerTCGA/txt_file/"
datExpr = readRDS(paste0(dir_data, "ex_var_regressed.rds"))
gene_meta = read.table(paste0(dir_data, "tumor.gene_pos.txt"),
                      sep = '\t',
                      header = T,
                      stringsAsFactors = F)
gene_module = readRDS(file = paste0(dir_data, "coexp.module.rds"))


# Use only the genes in the module & have postion info in the meta file & not on chromosome chr
gene_in_cluster = data.frame("gene_name" = names(gene_module$moduleLabels[gene_module$moduleLabels == module_id]), stringsAsFactors = F)
gene_w_pos = merge(gene_in_cluster, gene_meta, by.x = 1, by.y = 1)
gene_trans = gene_w_pos[gene_w_pos[,2] != paste0("chr", chr), "gene_name"]


Sigma = cor(datExpr[, gene_trans])
SigmaO = ModifiedSigmaOEstimate(Sigma,simNum=1000)
z_mat_trans = z_mat[, gene_trans]; rm(z_mat)


# Do parallel
cores=(Sys.getenv("SLURM_NTASKS_PER_NODE"))
cl = makeCluster(as.numeric(cores), outfile="")
registerDoParallel(cl)

Nsnps = nrow(z_mat_trans); snps = rownames(z_mat_trans)
chunk.size = Nsnps %/% as.numeric(cores)
Nsnps_left = Nsnps %% as.numeric(cores)
Nsnps_done = (chunk.size)*as.numeric(cores)

system.time(
  res.all <- foreach(i=1:cores, .combine='c') %dopar%
    {
      library(MPAT)
      library(mvtnorm)

      res <- rep(NA, chunk.size); names(res) <- snps[((i-1)*chunk.size+1):(i*chunk.size)]
      for(x in 1:chunk.size) {
        res[x] <- ModifiedPCOMerged(Z.vec=z_mat_trans[x + (i-1)*chunk.size, ],
                                                       Sigma=Sigma, SigmaO=SigmaO,method = "davies")

        if(x %% 100 == 0){print(x + (i-1)*chunk.size)}
      }

      res
    }
)
stopCluster(cl)

print("Parallel part is done!")


if(Nsnps_left != 0){
  res_left = rep(NA, Nsnps_left); names(res_left) = snps[1:(Nsnps_left) + Nsnps_done]
  for(k in 1:(Nsnps_left)){
    res_left[k] = ModifiedPCOMerged(Z.vec=z_mat_trans[k + Nsnps_done, ],
                               Sigma=Sigma, SigmaO=SigmaO,method = "davies")

    print(k + Nsnps_done)
  }

  res.all = c(res.all, res_left)
}

print("All part is done!")


saveRDS(res.all, file = paste0("./p/p.module", module_id, ".chr", chr, ".", p.method, ".rds"))

Multi-testing correction

Pvalue adjustment based on uniform null pvalue

rm(list = ls())

library(qvalue)

# estimate the p-value cut-off to control the false discovery rate at a certain level

parameter = commandArgs(trailingOnly = T)
fdr.level = parameter[1]
adjust.method = parameter[2]

# input pvalues for all 18 gene modules and 22 chromosomes
pvalue_pc = NULL
for(gene_cluster_id in 1:18){
  for(chr in 1:22){
    tmp_pc = readRDS(file = paste0("./TruncPCO/GeneCluster", gene_cluster_id, "/pvalue_pco_", chr, ".rds"))
    names(tmp_pc) = paste0("C", gene_cluster_id, ":", names(tmp_pc))
    pvalue_pc = c(pvalue_pc, tmp_pc)

    cat("Gene cluster:", gene_cluster_id, ". Chr:", chr, "\n")
    print(object.size(pvalue_pc))
  }
}

# histogram of all pvalues
png("./FDR/Hist of all pvalues")
hist(pvalue_pc)
dev.off()
print("Histogram of pvalues saved!")


# adjust pvalues
if(adjust.method == "BH"){
  p.adjusted = p.adjust(pvalue_pc, method = "BH")
  print("BH correction of pvalue done!")


  signal_q = p.adjusted[p.adjusted<fdr.level]
  signal_name = names(signal_q)
  signal_summary = cbind(pvalue_pc[signal_name], signal_q)
  colnames(signal_summary) = c("pvalue.PCO", "qvalue.PCO")
  write.table(signal_summary, file = "./FDR/transeQTL.Summary.BH.txt",
              quote = FALSE)
  print("Signals written in the file!")


  saveRDS(p.adjusted, file = "./FDR/qvalueofPC.BH.rds")
  print("qvalue results saved!")

}else if(adjust.method == "qvalue"){
  qobj_pc = qvalue(p = pvalue_pc, fdr.level=fdr.level)
  print("FDR analysis of pc has been done!")


  signal_q = qobj_pc$qvalues[qobj_pc$significant]
  signal_name = names(signal_q)
  signal_summary = cbind(qobj_pc$pvalues[signal_name], signal_q)
  colnames(signal_summary) = c("pvalue.PCO", "qvalue.PCO")
  write.table(signal_summary, file = "./FDR/transeQTL.Summary.q.txt",
              quote = FALSE)
  print("Signals written in the file!")


  saveRDS(qobj_pc, file = "./FDR/qvalueofPC.q.rds")
  print("qvalue results saved!")

}

Pvalue adjustment using empirical null pvalue

Post analysis of signals

Use all PC’s. Adjust pvalues using qvalue.
Use PC’s with eigenvalue \(\lambda>1\). Adjust pvalues using empirical null.
Use the first PC. (elife method)
Single-variate method. (MatrixeQTL)

Others

Use only the first PC. Use all PC’s.

Find phseudogenes, poor-mapped genes, and cross-mappable genes

Run MatrixeQTL

Prepare files
Run MatrixeQTL

sessionInfo()