Last updated: 2022-04-03

Checks: 5 1

Knit directory: diff_driver_analysis-main/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20181210) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Tracking code development and connecting the code version to the results is critical for reproducibility. To start using Git, open the Terminal and type git init in your project directory.


This project is not being versioned with Git. To obtain the full reproducibility benefits of using workflowr, please see ?wflow_start.


Methods

There are several approaches to identify genes associated with certain phenotypes.

1. Multi-linear regression (ANOVA)

This was the approach used by domainXplorer (Porta-Pardo and Godzik, 2016). The software is available here.

Let \(E_i\) be some form of immune score for tumor sample \(i\), \(T_i\) is the tumor type for sample \(i\), \(N_i\) is the total number of mutations in sample \(i\) (or mutation rate)., \(D_{jg}\) is the binary indicator of the mutation status for gene \(g\) in sample \(i\). For each \(i\), we have the following model (for simplicity, omiting \(i\))

\[ E = \beta_0 + \beta_1 * T + \beta_2* N + \beta_3 *D_g. \] In the next step, p value is obtained by ANOVA test.

In Ding et al 2018 cell, domainXplorer was used to identify driver events that correlate with the presence of different immune cell types. So \(E\) is the cell type fraction. In Rooney et al 2015 cell, \(E\) is rank transformed CYT (cytolytic activity defined based on gene expression).

2. Modeling of count data (CountLRT)

We propose a new approach to address this problem. Let \(Y_{ij}\) be the number of mutations at position \(j\), sample \(i\), \(Y_{ij} \in \{0,1\}\). \(\mu_{ij}\) is background mutation rate at position \(j\), sample \(i\).

\[Y_{ij} \sim Bernoulli(\mu_{ij}\delta_{ij})\] Here \(\delta_{ij}\) is the deviation of observed mutation rate from the background mutation rate. It is dependent on the sample’s immune score \(E_i\) in the following way:

\[ log(\delta_{ij})=\beta_{ij}^0 + \beta_g*E_{i} \]

\(\beta_g\) is gene \(g\)’s coefficient for \(E_i\). For the null model, \(\beta_g=0\). For the alternative model, we could get Maximum Likelihood Estimation for \(\beta_g\) and get Bayes Factor for gene \(g\).

In our approach, we can take into account confounders by modeling \(\mu_{ij}\) and \(\beta_{ij}^0\).

3. Binomial test

This method only works for binary \(E_i\) data. In this approach, we aggregate all positions in a gene and count the number of mutations in \(E_i=1\) (\(N_1\)) and \(E_i=0\) (\(N_2\)) groups, we assess if \(N_1\) deviates from \(Binom(N_1+N_2, \frac{\#\{E_i=1\}} {\#\{E_i=1\} + \#\{E_i=0\}})\).

4. logistic regression (gene level)

This method only works for binary \(E_i\) data.

\[log(\frac{p(E_i=1)}{1-p(E_i=1)})= \beta * D_i +\beta_0\]

5. Fisher’s exact test

This method only works for binary \(E_i\) data. In this approach, we aggregate all positions in a gene and count the number of mutations in \(E_i=1\) (\(N_1\)) and \(E_i=0\) (\(N_2\)) groups then use Fisher exact test a p value.

Simulation

In this simulation, we assume no gene specific effect of background mutation rate, i.e. no \(\lambda_g\) term for background mutation rate. Within the same gene, we assume all non-silent positions share the same selection mutation rate across positions, i.e does not distinguish functional positions from the rest non-silent positions.

Generate data

  • Generate \(E_i\). For a given sample size \(N\), we assign \(N/2\) samples with \(E=1\), the rest have \(E=0\)

  • Generate \(Y_{ij}\). We randomly pick one gene, ERBB3. For \(Y_{ij}\) in this gene, we simulate silent mutations based on realistic parameters estimated from UCS data using driverMAPS. We simulate nonsilent mutations with \(\beta_{ij}^0=0\) (this implies that on average this gene is not selected) or \(\beta_{ij}^0=1\) (this implies that on average this gene is positively selected). We simulate with \(\beta_{g(j)}=\{0,0.2,0.8\}\). \(\beta_{g(j)}=0\) indicates that this gene is not associated with the phenotype. Repeat this process 1000 times.

Run analysis
We run five methods for each of the 5000 times and obtain p values (for method 2, p values are obtained by likelihood ratio tests). Compare number of rounds with p value <0.01 for these methods.

We assume all samples come from the same tumor type and the mutation rate in each sample is the same. For method 2, we didn’t estimate mutation rate for each sample, but used the mutation rate from simulation.

Results

gene is not associated with phenotype ( \(\beta_g=0\) ).

In this simulation, the mutation data was generated with \(\beta_{ij}^0=0\) and \(\beta_g=0\). We treat all position \(i\) with the same relative risk,i.e. didn’t account for different functional impacts of among positions.

null

gene with low mutation rate and weakly associated with phenotype ( \(\beta_{ij}^0=0\) and \(\beta_g=0.2\) ).

In this simulation, the mutation data was generated with \(\beta_{ij}^0=0\) and \(\beta_g=0.2\).

lowweak

gene with low mutation rate and strongly associated with phenotype ( \(\beta_{ij}^0=0\) and \(\beta_g=0.8\) ).

In this simulation, the mutation data was generated with \(\beta_{ij}^0=0\) and \(\beta_g=0.8\).

lowstrong

gene with high mutation rate and weakly associated with phenotype ( \(\beta_{ij}^0=1\) and \(\beta_g=0.2\) ).

In this simulation, the mutation data was generated with \(\beta_{ij}^0=1\) and \(\beta_g=0.2\).

highweak

gene with high mutation rate and strongly associated with phenotype ( \(\beta_{ij}^0=1\) and \(\beta_g=0.8\) ).

In this simulation, the mutation data was generated with \(\beta_{ij}^0=1\) and \(\beta_g=0.8\).

highstrong


R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8       rstudioapi_0.13  knitr_1.37       magrittr_2.0.2  
 [5] workflowr_1.7.0  R6_2.5.1         rlang_1.0.1      fastmap_1.1.0   
 [9] fansi_1.0.2      stringr_1.4.0    tools_4.1.2      xfun_0.29       
[13] utf8_1.2.2       cli_3.1.1        git2r_0.29.0     jquerylib_0.1.4 
[17] htmltools_0.5.2  ellipsis_0.3.2   rprojroot_2.0.2  yaml_2.2.2      
[21] digest_0.6.29    tibble_3.1.6     lifecycle_1.0.1  crayon_1.5.0    
[25] later_1.3.0      sass_0.4.0       vctrs_0.3.8      promises_1.2.0.1
[29] fs_1.5.2         glue_1.6.1       evaluate_0.14    rmarkdown_2.11  
[33] stringi_1.7.6    bslib_0.3.1      compiler_4.1.2   pillar_1.7.0    
[37] jsonlite_1.7.3   httpuv_1.6.5     pkgconfig_2.0.3