Log 2025

Last updated: 2025-08-20

Checks: 7 0

Knit directory: qltan_postdoc_note/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20250707)

The command set.seed(20250707) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 6715fd7

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 6715fd7. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/2025.Rmd) and HTML (docs/2025.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	6715fd7	hmutanqilong	2025-08-20	Update_2025_08_20
html	6715fd7	hmutanqilong	2025-08-20	Update_2025_08_20
html	e871100	hmutanqilong	2025-08-20	Build site.
Rmd	f587a2c	hmutanqilong	2025-08-20	Update_2025_08_20
html	f587a2c	hmutanqilong	2025-08-20	Update_2025_08_20
html	1b21f64	hmutanqilong	2025-08-20	Build site.
html	102dcb4	hmutanqilong	2025-08-20	Update:20250820
html	3dfb98d	hmutanqilong	2025-08-20	Build site.
Rmd	c354798	hmutanqilong	2025-08-20	Update_2025_08_20
html	4167755	hmutanqilong	2025-07-08	Build site.
Rmd	de7deac	hmutanqilong	2025-07-08	Update
html	681c965	hmutanqilong	2025-07-08	Build site.
Rmd	bf2c2a4	hmutanqilong	2025-07-08	Publish finished local site

Core Gene Discovery

1. some term definitions and descriptions

Perturbation: A change to the genome of a cell carried out by the CRISPR-Cas9 system or one of its variants. Common perturbations include CRISPRko (knockout via nuclease-active Cas9), CRISPRi (inactivation via dCas9 tethered to a repressive domain), CRISPRa (activation via dCas9 tethered to an activating domain).
gRNA: An RNA guiding the Cas9 to its target in order to perturb it.
Target: The genomic element targeted by a gRNA, including gene TSS, enhancer. Typically, mulitple gRNAs are designed to perturb one given target.
Targeting gRNA: A gRNA that is intended to perturb a genomic target.
Non-targeting gRNA: A gRNA which maps to a location whose perturbation is known to have no effect on target gene. In general, we treat Non-targeting gRNAs negative controls.
Response: A molecular readout (gene expression) in the single-cell CRISPR screen, which is the response of perturbation of interest.
Multiplicity of infection (MOI): The MOI of a CRSIPR screen can be categorized as low or high and also can also be numerically quantified. A high MOI dataset is the experiment that aims to insert multiple gRNAs into each cell. A low MOI dataset aims to insert a single gRNA into each cell. The MOI reflects the average of gRNAs delivered per cell. if MOI of 10 means that on average each cell was infected by 10 gRNAs.

some descriptions of sceptre:

main task is “Does perturbing a target impact a response?”

Target-response pair: A pair consisting of a target and a response.For example, a target-response pair could be a specific enhancer coupled to a specific gene. Each target-response pair corresponds to a hypothesis to be tested: Does perturbing the target impact the response?
Negative control pair: A target-response pair in which the target consits of one or more non-targeting gRNAs. Negative control pairs are used to assess the calibration of a statistical analysis method.
Positive control pair: A target-response pair in which the target is a genomic element such as enhancer or gene known to have an effect on the response. Positive control pairs are used to assess the power of a statistical analysis method.
Cis pair: A kind of discovery target-response pair in which the target and response are located in close proximity on the same chromosome.
Trans pair: A a kind of discovery target-response pair for which the target and response are not necessarily located in close proximity or on the same chromosome as one another.

July 19

what I did
- exploring some mechanisms of sceptre package
- comparing NTC guides’s results between sceptre and limma
- run trans analysis of Replogle et al K562 KD8 genome-wide perturb-seq (gwps) dataset

1. some mechanisms of sceptre package

1.1 construct cis gRNA_target-response pairs:

Why dose sceptre exclude all pairs containing each positive control gRNA target from the cis pairs when using construct_cis_pairs(sceptre_object, positive_control_pairs = pc_pairs, distance_threshold = 5e6)?

sceptre authors’ thought: a gRNA target used for a positive control pair is a known biological effector. Its purpose is to produce a large and predictable effect. Including this gRNA target’s pairs in the discovery analysis, such as gRNA-other gene (A, B) pairs, violates the purpose of the discovery analysis. The analysis of these additional pairs is no longer a pure exploration of novel cis-regulatory relationships, but an analysis of the pleiotropic or off-target effects of a known, powerfull perturbation.

In statistics, 1) contaminating the Null Distribution: If a positive control gRNA target is included in the analysis, its strong effect on its target gene (and potentially other genes) will compromise the whole resampleing process. The null distribution will be generated by resampling a powerfull biological effector, which can make the variance of null distribution inflated. The simulation of the null hypothesis would be contaminated by a non-null reality. 2) Distorting the FDR: FDR depends on the overall p-value distribution of test. if we include a set of known, extremely significant p-values from positive control targets (whose p-values may be close to zero) in this distribution will be distorted.

1.2 calibration mechanism

In the calibration step, sceptre samples groups of non-targeting gRNAs to calibrate. The idea is to get enough NTC gRNA target-response pairs to check whether the test is well calibrated. It also depends on the number of qc passed discovery pairs (here is cis pairs). It follows this equation:

\[ \text{n_NTC_gRNAs}= \frac{ \text{MULT_FACTOR} \times \text{n_calibration_pairs} } { \text{p_hat} \times \text{n_genes} } \]

We set MULT_FACTOR to 5 as a redundancy factor.

p_hat represents the estimated probability that an NTC gRNA–gene pair is valid, which is 0.96 in our case.

The total number of genes (n_genes) is 7,991.

For n_calibration_pairs, we ranged it from 101,920 (the real number of cis pairs) up to 1,000,000 so that all NTC gRNAs could be covered.

code below is my simulation result:

suppressPackageStartupMessages(library(tidyverse))

unable to deduce timezone name from 'America/Chicago'

suppressPackageStartupMessages(library(scales))

compute_n_grna_groups <- function(n_calibration_pairs,
                                  n_genes = 7991,
                                  p_hat = 0.96,
                                  N_POSSIBLE_GROUPS_THRESHOLD = 100L,
                                  n_possible_groups = 514L,
                                  MULT_FACTOR = 5) {
  raw <- ceiling(MULT_FACTOR * n_calibration_pairs / (p_hat * n_genes))
  lower <- pmax(raw, N_POSSIBLE_GROUPS_THRESHOLD)
  out <- ifelse(lower >= n_possible_groups, n_possible_groups, lower)
  as.integer(out)
}

p_hat <- 0.96
n_genes <- 7991
N_POSSIBLE_GROUPS_THRESHOLD <- 100L
n_possible_groups <- 514L
MULT_FACTOR <- 5

lower <- N_POSSIBLE_GROUPS_THRESHOLD * p_hat * n_genes / MULT_FACTOR
upper <- n_possible_groups * p_hat * n_genes / MULT_FACTOR

x_vals <- seq(101920, 1100000, by = 5000)

dat <- data.frame(n_calibration_pairs = x_vals)
dat$n_grna_groups <- compute_n_grna_groups(
  dat$n_calibration_pairs,
  n_genes = n_genes,
  p_hat = p_hat,
  N_POSSIBLE_GROUPS_THRESHOLD = N_POSSIBLE_GROUPS_THRESHOLD,
  n_possible_groups = n_possible_groups,
  MULT_FACTOR = MULT_FACTOR
)

p <- ggplot(dat, aes(x = n_calibration_pairs, y = n_grna_groups)) +
  geom_line(linewidth = 1) +
  geom_hline(yintercept = N_POSSIBLE_GROUPS_THRESHOLD, linetype = 2) + # 上下限
  geom_hline(yintercept = n_possible_groups, linetype = 2) +
  geom_vline(xintercept = lower, linetype = 3) +
  geom_vline(xintercept = upper, linetype = 3) +
  annotate("text", x = lower, y = N_POSSIBLE_GROUPS_THRESHOLD + 20,
           label = sprintf("lower ≈ %.0f", lower), angle = 90, vjust = -0.5, hjust = 0) +
  annotate("text", x = upper, y = n_possible_groups - 20,
           label = sprintf("upper ≈ %.0f", upper), angle = 90, vjust = 1.1, hjust = 1) +
  labs(
    x = "n_calibration_pairs",
    y = "n_grna_groups",
    title = "p_hat=0.96, n_genes=7991",
  ) +
  scale_y_continuous(breaks = c(100, 200, 300, 400, 514)) +
  scale_x_continuous(labels = label_number()) +
  theme_classic(base_size = 13)

print(p)

Version	Author	Date
3dfb98d	hmutanqilong	2025-08-20

2. Comparison of NTC gRNA alibration between sceptre and limma

We compared calibration results of non-targeting gRNAs between limma and sceptre by aligning shared ntc gRNA–gene pairs (~1M) and visualizing overlap with Venn diagrams, QQ plots, and scatter plots to assess concordance and differences in significance.

In sceptre analysis, we manually set n_calibration_pairs = 1000000, which included all 514 NTC gRNAs in the calibration. On average, each NTC gRNA was paired with 1,946 genes as target–response pairs. We totally found 24 significant ntc gRNA pairs at alpha=0.1.

In the limma analysis, every NTC gRNA was paired with the entire set of genes without any restriction. So we obtained 1000,000 shared ntc gRNA pairs.

knitr::include_graphics("/assets/2025/figure/limma_sceptre_shared_pairs.png", error = FALSE)

venn diagram of shared ntc gRNA pairs

knitr::include_graphics(
  c("../docs/2025/figure/limma_all_pairs_qqplot.png", 
    "../docs/2025/figure/sceptre_all_pairs_qqplot.png"), error = FALSE)

QQ plot stratified by ntc gRNAs

knitr::include_graphics(
  c("../docs/2025/figure/limma_vs_sceptre_shared_pairs_qqplot.png",
    "../docs/2025/figure/limma_vs_sceptre_shared_pairs_scatter_plot.png"), error = FALSE)

shared pairs QQ plot and scatter plot: limma v.s sceptre

3. simulation of batch as continuous or categorial varibale

Design:

We simulated a null setting where gene expression does not depend on batch. We generated counts for 300 batches × 20 cells per batch from Poisson(μ = 10). For each replicate (500 in total) we fit three glm regressons:

y ~ 1 (intercept-only, the true model)
y ~ batch with batch as continuous
y ~ factor(batch) with batch as categorical

From each Poisson fit we took the fitted means and then estimated the NB size parameter . I calculated residual deviance/df, , and RMSE between and true .

Results: Across 50 replicates, all three models had deviance/df = 1 and very large , suggesting that near-Poisson residuals and no material overdispersion. the p value of in numeric model was with an approximately uniform p-value distribution. For mean estimation, the value of intercept-only is similar with the value of numeric model (small RMSE), but factor had a much larger RMSE. I guess it fits a separate mean per batch from small per-batch sample sizes (only 20 cells).

set.seed(1)
NSIM  <- 500
N_BATCH <- 300
CELLS_PER_BATCH <- 20 #每批细胞数量(均衡)
MU_TRUE <- 10 # 各批次真实均值（完全相同）
THETA_CAP <- 1000 # theta上限 sceptre setting is 1000
THETA_LO  <- 1e-2 # theta下限


library(MASS) # theta.md


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

library(tidyverse)
library(patchwork)


Attaching package: 'patchwork'

The following object is masked from 'package:MASS':

    area

library(scales)

# helper function to estimate theta
theta_est <- function(y, mu, dfr, cap = THETA_CAP, lo = THETA_LO) {
  # th_list <- sceptre:::estimate_theta(
  #   y = y, mu = mu, dfr = dfr,
  #   limit = 50, eps = (.Machine$double.eps)^(1/4)
  # )
  # th <- th_list[[1]]
  # th <- max(min(th, cap), lo)
  # th
  
  th <- tryCatch(MASS::theta.md(y, mu, dfr), error = function(e) NA_real_)
  th <- max(min(th, cap), lo)
  th
}

# one time simulation
# real: 各批次间无差异，即同均值。
# aim: 此时若把batch加入到协变量中作为连续性变量，是否会使poisson 估计的mu有偏？

fit_once_equal_mu <- function(n_batch = N_BATCH, cells_per_batch = CELLS_PER_BATCH, mu_true = MU_TRUE) {
  # 构造均衡批次
  batch <- rep(seq_len(n_batch), each = cells_per_batch)
  
  # 真实情况：所有批次均值完全相同
  y <- rpois(length(batch), lambda = mu_true)
  dat <- data.frame(y = y, batch = batch)
  
  # A) 真模型：截距-only
  m_int  <- glm(y ~ 1, data = dat, family = poisson())
  mu_int <- fitted(m_int)
  th_int <- theta_est(dat$y, mu_int, df.residual(m_int))
  out_int <- tibble(
    model   = "Intercept-only (true)",
    theta   = th_int,
    dev     = deviance(m_int),
    df      = df.residual(m_int),
    dev_df  = dev / df,
    beta    = NA_real_, se_beta = NA_real_, z_beta = NA_real_, p_beta = NA_real_,
    RMSE_mu = sqrt(mean((mu_int - mu_true)^2))
  )
  
  # B) 把 batch 当作连续
  m_num  <- glm(y ~ batch, data = dat, family = poisson())
  mu_num <- fitted(m_num)
  th_num <- theta_est(dat$y, mu_num, df.residual(m_num))
  co <- coef(summary(m_num))
  beta_hat <- unname(co[2, "Estimate"])
  se_hat <- unname(co[2, "Std. Error"])
  z_hat <- unname(co[2, "z value"])
  p_hat <- unname(co[2, "Pr(>|z|)"])
  out_num <- tibble(
    model   = "Numeric (batch as continuous)",
    theta   = th_num,
    dev     = deviance(m_num),
    df      = df.residual(m_num),
    dev_df  = dev / df,
    beta    = beta_hat, se_beta = se_hat, z_beta = z_hat, p_beta = p_hat,
    RMSE_mu = sqrt(mean((mu_num - mu_true)^2))
  )
  
  # C) 把 batch 当作因子 or dummy matrix
  m_fac  <- glm(y ~ factor(batch), data = dat, family = poisson())
  mu_fac <- fitted(m_fac)
  th_fac <- theta_est(dat$y, mu_fac, df.residual(m_fac))
  out_fac <- tibble(
    model = "Factor (batch as categorical)",
    theta = th_fac,
    dev = deviance(m_fac),
    df = df.residual(m_fac),
    dev_df = dev / df,
    beta = NA_real_, se_beta = NA_real_, z_beta = NA_real_, p_beta = NA_real_,
    RMSE_mu = sqrt(mean((mu_fac - mu_true)^2))
  )
  
  bind_rows(out_int, out_num, out_fac)
}

NSIM <- 50
res <- bind_rows(replicate(NSIM, fit_once_equal_mu(), simplify = FALSE)) %>%
  group_by(model) %>% mutate(sim = row_number()) %>% ungroup()

summ <- res %>%
  group_by(model) %>%
  summarise(
    SIM_n = n(),
    mean_dev_df = mean(dev_df, na.rm = TRUE),
    sd_dev_df = sd(dev_df,   na.rm = TRUE),
    median_theta = median(theta, na.rm = TRUE),
    prop_theta_cap = mean(theta >= THETA_CAP, na.rm = TRUE),
    mean_RMSE_mu = mean(RMSE_mu, na.rm = TRUE),
    .groups = "drop"
  )

kableExtra::kable(summ)

model	SIM_n	mean_dev_df	sd_dev_df	median_theta	prop_theta_cap	mean_RMSE_mu
Factor (batch as categorical)	50	1.019100	0.0176873	430.5795	0.24	0.7100287
Intercept-only (true)	50	1.018741	0.0173974	423.0253	0.24	0.0319833
Numeric (batch as continuous)	50	1.018792	0.0174404	436.1709	0.26	0.0480753

trans analysis

library(arrow)

Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.


Attaching package: 'arrow'

The following object is masked from 'package:utils':

    timestamp

library(duckplyr)

The duckplyr package is configured to fall back to dplyr when it encounters an
incompatibility. Fallback events can be collected and uploaded for analysis to
guide future development. By default, data will be collected but no data will
be uploaded.
ℹ Automatic fallback uploading is not controlled and therefore disabled, see
  `?duckplyr::fallback()`.
✔ Number of reports ready for upload: 8264.
→ Review with `duckplyr::fallback_review()`, upload with
  `duckplyr::fallback_upload()`.
ℹ Configure automatic uploading with `duckplyr::fallback_config()`.

✔ Overwriting dplyr methods with duckplyr methods.
ℹ Turn off with `duckplyr::methods_restore()`.

# db <- open_dataset('/project/xuanyao/qilong/Project/Core_Gene_Programs/Johnathan/sceptre/K562_gwps_all_in_one/sceptre_outputs_trans/trans_results/')
# write_parquet(db, '/project/xuanyao/qilong/Project/Core_Gene_Programs/Johnathan/sceptre/K562_gwps_all_in_one/sceptre_outputs_trans/Reploge_2022_Cell_K562_gwps_KD8_all_trans_results.parquet')

db <- read_parquet_duckdb('/project/xuanyao/qilong/Project/Core_Gene_Programs/Johnathan/sceptre/K562_gwps_all_in_one/sceptre_outputs_trans/Reploge_2022_Cell_K562_gwps_KD8_all_trans_results.parquet')

tmp <- db %>% filter(response_id=='ENSG00000206172' & pass_qc == T & grna_target == 'ENSG00000105610') 

kableExtra::kable(tmp)

response_id	grna_id	grna_target	n_nonzero_trt	n_nonzero_cntrl	pass_qc	p_value	log_2_fold_change
ENSG00000206172	gRNA:ENSG00000105610	ENSG00000105610	22	18518	TRUE	0.0008608	-0.9257564

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 8

Matrix products: default
BLAS/LAPACK: /software/openblas-0.3.13-el8-x86_64/lib/libopenblas_skylakexp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] duckplyr_1.1.0       arrow_20.0.0.2       patchwork_1.3.1.9000
 [4] MASS_7.3-60          scales_1.3.0         forcats_0.5.1       
 [7] stringr_1.5.1        dplyr_1.1.4          purrr_1.0.2         
[10] readr_1.4.0          tidyr_1.3.0          tibble_3.2.1        
[13] ggplot2_3.5.1        tidyverse_1.3.1      workflowr_1.7.1     

loaded via a namespace (and not attached):
 [1] httr_1.4.2        sass_0.4.9        bit64_4.6.0-1     jsonlite_1.8.7   
 [5] viridisLite_0.4.2 modelr_0.1.8      bslib_0.7.0       assertthat_0.2.1 
 [9] getPass_0.2-2     highr_0.11        cellranger_1.1.0  yaml_2.2.1       
[13] collections_0.3.8 pillar_1.10.2     backports_1.4.1   glue_1.6.2       
[17] digest_0.6.33     promises_1.2.0.1  rvest_1.0.0       colorspace_2.1-0 
[21] htmltools_0.5.8.1 httpuv_1.6.1      pkgconfig_2.0.3   broom_0.7.8      
[25] haven_2.4.1       processx_3.8.4    svglite_2.0.0     whisker_0.4.1    
[29] later_1.2.0       git2r_0.32.0      generics_0.1.3    farver_2.1.1     
[33] ellipsis_0.3.2    cachem_1.0.8      withr_2.5.2       cli_3.6.1        
[37] magrittr_2.0.3    crayon_1.5.2      readxl_1.3.1      memoise_2.0.1    
[41] evaluate_0.23     ps_1.7.5          fs_1.6.3          xml2_1.3.2       
[45] tools_4.1.0       hms_1.1.0         lifecycle_1.0.4   munsell_0.5.0    
[49] reprex_2.0.0      callr_3.7.6       kableExtra_1.4.0  compiler_4.1.0   
[53] jquerylib_0.1.4   duckdb_1.3.1      systemfonts_1.1.0 rlang_1.1.6      
[57] grid_4.1.0        rstudioapi_0.15.0 labeling_0.4.3    rmarkdown_2.27   
[61] gtable_0.3.6      DBI_1.2.3         R6_2.5.1          lubridate_1.7.10 
[65] knitr_1.48        fastmap_1.1.1     bit_4.0.5         rprojroot_2.0.4  
[69] stringi_1.6.2     Rcpp_1.0.12       vctrs_0.6.4       dbplyr_2.1.1     
[73] tidyselect_1.2.0  xfun_0.45