Last updated: 2021-10-11
Checks: 6 1
Knit directory: funcFinemapping/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it's best to always run the code in an empty environment.
The command set.seed(20210404)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.
absolute | relative |
---|---|
~/projects/funcFinemapping/output/scz_lipid_joint_enrichment.png | output/scz_lipid_joint_enrichment.png |
~/projects/funcFinemapping/output/scz_bmi_joint_enrichment.png | output/scz_bmi_joint_enrichment.png |
~/projects/funcFinemapping/output/scz_immune_joint_enrichment.png | output/scz_immune_joint_enrichment.png |
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 232f57c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .ipynb_checkpoints/
Ignored: analysis/ldsc_results.nb.html
Ignored: analysis/results.nb.html
Ignored: analysis/snp_finemapping_results.nb.html
Ignored: analysis/splicing.nb.html
Untracked files:
Untracked: SNPs_categories,png
Untracked: SNPs_categories.png
Untracked: analysis/asthma_prelim_results.Rmd
Untracked: analysis/asthma_results_cp.Rmd
Untracked: analysis/enhancer_gene_feature.Rmd
Untracked: analysis/feedback.Rmd
Untracked: analysis/gene_finemapping_results.Rmd
Untracked: analysis/learn_susie.Rmd
Untracked: analysis/notes.Rmd
Untracked: analysis/snp_finemapping_results.Rmd
Untracked: analysis/splicing.Rmd
Untracked: code/.ipynb_checkpoints/
Untracked: code/ldsc_regression.sh
Untracked: code/make_plots.R
Untracked: code/run_ldsc.sh
Untracked: code/run_ldsc_with_bed.sh
Untracked: code/run_torus.sh
Untracked: code/split_vcf.sh
Untracked: data/num_overlaps_finemapped_SNPs_and_ctcf.txt
Untracked: data/scz_2018
Untracked: data/torus_enrichment_novel_annot.est
Untracked: data/torus_joint_enrichment.est
Untracked: data/torus_joint_refined_enrichment.est
Untracked: enhancer_gene_feature.rmd
Untracked: fig1_panels.pdf
Untracked: fig2.pdf
Untracked: fig_panel2.pdf
Untracked: gene_mapping.pdf
Untracked: output/background_SNPs_annotated_percent.txt
Untracked: panel_figure2.pdf
Untracked: test.txt
Unstaged changes:
Modified: analysis/index.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/enrichment_analysis.Rmd
) and HTML (docs/enrichment_analysis.html
) files. If you've configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 232f57c | Jing Gu | 2021-10-11 | MTSplice on SCZ GWAS |
html | 5318fb4 | Jing Gu | 2021-10-09 | Build site. |
Rmd | 68fc455 | Jing Gu | 2021-10-09 | test MTSplice |
html | 91a2b39 | Jing Gu | 2021-10-09 | Build site. |
Rmd | ab86da7 | Jing Gu | 2021-10-09 | test MTSplice |
html | af50415 | Jing Gu | 2021-10-06 | Build site. |
Rmd | 670affb | Jing Gu | 2021-10-06 | workflowr::wflow_publish(c("analysis/enrichment_analysis.Rmd", |
html | 2a20236 | Jing Gu | 2021-09-23 | Build site. |
Rmd | 356fe37 | Jing Gu | 2021-09-23 | computed percent of random SNPs above spliceAI thresholds |
html | 639aacf | Jing Gu | 2021-09-08 | Build site. |
Rmd | a9e1278 | Jing Gu | 2021-09-08 | run spliceai against baselines |
html | c046e9e | Jing Gu | 2021-09-08 | Build site. |
html | bc419c1 | Jing Gu | 2021-09-08 | Build site. |
Rmd | 76ea7d1 | Jing Gu | 2021-09-08 | run spliceai against baselines |
html | dec3c68 | Jing Gu | 2021-09-01 | Build site. |
Rmd | eaa3eee | Jing Gu | 2021-09-01 | compute standardized effect sizes |
html | 0032b8c | Jing Gu | 2021-09-01 | Build site. |
Rmd | 4db9eb3 | Jing Gu | 2021-09-01 | compute standardized effect sizes |
html | 05f73e9 | Jing Gu | 2021-09-01 | Build site. |
Rmd | bed838e | Jing Gu | 2021-09-01 | update evaluations on slicing |
Fine-mapping with functional annotations as priors has shown improved results in identifying causal variants. This project is to evaluate the utility of novel annotation features and adopt ones that can improve fine mapping results.
GWAS summary statistics
Schizopherenia - Pardinas et al., 2018
GWAS QC Procedures
After filtering, there are around 6 million variants remained.
context-dependent tolerance scores
QCs for brain OCR ATAC-seq peaks
There are very few peaks with size larger than 5kb across different brain ATAC-Seq profiles. The removal of large peaks did not improve the enrichment results.
Overlaps between two sets of genomic features were identified using bedtools intersect
. The constrained sequences were counted to be overlapped when at least 20%(>=2 bp) intersect with peaks called from ATAC-Seq profiles. No reciprocal option used here due to the much shorter length of CDTS sequence.
b_cell | iN_Dopa | iN_GABA | iN_Glut | iPSC | NPC | t_cell | |
---|---|---|---|---|---|---|---|
CDTS_1% | 11.3% | 46.9% | 55.6% | 45.4% | 62.2% | 44.2% | 9.4% |
CDTS_5% | 3.5% | 19% | 21.9% | 15.7% | 25.9% | 14.9% | 2.5% |
CDTS_50% | 0.4% | 5.8% | 6.6% | 3.4% | 7.8% | 2.9% | 0.1% |
Example of top 1% CDTS sequences and genes nearby
Conserved sequences found in the noncoding regions nearby gene YPEL4, which is a member of the highly conserved YPEL gene family.
Summary-table1:
Overall, CDTS at top percentiles are more enriched in open chromatin regions from brain than immune cells. CDTS at 50% threshold can be regarded as a negative control, which implies that less constrained sequence bins have much less overlapping with OCRs in neurons.
Genomic regions were extracted from gencode and processed based on the post from Dave Tang (https://davetang.org/muse/2013/01/18/defining-genomic-regions/). The non-coding CDTS at top 5% were obtained by combining CDTS in top 5% bins completely overlapping both intergenic and intron regions. In total, 91% of original CDTS at top 5% bins are located in the non-coding regions. The intersection of CDTS bins and genomic regions were obtained using the following code: bedtools intersect -a in.bed -b *.regions.bed -wa -wb -f 1
cds | exon | gene | intergenic | introns | UTR | |
---|---|---|---|---|---|---|
CDTS_1% | 16.6% | 38.2% | 79.6% | 19.7% | 61.8% | 15.7% |
CDTS_5% | 7.7% | 18.4% | 75.7% | 23.7% | 67.4% | 7.3% |
cds | exon | gene | intergenic | introns | UTR | |
---|---|---|---|---|---|---|
CDTS5%_iN-Glut | 7.3% | 30.9% | 76.3% | 23.7% | 68.2% | 16.5% |
CDTS5%_iPSC | 7.4% | 26.4% | 75% | 25% | 66.8% | 12.7% |
CDTS5%_B-cell | 7.5% | 36.1% | 74.1% | 25.9% | 65.7% | 20.6% |
Summary-table2
1. Overall, CDTS top percentile bins are mainly located in intron regions, which is consistent with the results in the paper.
2. Around 20% of CDTS bins overlap with exons or intergenic regions.
3. We may look into other annotations such as promoter and enhancers.
Summary-table3
1. The compositions of the CDTS-OCR overlaps in terms of genomic regions are quite consistent across cell types.
2. Generally, there are twice amount of CDTS-OCR overlaps resided in introns than in exons.
Enrichment analysis
Version | Author | Date |
---|---|---|
05f73e9 | Jing Gu | 2021-09-01 |
The enrichment estimate has a confience level above zero for CDTS and positive controls. This shows SNPs associated with SCZ are on average ~ 9 fold enriched in genomic bins with up to 5 percentile of CDTS.
Compare with intra-species constraints
top5%_CDTS top5%_CADD
top5%_CDTS 1.000 0.108
top5%_CADD 0.154 1.000
top10%_CDTS top10%_LINSIGHT top10%_GERP
top10%_CDTS 1.00 0.01 0.087
top10%_LINSIGHT 0.12 1.00 NA
top10%_GERP 0.12 NA 1.000
Summary:
Overall, there is a maximum of 15.4% overlapping between the bases within the top CDTS bins and those ranked among top annotation scores.
Only 1% of bases from top 10% CDTS bins have LINSIGHT scores above 10%, probably due to many missing predictions in LINSIGHT.
The correlation table shows the pair-wise correlations between ranks for the conservation scores on GWAS SNPs across all the annotations. CDTS seems to be uncorrelated to other conservation scores. CADD has a moderate correlation separately with LINSIGHT and GERP. A low correlation observed between GERP and LINSIGHT.
joint enrichment analysis over conservation-related annotations
Version | Author | Date |
---|---|---|
05f73e9 | Jing Gu | 2021-09-01 |
With other conservation annotations as predictors in the model, we can see CDTS within top 5 percentile still shows around 8 fold enrichment.
contact maps generated with ref/alt at rs339331
Summary statistics - SNP contact difference scores (SCD)
Only two SNPs are within 95% credible sets
GWAS loci: * 11 24367339 24412992 rs1899543 * 3 17221017 17888256 rs11409090
By overlapping SNPs with a cumulative 95% of posterior probablity and CTCF peaks, I found only 57 out of 6104 the SNPs are within CTCF peaks. The amount of overlapped SNPs increases to 71 if 100bp flanking region incluode or 133 if 500bp flanking region included. flank ctcf_pred ctcfl_pred ctcf_peaks
ctcf_pred | ctcfl_pred | ctcf_peaks | |
---|---|---|---|
flank0 | 39 | 113 | 57 |
flank10 | 72 | 271 | 59 |
flank100 | 361 | 1243 | 71 |
flank500 | 1349 | 3359 | 133 |
Joint enrichment for CTCF-related annotations
Version | Author | Date |
---|---|---|
05f73e9 | Jing Gu | 2021-09-01 |
procedure
bedtools merge -i merged.bed
to merge overlapped regions in the concatenated bed file
Version | Author | Date |
---|---|---|
05f73e9 | Jing Gu | 2021-09-01 |
m6A regions across tissues
No differences in enrichment of SCZ risk variants across tissues
Joint enrichment analysis
Version | Author | Date |
---|---|---|
05f73e9 | Jing Gu | 2021-09-01 |
SpliceAI is a deep CNN method with a ResNets architecture. It provides a general prediction of splicing with precision from an arbitrary pre-mRNA sequence. For each nucleotide in the pre-mRNA transcript, SpliceAI examines 5kb away from it to predict the probability of this nucleotide being acceptor, donor or neither.
Binary annotations at different cutoffs
SCZ
*The percentage of random SNPs above each spliceAI threshold was listed in the parenthesis.
Across traits and cutoffs
For traits such as aFib, allergy, ibdCD, we observed that variants with predicted scores at a cutoff of 0.1 are more enriched with GWAS-associated SNPs. However, other traits require the predicted scores to be as low as 0.05 to show enrichment with GWAS-associated SNPs. Sample size may not be the main factor for the enrichment based on the results from BMI and height. It is interesting to see the enrichment of splicing variants in immune related disorders.
Joint enrichment for spliceAI scores at 0.01 threshold
Version | Author | Date |
---|---|---|
bc419c1 | Jing Gu | 2021-09-08 |
Version | Author | Date |
---|---|---|
bc419c1 | Jing Gu | 2021-09-08 |
Background: From literature, only a limited fraction between 7 to 21% of splicing QTLs are tissue-specific (Meme et al.2015). The quantity for measuring variant effect on splicing is the inclusion level of the exon, denoted by percent spliced in (PSI \(\Psi\)). It was defined as the fratcion of the number of overlapped reads (IR) out of total reads, which are the sum of overlapped reads (IR) and split reads (ER) that are aligned to multiple exons due to alternative splicing. \[ \Psi = \frac{IR}{IR+ES} \] When tissue-specific splicing was predicted using TSplice, the author found out less than 3.4% tested pairs of tissue-specific splicing scores deviated by 20% from the tissue-averaged splicing scores. In terms of the performance, TSplice's predictions is positively correlated with measurements (mean \(\rho=0.22\)). By combining the models of tissue-specific and tissue-indenpendent effects on splicing (called as MTSplice), they were able to achieve significant improvement, especially in 10 out of 12 brain tissues. However, the improvement in relevant mean square error was modest.
The main score predicted by the tissue-independent model(MMSplice) is delta_logit_psi (\(\Delta\text{logit}(\Psi)\)), which denotes the difference in the the inclusion level of the exon between ref and alt alleles. The score is on a logit scale. The \(\Delta\text{logit}(\Psi)\) score for each tissue predicted by the combined model (MTSplice) are included in the same output table. As multiple exons were examined for each variant, we retain the highest absolute value of delta_logit_psi across the variant-exon pairs. Lastly, it is recommended that effect of a variant with \(|\Delta\text{logit}(\Psi)|\) >=2 is considered to be strong.
MMSplice - not tissue-specific
This test run was based on splicing annotations generated using a tissue-independent model. AFib's Risk variants were significantly enriched in \(|\Delta\text{logit}(\Psi)|\) >= 0.6, which indicates risk variants tend to exert small effects on splicing. Across the cutoffs with higher values, AFib's risk variants were depleted in splicing annotations.
cutoff(absolute value) num_gwasVar_in_annot num_randomVar_in_annot
1 0.2 42169 51611
2 0.6 8643 10906
3 1.0 3841 4888
4 1.4 2444 3130
5 1.8 1751 2276
prop_randomVar_in_annot
1 0.0062
2 0.0013
3 0.0006
4 0.0004
5 0.0003
Tissue: Atrial Appendage-Heart
MTSplice - tissue-specific + tissue-independent
Very similar to tissue-independent predictions
MMSplice - not tissue-specific This test run was based on splicing annotations generated using a tissue-independent model. For SCZ, we observed the enrichment estimates take positive values and their standard deviations were smaller compared to AFib's risk variants.
Number of Variants in tisse-specific splicing annotations We counted the number of variants with the predicted splicing scores greater than each cutoff across multiple brain tissues. We observed greater variations across tissues at a lower cutoff.
Warning in melt(tbl[-1, ], "cutoff"): The melt generic in data.table has been
passed a data.frame and will attempt to redirect to the relevant reshape2
method; please note that reshape2 is deprecated, and this redirection is now
deprecated as well. To continue using melt methods from reshape2 while both
libraries are attached, e.g. melt.list, you can prepend the namespace like
reshape2::melt(tbl[-1, ]). In the next version, this warning will become an
error.
Enrichment of SCZ risk variants in brain-specific annotations Though the tissue-indenpendent predictions of variants effects on splicing show no significant enrichment with risk variants, we see different levels of enrichments on the combined predictions across multiple brain tissues as well as whole blood.
df<-read.table("output/splicing/torus_enrichment_scz_mtsplice_brain.est", header=F)
colnames(df)<-c("raw_term", "estimate", "low", "high")
df["label"]<-unlist(lapply(df$raw_term, function(i){strsplit(i, "_")[[1]][1]}))
df["counts"]<-c(as.numeric(tbl[tbl$cutoff=="0.6", 1:(dim(tbl)[2]-1)]), 10884)
df["term"]<-sprintf("%s(%s)", df$label, df$counts)
snp_enrichment_plot(df)
Schizophrenia
Procedures
Schizophrenia By performing S-LDSC analysis, we see DNA methylation and CDTS annotations shows 6-8 fold enrichment with SCZ variants conditional on all other annotations.
Across traits
Notes
If the lower end of 95% confidence interval of an estimate is negative, it will be assigned as 0.05 (-4.3 as log2 based). Overall, we see Torus and S-LDSC generates similar enrichment estimates, except for ibd_UC trait. There seems to be some discrepancy in the estimates, which needs a closer look. We so see the performance of annotations varies across traits.
SCZ variants are not enriched in protein coding regions, while variants associated with all other traits are enriched in protein coding regions.
# gwas<-readRDS("~/cluster/data/gwas/2_Filtered/scz_2018.rds")
# head(gwas)
# intergenic<-read.table("/home/jinggu/resources/genomes/intergenic_hg19_sorted.bed", header = F)
# snpRanges <- make_ranges(paste0("chr", gwas$chr), gwas$pos, gwas$pos)
# snpRanges <- plyranges::mutate(snpRanges, snp=gwas$snp)
# bedRanges<-make_ranges(intergenic$V1, intergenic$V2, intergenic$V3)
# overlaps<-IRanges::subsetByOverlaps(snpRanges, bedRanges)
# subgwas<-subset(gwas, subset=gwas$snp %in% overlaps$snp)
sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)
Matrix products: default
BLAS/LAPACK: /software/openblas-0.3.13-el7-x86_64/lib/libopenblas_haswellp-r0.3.13.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggpubr_0.4.0 gridExtra_2.3 ggplot2_3.3.3 knitr_1.31
[5] data.table_1.14.0 workflowr_1.6.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 tidyr_1.1.2 assertthat_0.2.1 rprojroot_2.0.2
[5] digest_0.6.27 utf8_1.2.2 plyr_1.8.6 R6_2.5.1
[9] cellranger_1.1.0 backports_1.2.1 evaluate_0.14 highr_0.8
[13] pillar_1.5.0 rlang_0.4.11 curl_4.3 readxl_1.3.1
[17] car_3.0-10 whisker_0.4 jquerylib_0.1.3 rmarkdown_2.7
[21] labeling_0.4.2 stringr_1.4.0 foreign_0.8-81 munsell_0.5.0
[25] broom_0.7.5 compiler_4.0.4 httpuv_1.5.5 xfun_0.21
[29] pkgconfig_2.0.3 htmltools_0.5.1.1 tidyselect_1.1.1 tibble_3.0.6
[33] rio_0.5.16 fansi_0.5.0 crayon_1.4.1 dplyr_1.0.4
[37] withr_2.4.2 later_1.1.0.1 grid_4.0.4 jsonlite_1.7.2
[41] gtable_0.3.0 lifecycle_1.0.0 DBI_1.1.1 git2r_0.28.0
[45] magrittr_2.0.1 scales_1.1.1 zip_2.1.1 stringi_1.5.3
[49] carData_3.0-4 reshape2_1.4.4 farver_2.1.0 ggsignif_0.6.1
[53] fs_1.5.0 promises_1.2.0.1 bslib_0.2.4 ellipsis_0.3.2
[57] generics_0.1.0 vctrs_0.3.8 cowplot_1.1.1 openxlsx_4.2.3
[61] tools_4.0.4 forcats_0.5.1 glue_1.4.2 purrr_0.3.4
[65] hms_1.0.0 abind_1.4-5 yaml_2.2.1 colorspace_2.0-2
[69] rstatix_0.7.0 haven_2.3.1 sass_0.3.1