Fine-mapping with functional annotations as priors has shown improved results in identifying causal variants. This project is to evaluate the utility of novel annotation features and adopt ones that can improve fine mapping results.


GWAS summary statistics

Schizopherenia - Pardinas et al., 2018

  • 40675 cases and 64643 controls
  • CLOZUK sample + PGC sample (independent)
  • 179 independent GWAS significant SNPs mapped to 145 independent loci
  • SNPs were imputed using a combination of the 1KGPp3 and UK 10K datasets.
  • SNPs were filtered by NFO > 0.6 and MAF > 0.01
  • LD-score regression analysis: An LD reference was generated from 1KGPp3 after restricting this dataset to strictly unrelated individuals and retaining only markers with MAF > 0.01.

GWAS QC Procedures

  • Current procedures was based on Alan's finemappeR pipeline
  • Criteria for filtering gwas SNPs
  1. Remove all non-biallelic SNPs
  2. Remove all SNPs with strand-ambiguous alleles (SNPs with A/T, C/G alleles)
  3. Removed SNPs without rs IDs, duplicated rs IDs or base pair position.
  4. Removed SNPs not in the reference panels
  5. Removed SNPs whose base pair positions or alleles doesn’t match the reference panels
  6. Removed all SNPs on chromosome X, Y, and MT

After filtering, there are around 6 million variants remained.

Plots for GWAS summary statistics

  1. Sequence constraints:
    • context-dependent tolerance scores(CDTS) in percentiles
    • A score was computed for each 10bp bin in the genome.
    • The lower the score is, the more intolerant to variation is the bin.


  1. GWAS summary statistics was pre-processed to remove sex chromosomes, indels, ambiguous and duplicated SNPs.
  2. Currently, genotypes from 1kg European samples are used to compute LD between SNPs.
  3. SNPs in GWAS summary statistics were matched with the reference panel and assigned to in total 1687 independent LD blocks.
  4. Run TORUS to perform genome-wide enrichment analyses.


All variants were catogrized into whether or not they occur in genomic bins with CDTS up to 1 percentile or 5 percentile.

Examine the CDTS feature

check the proportion of variants with high sequencing constraints that also have functional annotations in brain

Summary of percentage of genetic variants within up to one/five percentile of CDTS that overlapped with OCRs in brain
iN_Dopa iN_GABA iN_Glut iPSC NPC Any_OCR
CDTS_1% 53.7% 62.6% 53.2% 70.8% 51.9% 76.4%
CDTS_5% 24.4% 27.8% 21.6% 33.9% 20.6% 40.4%

Check the percent of constrained sequences that overlaps with open chromatin regions from neurons

  • Overlaps between two sets of genomic features were identified using bedtools intersect. The constrained sequences were counted to be overlapped when at least 20%(>=2 bp) intersect with peaks called from ATAC-Seq profiles.
Summary of the overlappings between constrained sequences and OCRs in brain
iN_Dopa iN_GABA iN_Glut iPSC NPC
CDTS_1% 46.9% 55.6% 45.4% 62.2% 44.2%
CDTS_5% 18% 21.3% 17.4% 23.8% 16.9%

Enrichment analysis for sequence constraints

Version Author Date
76883f1 Jing Gu 2021-04-14

The enrichment estimate has a confience level above zero for CDTS and positive controls. This shows SNPs associated with SCZ are on average ~ 9 fold enriched in genomic bins with up to 5 percentile of CDTS.

Compare with other conservation annotations

    • predict how noncoding nucleotide sites are likely to have deleterious fitness consequences and hence be phenotypically important
  • genome-wide average of LINSIGHT scores was ~0.07 (range: 0.03-0.99)
  • Estimated mean LINSIGHT score for conserved TFBSs was 0.24->used as cutoff for whether the nucelotide site is conserved
  • 2.5% of GWAS SNPs are above LINSIGHT threshold.

  • CADD - Combined Annotation–Dependent Depletion,
    • provides metrics of deleteriousness
    • scaled PHRED score [-10log10(P)]
    • 5 percent chosen as a cutoff, which represents top 5% of all possible reference genome SNVs
  • GERP - Genomic Evolutionary Rate Profiling
    • produce position-specific estimates of evolutionary constraint
    • constraint intensity quantified as a "rejection score" range from -12.3 to 6.17
    • UCSC suggests a RS score threshold of 2 which provides high sensitivity and strongly enriched for true constraint sites
Summary of the pair-wise correlations between conservation annotations
CDTS 1.0000000 0.0401113 0.0207126 0.0321172
LINSIGHT 0.0401113 1.0000000 0.4004946 0.5376002
GERP 0.0207126 0.4004946 1.0000000 0.2748316
CADD 0.0321172 0.5376002 0.2748316 1.0000000

The correlation table shows the pair-wise correlations between each binary annotations. With current thresholds, CDTS at top 5 percent is uncorrelated to other binary annotations. Instead, there are high correlations among LINSIGHT, GERP and CADD scores.

joint TORUS enrichment analysis over conservation-related annotations

Version Author Date
e597d4e Jing Gu 2021-04-21

With other conservation annotations as predictors in the model, we can see CDTS within top 5 percentile still shows around 8 fold enrichment.

