{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "74846eb0", "metadata": { "execution": { "iopub.execute_input": "2021-09-23T15:04:50.457724Z", "iopub.status.busy": "2021-09-23T15:04:50.456036Z", "iopub.status.idle": "2021-09-23T15:04:50.981763Z", "shell.execute_reply": "2021-09-23T15:04:50.980817Z" }, "tags": [ "snakemake-job-properties" ] }, "outputs": [], "source": [ "\n", "######## snakemake preamble start (automatically inserted, do not edit) ########\n", "import sys; sys.path.extend(['/project2/gilad/bjf79_project1/envs/main/lib/python3.9/site-packages', '/project2/yangili1/bjf79/ChromatinSplicingQTLs/code/notebooks']); import pickle; snakemake = pickle.loads(b'\\x80\\x04\\x95\\x9f\\x05\\x00\\x00\\x00\\x00\\x00\\x00\\x8c\\x10snakemake.script\\x94\\x8c\\tSnakemake\\x94\\x93\\x94)\\x81\\x94}\\x94(\\x8c\\x05input\\x94\\x8c\\x0csnakemake.io\\x94\\x8c\\nInputFiles\\x94\\x93\\x94)\\x81\\x94(\\x8c=QTLs/QTLTools/polyA.Splicing/PermutationPass.FDR_Added.txt.gz\\x94\\x8c?QTLs/QTLTools/polyA.Splicing/OnlyFirstReps.sorted.qqnorm.bed.gz\\x94\\x8c>QTLs/QTLTools/Expression.Splicing/Genotypes/WholeGenome.vcf.gz\\x94e}\\x94(\\x8c\\x06_names\\x94}\\x94\\x8c\\x12_allowed_overrides\\x94]\\x94(\\x8c\\x05index\\x94\\x8c\\x04sort\\x94eh\\x12\\x8c\\tfunctools\\x94\\x8c\\x07partial\\x94\\x93\\x94h\\x06\\x8c\\x19Namedlist._used_attribute\\x94\\x93\\x94\\x85\\x94R\\x94(h\\x18)}\\x94\\x8c\\x05_name\\x94h\\x12sNt\\x94bh\\x13h\\x16h\\x18\\x85\\x94R\\x94(h\\x18)}\\x94h\\x1ch\\x13sNt\\x94bub\\x8c\\x06output\\x94h\\x06\\x8c\\x0bOutputFiles\\x94\\x93\\x94)\\x81\\x94\\x8cOQTLs/QTLTools/polyA.Splicing/PermutationPass.FDR_Added.SS_SNPs.Annotated.txt.gz\\x94a}\\x94(h\\x0e}\\x94h\\x10]\\x94(h\\x12h\\x13eh\\x12h\\x16h\\x18\\x85\\x94R\\x94(h\\x18)}\\x94h\\x1ch\\x12sNt\\x94bh\\x13h\\x16h\\x18\\x85\\x94R\\x94(h\\x18)}\\x94h\\x1ch\\x13sNt\\x94bub\\x8c\\x06params\\x94h\\x06\\x8c\\x06Params\\x94\\x93\\x94)\\x81\\x94}\\x94(h\\x0e}\\x94h\\x10]\\x94(h\\x12h\\x13eh\\x12h\\x16h\\x18\\x85\\x94R\\x94(h\\x18)}\\x94h\\x1ch\\x12sNt\\x94bh\\x13h\\x16h\\x18\\x85\\x94R\\x94(h\\x18)}\\x94h\\x1ch\\x13sNt\\x94bub\\x8c\\twildcards\\x94h\\x06\\x8c\\tWildcards\\x94\\x93\\x94)\\x81\\x94}\\x94(h\\x0e}\\x94h\\x10]\\x94(h\\x12h\\x13eh\\x12h\\x16h\\x18\\x85\\x94R\\x94(h\\x18)}\\x94h\\x1ch\\x12sNt\\x94bh\\x13h\\x16h\\x18\\x85\\x94R\\x94(h\\x18)}\\x94h\\x1ch\\x13sNt\\x94bub\\x8c\\x07threads\\x94K\\x01\\x8c\\tresources\\x94h\\x06\\x8c\\tResources\\x94\\x93\\x94)\\x81\\x94(K\\x01K\\x01e}\\x94(h\\x0e}\\x94(\\x8c\\x06_cores\\x94K\\x00N\\x86\\x94\\x8c\\x06_nodes\\x94K\\x01N\\x86\\x94uh\\x10]\\x94(h\\x12h\\x13eh\\x12h\\x16h\\x18\\x85\\x94R\\x94(h\\x18)}\\x94h\\x1ch\\x12sNt\\x94bh\\x13h\\x16h\\x18\\x85\\x94R\\x94(h\\x18)}\\x94h\\x1ch\\x13sNt\\x94bhWK\\x01hYK\\x01ub\\x8c\\x03log\\x94h\\x06\\x8c\\x03Log\\x94\\x93\\x94)\\x81\\x94\\x8c4../docs/20210921_CountSpliceSiteSNPsInSQTLs.py.ipynb\\x94a}\\x94(h\\x0e}\\x94\\x8c\\x08notebook\\x94K\\x00N\\x86\\x94sh\\x10]\\x94(h\\x12h\\x13eh\\x12h\\x16h\\x18\\x85\\x94R\\x94(h\\x18)}\\x94h\\x1ch\\x12sNt\\x94bh\\x13h\\x16h\\x18\\x85\\x94R\\x94(h\\x18)}\\x94h\\x1ch\\x13sNt\\x94bhkhhub\\x8c\\x06config\\x94}\\x94(\\x8c\\x07scratch\\x94\\x8c\\x17/scratch/midway2/bjf79/\\x94\\x8c\\x07samples\\x94\\x8c\\x0bsamples.tsv\\x94\\x8c\\naspera_key\\x94\\x8c8/home/bjf79/.aspera/connect/etc/asperaweb_id_dsa.openssh\\x94\\x8c\\x0bsQTL_chunks\\x94M\\xe8\\x03u\\x8c\\x04rule\\x94\\x8c\\x1fCount_sQTLs_with_SpliceSiteSNPs\\x94\\x8c\\x0fbench_iteration\\x94N\\x8c\\tscriptdir\\x94\\x8c=/project2/yangili1/bjf79/ChromatinSplicingQTLs/code/notebooks\\x94ub.'); from snakemake.logging import logger; logger.printshellcmds = True; import os; os.chdir('/project2/yangili1/bjf79/ChromatinSplicingQTLs/code');\n", "######## snakemake preamble end #########\n" ] }, { "cell_type": "markdown", "id": "abandoned-failing", "metadata": {}, "source": [ "# Intro\n", "\n", "In this notebook, I was planning to better understand how many spliceQTLs are due to SNPs in core splice sites (5'ss or 3'ss). As a simple first pass at this, I will look at the QTLtools permutation pass output in which I performed permutation pass on leafcutter intron phenotypes, using the smallest P-value amongst any intron within a cluster as the test statistic. Therefore, this QTLtools output has a single nominal P value (and single FDR corrected Qvalue) for each cluster of introns. So the way I plan to go about this, is to check for each cluster, check if any of the introns in the cluster overlap splice sites, then look for a pattern between the cluster P-value and whether any of the cluster's introns contain a splice site.\n", "\n", "Now onto the code..." ] }, { "cell_type": "markdown", "id": "quality-europe", "metadata": {}, "source": [ "## import libs\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "5285cdf6", "metadata": { "execution": { "iopub.execute_input": "2021-09-23T15:04:50.988508Z", "iopub.status.busy": "2021-09-23T15:04:50.987589Z", "iopub.status.idle": "2021-09-23T15:04:50.990411Z", "shell.execute_reply": "2021-09-23T15:04:50.989646Z" } }, "outputs": [], "source": [ "import os" ] }, { "cell_type": "markdown", "id": "2a55b3f5", "metadata": {}, "source": [ "I think jupyter notebook will open a notebook and change to the parent directory of the notebook while you are editing. But I want this notebook to execute as if it will be executed in a snakemake, so while I am editing in jupyter notebook, I will change to that snakemake's parent directory, and be sure to comment out that change directory command" ] }, { "cell_type": "code", "execution_count": 3, "id": "a1f691bc", "metadata": { "execution": { "iopub.execute_input": "2021-09-23T15:04:50.995482Z", "iopub.status.busy": "2021-09-23T15:04:50.994710Z", "iopub.status.idle": "2021-09-23T15:05:21.963809Z", "shell.execute_reply": "2021-09-23T15:05:21.962939Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/project2/yangili1/bjf79/ChromatinSplicingQTLs/code\n" ] } ], "source": [ "# os.chdir(\"../\")\n", "print(os.getcwd())\n", "\n", "import pandas as pd\n", "import gzip\n", "import vcf\n", "from plotnine import *" ] }, { "cell_type": "markdown", "id": "swedish-bottle", "metadata": {}, "source": [ "## helper functions" ] }, { "cell_type": "code", "execution_count": 4, "id": "hungarian-meeting", "metadata": { "execution": { "iopub.execute_input": "2021-09-23T15:05:21.975427Z", "iopub.status.busy": "2021-09-23T15:05:21.974311Z", "iopub.status.idle": "2021-09-23T15:05:21.977752Z", "shell.execute_reply": "2021-09-23T15:05:21.976819Z" } }, "outputs": [], "source": [ "def count_iterable(i):\n", " return sum(1 for e in i)\n", "\n", "def Count_SNPs_Around_Pos(vcf_reader, chrom, pos, upstream_bp, downstream_bp, strand=\"+\"):\n", " \"\"\"\n", " for a vcf_reader object, return an integer of the number of variants that intersect a window\n", " defined by a upstream_bp (normally negative), and a downstream_bp (normally positive),\n", " relative to a pos on a chrom. Reverse upstream_bp and downstream_bp if strand not +\n", " \"\"\"\n", " if strand==\"+\":\n", " return count_iterable(vcf_reader.fetch(chrom, pos+upstream_bp, pos+downstream_bp))\n", " else:\n", " return count_iterable(vcf_reader.fetch(chrom, pos-downstream_bp, pos-upstream_bp))\n", "\n", "def PickArg2IfStrandNegative(arg1, arg2, strand):\n", " if strand == \"+\":\n", " return arg1\n", " else:\n", " return arg2" ] }, { "cell_type": "markdown", "id": "fifth-figure", "metadata": {}, "source": [ "## read clusters\n", "\n", "Read introns from QTLtools bedgz input, which has intron coordinates and cluster ID for each intron. Then read in QTLtools permutation test output (permutation tested grouped by cluster) to get Q value for each cluster being a QTL." ] }, { "cell_type": "code", "execution_count": 5, "id": "hawaiian-hollywood", "metadata": { "execution": { "iopub.execute_input": "2021-09-23T15:05:21.985664Z", "iopub.status.busy": "2021-09-23T15:05:21.985001Z", "iopub.status.idle": "2021-09-23T15:05:36.206183Z", "shell.execute_reply": "2021-09-23T15:05:36.207112Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | #Chr | \n", "start | \n", "end | \n", "pid | \n", "gid | \n", "grp_id | \n", "q | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "chr1 | \n", "15947 | \n", "16607 | \n", "1:15947:16607:clu_1_- | \n", "chr1_clu_1_- | \n", "chr1_clu_1_- | \n", "0.51821 | \n", "
1 | \n", "chr1 | \n", "16310 | \n", "16607 | \n", "1:16310:16607:clu_1_- | \n", "chr1_clu_1_- | \n", "chr1_clu_1_- | \n", "0.51821 | \n", "
2 | \n", "chr1 | \n", "17055 | \n", "17915 | \n", "1:17055:17915:clu_2_- | \n", "chr1_clu_2_- | \n", "chr1_clu_2_- | \n", "0.79293 | \n", "
3 | \n", "chr1 | \n", "17055 | \n", "17233 | \n", "1:17055:17233:clu_2_- | \n", "chr1_clu_2_- | \n", "chr1_clu_2_- | \n", "0.79293 | \n", "
4 | \n", "chr1 | \n", "17368 | \n", "17915 | \n", "1:17368:17915:clu_2_- | \n", "chr1_clu_2_- | \n", "chr1_clu_2_- | \n", "0.79293 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
143404 | \n", "chr9 | \n", "137798914 | \n", "137800880 | \n", "9:137798914:137800880:clu_20895_+ | \n", "chr9_clu_20895_+ | \n", "chr9_clu_20895_+ | \n", "0.76432 | \n", "
143405 | \n", "chr9 | \n", "137800575 | \n", "137800880 | \n", "9:137800575:137800880:clu_20895_+ | \n", "chr9_clu_20895_+ | \n", "chr9_clu_20895_+ | \n", "0.76432 | \n", "
143406 | \n", "chr9 | \n", "137800984 | \n", "137802409 | \n", "9:137800984:137802409:clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "0.43322 | \n", "
143407 | \n", "chr9 | \n", "137800984 | \n", "137802857 | \n", "9:137800984:137802857:clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "0.43322 | \n", "
143408 | \n", "chr9 | \n", "137800984 | \n", "137811461 | \n", "9:137800984:137811461:clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "0.43322 | \n", "
143409 rows × 7 columns
\n", "\n", " | #Chr | \n", "start | \n", "end | \n", "pid | \n", "gid | \n", "grp_id | \n", "q | \n", "strand | \n", "DonorPos | \n", "AcceptorPos | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "chr1 | \n", "15947 | \n", "16607 | \n", "1:15947:16607:clu_1_- | \n", "chr1_clu_1_- | \n", "chr1_clu_1_- | \n", "0.51821 | \n", "- | \n", "16606 | \n", "15947 | \n", "
1 | \n", "chr1 | \n", "16310 | \n", "16607 | \n", "1:16310:16607:clu_1_- | \n", "chr1_clu_1_- | \n", "chr1_clu_1_- | \n", "0.51821 | \n", "- | \n", "16606 | \n", "16310 | \n", "
2 | \n", "chr1 | \n", "17055 | \n", "17915 | \n", "1:17055:17915:clu_2_- | \n", "chr1_clu_2_- | \n", "chr1_clu_2_- | \n", "0.79293 | \n", "- | \n", "17914 | \n", "17055 | \n", "
3 | \n", "chr1 | \n", "17055 | \n", "17233 | \n", "1:17055:17233:clu_2_- | \n", "chr1_clu_2_- | \n", "chr1_clu_2_- | \n", "0.79293 | \n", "- | \n", "17232 | \n", "17055 | \n", "
4 | \n", "chr1 | \n", "17368 | \n", "17915 | \n", "1:17368:17915:clu_2_- | \n", "chr1_clu_2_- | \n", "chr1_clu_2_- | \n", "0.79293 | \n", "- | \n", "17914 | \n", "17368 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
143404 | \n", "chr9 | \n", "137798914 | \n", "137800880 | \n", "9:137798914:137800880:clu_20895_+ | \n", "chr9_clu_20895_+ | \n", "chr9_clu_20895_+ | \n", "0.76432 | \n", "+ | \n", "137798914 | \n", "137800879 | \n", "
143405 | \n", "chr9 | \n", "137800575 | \n", "137800880 | \n", "9:137800575:137800880:clu_20895_+ | \n", "chr9_clu_20895_+ | \n", "chr9_clu_20895_+ | \n", "0.76432 | \n", "+ | \n", "137800575 | \n", "137800879 | \n", "
143406 | \n", "chr9 | \n", "137800984 | \n", "137802409 | \n", "9:137800984:137802409:clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "0.43322 | \n", "+ | \n", "137800984 | \n", "137802408 | \n", "
143407 | \n", "chr9 | \n", "137800984 | \n", "137802857 | \n", "9:137800984:137802857:clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "0.43322 | \n", "+ | \n", "137800984 | \n", "137802856 | \n", "
143408 | \n", "chr9 | \n", "137800984 | \n", "137811461 | \n", "9:137800984:137811461:clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "chr9_clu_20894_+ | \n", "0.43322 | \n", "+ | \n", "137800984 | \n", "137811460 | \n", "
143409 rows × 10 columns
\n", "