Last updated: 2020-09-18
Checks: 7 0
Knit directory: gene_level_fine_mapping/
This reproducible R Markdown analysis was created with workflowr (version 1.6.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200622) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 464e714. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: analysis/.RData
Ignored: analysis/.Rhistory
Untracked files:
Untracked: analysis/atac_eqtl.Rmd
Untracked: data/hic_eqtl.RData
Untracked: data/train_add_hic.RData
Untracked: data/train_all.RData
Unstaged changes:
Modified: analysis/add_hic_feature.Rmd
Modified: analysis/assess_distal_observation_filtered.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/add_tss.Rmd) and HTML (docs/add_tss.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | 464e714 | yunqiyang0215 | 2020-09-18 | wflow_publish(“analysis/add_tss.Rmd”) |
tss = read.table('/Users/nicholeyang/Desktop/Rotation2/gene-level fine mapping/data/gencode.v28.hg38.sga')
# load fine-mapping positive set
load("/Users/nicholeyang/Desktop/Rotation2/gene-level fine mapping/data/dt_pos3.RData")
load('/Users/nicholeyang/Desktop/Rotation2/gene-level fine mapping/data/noncausal_set.RData')
head(dt_pos3)
gene_id gene rank variant_id pip
1 ENSG00000000457 ENSG00000000457.13 1 chr1_169891332_G_A_b38 0.939551
2 ENSG00000000460 ENSG00000000460.16 1 chr1_169661963_G_A_b38 0.580082
3 ENSG00000001561 ENSG00000001561.6 2 chr6_46129743_G_GT_b38 0.513630
5 ENSG00000001561 ENSG00000001561.6 1 chr6_46130021_C_G_b38 0.657995
7 ENSG00000001629 ENSG00000001629.9 1 chr7_92245996_C_T_b38 0.520558
8 ENSG00000002016 ENSG00000002016.17 1 chr12_949572_C_G_b38 0.522047
log10_abf cluster_id SNP_chr SNP_loc gene_name seqnames start end
1 20.142 1 1 169891332 SCYL3 1 169849631 169894267
2 5.656 1 1 169661963 C1orf112 1 169662007 169854080
3 22.020 3 6 46129743 ENPP4 6 46129989 46146688
5 5.590 2 6 46130021 ENPP4 6 46129989 46146688
7 7.253 1 7 92245996 ANKIB1 7 92245974 92401383
8 23.668 2 12 949572 RAD52 12 911736 990053
width
1 44637
2 192074
3 16700
5 16700
7 155410
8 78318
head(neg3)
seqnames start end width strand type gene_id gene_name
1565 1 169367970 169396540 28571 + gene ENSG00000117475 BLZF1
1846 1 169662007 169854080 192074 + gene ENSG00000000460 C1orf112
2370 1 169394870 169460669 65800 - gene ENSG00000117477 CCDC181
5214 1 169511951 169586588 74638 - gene ENSG00000198734 F5
8155 1 169921326 170085208 163883 - gene ENSG00000075945 KIFAP3
9432 1 170145959 170168866 22908 + gene ENSG00000203740 METTL11B
gene_biotype var_id
1565 protein_coding chr1_169891332_G_A_b38
1846 protein_coding chr1_169891332_G_A_b38
2370 protein_coding chr1_169891332_G_A_b38
5214 protein_coding chr1_169891332_G_A_b38
8155 protein_coding chr1_169891332_G_A_b38
9432 protein_coding chr1_169891332_G_A_b38
colnames(tss) = c("organism", 'source', 'tss_position','strand','chr', 'gene_info')
gene_info = strsplit(as.character(tss$gene_info), '.', fixed = TRUE)
tss$gene_id = unlist(lapply(gene_info, function(x) x[1]))
tss$gene_name = unlist(lapply(gene_info, function(x) x[3]))
head(tss)
organism source tss_position strand chr gene_info
1 NC_000001.11 ENSEMBL 959290 - 1 ENST00000327044..NOC2L
2 NC_000001.11 ENSEMBL 960587 + 1 ENST00000338591..KLHL17
3 NC_000001.11 ENSEMBL 961449 + 1 ENST00000463212..KLHL17
4 NC_000001.11 ENSEMBL 966497 + 1 ENST00000379410..PLEKHN1
5 NC_000001.11 ENSEMBL 966502 + 1 ENST00000379407..PLEKHN1
6 NC_000001.11 ENSEMBL 976641 - 1 ENST00000479361..PERM1
gene_id gene_name
1 ENST00000327044 NOC2L
2 ENST00000338591 KLHL17
3 ENST00000463212 KLHL17
4 ENST00000379410 PLEKHN1
5 ENST00000379407 PLEKHN1
6 ENST00000479361 PERM1
sub_dt3 = dt_pos3[ , c('gene_id','gene','variant_id','pip','SNP_chr','SNP_loc','gene_name','seqnames','start','end','width')]
dt_tss = merge(sub_dt3, tss[, c('gene_id','gene_name','chr','tss_position','strand')], by = 'gene_name', all.x = TRUE)
head(dt_tss)
gene_name gene_id.x gene variant_id pip
1 A1BG ENSG00000121410 ENSG00000121410.11 chr19_58359927_G_A_b38 0.940509
2 A1BG ENSG00000121410 ENSG00000121410.11 chr19_58359927_G_A_b38 0.940509
3 A1BG ENSG00000121410 ENSG00000121410.11 chr19_58359927_G_A_b38 0.940509
4 A1BG ENSG00000121410 ENSG00000121410.11 chr19_58330182_C_T_b38 0.788969
5 A1BG ENSG00000121410 ENSG00000121410.11 chr19_58330182_C_T_b38 0.788969
6 A1BG ENSG00000121410 ENSG00000121410.11 chr19_58330182_C_T_b38 0.788969
SNP_chr SNP_loc seqnames start end width gene_id.y chr
1 19 58359927 19 58345178 58353492 8315 ENST00000263100 1
2 19 58359927 19 58345178 58353492 8315 ENST00000596924 1
3 19 58359927 19 58345178 58353492 8315 ENST00000598345 1
4 19 58330182 19 58345178 58353492 8315 ENST00000263100 1
5 19 58330182 19 58345178 58353492 8315 ENST00000596924 1
6 19 58330182 19 58345178 58353492 8315 ENST00000598345 1
tss_position strand
1 58353499 -
2 58347634 -
3 58347657 -
4 58353499 -
5 58347634 -
6 58347657 -
# find the nearest tss to the snp
dt_tss$dist_to_snp = abs(as.numeric(dt_tss$SNP_loc) - as.numeric(dt_tss$tss_position))
dt_tss = dt_tss[order(dt_tss$gene_id.x, dt_tss$dist_to_snp) , ]
sub_dt_tss = dt_tss[, c('gene_name', 'variant_id', 'pip', 'SNP_loc', 'strand')]
dt_pos_add_tss = dt_tss[!duplicated(sub_dt_tss), ]
head(dt_pos_add_tss)
gene_name gene_id.x gene variant_id
7968 SCYL3 ENSG00000000457 ENSG00000000457.13 chr1_169891332_G_A_b38
1327 C1orf112 ENSG00000000460 ENSG00000000460.16 chr1_169661963_G_A_b38
3020 ENPP4 ENSG00000001561 ENSG00000001561.6 chr6_46130021_C_G_b38
3021 ENPP4 ENSG00000001561 ENSG00000001561.6 chr6_46129743_G_GT_b38
610 ANKIB1 ENSG00000001629 ENSG00000001629.9 chr7_92245996_C_T_b38
7380 RAD52 ENSG00000002016 ENSG00000002016.17 chr12_949572_C_G_b38
pip SNP_chr SNP_loc seqnames start end width
7968 0.939551 1 169891332 1 169849631 169894267 44637
1327 0.580082 1 169661963 1 169662007 169854080 192074
3020 0.657995 6 46130021 6 46129989 46146688 16700
3021 0.513630 6 46129743 6 46129989 46146688 16700
610 0.520558 7 92245996 7 92245974 92401383 155410
7380 0.522047 12 949572 12 911736 990053 78318
gene_id.y chr tss_position strand dist_to_snp
7968 ENST00000367770 1 169888888 - 2444
1327 ENST00000496973 1 169795043 + 133080
3020 ENST00000321037 1 46129993 + 28
3021 ENST00000321037 1 46129993 + 250
610 ENST00000265742 1 92246234 + 238
7380 ENST00000461568 2 949665 - 93
save(dt_pos_add_tss, file = '/Users/nicholeyang/Desktop/Rotation2/gene-level fine mapping/data/dt_pos_add_tss.RData')
SNP_loc = strsplit(as.character(neg3$var_id), '_')
neg3$SNP_loc = unlist(lapply(SNP_loc, function(x) x[2]))
neg_tss = merge(neg3, tss[, c('gene_id','gene_name','chr','tss_position','strand')], by = 'gene_name', all.x = TRUE)
# find the nearest tss to the snp
neg_tss$dist_to_snp = abs(as.numeric(neg_tss$SNP_loc) - as.numeric(neg_tss$tss_position))
neg_tss = neg_tss[order(neg_tss$gene_id.x, neg_tss$dist_to_snp) , ]
sub_neg_tss = neg_tss[, c('gene_name', 'var_id', 'SNP_loc', 'strand.x', 'seqnames')]
dt_neg_tss = neg_tss[!duplicated(sub_neg_tss), ]
head(dt_neg_tss)
gene_name seqnames start end width strand.x type
37846 DPM1 20 50934867 50958555 23689 - gene
120661 SCYL3 1 169849631 169894267 44637 - gene
120665 SCYL3 1 169849631 169894267 44637 - gene
120663 SCYL3 1 169849631 169894267 44637 - gene
15473 C1orf112 1 169662007 169854080 192074 + gene
15472 C1orf112 1 169662007 169854080 192074 + gene
gene_id.x gene_biotype var_id SNP_loc
37846 ENSG00000000419 protein_coding chr20_50794732_A_G_b38 50794732
120661 ENSG00000000457 protein_coding chr1_169661963_G_A_b38 169661963
120665 ENSG00000000457 protein_coding chr1_169605649_C_T_b38 169605649
120663 ENSG00000000457 protein_coding chr1_169443797_C_T_b38 169443797
15473 ENSG00000000460 protein_coding chr1_169891332_G_A_b38 169891332
15472 ENSG00000000460 protein_coding chr1_169605649_C_T_b38 169605649
gene_id.y chr tss_position strand.y dist_to_snp
37846 ENST00000371588 2 50958550 - 163818
120661 ENST00000367770 1 169888888 - 226925
120665 ENST00000367770 1 169888888 - 283239
120663 ENST00000367770 1 169888888 - 445091
15473 ENST00000413811 1 169795921 + 95411
15472 ENST00000496973 1 169795043 + 189394
save(dt_neg_tss, file = '/Users/nicholeyang/Desktop/Rotation2/gene-level fine mapping/data/dt_neg_tss.RData')
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] workflowr_1.6.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4 rprojroot_1.3-2 digest_0.6.25 later_1.0.0
[5] R6_2.4.1 backports_1.1.5 git2r_0.26.1 magrittr_1.5
[9] evaluate_0.14 stringi_1.4.6 rlang_0.4.5 fs_1.3.2
[13] promises_1.1.0 whisker_0.4 rmarkdown_2.1 tools_3.6.3
[17] stringr_1.4.0 glue_1.3.2 httpuv_1.5.2 xfun_0.12
[21] yaml_2.2.1 compiler_3.6.3 htmltools_0.4.0 knitr_1.28