tss

Last updated: 2020-09-18

Checks: 7 0

Knit directory: gene_level_fine_mapping/

This reproducible R Markdown analysis was created with workflowr (version 1.6.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200622)

The command set.seed(20200622) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 464e714

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 464e714. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.RData
    Ignored:    analysis/.Rhistory

Untracked files:
    Untracked:  analysis/atac_eqtl.Rmd
    Untracked:  data/hic_eqtl.RData
    Untracked:  data/train_add_hic.RData
    Untracked:  data/train_all.RData

Unstaged changes:
    Modified:   analysis/add_hic_feature.Rmd
    Modified:   analysis/assess_distal_observation_filtered.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/add_tss.Rmd) and HTML (docs/add_tss.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	464e714	yunqiyang0215	2020-09-18	wflow_publish(“analysis/add_tss.Rmd”)

add distance of snp to target gene tss

tss = read.table('/Users/nicholeyang/Desktop/Rotation2/gene-level fine mapping/data/gencode.v28.hg38.sga')

# load fine-mapping positive set
load("/Users/nicholeyang/Desktop/Rotation2/gene-level fine mapping/data/dt_pos3.RData")

load('/Users/nicholeyang/Desktop/Rotation2/gene-level fine mapping/data/noncausal_set.RData')

head(dt_pos3)

          gene_id               gene rank             variant_id      pip
1 ENSG00000000457 ENSG00000000457.13    1 chr1_169891332_G_A_b38 0.939551
2 ENSG00000000460 ENSG00000000460.16    1 chr1_169661963_G_A_b38 0.580082
3 ENSG00000001561  ENSG00000001561.6    2 chr6_46129743_G_GT_b38 0.513630
5 ENSG00000001561  ENSG00000001561.6    1  chr6_46130021_C_G_b38 0.657995
7 ENSG00000001629  ENSG00000001629.9    1  chr7_92245996_C_T_b38 0.520558
8 ENSG00000002016 ENSG00000002016.17    1   chr12_949572_C_G_b38 0.522047
  log10_abf cluster_id SNP_chr   SNP_loc gene_name seqnames     start       end
1    20.142          1       1 169891332     SCYL3        1 169849631 169894267
2     5.656          1       1 169661963  C1orf112        1 169662007 169854080
3    22.020          3       6  46129743     ENPP4        6  46129989  46146688
5     5.590          2       6  46130021     ENPP4        6  46129989  46146688
7     7.253          1       7  92245996    ANKIB1        7  92245974  92401383
8    23.668          2      12    949572     RAD52       12    911736    990053
   width
1  44637
2 192074
3  16700
5  16700
7 155410
8  78318

head(neg3)

     seqnames     start       end  width strand type         gene_id gene_name
1565        1 169367970 169396540  28571      + gene ENSG00000117475     BLZF1
1846        1 169662007 169854080 192074      + gene ENSG00000000460  C1orf112
2370        1 169394870 169460669  65800      - gene ENSG00000117477   CCDC181
5214        1 169511951 169586588  74638      - gene ENSG00000198734        F5
8155        1 169921326 170085208 163883      - gene ENSG00000075945    KIFAP3
9432        1 170145959 170168866  22908      + gene ENSG00000203740  METTL11B
       gene_biotype                 var_id
1565 protein_coding chr1_169891332_G_A_b38
1846 protein_coding chr1_169891332_G_A_b38
2370 protein_coding chr1_169891332_G_A_b38
5214 protein_coding chr1_169891332_G_A_b38
8155 protein_coding chr1_169891332_G_A_b38
9432 protein_coding chr1_169891332_G_A_b38

process the tss file

colnames(tss) = c("organism", 'source', 'tss_position','strand','chr', 'gene_info')

gene_info = strsplit(as.character(tss$gene_info), '.', fixed = TRUE)
tss$gene_id = unlist(lapply(gene_info, function(x) x[1]))
tss$gene_name = unlist(lapply(gene_info, function(x) x[3]))

head(tss)

      organism  source tss_position strand chr                gene_info
1 NC_000001.11 ENSEMBL       959290      -   1   ENST00000327044..NOC2L
2 NC_000001.11 ENSEMBL       960587      +   1  ENST00000338591..KLHL17
3 NC_000001.11 ENSEMBL       961449      +   1  ENST00000463212..KLHL17
4 NC_000001.11 ENSEMBL       966497      +   1 ENST00000379410..PLEKHN1
5 NC_000001.11 ENSEMBL       966502      +   1 ENST00000379407..PLEKHN1
6 NC_000001.11 ENSEMBL       976641      -   1   ENST00000479361..PERM1
          gene_id gene_name
1 ENST00000327044     NOC2L
2 ENST00000338591    KLHL17
3 ENST00000463212    KLHL17
4 ENST00000379410   PLEKHN1
5 ENST00000379407   PLEKHN1
6 ENST00000479361     PERM1

merge tss file with positive snp-gene set

sub_dt3 = dt_pos3[ , c('gene_id','gene','variant_id','pip','SNP_chr','SNP_loc','gene_name','seqnames','start','end','width')]

dt_tss = merge(sub_dt3, tss[, c('gene_id','gene_name','chr','tss_position','strand')], by = 'gene_name', all.x = TRUE)

head(dt_tss)

  gene_name       gene_id.x               gene             variant_id      pip
1      A1BG ENSG00000121410 ENSG00000121410.11 chr19_58359927_G_A_b38 0.940509
2      A1BG ENSG00000121410 ENSG00000121410.11 chr19_58359927_G_A_b38 0.940509
3      A1BG ENSG00000121410 ENSG00000121410.11 chr19_58359927_G_A_b38 0.940509
4      A1BG ENSG00000121410 ENSG00000121410.11 chr19_58330182_C_T_b38 0.788969
5      A1BG ENSG00000121410 ENSG00000121410.11 chr19_58330182_C_T_b38 0.788969
6      A1BG ENSG00000121410 ENSG00000121410.11 chr19_58330182_C_T_b38 0.788969
  SNP_chr  SNP_loc seqnames    start      end width       gene_id.y chr
1      19 58359927       19 58345178 58353492  8315 ENST00000263100   1
2      19 58359927       19 58345178 58353492  8315 ENST00000596924   1
3      19 58359927       19 58345178 58353492  8315 ENST00000598345   1
4      19 58330182       19 58345178 58353492  8315 ENST00000263100   1
5      19 58330182       19 58345178 58353492  8315 ENST00000596924   1
6      19 58330182       19 58345178 58353492  8315 ENST00000598345   1
  tss_position strand
1     58353499      -
2     58347634      -
3     58347657      -
4     58353499      -
5     58347634      -
6     58347657      -

# find the nearest tss to the snp 
dt_tss$dist_to_snp = abs(as.numeric(dt_tss$SNP_loc) - as.numeric(dt_tss$tss_position))
dt_tss = dt_tss[order(dt_tss$gene_id.x, dt_tss$dist_to_snp) , ]

sub_dt_tss = dt_tss[, c('gene_name', 'variant_id', 'pip', 'SNP_loc', 'strand')]
dt_pos_add_tss = dt_tss[!duplicated(sub_dt_tss), ]

head(dt_pos_add_tss)

     gene_name       gene_id.x               gene             variant_id
7968     SCYL3 ENSG00000000457 ENSG00000000457.13 chr1_169891332_G_A_b38
1327  C1orf112 ENSG00000000460 ENSG00000000460.16 chr1_169661963_G_A_b38
3020     ENPP4 ENSG00000001561  ENSG00000001561.6  chr6_46130021_C_G_b38
3021     ENPP4 ENSG00000001561  ENSG00000001561.6 chr6_46129743_G_GT_b38
610     ANKIB1 ENSG00000001629  ENSG00000001629.9  chr7_92245996_C_T_b38
7380     RAD52 ENSG00000002016 ENSG00000002016.17   chr12_949572_C_G_b38
          pip SNP_chr   SNP_loc seqnames     start       end  width
7968 0.939551       1 169891332        1 169849631 169894267  44637
1327 0.580082       1 169661963        1 169662007 169854080 192074
3020 0.657995       6  46130021        6  46129989  46146688  16700
3021 0.513630       6  46129743        6  46129989  46146688  16700
610  0.520558       7  92245996        7  92245974  92401383 155410
7380 0.522047      12    949572       12    911736    990053  78318
           gene_id.y chr tss_position strand dist_to_snp
7968 ENST00000367770   1    169888888      -        2444
1327 ENST00000496973   1    169795043      +      133080
3020 ENST00000321037   1     46129993      +          28
3021 ENST00000321037   1     46129993      +         250
610  ENST00000265742   1     92246234      +         238
7380 ENST00000461568   2       949665      -          93

save(dt_pos_add_tss, file = '/Users/nicholeyang/Desktop/Rotation2/gene-level fine mapping/data/dt_pos_add_tss.RData')

merge tss file with negative snp-gene set

SNP_loc = strsplit(as.character(neg3$var_id), '_')
neg3$SNP_loc = unlist(lapply(SNP_loc, function(x) x[2]))

neg_tss = merge(neg3, tss[, c('gene_id','gene_name','chr','tss_position','strand')], by = 'gene_name', all.x = TRUE)

# find the nearest tss to the snp 
neg_tss$dist_to_snp = abs(as.numeric(neg_tss$SNP_loc) - as.numeric(neg_tss$tss_position))
neg_tss = neg_tss[order(neg_tss$gene_id.x, neg_tss$dist_to_snp) , ]

sub_neg_tss = neg_tss[, c('gene_name', 'var_id', 'SNP_loc', 'strand.x', 'seqnames')]
dt_neg_tss = neg_tss[!duplicated(sub_neg_tss), ]

head(dt_neg_tss)

       gene_name seqnames     start       end  width strand.x type
37846       DPM1       20  50934867  50958555  23689        - gene
120661     SCYL3        1 169849631 169894267  44637        - gene
120665     SCYL3        1 169849631 169894267  44637        - gene
120663     SCYL3        1 169849631 169894267  44637        - gene
15473   C1orf112        1 169662007 169854080 192074        + gene
15472   C1orf112        1 169662007 169854080 192074        + gene
             gene_id.x   gene_biotype                 var_id   SNP_loc
37846  ENSG00000000419 protein_coding chr20_50794732_A_G_b38  50794732
120661 ENSG00000000457 protein_coding chr1_169661963_G_A_b38 169661963
120665 ENSG00000000457 protein_coding chr1_169605649_C_T_b38 169605649
120663 ENSG00000000457 protein_coding chr1_169443797_C_T_b38 169443797
15473  ENSG00000000460 protein_coding chr1_169891332_G_A_b38 169891332
15472  ENSG00000000460 protein_coding chr1_169605649_C_T_b38 169605649
             gene_id.y chr tss_position strand.y dist_to_snp
37846  ENST00000371588   2     50958550        -      163818
120661 ENST00000367770   1    169888888        -      226925
120665 ENST00000367770   1    169888888        -      283239
120663 ENST00000367770   1    169888888        -      445091
15473  ENST00000413811   1    169795921        +       95411
15472  ENST00000496973   1    169795043        +      189394

save(dt_neg_tss, file = '/Users/nicholeyang/Desktop/Rotation2/gene-level fine mapping/data/dt_neg_tss.RData')

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4      rprojroot_1.3-2 digest_0.6.25   later_1.0.0    
 [5] R6_2.4.1        backports_1.1.5 git2r_0.26.1    magrittr_1.5   
 [9] evaluate_0.14   stringi_1.4.6   rlang_0.4.5     fs_1.3.2       
[13] promises_1.1.0  whisker_0.4     rmarkdown_2.1   tools_3.6.3    
[17] stringr_1.4.0   glue_1.3.2      httpuv_1.5.2    xfun_0.12      
[21] yaml_2.2.1      compiler_3.6.3  htmltools_0.4.0 knitr_1.28

tss

Yunqi Yang

7/10/2020

add distance of snp to target gene tss

process the tss file

merge tss file with positive snp-gene set

merge tss file with negative snp-gene set