Sequence composition and Random Forests

Introduction

Packages

RefSeq

Transcription Start Site

Last updated: 2022-04-22

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: e372c81

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version e372c81. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.1.2/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/tss_predict.Rmd) and HTML (docs/tss_predict.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	e372c81	Dave Tang	2022-04-22	TSS

Introduction

Transcriptional starting sites (TSSs) demarcate the first position in the DNA sequence that gets transcribed into RNA. In this work, we will try to build a classifier to try to predict TSSs. We will train our classifier using the NCBI Reference Sequence Database also known as RefSeq.

Packages

Function for loading required Bioconductor packages (and install them first, if missing).

load_bioc_package <- function(x){
  if(!require(x, character.only = TRUE, quietly = TRUE)){
    BiocManager::install(x, character.only = TRUE)
    library(x, character.only = TRUE)
  }
}

RefSeq

We will use the Bioconductor package biomaRt to download the entire collection of RefSeq sequences.

load_bioc_package('biomaRt')

Find the human gene set.

ensembl <- useMart('ensembl')
biomaRt::listDatasets(ensembl) %>%
  filter(grepl('human', description, TRUE))

                dataset              description    version
1 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13

Find RefSeq attributes.

ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl')
listAttributes(ensembl) %>%
  filter(grepl('refseq', description, TRUE))

                           name                                  description
1        transcript_mane_select        RefSeq match transcript (MANE Select)
2 transcript_mane_plus_clinical RefSeq match transcript (MANE Plus Clinical)
3                   refseq_mrna                               RefSeq mRNA ID
4         refseq_mrna_predicted                     RefSeq mRNA predicted ID
5                  refseq_ncrna                              RefSeq ncRNA ID
6        refseq_ncrna_predicted                    RefSeq ncRNA predicted ID
7                refseq_peptide                            RefSeq peptide ID
8      refseq_peptide_predicted                  RefSeq peptide predicted ID
          page
1 feature_page
2 feature_page
3 feature_page
4 feature_page
5 feature_page
6 feature_page
7 feature_page
8 feature_page

Fetch all RefSeq mRNAs on assembled chromosomes.

my_chr <- c(1:22, 'X', 'Y')

my_refseq <- getBM(
  attributes='refseq_mrna',
  filters = 'chromosome_name',
  values = my_chr,
  mart = ensembl
)

Number of RefSeq IDs.

dim(my_refseq)

[1] 61953     1

Build table. Note that when working with transcripts, use the attributes transcript_start and transcript_end, and not the attributes start_position and end_position; using those will give you the Ensembl gene coordinates and not the RefSeq coordinates, which is what we want! (I have retrieved the Ensembl coordinates for illustrative purposes.)

my_att <- c(
  'refseq_mrna',
  'chromosome_name',
  'transcript_start',
  'transcript_end',
  'start_position',
  'end_position',
  'strand'
)

my_refseq_loci <- getBM(
  attributes = my_att,
  filters = c('refseq_mrna', 'chromosome_name'),
  values = list(refseq_mrna = my_refseq$refseq_mrna, chromosome_name = my_chr),
  mart = ensembl
)

Check out the table.

head(my_refseq_loci)

  refseq_mrna chromosome_name transcript_start transcript_end start_position
1   NM_000016               1         75724709       75763679       75724431
2   NM_000028               1         99850489       99924020       99850361
3   NM_000036               1        114673098      114695546      114673090
4   NM_000069               1        201039512      201112426      201039512
5   NM_000081               1        235661041      235866906      235661041
6   NM_000098               1         53196824       53214197       53196792
  end_position strand
1     75787575      1
2     99924023      1
3    114695618     -1
4    201112451     -1
5    235883640     -1
6     53214197      1

Check for duplicated entries.

table(duplicated(my_refseq_loci$refseq_mrna))


FALSE  TRUE 
61953    25

Removed duplicated entries.

my_refseq_loci_uniq <- my_refseq_loci[!duplicated(my_refseq_loci$refseq_mrna),]
dim(my_refseq_loci_uniq)

[1] 61953     7

We will modify chromosome_name and strand to make it compatible with BSgenome.Hsapiens.UCSC.hg38.

my_refseq_loci_uniq %>%
  mutate(strand = ifelse(strand == 1, yes = '+', no = '-')) %>%
  mutate(chromosome_name = sub("^", "chr", chromosome_name)) -> my_refseq_loci_uniq

head(my_refseq_loci_uniq)

  refseq_mrna chromosome_name transcript_start transcript_end start_position
1   NM_000016            chr1         75724709       75763679       75724431
2   NM_000028            chr1         99850489       99924020       99850361
3   NM_000036            chr1        114673098      114695546      114673090
4   NM_000069            chr1        201039512      201112426      201039512
5   NM_000081            chr1        235661041      235866906      235661041
6   NM_000098            chr1         53196824       53214197       53196792
  end_position strand
1     75787575      +
2     99924023      +
3    114695618      -
4    201112451      -
5    235883640      -
6     53214197      +

Transcription Start Site

The TSS is transcript_start when the transcript is on the + strand and is transcript_end on the - strand. We want to retrieve sequence upstream and downstream of the TSS, so we will subtract and add coordinates accordingly.

my_span <- 2

my_refseq_loci_uniq %>%
  dplyr::select(-c(start_position, end_position)) %>%
  mutate(tss_start = if_else(strand == "+", transcript_start - my_span, transcript_end - my_span)) %>%
  mutate(tss_end = if_else(strand == "+", transcript_start + my_span, transcript_end + my_span)) -> my_tss

head(my_tss)

  refseq_mrna chromosome_name transcript_start transcript_end strand tss_start
1   NM_000016            chr1         75724709       75763679      +  75724707
2   NM_000028            chr1         99850489       99924020      +  99850487
3   NM_000036            chr1        114673098      114695546      - 114695544
4   NM_000069            chr1        201039512      201112426      - 201112424
5   NM_000081            chr1        235661041      235866906      - 235866904
6   NM_000098            chr1         53196824       53214197      +  53196822
    tss_end
1  75724711
2  99850491
3 114695548
4 201112428
5 235866908
6  53196826

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.50.3  forcats_0.5.1   stringr_1.4.0   dplyr_1.0.8    
 [5] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.6   
 [9] ggplot2_3.3.5   tidyverse_1.3.1 workflowr_1.7.0

loaded via a namespace (and not attached):
 [1] bitops_1.0-7           fs_1.5.2               lubridate_1.8.0       
 [4] bit64_4.0.5            filelock_1.0.2         progress_1.2.2        
 [7] httr_1.4.2             GenomeInfoDb_1.30.1    rprojroot_2.0.3       
[10] tools_4.1.2            backports_1.4.1        bslib_0.3.1           
[13] utf8_1.2.2             R6_2.5.1               DBI_1.1.2             
[16] BiocGenerics_0.40.0    colorspace_2.0-3       withr_2.5.0           
[19] prettyunits_1.1.1      tidyselect_1.1.2       processx_3.5.3        
[22] curl_4.3.2             bit_4.0.4              compiler_4.1.2        
[25] git2r_0.30.1           cli_3.2.0              rvest_1.0.2           
[28] Biobase_2.54.0         xml2_1.3.3             sass_0.4.1            
[31] scales_1.2.0           callr_3.7.0            rappdirs_0.3.3        
[34] digest_0.6.29          rmarkdown_2.13         XVector_0.34.0        
[37] pkgconfig_2.0.3        htmltools_0.5.2        dbplyr_2.1.1          
[40] fastmap_1.1.0          rlang_1.0.2            readxl_1.4.0          
[43] rstudioapi_0.13        RSQLite_2.2.12         jquerylib_0.1.4       
[46] generics_0.1.2         jsonlite_1.8.0         RCurl_1.98-1.6        
[49] magrittr_2.0.3         GenomeInfoDbData_1.2.7 Rcpp_1.0.8.3          
[52] munsell_0.5.0          S4Vectors_0.32.4       fansi_1.0.3           
[55] lifecycle_1.0.1        stringi_1.7.6          whisker_0.4           
[58] yaml_2.3.5             zlibbioc_1.40.0        BiocFileCache_2.2.1   
[61] grid_4.1.2             blob_1.2.3             promises_1.2.0.1      
[64] crayon_1.5.1           Biostrings_2.62.0      haven_2.5.0           
[67] hms_1.1.1              KEGGREST_1.34.0        knitr_1.38            
[70] ps_1.6.0               pillar_1.7.0           stats4_4.1.2          
[73] reprex_2.0.1           XML_3.99-0.9           glue_1.6.2            
[76] evaluate_0.15          getPass_0.2-2          modelr_0.1.8          
[79] vctrs_0.4.1            png_0.1-7              tzdb_0.3.0            
[82] httpuv_1.6.5           cellranger_1.1.0       gtable_0.3.0          
[85] assertthat_0.2.1       cachem_1.0.6           xfun_0.30             
[88] broom_0.8.0            later_1.3.0            AnnotationDbi_1.56.2  
[91] memoise_2.0.1          IRanges_2.28.0         ellipsis_0.3.2