Last updated: 2022-04-22
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version e372c81. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: r_packages_4.1.2/
Untracked files:
Untracked: analysis/cell_ranger.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/tss_predict.Rmd
) and HTML (docs/tss_predict.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | e372c81 | Dave Tang | 2022-04-22 | TSS |
Transcriptional starting sites (TSSs) demarcate the first position in the DNA sequence that gets transcribed into RNA. In this work, we will try to build a classifier to try to predict TSSs. We will train our classifier using the NCBI Reference Sequence Database also known as RefSeq.
Function for loading required Bioconductor packages (and install them first, if missing).
load_bioc_package <- function(x){
if(!require(x, character.only = TRUE, quietly = TRUE)){
BiocManager::install(x, character.only = TRUE)
library(x, character.only = TRUE)
}
}
We will use the Bioconductor package biomaRt
to download the entire collection of RefSeq sequences.
load_bioc_package('biomaRt')
Find the human gene set.
ensembl <- useMart('ensembl')
biomaRt::listDatasets(ensembl) %>%
filter(grepl('human', description, TRUE))
dataset description version
1 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13
Find RefSeq attributes.
ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl')
listAttributes(ensembl) %>%
filter(grepl('refseq', description, TRUE))
name description
1 transcript_mane_select RefSeq match transcript (MANE Select)
2 transcript_mane_plus_clinical RefSeq match transcript (MANE Plus Clinical)
3 refseq_mrna RefSeq mRNA ID
4 refseq_mrna_predicted RefSeq mRNA predicted ID
5 refseq_ncrna RefSeq ncRNA ID
6 refseq_ncrna_predicted RefSeq ncRNA predicted ID
7 refseq_peptide RefSeq peptide ID
8 refseq_peptide_predicted RefSeq peptide predicted ID
page
1 feature_page
2 feature_page
3 feature_page
4 feature_page
5 feature_page
6 feature_page
7 feature_page
8 feature_page
Fetch all RefSeq mRNAs on assembled chromosomes.
my_chr <- c(1:22, 'X', 'Y')
my_refseq <- getBM(
attributes='refseq_mrna',
filters = 'chromosome_name',
values = my_chr,
mart = ensembl
)
Number of RefSeq IDs.
dim(my_refseq)
[1] 61953 1
Build table. Note that when working with transcripts, use the attributes transcript_start
and transcript_end
, and not the attributes start_position
and end_position
; using those will give you the Ensembl gene coordinates and not the RefSeq coordinates, which is what we want! (I have retrieved the Ensembl coordinates for illustrative purposes.)
my_att <- c(
'refseq_mrna',
'chromosome_name',
'transcript_start',
'transcript_end',
'start_position',
'end_position',
'strand'
)
my_refseq_loci <- getBM(
attributes = my_att,
filters = c('refseq_mrna', 'chromosome_name'),
values = list(refseq_mrna = my_refseq$refseq_mrna, chromosome_name = my_chr),
mart = ensembl
)
Check out the table.
head(my_refseq_loci)
refseq_mrna chromosome_name transcript_start transcript_end start_position
1 NM_000016 1 75724709 75763679 75724431
2 NM_000028 1 99850489 99924020 99850361
3 NM_000036 1 114673098 114695546 114673090
4 NM_000069 1 201039512 201112426 201039512
5 NM_000081 1 235661041 235866906 235661041
6 NM_000098 1 53196824 53214197 53196792
end_position strand
1 75787575 1
2 99924023 1
3 114695618 -1
4 201112451 -1
5 235883640 -1
6 53214197 1
Check for duplicated entries.
table(duplicated(my_refseq_loci$refseq_mrna))
FALSE TRUE
61953 25
Removed duplicated entries.
my_refseq_loci_uniq <- my_refseq_loci[!duplicated(my_refseq_loci$refseq_mrna),]
dim(my_refseq_loci_uniq)
[1] 61953 7
We will modify chromosome_name
and strand
to make it compatible with BSgenome.Hsapiens.UCSC.hg38
.
my_refseq_loci_uniq %>%
mutate(strand = ifelse(strand == 1, yes = '+', no = '-')) %>%
mutate(chromosome_name = sub("^", "chr", chromosome_name)) -> my_refseq_loci_uniq
head(my_refseq_loci_uniq)
refseq_mrna chromosome_name transcript_start transcript_end start_position
1 NM_000016 chr1 75724709 75763679 75724431
2 NM_000028 chr1 99850489 99924020 99850361
3 NM_000036 chr1 114673098 114695546 114673090
4 NM_000069 chr1 201039512 201112426 201039512
5 NM_000081 chr1 235661041 235866906 235661041
6 NM_000098 chr1 53196824 53214197 53196792
end_position strand
1 75787575 +
2 99924023 +
3 114695618 -
4 201112451 -
5 235883640 -
6 53214197 +
The TSS is transcript_start
when the transcript is on the +
strand and is transcript_end
on the -
strand. We want to retrieve sequence upstream and downstream of the TSS, so we will subtract and add coordinates accordingly.
my_span <- 2
my_refseq_loci_uniq %>%
dplyr::select(-c(start_position, end_position)) %>%
mutate(tss_start = if_else(strand == "+", transcript_start - my_span, transcript_end - my_span)) %>%
mutate(tss_end = if_else(strand == "+", transcript_start + my_span, transcript_end + my_span)) -> my_tss
head(my_tss)
refseq_mrna chromosome_name transcript_start transcript_end strand tss_start
1 NM_000016 chr1 75724709 75763679 + 75724707
2 NM_000028 chr1 99850489 99924020 + 99850487
3 NM_000036 chr1 114673098 114695546 - 114695544
4 NM_000069 chr1 201039512 201112426 - 201112424
5 NM_000081 chr1 235661041 235866906 - 235866904
6 NM_000098 chr1 53196824 53214197 + 53196822
tss_end
1 75724711
2 99850491
3 114695548
4 201112428
5 235866908
6 53196826
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.50.3 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.8
[5] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.6
[9] ggplot2_3.3.5 tidyverse_1.3.1 workflowr_1.7.0
loaded via a namespace (and not attached):
[1] bitops_1.0-7 fs_1.5.2 lubridate_1.8.0
[4] bit64_4.0.5 filelock_1.0.2 progress_1.2.2
[7] httr_1.4.2 GenomeInfoDb_1.30.1 rprojroot_2.0.3
[10] tools_4.1.2 backports_1.4.1 bslib_0.3.1
[13] utf8_1.2.2 R6_2.5.1 DBI_1.1.2
[16] BiocGenerics_0.40.0 colorspace_2.0-3 withr_2.5.0
[19] prettyunits_1.1.1 tidyselect_1.1.2 processx_3.5.3
[22] curl_4.3.2 bit_4.0.4 compiler_4.1.2
[25] git2r_0.30.1 cli_3.2.0 rvest_1.0.2
[28] Biobase_2.54.0 xml2_1.3.3 sass_0.4.1
[31] scales_1.2.0 callr_3.7.0 rappdirs_0.3.3
[34] digest_0.6.29 rmarkdown_2.13 XVector_0.34.0
[37] pkgconfig_2.0.3 htmltools_0.5.2 dbplyr_2.1.1
[40] fastmap_1.1.0 rlang_1.0.2 readxl_1.4.0
[43] rstudioapi_0.13 RSQLite_2.2.12 jquerylib_0.1.4
[46] generics_0.1.2 jsonlite_1.8.0 RCurl_1.98-1.6
[49] magrittr_2.0.3 GenomeInfoDbData_1.2.7 Rcpp_1.0.8.3
[52] munsell_0.5.0 S4Vectors_0.32.4 fansi_1.0.3
[55] lifecycle_1.0.1 stringi_1.7.6 whisker_0.4
[58] yaml_2.3.5 zlibbioc_1.40.0 BiocFileCache_2.2.1
[61] grid_4.1.2 blob_1.2.3 promises_1.2.0.1
[64] crayon_1.5.1 Biostrings_2.62.0 haven_2.5.0
[67] hms_1.1.1 KEGGREST_1.34.0 knitr_1.38
[70] ps_1.6.0 pillar_1.7.0 stats4_4.1.2
[73] reprex_2.0.1 XML_3.99-0.9 glue_1.6.2
[76] evaluate_0.15 getPass_0.2-2 modelr_0.1.8
[79] vctrs_0.4.1 png_0.1-7 tzdb_0.3.0
[82] httpuv_1.6.5 cellranger_1.1.0 gtable_0.3.0
[85] assertthat_0.2.1 cachem_1.0.6 xfun_0.30
[88] broom_0.8.0 later_1.3.0 AnnotationDbi_1.56.2
[91] memoise_2.0.1 IRanges_2.28.0 ellipsis_0.3.2