Last updated: 2023-02-13
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 1242888. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: r_packages_4.1.2/
Ignored: r_packages_4.2.0/
Untracked files:
Untracked: analysis/cell_ranger.Rmd
Untracked: analysis/tss_xgboost.Rmd
Untracked: data/HG00702_SH089_CHSTrio.chr1.vcf.gz
Untracked: data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
Untracked: data/ncrna_NONCODE[v3.0].fasta.tar.gz
Untracked: data/ncrna_noncode_v3.fa
Unstaged changes:
Modified: analysis/graph.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/genomic_ranges.Rmd
) and
HTML (docs/genomic_ranges.html
) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote
), click on
the hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 1242888 | Dave Tang | 2023-02-13 | GenomicRanges |
From the introductory article for GenomicRanges:
The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. In the Bioconductor package hierarchy, it builds upon the IRanges (infrastructure) package and provides support for the BSgenome (infrastructure), Rsamtools (I/O), ShortRead (I/O & QA), rtracklayer (I/O), GenomicFeatures (infrastructure), GenomicAlignments (sequence reads), VariantAnnotation (called variants), and many other Bioconductor packages.
This package lays a foundation for genomic analysis by introducing three classes (GRanges, GPos, and GRangesList), which are used to represent genomic ranges, genomic positions, and groups of genomic ranges. This vignette focuses on the GRanges and GRangesList classes and their associated methods.
To being, install GenomicRanges.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require("GenomicRanges", quietly = TRUE))
BiocManager::install("GenomicRanges")
library(GenomicRanges)
The introduction article starts with creating a GRanges
object:
The GRanges class represents a collection of genomic features that each have a single start and end location on the genome. This includes features such as contiguous binding sites, transcripts, and exons. These objects can be created by using the GRanges constructor function.
gr <- GRanges(
seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
ranges = IRanges(101:110, end = 111:120, names = letters[1:10]),
strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
score = 1:10,
GC = seq(1, 0, length=10)
)
class(gr)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"
The GRanges()
function was used to specify a list of
sequence names, their ranges, strand, score, and GC content. For the
seqnames
and strand
, the Rle()
function was used; RLE is an abbreviation for run-length
encoding, which is a way of representing and compressing data. The
function saves us from typing
c("chr1", "chr2", "chr2", "chr2", "chr1", "chr1", "chr3", "chr3", "chr3", "chr3")
,
which would also work.
The IRanges()
function was used to create a vector
representation of sequence ranges; 10 ranges were created and named
using the first ten letters of the alphabet.
Rle()
was also used to indicate the strandedness of the
ranges.
Metadata can also be added to a GRanges
object and in
the example, a score and the GC content were included.
The genomic coordinates (seqnames, ranges, and strand) are displayed on the left-hand side and the metadata columns (annotations) are displayed on the right-hand side.
gr
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
<Rle> <IRanges> <Rle> | <integer> <numeric>
a chr1 101-111 - | 1 1.000000
b chr2 102-112 + | 2 0.888889
c chr2 103-113 + | 3 0.777778
d chr2 104-114 * | 4 0.666667
e chr1 105-115 * | 5 0.555556
f chr1 106-116 + | 6 0.444444
g chr3 107-117 + | 7 0.333333
h chr3 108-118 + | 8 0.222222
i chr3 109-119 - | 9 0.111111
j chr3 110-120 - | 10 0.000000
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
Each component of the genomic coordinates can be extracted using the
accessor functions that have the same name as the column names. For
example to retrieve the ranges we use the ranges()
function.
ranges(gr)
IRanges object with 10 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
a 101 111 11
b 102 112 11
c 103 113 11
d 104 114 11
e 105 115 11
f 106 116 11
g 107 117 11
h 108 118 11
i 109 119 11
j 110 120 11
Metadata is extracted using the mcols()
function.
mcols(gr)
DataFrame with 10 rows and 2 columns
score GC
<integer> <numeric>
a 1 1.000000
b 2 0.888889
c 3 0.777778
d 4 0.666667
e 5 0.555556
f 6 0.444444
g 7 0.333333
h 8 0.222222
i 9 0.111111
j 10 0.000000
The Browser
Extensible Data (BED) format is quite ubiquitous in bioinformatics,
so it is useful to know how a GRanges
object can be created
from a BED file.
We first create a data frame from a small BED file hosted on my web
server and then we create a GRanges
object using the data
frame. Since BED is 0-based, we add one to make it 1-based.
data <- read.table(
"https://davetang.org/file/granges.bed",
col.names = c('chr','start','end','id','score','strand')
)
bed <- with(data, GRanges(chr, IRanges(start+1L, end), strand, score, refseq=id))
bed
GRanges object with 6 ranges and 2 metadata columns:
seqnames ranges strand | score refseq
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 66999825-67210768 + | 0 NM_032291
[2] chr1 33546714-33585995 + | 0 NM_052998
[3] chr1 48998527-50489626 - | 0 NM_032785
[4] chr1 16767167-16786584 + | 0 NM_001145278
[5] chr1 16767167-16786584 + | 0 NM_001145277
[6] chr1 8384390-8404227 + | 0 NM_001080397
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
The elementMetadata
function can be used to fetch
metadata.
elementMetadata(bed)
DataFrame with 6 rows and 2 columns
score refseq
<integer> <character>
1 0 NM_032291
2 0 NM_052998
3 0 NM_032785
4 0 NM_001145278
5 0 NM_001145277
6 0 NM_001080397
The with()
function used to create the
GRanges
object is nice because it saves
us some typing; we can directly refer to the column names.
bed <- with(
data,
GRanges(
chr,
IRanges(start+1, end),
strand,
score,
refseq=id
)
)
bed2 <- GRanges(
data$chr,
IRanges(data$start+1L, data$end),
data$strand,
score = data$score,
refseq = data$id
)
identical(bed, bed2)
[1] TRUE
Let’s load another BED file to demonstrate how we can intersect ranges.
data2 <- read.table(
"https://davetang.org/file/granges2.bed",
col.names = c('chr','start','end','id','score','strand')
)
bed2 <- with(data2, GRanges(chr, IRanges(start+1, end), strand, score, refseq = id))
bed2
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | score refseq
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 66999823 + | 0 1_no
[2] chr1 33585997 + | 0 2_no
[3] chr1 16786584 + | 0 3_yes
[4] chr1 8384390 + | 0 4_yes
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
We can use the intersect
function to perform an
intersect on the two GRanges
objects; note that
ignore.strand=FALSE
is the default, which means the strand
information is taken into account (but not illustrated in my
example).
GenomicRanges::intersect(bed, bed2)
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 8384390 +
[2] chr1 16786584 +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Note that the metadata from both BED files is lost. Thanks to a suggestion
by Daniel, we can use the subsetByOverlaps()
function
to find overlapping genomic ranges and returns results that retain the
metadata. This function returns an additional result because when
considering the refseq
metadata, there are two unique
ranges.
subsetByOverlaps(bed, bed2)
GRanges object with 3 ranges and 2 metadata columns:
seqnames ranges strand | score refseq
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 16767167-16786584 + | 0 NM_001145278
[2] chr1 16767167-16786584 + | 0 NM_001145277
[3] chr1 8384390-8404227 + | 0 NM_001080397
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
In the previous example, subsetByOverlaps()
does not
indicate which features overlapped, only that there is an overlap. In
this section, we will demonstrate the use of the
GenomicRanges
functions countOverlaps()
,
findOverlaps()
, queryHits()
, and
subjectHits()
.
my_ranges <- GRanges(
seqnames = rep('chr1',6),
ranges = IRanges(
start = c(66999824,33546713,48998526,16767166,16767166,8384389),
end = c(67210768,33585995,50489626,16786584,16786584,8404227)
),
strand = c('+','+','-','+','+','+'),
score = rep(0,6),
refseq = c('NM_032291','NM_052998','NM_032785','NM_001145278','NM_001145277','NM_001080397')
)
my_snps <- GRanges(
seqnames = rep('chr1',3),
ranges = IRanges(
start = c(66999900,33546793,8384389),
end = c(66999901,33546794,8384390)
),
strand = c('+','+','+'),
id = c('snp1','snp2','snp3'),
score = rep(0,3)
)
The function countOverlaps()
counts the overlaps with
respect to the first GRanges
object.
countOverlaps(my_ranges, my_snps)
[1] 1 1 0 0 0 1
The function findOverlaps()
returns the matching indexes
between two GRanges
objects; a subject and a query.
my_overlaps <- findOverlaps(
subject = my_ranges,
query = my_snps
)
my_overlaps
Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 2 2
[3] 3 6
-------
queryLength: 3 / subjectLength: 6
Finally, we create a data frame that displays the matches by their
metadata; the functions queryHits()
and
subjectHits()
are used to retrieve the match indexes from
the findOverlaps()
result and these indexes are used to
retrieve the metadata. This process is similar to using the base R
function match()
.
my_query <- queryHits(my_overlaps)
my_subject <- subjectHits(my_overlaps)
data.frame(
subject = my_ranges[my_subject]$refseq,
query = my_snps[my_query]$id
)
subject query
1 NM_032291 snp1
2 NM_052998 snp2
3 NM_001080397 snp3
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] GenomicRanges_1.50.2 GenomeInfoDb_1.34.7 IRanges_2.32.0
[4] S4Vectors_0.36.1 BiocGenerics_0.44.0 BiocManager_1.30.19
[7] workflowr_1.7.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.10 XVector_0.38.0 compiler_4.2.0
[4] pillar_1.8.1 bslib_0.4.2 later_1.3.0
[7] git2r_0.30.1 jquerylib_0.1.4 zlibbioc_1.44.0
[10] bitops_1.0-7 tools_4.2.0 getPass_0.2-2
[13] digest_0.6.31 jsonlite_1.8.4 evaluate_0.20
[16] lifecycle_1.0.3 tibble_3.1.8 pkgconfig_2.0.3
[19] rlang_1.0.6 cli_3.6.0 rstudioapi_0.14
[22] yaml_2.3.7 xfun_0.36 fastmap_1.1.0
[25] GenomeInfoDbData_1.2.9 httr_1.4.4 stringr_1.5.0
[28] knitr_1.42 fs_1.5.2 vctrs_0.5.2
[31] sass_0.4.5 rprojroot_2.0.3 glue_1.6.2
[34] R6_2.5.1 processx_3.8.0 fansi_1.0.3
[37] rmarkdown_2.20 callr_3.7.3 magrittr_2.0.3
[40] whisker_0.4 ps_1.7.2 promises_1.2.0.1
[43] htmltools_0.5.4 httpuv_1.6.8 utf8_1.2.2
[46] stringi_1.7.12 RCurl_1.98-1.10 cachem_1.0.6