Last updated: 2022-11-08
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version ce8e447. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/figure/
Ignored: r_packages_4.1.2/
Ignored: r_packages_4.2.0/
Untracked files:
Untracked: analysis/cell_ranger.Rmd
Untracked: data/ncrna_NONCODE[v3.0].fasta.tar.gz
Untracked: data/ncrna_noncode_v3.fa
Untracked: data/tss.rds
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/tss_predict.Rmd
) and HTML
(docs/tss_predict.html
) files. If you’ve configured a
remote Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | ce8e447 | Dave Tang | 2022-11-08 | Update notebook |
html | 9ec0dec | Dave Tang | 2022-04-26 | Build site. |
Rmd | 56b1a79 | Dave Tang | 2022-04-26 | Positional di-nuc freq |
html | fd116a1 | Dave Tang | 2022-04-24 | Build site. |
Rmd | 0616464 | Dave Tang | 2022-04-24 | Random forest model |
html | fd41c75 | Dave Tang | 2022-04-22 | Build site. |
Rmd | e372c81 | Dave Tang | 2022-04-22 | TSS |
Transcriptional starting sites (TSSs) demarcate the first position in the DNA sequence that gets transcribed into RNA. In this work, we will build a Random Forest classifier that predicts whether a sequence of nucleotides is a TSSs or not. We will train our classifier using TSS sequences based on the NCBI Reference Sequence Database (RefSeq database). Building an accurate classifier will allow us to predict whether an unknown sequence is likely a TSS or not.
First we’ll define a function for loading required packages (and to install them first, if missing).
load_package <- function(x, source = "bioc"){
if(!require(x, character.only = TRUE, quietly = TRUE)){
if (source == "bioc"){
BiocManager::install(x, character.only = TRUE)
} else if (source == "cran"){
install.packages(x, character.only = TRUE)
} else {
stop("Unrecognised source")
}
}
library(x, character.only = TRUE)
}
We will use the Bioconductor package biomaRt
to download
the entire collection of RefSeq sequences.
load_package('biomaRt')
Find the human gene set.
ensembl <- useMart('ensembl', host = params$host)
biomaRt::listDatasets(ensembl) %>%
filter(grepl('human', description, TRUE))
dataset description version
1 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13
List RefSeq attributes for use with getBM
.
biomaRt::listDatasets(ensembl) %>%
filter(grepl('human', description, TRUE)) %>%
pull(dataset) -> my_dataset
ensembl <- useMart('ensembl', dataset = my_dataset, host = params$host)
listAttributes(ensembl) %>%
filter(grepl('refseq', description, TRUE))
name description
1 transcript_mane_select RefSeq match transcript (MANE Select)
2 transcript_mane_plus_clinical RefSeq match transcript (MANE Plus Clinical)
3 refseq_mrna RefSeq mRNA ID
4 refseq_mrna_predicted RefSeq mRNA predicted ID
5 refseq_ncrna RefSeq ncRNA ID
6 refseq_ncrna_predicted RefSeq ncRNA predicted ID
7 refseq_peptide RefSeq peptide ID
8 refseq_peptide_predicted RefSeq peptide predicted ID
page
1 feature_page
2 feature_page
3 feature_page
4 feature_page
5 feature_page
6 feature_page
7 feature_page
8 feature_page
Fetch all RefSeq mRNA IDs on assembled chromosomes.
my_chr <- c(1:22, 'X', 'Y')
my_refseq <- getBM(
attributes='refseq_mrna',
filters = 'chromosome_name',
values = my_chr,
mart = ensembl
)
head(my_refseq)
refseq_mrna
1 NM_004676
2 NM_004654
3 NM_001002758
4 NM_004681
5 NM_001278612
6 NM_004678
Number of RefSeq IDs.
length(my_refseq$refseq_mrna)
[1] 62674
We will perform another getBM
query to build a table
with the start and end coordinates of each mRNA/transcript. Note that
when working with transcripts, use the attributes
transcript_start
and transcript_end
, and not
the attributes start_position
and
end_position
; using those will give you the Ensembl gene
coordinates and not the RefSeq coordinates, which is what we want! (I
have retrieved the Ensembl coordinates for illustrative purposes.)
my_att <- c(
'refseq_mrna',
'chromosome_name',
'transcript_start',
'transcript_end',
'start_position',
'end_position',
'strand'
)
my_refseq_loci <- getBM(
attributes = my_att,
filters = c('refseq_mrna', 'chromosome_name'),
values = list(refseq_mrna = my_refseq$refseq_mrna, chromosome_name = my_chr),
mart = ensembl
)
dim(my_refseq_loci)
[1] 62704 7
Check out the table.
head(my_refseq_loci)
refseq_mrna chromosome_name transcript_start transcript_end start_position
1 NM_000015 8 18391282 18401218 18391282
2 NM_000046 5 78777209 78985310 78777209
3 NM_000071 21 43053191 43075835 43053191
4 NM_000084 X 50067576 50092406 49922596
5 NM_000099 20 23633657 23637955 23626706
6 NM_000100 21 43773950 43776308 43772511
end_position strand
1 18401218 1
2 78986087 -1
3 43076943 -1
4 50099235 1
5 23638473 -1
6 43776330 -1
Check for duplicated entries.
table(duplicated(my_refseq_loci$refseq_mrna))
FALSE TRUE
62674 30
Removed duplicated entries.
my_refseq_loci_uniq <- my_refseq_loci[!duplicated(my_refseq_loci$refseq_mrna),]
dim(my_refseq_loci_uniq)
[1] 62674 7
We will modify chromosome_name
and strand
to make it compatible with BSgenome.Hsapiens.UCSC.hg38
.
my_refseq_loci_uniq %>%
mutate(strand = ifelse(strand == 1, yes = '+', no = '-')) %>%
mutate(chromosome_name = sub("^", "chr", chromosome_name)) -> my_refseq_loci_uniq
head(my_refseq_loci_uniq)
refseq_mrna chromosome_name transcript_start transcript_end start_position
1 NM_000015 chr8 18391282 18401218 18391282
2 NM_000046 chr5 78777209 78985310 78777209
3 NM_000071 chr21 43053191 43075835 43053191
4 NM_000084 chrX 50067576 50092406 49922596
5 NM_000099 chr20 23633657 23637955 23626706
6 NM_000100 chr21 43773950 43776308 43772511
end_position strand
1 18401218 +
2 78986087 -
3 43076943 -
4 50099235 +
5 23638473 -
6 43776330 -
The TSS is defined by transcript_start
when the
transcript is on the +
strand and by
transcript_end
if the strand is -
. We want to
retrieve nucleotides upstream and downstream of the TSS, so we will
subtract and add coordinates accordingly.
my_tss <- list()
my_tss$loci <- my_refseq_loci_uniq %>%
dplyr::select(-c(start_position, end_position)) %>%
mutate(tss_start = if_else(strand == "+", transcript_start - params$span, transcript_end - params$span)) %>%
mutate(tss_end = if_else(strand == "+", transcript_start + params$span, transcript_end + params$span))
head(my_tss$loci)
refseq_mrna chromosome_name transcript_start transcript_end strand tss_start
1 NM_000015 chr8 18391282 18401218 + 18391280
2 NM_000046 chr5 78777209 78985310 - 78985308
3 NM_000071 chr21 43053191 43075835 - 43075833
4 NM_000084 chrX 50067576 50092406 + 50067574
5 NM_000099 chr20 23633657 23637955 - 23637953
6 NM_000100 chr21 43773950 43776308 - 43776306
tss_end
1 18391284
2 78985312
3 43075837
4 50067578
5 23637957
6 43776310
Non-TSS sequences will be generated by randomly sampling hg38, which
is available via the BSgenome.Hsapiens.UCSC.hg38
package.
load_package('BSgenome.Hsapiens.UCSC.hg38')
Save chromosome names and their sizes.
hg38_ref <- seqnames(BSgenome.Hsapiens.UCSC.hg38)
# remove unassembled and mitochondrial sequences
hg38_chr <- hg38_ref[!grepl("_", hg38_ref)]
hg38_chr <- hg38_chr[!grepl("chrM", hg38_chr)]
hg38_chr_size <- sapply(hg38_chr, function(x){
length(BSgenome.Hsapiens.UCSC.hg38[[x]])
})
head(hg38_chr_size)
chr1 chr2 chr3 chr4 chr5 chr6
248956422 242193529 198295559 190214555 181538259 170805979
Sample chromosome to match the number of TSSs.
set.seed(1984)
n <- nrow(my_tss$loci)
sstrand <- sample(c("+", "-"), n, TRUE)
sstart <- sapply(my_tss$loci$chromosome_name, function(x){
sample(hg38_chr_size[x] - (params$span * 2), 1)
})
send <- sstart + (params$span*2)
my_random <- list()
my_random$loci <- data.frame(
chr = my_tss$loci$chromosome_name,
start = sstart,
end = send,
strand = sstrand
)
head(my_random$loci)
chr start end strand
1 chr8 130846725 130846729 -
2 chr5 136222485 136222489 -
3 chr21 17034887 17034891 -
4 chrX 122209123 122209127 -
5 chr20 27691322 27691326 +
6 chr21 10253649 10253653 +
We will use the GenomicRanges
package to remove randomly
sampled regions that may have overlapped with TSS regions.
load_package("GenomicRanges")
First we will create a GRanges
object using
my_refseq_loci
.
my_tss$gr <- with(
my_tss$loci,
GRanges(chromosome_name,
IRanges(tss_start, tss_end, names = refseq_mrna),
strand
)
)
head(my_tss$gr)
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
NM_000015 chr8 18391280-18391284 +
NM_000046 chr5 78985308-78985312 -
NM_000071 chr21 43075833-43075837 -
NM_000084 chrX 50067574-50067578 +
NM_000099 chr20 23637953-23637957 -
NM_000100 chr21 43776306-43776310 -
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
Next we will create a GRanges
object for the random
loci.
my_random$gr <- with(
my_random$loci,
GRanges(chr,
IRanges(start, end, names = paste(chr, start, end, sep = "_")),
strand
)
)
head(my_random$gr)
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
chr8_130846725_130846729 chr8 130846725-130846729 -
chr5_136222485_136222489 chr5 136222485-136222489 -
chr21_17034887_17034891 chr21 17034887-17034891 -
chrX_122209123_122209127 chrX 122209123-122209127 -
chr20_27691322_27691326 chr20 27691322-27691326 +
chr21_10253649_10253653 chr21 10253649-10253653 +
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
The countOverlaps
function will let us know the number
of overlaps between random regions and TSSs.
table(countOverlaps(my_random$gr, my_tss$gr))
0 1
62670 4
Remove overlapping random sites.
my_random$loci_no <- my_random$loci[countOverlaps(my_random$gr, my_tss$gr) == 0, ]
dim(my_random$loci_no)
[1] 62670 4
We will also check whether TSS regions overlap each other.
table(countOverlaps(my_tss$gr, my_tss$gr, minoverlap = 1))
1 2 3 4 5 6 7 8 9 10 11 12 13
25962 11653 6071 4042 2700 1935 1499 1104 1006 834 652 577 514
14 15 16 17 18 19 20 21 22 23 24 25 26
447 296 414 371 255 268 238 188 155 97 120 50 70
27 28 29 30 31 32 33 34 35 36 38 43 44
62 140 58 60 31 64 99 34 70 108 76 43 44
45 46 50 54 72
45 46 50 54 72
We will also create a set of non-overlapping TSS regions (the
conditional is == 1
because we are comparing regions
against itself and 1 is the self overlap), where a TSS does not overlap
by a single bp.
my_tss$loci_no <- my_tss$loci[countOverlaps(my_tss$gr, my_tss$gr, minoverlap = 1) == 1, ]
dim(my_tss$loci_no)
[1] 25962 7
At this point, we have our TSS and random coordinates but we are
missing the nucleotide sequences that correspond to the defined
coordinates. The getSeq
function can be used to fetch the
sequence.
my_random$seq <- getSeq(
BSgenome.Hsapiens.UCSC.hg38,
names = my_random$loci$chr,
start = my_random$loci$start,
end = my_random$loci$end,
strand = my_random$loci$strand
)
head(my_random$seq)
DNAStringSet object of length 6:
width seq names
[1] 5 AAGGA chr8
[2] 5 ATGTG chr5
[3] 5 TAGTG chr21
[4] 5 AGATG chrX
[5] 5 CAGAA chr20
[6] 5 NNNNN chr21
Remove entries with N’s.
my_random$seq <- my_random$seq[! grepl("N", my_random$seq)]
head(my_random$seq)
DNAStringSet object of length 6:
width seq names
[1] 5 AAGGA chr8
[2] 5 ATGTG chr5
[3] 5 TAGTG chr21
[4] 5 AGATG chrX
[5] 5 CAGAA chr20
[6] 5 TCCCT chr13
Obtain TSS sequences.
my_tss$seq <- getSeq(
BSgenome.Hsapiens.UCSC.hg38,
names = my_tss$loci$chromosome_name,
start = my_tss$loci$tss_start,
end = my_tss$loci$tss_end,
strand = my_tss$loci$strand
)
my_tss$seq <- my_tss$seq[! grepl("N", my_tss$seq)]
head(my_tss$seq)
DNAStringSet object of length 6:
width seq names
[1] 5 GCACT chr8
[2] 5 TCATT chr5
[3] 5 GGGTC chr21
[4] 5 CAGGG chrX
[5] 5 TCACG chr20
[6] 5 TCGCC chr21
Obtain non-overlapping TSS sequences.
my_tss$seq_no <- getSeq(
BSgenome.Hsapiens.UCSC.hg38,
names = my_tss$loci_no$chromosome_name,
start = my_tss$loci_no$tss_start,
end = my_tss$loci_no$tss_end,
strand = my_tss$loci_no$strand
)
my_tss$seq_no <- my_tss$seq_no[! grepl("N", my_tss$seq_no)]
head(my_tss$seq_no)
DNAStringSet object of length 6:
width seq names
[1] 5 GCACT chr8
[2] 5 TCATT chr5
[3] 5 CAGGG chrX
[4] 5 TCACG chr20
[5] 5 TCGCC chr21
[6] 5 TTCTT chr13
The dinucleotideFrequency
function can be used to
calculate the di-nucleotide frequency given a sequence string.
my_random$di <- dinucleotideFrequency(my_random$seq)
colSums(my_random$di)
AA AC AG AT CA CC CG CT GA GC GG GT TA
23283 12211 16924 18043 17744 12813 2661 16981 14501 10691 12848 12093 15212
TC TG TT
14454 17551 23314
my_tss$di <- dinucleotideFrequency(my_tss$seq)
colSums(my_tss$di)
AA AC AG AT CA CC CG CT GA GC GG GT TA
8572 11739 20484 8622 24740 21756 16399 15887 16045 24409 19844 14664 6980
TC TG TT
15055 14683 10817
my_tss$di_no <- dinucleotideFrequency(my_tss$seq_no)
colSums(my_tss$di_no)
AA AC AG AT CA CC CG CT GA GC GG GT TA
3703 4981 8592 3758 10473 8977 6313 6640 6704 10003 8061 5764 2884
TC TG TT
6443 6151 4401
Plot di-nucleotide frequencies between TSS and random regions.
my_df <- data.frame(
x = colMeans(my_tss$di),
y = colMeans(my_random$di),
di = colnames(my_tss$di)
)
ggplot(my_df, aes(x, y, label = di)) +
geom_point() +
geom_text(nudge_x = 0.005) +
geom_abline(slope = 1, lty = 3) +
labs(x = "TSS", y = "Random") +
theme_bw()
Plot di-nucleotide frequencies between overlapping and non-overlapping TSS regions.
my_df <- data.frame(
x = colMeans(my_tss$di),
y = colMeans(my_tss$di_no),
di = colnames(my_tss$di)
)
ggplot(my_df, aes(x, y, label = di)) +
geom_point() +
geom_text(nudge_x = 0.005) +
geom_abline(slope = 1, lty = 3) +
labs(x = "TSS", y = "TSS no overlap") +
theme_bw()
Version | Author | Date |
---|---|---|
9ec0dec | Dave Tang | 2022-04-26 |
A Random Forest classifier generally performs well and we will use it to train a model to predict whether a sequence is a TSS or not based on di-nucleotide frequencies.
load_package('randomForest', source = "cran")
Combine TSS and random data. (We saw that overlapping and non-overlapping TSSs had similar di-nucleotide frequencies, so we will use the overlapping set.)
as.data.frame(my_tss$di) %>%
mutate(class = 'tss') -> my_data
as.data.frame(my_random$di) %>%
mutate(class = 'random') %>%
rbind(my_data) %>%
mutate(class = factor(class)) -> my_data
dim(my_data)
[1] 123005 17
Check out my_data
.
my_data[c(1:3, (nrow(my_data)-2):(nrow(my_data))), ]
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT class
1 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 random
2 0 0 0 1 0 0 0 0 0 0 0 1 0 0 2 0 random
3 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 random
123003 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 tss
123004 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 tss
123005 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 tss
Split 80/20.
set.seed(1984)
idx <- sample(nrow(my_data), nrow(my_data)*0.8)
train <- my_data[idx, ]
test <- my_data[-idx, ]
prop.table(table(train$class))
random tss
0.4904577 0.5095423
prop.table(table(test$class))
random tss
0.4905492 0.5094508
Train Random Forest.
my_rf <- randomForest(
class ~ .,
data = train,
importance = TRUE,
do.trace = 100,
ntree = 500
)
ntree OOB 1 2
100: 27.13% 26.86% 27.38%
200: 27.04% 26.68% 27.38%
300: 27.03% 26.65% 27.39%
400: 26.95% 26.69% 27.21%
500: 26.95% 26.68% 27.20%
my_rf
Call:
randomForest(formula = class ~ ., data = train, importance = TRUE, do.trace = 100, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 26.95%
Confusion matrix:
random tss class.error
random 35385 12878 0.2668297
tss 13640 36501 0.2720329
Train using more features.
my_rf_6 <- randomForest(
class ~ .,
data = train,
importance = TRUE,
do.trace = 100,
ntree = 500,
mtry = 6
)
ntree OOB 1 2
100: 27.03% 27.31% 26.75%
200: 27.09% 27.74% 26.46%
300: 27.06% 27.79% 26.35%
400: 27.02% 27.81% 26.25%
500: 27.09% 27.91% 26.30%
my_rf_6
Call:
randomForest(formula = class ~ ., data = train, importance = TRUE, do.trace = 100, ntree = 500, mtry = 6)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 6
OOB estimate of error rate: 27.09%
Confusion matrix:
random tss class.error
random 34794 13469 0.2790751
tss 13185 36956 0.2629585
Feature importance.
rn <- round(importance(my_rf), 2)
rn[order(rn[,3], decreasing=TRUE),]
random tss MeanDecreaseAccuracy MeanDecreaseGini
AA 35.80 64.89 83.54 1499.41
CG 58.97 57.37 77.89 2750.53
CA -42.96 72.08 76.85 832.48
AT -7.48 68.13 74.31 886.97
AC -2.52 52.71 73.42 447.21
GG -7.25 58.49 72.85 494.45
TG 8.80 50.53 70.33 535.25
CT 21.70 42.34 68.66 504.98
AG -20.31 52.35 62.32 554.21
GT 9.17 45.51 61.79 421.48
GA -22.80 53.08 61.59 400.75
TT 13.61 51.23 58.04 1061.71
CC -0.12 53.14 57.82 676.20
TA -18.31 54.24 53.68 599.27
GC 27.85 43.24 53.18 873.14
TC 15.60 34.67 52.01 416.18
Plot feature importance.
varImpPlot(my_rf)
Use our Random Forest classifier to make predictions on the test data.
data.predict <- predict(my_rf, test)
prop.table(table(observed = test$class, predict = data.predict), 1)
predict
observed random tss
random 0.7365761 0.2634239
tss 0.2675337 0.7324663
Use pROC
package to calculate the area under the ROC
curve.
load_package("pROC", source = "cran")
Area under the ROC curve.
roc(my_rf$y, my_rf$votes[, 'random'])
Setting levels: control = random, case = tss
Setting direction: controls > cases
Call:
roc.default(response = my_rf$y, predictor = my_rf$votes[, "random"])
Data: my_rf$votes[, "random"] in 48263 controls (my_rf$y random) > 50141 cases (my_rf$y tss).
Area under the curve: 0.7908
Plot ROC with the verification
package.
load_package('verification', source = "cran")
Plot ROC.
aucc <- roc.area(as.integer(train$class=='tss'), my_rf$votes[,2])$A
roc.plot(as.integer(train$class=='tss'), my_rf$votes[,2], main="")
Warning in roc.plot.default(as.integer(train$class == "tss"), my_rf$votes[, :
Large amount of unique predictions used as thresholds. Consider specifying
thresholds.
legend("bottomright", bty="n", sprintf("Area Under the Curve (AUC) = %1.3f", aucc))
title(main="OOB ROC Curve Random Forest for predicting TSS")
The di-nucleotide frequencies were useful in predicting TSSs but we lost the positional information of each di-nucleotide and the relative position of a di-nucleotide may be important.
To confirm this, we can use the nucleotideFrequencyAt
function to return the single nucleotide frequency at a specific
location.
nucleotideFrequencyAt(x = my_tss$seq, at = 3)
A C G T
26030 9426 24056 3162
We will use sapply
to calculate the nucleotide
frequencies at all sites.
nuc_freq <- sapply(1:((params$span*2)+1), function(x) nucleotideFrequencyAt(my_tss$seq, x))
nuc_freq_norm <- apply(nuc_freq, 2, function(x) x/sum(x))
colnames(nuc_freq_norm) <- c('m2', 'm1', 't', 'p1', 'p2')
nuc_freq_norm
m2 m1 t p1 p2
A 0.1252194 0.09062769 0.41532374 0.1573061 0.2356320
C 0.3273925 0.51676931 0.15039729 0.2624533 0.2344832
G 0.3261161 0.15159396 0.38382742 0.3345247 0.2694419
T 0.2212720 0.24100903 0.05045154 0.2457159 0.2604429
There is a higher CT (pyrimidine) ratio one position before the TSS
(m1
[minus 1] column) and a higher AG (purine) ratio at the
TSS (t
column) forming the classic pyrimidine–purine (PyPu)
di-nucleotide at a TSS.
Given this positional bias, we will calculate di-nucleotide
frequencies at all positions using the
dinucleotideFrequency
function.
dinuc_freq_tss <- lapply(1:(params$span*2), function(x){
dinucleotideFrequency(subseq(my_tss$seq, x, x+1))
})
dinuc_freq_random <- lapply(1:(params$span*2), function(x){
dinucleotideFrequency(subseq(my_random$seq, x, x+1))
})
dinuc_freq_tss_df <- do.call(cbind.data.frame, dinuc_freq_tss)
dinuc_freq_random_df <- do.call(cbind.data.frame, dinuc_freq_random)
my_colname <- paste0(
'X',
rep(1:(params$span*2), each = 16),
"_",
colnames(dinuc_freq_tss_df)[1:16]
)
colnames(dinuc_freq_tss_df) <- my_colname
colnames(dinuc_freq_random_df) <- my_colname
head(dinuc_freq_tss_df)
X1_AA X1_AC X1_AG X1_AT X1_CA X1_CC X1_CG X1_CT X1_GA X1_GC X1_GG X1_GT X1_TA
1 0 0 0 0 0 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1 0 0
4 0 0 0 0 1 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0
X1_TC X1_TG X1_TT X2_AA X2_AC X2_AG X2_AT X2_CA X2_CC X2_CG X2_CT X2_GA X2_GC
1 0 0 0 0 0 0 0 1 0 0 0 0 0
2 1 0 0 0 0 0 0 1 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 1 0 0 0 0 0 0 0
5 1 0 0 0 0 0 0 1 0 0 0 0 0
6 1 0 0 0 0 0 0 0 0 1 0 0 0
X2_GG X2_GT X2_TA X2_TC X2_TG X2_TT X3_AA X3_AC X3_AG X3_AT X3_CA X3_CC X3_CG
1 0 0 0 0 0 0 0 1 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 0 0 0
3 1 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 1 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0
X3_CT X3_GA X3_GC X3_GG X3_GT X3_TA X3_TC X3_TG X3_TT X4_AA X4_AC X4_AG X4_AT
1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0 0 0 0 0
X4_CA X4_CC X4_CG X4_CT X4_GA X4_GC X4_GG X4_GT X4_TA X4_TC X4_TG X4_TT
1 0 0 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 1
3 0 0 0 0 0 0 0 0 0 1 0 0
4 0 0 0 0 0 0 1 0 0 0 0 0
5 0 0 1 0 0 0 0 0 0 0 0 0
6 0 1 0 0 0 0 0 0 0 0 0 0
Combine TSS and random data.
dinuc_freq_tss_df %>%
mutate(class = 'tss') -> my_data2
dinuc_freq_random_df %>%
mutate(class = 'random') %>%
rbind(my_data2) %>%
mutate(class = factor(class)) -> my_data2
saveRDS(object = my_data2, file = "data/tss.rds")
dim(my_data2)
[1] 123005 65
Split 80/20.
set.seed(1984)
idx <- sample(nrow(my_data2), nrow(my_data2)*0.8)
train2 <- my_data2[idx, ]
test2 <- my_data2[-idx, ]
dim(train2)
[1] 98404 65
dim(test2)
[1] 24601 65
Train Random Forest.
my_rf2 <- randomForest(
class ~ .,
data = train2,
importance = TRUE,
do.trace = 100,
ntree = 500
)
ntree OOB 1 2
100: 23.61% 25.44% 21.85%
200: 23.57% 25.23% 21.97%
300: 23.57% 25.17% 22.03%
400: 23.59% 25.20% 22.05%
500: 23.59% 25.11% 22.12%
my_rf2
Call:
randomForest(formula = class ~ ., data = train2, importance = TRUE, do.trace = 100, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 8
OOB estimate of error rate: 23.59%
Confusion matrix:
random tss class.error
random 36143 12120 0.2511240
tss 11091 39050 0.2211962
Feature importance.
rn2 <- round(importance(my_rf2), 2)
head(rn2[order(rn2[,3], decreasing=TRUE),])
random tss MeanDecreaseAccuracy MeanDecreaseGini
X4_CG 27.77 41.79 47.49 308.70
X3_CA 35.17 16.16 38.48 424.26
X1_AT 41.21 10.50 36.20 345.93
X2_CG 38.74 19.75 33.97 1943.38
X2_CA 29.40 20.47 33.35 1830.46
X4_TG 0.24 27.13 31.65 112.98
Feature importance plot.
varImpPlot(my_rf2)
Version | Author | Date |
---|---|---|
9ec0dec | Dave Tang | 2022-04-26 |
Compare prediction results of our first model and our second model.
# results from first model
data.predict <- predict(my_rf, test, type = "vote")
auroc <- roc.area(as.integer(test$class=='tss'), data.predict[,2])$A
roc.plot(as.integer(test$class=='tss'), data.predict[,2], main="ROC")
legend("bottomright", bty="n", sprintf("Area Under the Curve (AUC) = %1.3f", auroc))
# results from second model
data.predict2 <- predict(my_rf2, test2, type = "vote")
auroc2 <- roc.area(as.integer(test2$class=='tss'), data.predict2[,2])$A
roc.plot(as.integer(test2$class=='tss'), data.predict2[,2], main="ROC")
legend("bottomright", bty="n", sprintf("Area Under the Curve (AUC) = %1.3f", auroc2))
By including the positional information of the di-nucleotides, we have a better TSS classifier, which makes sense since not all the bases upstream and downstream of the TSS have a nucleotide preference and thus were diluting the signal in our first model where we calculated the overall di-nucleotide frequency.
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] verification_1.42 dtw_1.23-1
[3] proxy_0.4-27 CircStats_0.2-6
[5] MASS_7.3-58.1 boot_1.3-28
[7] fields_14.1 viridis_0.6.2
[9] viridisLite_0.4.1 spam_2.9-1
[11] pROC_1.18.0 randomForest_4.7-1.1
[13] BSgenome.Hsapiens.UCSC.hg38_1.4.4 BSgenome_1.66.1
[15] rtracklayer_1.58.0 Biostrings_2.66.0
[17] XVector_0.38.0 GenomicRanges_1.50.1
[19] GenomeInfoDb_1.34.2 IRanges_2.32.0
[21] S4Vectors_0.36.0 BiocGenerics_0.44.0
[23] biomaRt_2.54.0 forcats_0.5.2
[25] stringr_1.4.1 dplyr_1.0.10
[27] purrr_0.3.5 readr_2.1.3
[29] tidyr_1.2.1 tibble_3.1.8
[31] ggplot2_3.4.0 tidyverse_1.3.2
[33] workflowr_1.7.0
loaded via a namespace (and not attached):
[1] readxl_1.4.1 backports_1.4.1
[3] BiocFileCache_2.6.0 plyr_1.8.7
[5] BiocParallel_1.32.0 digest_0.6.30
[7] htmltools_0.5.3 fansi_1.0.3
[9] magrittr_2.0.3 memoise_2.0.1
[11] googlesheets4_1.0.1 tzdb_0.3.0
[13] modelr_0.1.9 matrixStats_0.62.0
[15] timechange_0.1.1 prettyunits_1.1.1
[17] colorspace_2.0-3 blob_1.2.3
[19] rvest_1.0.3 rappdirs_0.3.3
[21] haven_2.5.1 xfun_0.34
[23] callr_3.7.3 crayon_1.5.2
[25] RCurl_1.98-1.9 jsonlite_1.8.3
[27] glue_1.6.2 gtable_0.3.1
[29] gargle_1.2.1 zlibbioc_1.44.0
[31] DelayedArray_0.24.0 maps_3.4.1
[33] scales_1.2.1 DBI_1.1.3
[35] Rcpp_1.0.9 progress_1.2.2
[37] bit_4.0.4 dotCall64_1.0-2
[39] httr_1.4.4 ellipsis_0.3.2
[41] pkgconfig_2.0.3 XML_3.99-0.12
[43] farver_2.1.1 sass_0.4.2
[45] dbplyr_2.2.1 utf8_1.2.2
[47] tidyselect_1.2.0 labeling_0.4.2
[49] rlang_1.0.6 later_1.3.0
[51] AnnotationDbi_1.60.0 munsell_0.5.0
[53] cellranger_1.1.0 tools_4.2.0
[55] cachem_1.0.6 cli_3.4.1
[57] generics_0.1.3 RSQLite_2.2.18
[59] broom_1.0.1 evaluate_0.17
[61] fastmap_1.1.0 yaml_2.3.6
[63] processx_3.8.0 knitr_1.40
[65] bit64_4.0.5 fs_1.5.2
[67] KEGGREST_1.38.0 whisker_0.4
[69] xml2_1.3.3 compiler_4.2.0
[71] rstudioapi_0.14 filelock_1.0.2
[73] curl_4.3.3 png_0.1-7
[75] reprex_2.0.2 bslib_0.4.1
[77] stringi_1.7.8 highr_0.9
[79] ps_1.7.2 lattice_0.20-45
[81] Matrix_1.5-1 vctrs_0.5.0
[83] pillar_1.8.1 lifecycle_1.0.3
[85] jquerylib_0.1.4 bitops_1.0-7
[87] httpuv_1.6.6 R6_2.5.1
[89] BiocIO_1.8.0 promises_1.2.0.1
[91] gridExtra_2.3 codetools_0.2-18
[93] assertthat_0.2.1 SummarizedExperiment_1.28.0
[95] rprojroot_2.0.3 rjson_0.2.21
[97] withr_2.5.0 GenomicAlignments_1.34.0
[99] Rsamtools_2.14.0 GenomeInfoDbData_1.2.9
[101] parallel_4.2.0 hms_1.1.2
[103] grid_4.2.0 rmarkdown_2.17
[105] MatrixGenerics_1.10.0 googledrive_2.0.0
[107] git2r_0.30.1 getPass_0.2-2
[109] Biobase_2.58.0 lubridate_1.9.0
[111] restfulr_0.0.15