Sequence composition and Random Forests

Last updated: 2022-04-26

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 56b1a79

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 56b1a79. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.1.2/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd

Unstaged changes:
    Modified:   .Rprofile

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/tss_predict.Rmd) and HTML (docs/tss_predict.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	56b1a79	Dave Tang	2022-04-26	Positional di-nuc freq
html	fd116a1	Dave Tang	2022-04-24	Build site.
Rmd	0616464	Dave Tang	2022-04-24	Random forest model
html	fd41c75	Dave Tang	2022-04-22	Build site.
Rmd	e372c81	Dave Tang	2022-04-22	TSS

Introduction

Transcriptional starting sites (TSSs) demarcate the first position in the DNA sequence that gets transcribed into RNA. In this work, we will try to build a classifier to try to predict TSSs. We will train our classifier using the NCBI Reference Sequence Database also known as RefSeq.

Packages

Function for loading required Bioconductor packages (and install them first, if missing).

load_package <- function(x, source = "bioc"){
  if(!require(x, character.only = TRUE, quietly = TRUE)){
    if (source == "bioc"){
      BiocManager::install(x, character.only = TRUE)
    } else if (source == "cran"){
      install.packages(x, character.only = TRUE)
    } else {
      stop("Unrecognised source")
    }
    library(x, character.only = TRUE)
  }
}

RefSeq

We will use the Bioconductor package biomaRt to download the entire collection of RefSeq sequences.

load_package('biomaRt')

Find the human gene set.

ensembl <- useMart('ensembl', host = params$host)
biomaRt::listDatasets(ensembl) %>%
  filter(grepl('human', description, TRUE))

                dataset              description    version
1 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13

Find RefSeq attributes.

ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl', host = params$host)
listAttributes(ensembl) %>%
  filter(grepl('refseq', description, TRUE))

                           name                                  description
1        transcript_mane_select        RefSeq match transcript (MANE Select)
2 transcript_mane_plus_clinical RefSeq match transcript (MANE Plus Clinical)
3                   refseq_mrna                               RefSeq mRNA ID
4         refseq_mrna_predicted                     RefSeq mRNA predicted ID
5                  refseq_ncrna                              RefSeq ncRNA ID
6        refseq_ncrna_predicted                    RefSeq ncRNA predicted ID
7                refseq_peptide                            RefSeq peptide ID
8      refseq_peptide_predicted                  RefSeq peptide predicted ID
          page
1 feature_page
2 feature_page
3 feature_page
4 feature_page
5 feature_page
6 feature_page
7 feature_page
8 feature_page

Fetch all RefSeq mRNAs on assembled chromosomes.

my_chr <- c(1:22, 'X', 'Y')

my_refseq <- getBM(
  attributes='refseq_mrna',
  filters = 'chromosome_name',
  values = my_chr,
  mart = ensembl
)

Number of RefSeq IDs.

dim(my_refseq)

[1] 61953     1

Build table. Note that when working with transcripts, use the attributes transcript_start and transcript_end, and not the attributes start_position and end_position; using those will give you the Ensembl gene coordinates and not the RefSeq coordinates, which is what we want! (I have retrieved the Ensembl coordinates for illustrative purposes.)

my_att <- c(
  'refseq_mrna',
  'chromosome_name',
  'transcript_start',
  'transcript_end',
  'start_position',
  'end_position',
  'strand'
)

my_refseq_loci <- getBM(
  attributes = my_att,
  filters = c('refseq_mrna', 'chromosome_name'),
  values = list(refseq_mrna = my_refseq$refseq_mrna, chromosome_name = my_chr),
  mart = ensembl
)

Check out the table.

head(my_refseq_loci)

  refseq_mrna chromosome_name transcript_start transcript_end start_position
1   NM_000015               8         18391282       18401218       18391282
2   NM_000021              14         73136507       73223691       73136418
3   NM_000053              13         51932669       52011450       51930436
4   NM_000059              13         32315508       32400268       32315086
5   NM_000071              21         43053191       43075835       43053191
6   NM_000084               X         50067576       50092406       49922596
  end_position strand
1     18401218      1
2     73223691      1
3     52012125     -1
4     32400268      1
5     43076943     -1
6     50099235      1

Check for duplicated entries.

table(duplicated(my_refseq_loci$refseq_mrna))


FALSE  TRUE 
61953    25

Removed duplicated entries.

my_refseq_loci_uniq <- my_refseq_loci[!duplicated(my_refseq_loci$refseq_mrna),]
dim(my_refseq_loci_uniq)

[1] 61953     7

We will modify chromosome_name and strand to make it compatible with BSgenome.Hsapiens.UCSC.hg38.

my_refseq_loci_uniq %>%
  mutate(strand = ifelse(strand == 1, yes = '+', no = '-')) %>%
  mutate(chromosome_name = sub("^", "chr", chromosome_name)) -> my_refseq_loci_uniq

head(my_refseq_loci_uniq)

  refseq_mrna chromosome_name transcript_start transcript_end start_position
1   NM_000015            chr8         18391282       18401218       18391282
2   NM_000021           chr14         73136507       73223691       73136418
3   NM_000053           chr13         51932669       52011450       51930436
4   NM_000059           chr13         32315508       32400268       32315086
5   NM_000071           chr21         43053191       43075835       43053191
6   NM_000084            chrX         50067576       50092406       49922596
  end_position strand
1     18401218      +
2     73223691      +
3     52012125      -
4     32400268      +
5     43076943      -
6     50099235      +

Transcription Start Site

The TSS is transcript_start when the transcript is on the + strand and is transcript_end on the - strand. We want to retrieve sequence upstream and downstream of the TSS, so we will subtract and add coordinates accordingly.

my_tss <- list()

my_tss$loci <- my_refseq_loci_uniq %>%
  dplyr::select(-c(start_position, end_position)) %>%
  mutate(tss_start = if_else(strand == "+", transcript_start - params$span, transcript_end - params$span)) %>%
  mutate(tss_end = if_else(strand == "+", transcript_start + params$span, transcript_end + params$span))

head(my_tss$loci)

  refseq_mrna chromosome_name transcript_start transcript_end strand tss_start
1   NM_000015            chr8         18391282       18401218      +  18391280
2   NM_000021           chr14         73136507       73223691      +  73136505
3   NM_000053           chr13         51932669       52011450      -  52011448
4   NM_000059           chr13         32315508       32400268      +  32315506
5   NM_000071           chr21         43053191       43075835      -  43075833
6   NM_000084            chrX         50067576       50092406      +  50067574
   tss_end
1 18391284
2 73136509
3 52011452
4 32315510
5 43075837
6 50067578

Random loci

Sample a bunch of random sequences in hg38.

load_package('BSgenome.Hsapiens.UCSC.hg38')

Save chromosomes and their sizes.

hg38_ref <- seqnames(BSgenome.Hsapiens.UCSC.hg38)
hg38_chr <- hg38_ref[!grepl("_", hg38_ref)]
hg38_chr <- hg38_chr[!grepl("chrM", hg38_chr)]

hg38_chr_size <- sapply(hg38_chr, function(x){
  length(BSgenome.Hsapiens.UCSC.hg38[[x]])
})

head(hg38_chr_size)

     chr1      chr2      chr3      chr4      chr5      chr6 
248956422 242193529 198295559 190214555 181538259 170805979

Sample chromosome based on length of chromosome.

set.seed(1984)
n <- nrow(my_tss$loci)
schr <- sample(
  hg38_chr,
  n,
  TRUE,
  prob = hg38_chr_size/sum(hg38_chr_size)
)

sstrand <- sample(c("+", "-"), n, TRUE)
sstart <- sapply(schr, function(x){
  sample(hg38_chr_size[x] - (params$span * 2), 1)
})
send <- sstart + (params$span*2)

my_random <- list()

my_random$loci <- data.frame(
  chr = schr,
  start = sstart,
  end = send,
  strand = sstrand
)

head(my_random$loci)

    chr     start       end strand
1 chr10 132635027 132635031      -
2  chr7 115323451 115323455      -
3  chr6  27058086  27058090      -
4  chr5 164216413 164216417      +
5 chr13   1446812   1446816      +
6 chr17  52786687  52786691      -

Removing overlaps

Since some of the randomly sampled regions may overlap/intersect with TSSs, I used the GenomicRanges package to remove these overlapping regions:

load_package("GenomicRanges")

Create a GRanges object given an object, my_refseq_loci

my_tss$gr <- with(
  my_tss$loci,
  GRanges(chromosome_name,
          IRanges(tss_start, tss_end, names = refseq_mrna),
          strand
  )
)

head(my_tss$gr)

GRanges object with 6 ranges and 0 metadata columns:
            seqnames            ranges strand
               <Rle>         <IRanges>  <Rle>
  NM_000015     chr8 18391280-18391284      +
  NM_000021    chr14 73136505-73136509      +
  NM_000053    chr13 52011448-52011452      -
  NM_000059    chr13 32315506-32315510      +
  NM_000071    chr21 43075833-43075837      -
  NM_000084     chrX 50067574-50067578      +
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

Create random GRanges object.

my_random$gr <- with(
  my_random$loci,
  GRanges(chr,
          IRanges(start, end, names = paste(chr, start, end, sep = "_")),
          strand
  )
)

head(my_random$gr)

GRanges object with 6 ranges and 0 metadata columns:
                            seqnames              ranges strand
                               <Rle>           <IRanges>  <Rle>
  chr10_132635027_132635031    chr10 132635027-132635031      -
   chr7_115323451_115323455     chr7 115323451-115323455      -
     chr6_27058086_27058090     chr6   27058086-27058090      -
   chr5_164216413_164216417     chr5 164216413-164216417      +
      chr13_1446812_1446816    chr13     1446812-1446816      +
    chr17_52786687_52786691    chr17   52786687-52786691      -
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

Overlap between random regions and TSS’s. Note that there are cases where a random region overlaps more than one TSS, which indicates that TSS regions are overlapping.

table(countOverlaps(my_random$gr, my_tss$gr))


    0     1     4     5 
61947     4     1     1

Remove overlapping random sites.

my_random$loci_no <- my_random$loci[countOverlaps(my_random$gr, my_tss$gr) == 0, ]
dim(my_random$loci_no)

[1] 61947     4

Check overlapping TSS regions.

table(countOverlaps(my_tss$gr, my_tss$gr, minoverlap = 1))


    1     2     3     4     5     6     7     8     9    10    11    12    13 
25851 11455  5969  3918  2691  1911  1455  1051  1015   834   663   557   475 
   14    15    16    17    18    19    20    21    22    23    24    25    26 
  430   258   363   394   280   258   220   209   155    92    96    75    96 
   27    28    29    30    31    32    33    35    36    38    43    44    45 
   62   168    87    30    62    64    66   105   108    76    43    44    45 
   46    50    54    72 
   46    50    54    72

Create a set of non-overlapping TSS regions (the conditional is == 1 because we are comparing regions against itself and 1 is the self overlap), where a TSS does not overlap by a single bp.

my_tss$loci_no <- my_tss$loci[countOverlaps(my_tss$gr, my_tss$gr, minoverlap = 1) == 1, ]
dim(my_tss$loci_no)

[1] 25851     7

Sequence

Now to fetch the sequences and calculate the dinucleotide frequencies:

my_random$seq <- getSeq(
  BSgenome.Hsapiens.UCSC.hg38,
  names = my_random$loci$chr,
  start = my_random$loci$start,
  end = my_random$loci$end,
  strand = my_random$loci$strand
)

head(my_random$seq)

DNAStringSet object of length 6:
    width seq                                               names               
[1]     5 CATCA                                             chr10
[2]     5 CTAGT                                             chr7
[3]     5 AGAAT                                             chr6
[4]     5 AATGT                                             chr5
[5]     5 NNNNN                                             chr13
[6]     5 TTACT                                             chr17

Remove entries with N’s.

my_random$seq <- my_random$seq[! grepl("N", my_random$seq)]
head(my_random$seq)

DNAStringSet object of length 6:
    width seq                                               names               
[1]     5 CATCA                                             chr10
[2]     5 CTAGT                                             chr7
[3]     5 AGAAT                                             chr6
[4]     5 AATGT                                             chr5
[5]     5 TTACT                                             chr17
[6]     5 CCTCC                                             chr1

Obtain TSS sequences.

my_tss$seq <- getSeq(
  BSgenome.Hsapiens.UCSC.hg38,
  names = my_tss$loci$chromosome_name,
  start = my_tss$loci$tss_start,
  end = my_tss$loci$tss_end,
  strand = my_tss$loci$strand
)

my_tss$seq <- my_tss$seq[! grepl("N", my_tss$seq)]
head(my_tss$seq)

DNAStringSet object of length 6:
    width seq                                               names               
[1]     5 GCACT                                             chr8
[2]     5 TCGGA                                             chr14
[3]     5 CCCTC                                             chr13
[4]     5 GCAGA                                             chr13
[5]     5 GGGTC                                             chr21
[6]     5 CAGGG                                             chrX

Obtain TSS sequences non-overlapping.

my_tss$seq_no <- getSeq(
  BSgenome.Hsapiens.UCSC.hg38,
  names = my_tss$loci_no$chromosome_name,
  start = my_tss$loci_no$tss_start,
  end = my_tss$loci_no$tss_end,
  strand = my_tss$loci_no$strand
)

my_tss$seq_no <- my_tss$seq_no[! grepl("N", my_tss$seq_no)]
head(my_tss$seq_no)

DNAStringSet object of length 6:
    width seq                                               names               
[1]     5 GCACT                                             chr8
[2]     5 TCGGA                                             chr14
[3]     5 CCCTC                                             chr13
[4]     5 GCAGA                                             chr13
[5]     5 CAGGG                                             chrX
[6]     5 TCACG                                             chr20

Calculate the di-nucleotide frequency.

my_random$di <- dinucleotideFrequency(my_random$seq)
colSums(my_random$di)

   AA    AC    AG    AT    CA    CC    CG    CT    GA    GC    GG    GT    TA 
23038 11832 16208 18247 16874 12151  2429 16470 13965 10067 12230 12207 15480 
   TC    TG    TT 
14141 17269 23132

my_tss$di <- dinucleotideFrequency(my_tss$seq)
colSums(my_tss$di)

   AA    AC    AG    AT    CA    CC    CG    CT    GA    GC    GG    GT    TA 
 8678 11634 20201  8509 24443 21320 16226 15711 15760 24256 19408 14537  6959 
   TC    TG    TT 
14903 14539 10728

my_tss$di_no <- dinucleotideFrequency(my_tss$seq_no)
colSums(my_tss$di_no)

   AA    AC    AG    AT    CA    CC    CG    CT    GA    GC    GG    GT    TA 
 3721  4956  8551  3729 10443  8911  6271  6625  6646  9977  8019  5738  2861 
   TC    TG    TT 
 6439  6118  4399

Plot di-nucleotide frequencies between TSS and random regions.

my_df <- data.frame(
  x = colMeans(my_tss$di),
  y = colMeans(my_random$di),
  di = colnames(my_tss$di)
)

ggplot(my_df, aes(x, y, label = di)) +
  geom_point() +
  geom_text(nudge_x = 0.005) +
  geom_abline(slope = 1, lty = 3) +
  labs(x = "TSS", y = "Random") +
  theme_bw()

Version	Author	Date
fd116a1	Dave Tang	2022-04-24

Plot di-nucleotide frequencies between overlapping and non-overlapping TSS regions.

my_df <- data.frame(
  x = colMeans(my_tss$di),
  y = colMeans(my_tss$di_no),
  di = colnames(my_tss$di)
)

ggplot(my_df, aes(x, y, label = di)) +
  geom_point() +
  geom_text(nudge_x = 0.005) +
  geom_abline(slope = 1, lty = 3) +
  labs(x = "TSS", y = "TSS no overlap") +
  theme_bw()

Random Forests

Use Random Forests to train a predictor using the TSSs and random di-nucleotide frequencies.

load_package('randomForest', source = "cran")

Combine TSS and random data.

as.data.frame(my_tss$di) %>%
  mutate(class = 'tss') -> my_data

as.data.frame(my_random$di) %>%
  mutate(class = 'random') %>%
  rbind(my_data) %>%
  mutate(class = factor(class)) -> my_data

dim(my_data)

[1] 120888     17

Split 80/20.

set.seed(1984)
idx <- sample(nrow(my_data), nrow(my_data)*0.8)

train <- my_data[idx, ]
test <- my_data[-idx, ]

dim(train)

[1] 96710    17

dim(test)

[1] 24178    17

Train Random Forest.

my_rf <- randomForest(
  class ~ .,
  data = train,
  importance = TRUE,
  do.trace = 100,
  ntree = 200
)

ntree      OOB      1      2
  100:  26.89% 26.77% 27.00%
  200:  26.85% 26.47% 27.22%

my_rf


Call:
 randomForest(formula = class ~ ., data = train, importance = TRUE,      do.trace = 100, ntree = 200) 
               Type of random forest: classification
                     Number of trees: 200
No. of variables tried at each split: 4

        OOB estimate of  error rate: 26.85%
Confusion matrix:
       random   tss class.error
random  34622 12462   0.2646759
tss     13508 36118   0.2721960

Feature importance.

rn <- round(importance(my_rf), 2)
rn[order(rn[,3], decreasing=TRUE),]

   random   tss MeanDecreaseAccuracy MeanDecreaseGini
AA  21.74 40.17                52.33          1505.47
AT  -1.87 44.72                48.11           939.64
GG  -3.66 35.79                47.02           478.09
CA -26.74 45.55                46.95           858.20
CG  34.59 34.79                46.47          2599.94
CT  15.54 27.60                45.84           488.97
AC  -0.84 34.24                45.43           442.64
GT   8.18 29.41                43.86           418.54
TG   5.15 32.84                39.78           545.91
AG -13.89 34.59                39.47           539.19
TT  10.58 31.98                39.28          1103.73
GA -15.40 36.52                38.76           394.84
TA -10.21 32.41                35.69           623.41
CC   2.66 33.37                34.07           699.46
TC  12.13 21.00                32.92           428.78
GC  18.90 25.14                31.12          1014.36

As a plot.

varImpPlot(my_rf)

Version	Author	Date
fd116a1	Dave Tang	2022-04-24

Predict.

data.predict <- predict(my_rf, test)
prop.table(table(observed = test$class, predict = data.predict), 1)

        predict
observed    random       tss
  random 0.7326808 0.2673192
  tss    0.2746005 0.7253995

Use pROC package to get area under curve.

load_package("pROC", source = "cran")

Check ROC.

roc(my_rf$y, my_rf$votes[, 'random'])

Setting levels: control = random, case = tss

Setting direction: controls > cases


Call:
roc.default(response = my_rf$y, predictor = my_rf$votes[, "random"])

Data: my_rf$votes[, "random"] in 47084 controls (my_rf$y random) > 49626 cases (my_rf$y tss).
Area under the curve: 0.7945

Plot ROC with the verification package.

load_package('verification', source = "cran")

Plot ROC.

aucc <- roc.area(as.integer(train$class=='tss'), my_rf$votes[,2])$A
roc.plot(as.integer(train$class=='tss'), my_rf$votes[,2], main="")

Warning in roc.plot.default(as.integer(train$class == "tss"), my_rf$votes[, :
Large amount of unique predictions used as thresholds. Consider specifying
thresholds.

legend("bottomright", bty="n", sprintf("Area Under the Curve (AUC) = %1.3f", aucc))
title(main="OOB ROC Curve Random Forest for predicting TSS")

Version	Author	Date
fd116a1	Dave Tang	2022-04-24

Nucleotide frequency across TSS

The nucleotideFrequencyAt function will return the single nucleotide frequency at a specific location.

nucleotideFrequencyAt(my_tss$seq, 3)

    A     C     G     T 
25775  9399 23652  3127

Use sapply to calculate at all sites.

There is a higher CT (pyrimidine) ratio one position before the TSS (second column) and a higher AG (purine) ratio at the TSS (third column) forming the classic pyrimidine–purine (PyPu) di-nucleotide at a TSS.

nuc_freq <- sapply(1:((params$span*2)+1), function(x) nucleotideFrequencyAt(my_tss$seq, x))
apply(nuc_freq, 2, function(x) x/sum(x))

       [,1]       [,2]       [,3]      [,4]      [,5]
A 0.1251755 0.09239262 0.41604119 0.1576679 0.2352267
C 0.3242781 0.51500331 0.15171178 0.2631834 0.2340968
G 0.3270221 0.15193776 0.38177328 0.3330912 0.2691234
T 0.2235243 0.24066631 0.05047375 0.2460575 0.2615531

Calculate di-nucleotide frequencies at all sites.

dinuc_freq_tss <- lapply(1:(params$span*2), function(x){
  dinucleotideFrequency(subseq(my_tss$seq, x, x+1))
})

dinuc_freq_random <- lapply(1:(params$span*2), function(x){
  dinucleotideFrequency(subseq(my_random$seq, x, x+1))
})

dinuc_freq_tss_df <- do.call(cbind.data.frame, dinuc_freq_tss)
dinuc_freq_random_df <- do.call(cbind.data.frame, dinuc_freq_random)

my_colname <- paste0(
  'X',
  rep(1:(params$span*2), each = 16),
  "_",
  colnames(dinuc_freq_tss_df)[1:16]
)

colnames(dinuc_freq_tss_df) <- my_colname
colnames(dinuc_freq_random_df) <- my_colname

head(dinuc_freq_tss_df)

  X1_AA X1_AC X1_AG X1_AT X1_CA X1_CC X1_CG X1_CT X1_GA X1_GC X1_GG X1_GT X1_TA
1     0     0     0     0     0     0     0     0     0     1     0     0     0
2     0     0     0     0     0     0     0     0     0     0     0     0     0
3     0     0     0     0     0     1     0     0     0     0     0     0     0
4     0     0     0     0     0     0     0     0     0     1     0     0     0
5     0     0     0     0     0     0     0     0     0     0     1     0     0
6     0     0     0     0     1     0     0     0     0     0     0     0     0
  X1_TC X1_TG X1_TT X2_AA X2_AC X2_AG X2_AT X2_CA X2_CC X2_CG X2_CT X2_GA X2_GC
1     0     0     0     0     0     0     0     1     0     0     0     0     0
2     1     0     0     0     0     0     0     0     0     1     0     0     0
3     0     0     0     0     0     0     0     0     1     0     0     0     0
4     0     0     0     0     0     0     0     1     0     0     0     0     0
5     0     0     0     0     0     0     0     0     0     0     0     0     0
6     0     0     0     0     0     1     0     0     0     0     0     0     0
  X2_GG X2_GT X2_TA X2_TC X2_TG X2_TT X3_AA X3_AC X3_AG X3_AT X3_CA X3_CC X3_CG
1     0     0     0     0     0     0     0     1     0     0     0     0     0
2     0     0     0     0     0     0     0     0     0     0     0     0     0
3     0     0     0     0     0     0     0     0     0     0     0     0     0
4     0     0     0     0     0     0     0     0     1     0     0     0     0
5     1     0     0     0     0     0     0     0     0     0     0     0     0
6     0     0     0     0     0     0     0     0     0     0     0     0     0
  X3_CT X3_GA X3_GC X3_GG X3_GT X3_TA X3_TC X3_TG X3_TT X4_AA X4_AC X4_AG X4_AT
1     0     0     0     0     0     0     0     0     0     0     0     0     0
2     0     0     0     1     0     0     0     0     0     0     0     0     0
3     1     0     0     0     0     0     0     0     0     0     0     0     0
4     0     0     0     0     0     0     0     0     0     0     0     0     0
5     0     0     0     0     1     0     0     0     0     0     0     0     0
6     0     0     0     1     0     0     0     0     0     0     0     0     0
  X4_CA X4_CC X4_CG X4_CT X4_GA X4_GC X4_GG X4_GT X4_TA X4_TC X4_TG X4_TT
1     0     0     0     1     0     0     0     0     0     0     0     0
2     0     0     0     0     1     0     0     0     0     0     0     0
3     0     0     0     0     0     0     0     0     0     1     0     0
4     0     0     0     0     1     0     0     0     0     0     0     0
5     0     0     0     0     0     0     0     0     0     1     0     0
6     0     0     0     0     0     0     1     0     0     0     0     0

Combine TSS and random data.

dinuc_freq_tss_df %>%
  mutate(class = 'tss') -> my_data2

dinuc_freq_random_df %>%
  mutate(class = 'random') %>%
  rbind(my_data2) %>%
  mutate(class = factor(class)) -> my_data2

dim(my_data2)

[1] 120888     65

Split 80/20.

set.seed(1984)
idx <- sample(nrow(my_data2), nrow(my_data2)*0.8)

train2 <- my_data2[idx, ]
test2 <- my_data2[-idx, ]

dim(train2)

[1] 96710    65

dim(test2)

[1] 24178    65

Train Random Forest.

my_rf2 <- randomForest(
  class ~ .,
  data = train2,
  importance = TRUE,
  do.trace = 100,
  ntree = 200
)

ntree      OOB      1      2
  100:  23.53% 25.82% 21.36%
  200:  23.56% 26.09% 21.16%

my_rf2


Call:
 randomForest(formula = class ~ ., data = train2, importance = TRUE,      do.trace = 100, ntree = 200) 
               Type of random forest: classification
                     Number of trees: 200
No. of variables tried at each split: 8

        OOB estimate of  error rate: 23.56%
Confusion matrix:
       random   tss class.error
random  34802 12282   0.2608529
tss     10501 39125   0.2116028

Feature importance.

rn2 <- round(importance(my_rf2), 2)
head(rn2[order(rn2[,3], decreasing=TRUE),])

      random   tss MeanDecreaseAccuracy MeanDecreaseGini
X4_CG  16.35 23.54                26.19           299.22
X1_AT  28.87  7.19                24.64           373.22
X2_CG  29.80 12.67                23.00          2091.22
X3_CA  23.48  7.72                22.45           323.41
X4_TA  18.52  9.46                20.97           142.35
X1_TG  19.69  5.41                20.80           328.67

As a plot.

varImpPlot(my_rf2)

Compare prediction results of our first model and our second model.

# results from first model
data.predict <- predict(my_rf, test)
prop.table(table(observed = test$class, predict = data.predict), 1)

        predict
observed    random       tss
  random 0.7326808 0.2673192
  tss    0.2747627 0.7252373

# results from second model
data.predict2 <- predict(my_rf2, test2)
prop.table(table(observed = test2$class, predict = data.predict2), 1)

        predict
observed    random       tss
  random 0.7383343 0.2616657
  tss    0.2103513 0.7896487

By including the positional information of the di-nucleotides, we have a better TSS classifier, which makes sense since not all the bases upstream and downstream of the TSS have a nucleotide preference and thus were diluting the signal in our first model where we calculated the overall di-nucleotide frequency.

sessionInfo()

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] verification_1.42                 dtw_1.22-3                       
 [3] proxy_0.4-26                      CircStats_0.2-6                  
 [5] MASS_7.3-54                       boot_1.3-28                      
 [7] fields_13.3                       viridis_0.6.2                    
 [9] viridisLite_0.4.0                 spam_2.8-0                       
[11] pROC_1.18.0                       randomForest_4.7-1               
[13] BSgenome.Hsapiens.UCSC.hg38_1.4.4 BSgenome_1.62.0                  
[15] rtracklayer_1.54.0                Biostrings_2.62.0                
[17] XVector_0.34.0                    GenomicRanges_1.46.1             
[19] GenomeInfoDb_1.30.1               IRanges_2.28.0                   
[21] S4Vectors_0.32.4                  BiocGenerics_0.40.0              
[23] biomaRt_2.50.3                    forcats_0.5.1                    
[25] stringr_1.4.0                     dplyr_1.0.7                      
[27] purrr_0.3.4                       readr_2.1.1                      
[29] tidyr_1.1.4                       tibble_3.1.6                     
[31] ggplot2_3.3.5                     tidyverse_1.3.1                  
[33] workflowr_1.7.0                  

loaded via a namespace (and not attached):
  [1] colorspace_2.0-2            rjson_0.2.21               
  [3] ellipsis_0.3.2              rprojroot_2.0.2            
  [5] fs_1.5.2                    rstudioapi_0.13            
  [7] farver_2.1.0                bit64_4.0.5                
  [9] AnnotationDbi_1.56.2        fansi_1.0.2                
 [11] lubridate_1.8.0             xml2_1.3.3                 
 [13] cachem_1.0.6                knitr_1.37                 
 [15] jsonlite_1.7.3              Rsamtools_2.10.0           
 [17] broom_0.7.11                dbplyr_2.1.1               
 [19] png_0.1-7                   compiler_4.1.2             
 [21] httr_1.4.2                  backports_1.4.1            
 [23] assertthat_0.2.1            Matrix_1.3-4               
 [25] fastmap_1.1.0               cli_3.1.1                  
 [27] later_1.3.0                 htmltools_0.5.2            
 [29] prettyunits_1.1.1           tools_4.1.2                
 [31] dotCall64_1.0-1             gtable_0.3.0               
 [33] glue_1.6.1                  GenomeInfoDbData_1.2.7     
 [35] maps_3.4.0                  rappdirs_0.3.3             
 [37] Rcpp_1.0.8                  Biobase_2.54.0             
 [39] cellranger_1.1.0            jquerylib_0.1.4            
 [41] vctrs_0.3.8                 xfun_0.29                  
 [43] ps_1.6.0                    rvest_1.0.2                
 [45] lifecycle_1.0.1             restfulr_0.0.13            
 [47] XML_3.99-0.9                getPass_0.2-2              
 [49] zlibbioc_1.40.0             scales_1.1.1               
 [51] hms_1.1.1                   promises_1.2.0.1           
 [53] MatrixGenerics_1.6.0        parallel_4.1.2             
 [55] SummarizedExperiment_1.24.0 yaml_2.2.2                 
 [57] curl_4.3.2                  gridExtra_2.3              
 [59] memoise_2.0.1               stringi_1.7.6              
 [61] RSQLite_2.2.12              highr_0.9                  
 [63] BiocIO_1.4.0                filelock_1.0.2             
 [65] BiocParallel_1.28.3         rlang_1.0.0                
 [67] pkgconfig_2.0.3             bitops_1.0-7               
 [69] matrixStats_0.62.0          evaluate_0.14              
 [71] lattice_0.20-45             labeling_0.4.2             
 [73] GenomicAlignments_1.30.0    bit_4.0.4                  
 [75] processx_3.5.2              tidyselect_1.1.1           
 [77] plyr_1.8.7                  magrittr_2.0.2             
 [79] R6_2.5.1                    generics_0.1.1             
 [81] DelayedArray_0.20.0         DBI_1.1.2                  
 [83] pillar_1.6.5                haven_2.4.3                
 [85] whisker_0.4                 withr_2.4.3                
 [87] KEGGREST_1.34.0             RCurl_1.98-1.6             
 [89] modelr_0.1.8                crayon_1.4.2               
 [91] utf8_1.2.2                  BiocFileCache_2.2.1        
 [93] tzdb_0.2.0                  rmarkdown_2.11             
 [95] progress_1.2.2              grid_4.1.2                 
 [97] readxl_1.3.1                blob_1.2.2                 
 [99] callr_3.7.0                 git2r_0.29.0               
[101] reprex_2.0.1                digest_0.6.29              
[103] httpuv_1.6.5                munsell_0.5.0