Support Vector Machine

Last updated: 2019-09-05

Checks: 7 0

Knit directory: polymeRID/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0.9001). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190729)

The command set.seed(20190729) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: a848def

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rprofile
    Ignored:    .Rproj.user/
    Ignored:    analysis/library.bib
    Ignored:    docs/figure/
    Ignored:    fun/
    Ignored:    output/20190810_1538/
    Ignored:    output/20190810_1546/
    Ignored:    output/20190810_1609/
    Ignored:    output/20190813_1044/
    Ignored:    output/logs/
    Ignored:    output/natural/
    Ignored:    output/nnet/
    Ignored:    output/svm/
    Ignored:    output/testRunII/
    Ignored:    output/testRunIII/
    Ignored:    packrat/lib-R/
    Ignored:    packrat/lib-ext/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/BH/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/FactoMineR/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/IDPmisc/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/KernSmooth/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/MASS/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/Matrix/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/MatrixModels/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/ModelMetrics/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/R6/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/RColorBrewer/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/RCurl/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rcpp/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppArmadillo/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppEigen/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppGSL/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppZiggurat/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rfast/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rgtsvm/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rmisc/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/SQUAREM/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/SparseM/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/abind/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/askpass/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/assertthat/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/backports/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/base64enc/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/baseline/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/bit/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/bit64/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/bitops/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/boot/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/brew/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/callr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/car/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/carData/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/caret/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/cellranger/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/class/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/cli/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/clipr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/clisymbols/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/cluster/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/codetools/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/colorspace/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/commonmark/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/config/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/cowplot/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/crayon/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/crosstalk/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/curl/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/data.table/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/dendextend/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/desc/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/devtools/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/digest/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/doParallel/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/dplyr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/e1071/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/ellipse/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/ellipsis/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/evaluate/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/factoextra/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/fansi/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/flashClust/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/forcats/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/foreach/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/foreign/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/fs/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/generics/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/getPass/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggplot2/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggpubr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggrepel/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggsci/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggsignif/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/gh/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/git2r/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/glue/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/gower/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/gridExtra/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/gtable/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/haven/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/hexbin/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/highr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/hms/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/htmltools/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/htmlwidgets/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/httpuv/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/httr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/ini/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/ipred/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/iterators/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/jsonlite/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/keras/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/kerasR/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/knitr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/labeling/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/later/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/lattice/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/lava/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/lazyeval/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/leaps/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/lme4/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/lubridate/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/magrittr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/maptools/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/markdown/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/memoise/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/mgcv/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/mime/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/minqa/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/munsell/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/nlme/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/nloptr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/nnet/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/numDeriv/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/openssl/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/openxlsx/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/packrat/tests/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/pbkrtest/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/pillar/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/pkgbuild/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/pkgconfig/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/pkgload/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/plogr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/plotly/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/plyr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/polynom/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/praise/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/prettyunits/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/processx/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/prodlim/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/progress/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/promises/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/prospectr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/ps/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/purrr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/quantreg/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/randomForest/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/rcmdcheck/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/readr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/readxl/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/recipes/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/rematch/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/remotes/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/reshape2/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/reticulate/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/rio/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/rlang/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/rmarkdown/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/roxygen2/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/rpart/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/rprojroot/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/rsconnect/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/rstudioapi/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/scales/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/scatterplot3d/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/sessioninfo/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/shiny/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/sourcetools/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/sp/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/stringi/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/stringr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/survival/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/sys/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/tensorflow/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/testthat/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/tfruns/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/tibble/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/tidyr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/tidyselect/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/timeDate/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/tinytex/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/usethis/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/utf8/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/vctrs/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/viridis/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/viridisLite/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/whisker/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/withr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/workflowr/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/xfun/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/xml2/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/xopen/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/xtable/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/yaml/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/zeallot/
    Ignored:    packrat/lib/x86_64-pc-linux-gnu/3.6.1/zip/
    Ignored:    packrat/src/
    Ignored:    polymeRID.Rproj
    Ignored:    smp/20190812_1723_NNET/files/
    Ignored:    smp/20190812_1723_NNET/plots/
    Ignored:    smp/20190812_1729_NNET/files/
    Ignored:    smp/20190812_1729_NNET/plots/
    Ignored:    smp/20190812_1731_NNET/files/
    Ignored:    smp/20190812_1731_NNET/plots/
    Ignored:    smp/20190812_1733_NNET/files/
    Ignored:    smp/20190812_1733_NNET/plots/
    Ignored:    smp/20190815_1847_FUSION/
    Ignored:    smp/20190905_1602_FUSION/
    Ignored:    smp/20190905_1618_RFRAW/
    Ignored:    smp/20190905_1637_CNND2/
    Ignored:    smp/20190905_1708_FUSION/
    Ignored:    website/

Untracked files:
    Untracked:  analysis/elsevier-harvard.csl

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	a848def	goergen95	2019-09-05	changed citation style
html	070e93f	goergen95	2019-08-22	Build site.
Rmd	26b0062	goergen95	2019-08-21	updated svm_exploration.html
html	26b0062	goergen95	2019-08-21	updated svm_exploration.html
html	f2ee83c	goergen95	2019-08-19	Build site.
html	d960dc2	goergen95	2019-08-19	included calibration
html	b846f0b	goergen95	2019-08-19	Build site.
Rmd	de84a71	goergen95	2019-08-19	large update for website
html	de84a71	goergen95	2019-08-19	large update for website
Rmd	6bef5e6	goergen95	2019-08-14	confusion matrix output in rf_exploration
html	6bef5e6	goergen95	2019-08-14	confusion matrix output in rf_exploration
html	2385fbc	goergen95	2019-08-14	republish for layout change
Rmd	293fd73	goergen95	2019-08-14	first step on svm_exploration
html	293fd73	goergen95	2019-08-14	first step on svm_exploration

Overview

Support Vector Machine (SVM) is a non-parametric classification method which was initially designed for binary classification problems and was developed in its current form by Boser et al. (2004). A detailed overview of the SVM algorithm is found in Burges (1998). The principal idea behind SVM is to find an optimal hyperplane which separates two classes from each other by the largest possible margin. The algorithm is optimized by iteratively maximizing this margin. Only the closest observations to the margin are considered. These specific observations are also called support-vectors. Multidimensional data can be processed by mapping the data into a higher dimensional feature space through a specified mapping function. This function is called kernel function, and mainly four different groups are used: linear, polynomial, radial and sigmoid functions (Burges, 1998). In this project only the radial basis function was used. Multi-class problems are addressed by calculating an optimal margin following the one class-against-all pattern and conducting a majority vote at the end of the calculations. SVM needs proper fine-tuning of its parameters. These are the regularization parameter C and the kernel width γ for radial basis functions. The regularization parameter is also called “penalty value”, as it is a constant giving penalty to misclassified observations. The optimal values might change if different representations of the data are presented to the algorithm.

Adding Noise

Different techniques of data pre-processing might emphasize different features of the patterns to be learned by an algorithm. To grasp this, different transformations of the data were presented to the SVM algorithm. Additionally, the raw data signal was jittered to test which transformation might prove beneficial in delivering a high classification accuracy even in the presence of noise. To test this, we define a function which adds noise to the raw data and returns a list with the number of elements equal to the levels of noise applied.

addNoise = function(data, levels = c(0), category="class"){
  data.return = list()
  index = which(names(data) == category)
  for (n in levels){
    tmp = as.matrix(data[ , -index])
    if (n == 0){
      tmp = data
    }else{
      tmp = as.data.frame(jitter(tmp, n))
      tmp[category] = data[category]
    }
    data.return[[paste("noise", n, sep="")]] = tmp
  }
  return(data.return)
}

data = read.csv(file = paste0(ref, "reference_database.csv"), header = TRUE)
noisy_data = addNoise(data, levels = c(0,10,100,250,500), category = "class")

# indivitual elements can be selected by using [[ and refering to the index or the name
head(noisy_data[["noise100"]])[1:3,1:3]

  wvn3992.63003826141 wvn3990.70123147964 wvn3988.77242469788
1        -0.010328834          0.01829306         0.001603137
2         0.005237827          0.01683373        -0.013921331
3        -0.011751581          0.01958805         0.002650597

Data Pre-processing

In another function which uses the noisy_data-object as input, specific data transformations are applied. These is normalization which centers and scales the input data, different forms of the Savitzkiy-Golay filter (Savitzky and Golay, 1964), and first and second derivatives of the raw spectra. The function iterates through the elements in the noisy_data object and returns each specified transformation in a list element below the noise level. The exemplary function below applies the pre-processing for normalization, standard filtering and first derivative only. The implementation of the function used in the project can be found here.

createTrainingSet = function(data, category = "class",
                             SGpara = list(p=3,w=11), lag = 15){
  
  data.return = list()
  for (noise in names(data)){
    
    tmp = as.data.frame(data[[noise]])
    classes = tmp[,category]
    tmp = tmp[!names(tmp) %in% category]
    
    # original data
    data.return[[noise]][["raw"]] = as.data.frame(data[[noise]])
    
    # normalised data
    data_norm = preprocess(tmp, type="norm")
    data_norm[category] = classes
    data.return[[noise]][["norm"]] = data_norm
    
    # SG-filtered data
    data_sg = preprocess(tmp, type="sg", SGpara = SGpara)
    data_sg[category] = classes
    data.return[[noise]][["sg"]] = data_sg
    
    # first derivative of original data
    data_rawd1 = preprocess(tmp, type="raw.d1", lag = lag)
    data_rawd1[category] = classes
    data.return[[noise]][["raw.d1"]] = data_rawd1
    
  }
  return(data.return)
}

# applying the function
test_dataset = createTrainingSet(noisy_data, category = "class")

# individual transformations at a certain noise level can be accessed with [[
head(test_dataset[["noise500"]][["raw.d1"]])[1:3,1:3]

  wvn3963.69793653488 wvn3961.76912975311 wvn3959.84032297134
1          0.08295627          0.02234302          0.17399975
2          0.04253842          0.07918397          0.03849980
3         -0.05579848          0.11670539          0.02503375

Dimensionality Reduction

The database of Primpke et al. (2018) currently shows 1863 variables for each observation. Most of these data points do not hold relevant information to distinguish between different types of particles. To shorten the computation time, one can use dimensionality reduction techniques such as principal component analysis (PCA). PCA has already been used to transform spectral data of microplastic in marine ecosystems (Jung et al., 2018; Lorenzo-Navarro et al., 2018). PCA basically takes the input data for a given number of observations and performs a orthogonal transformation to derive uncorrelated principal components from the possibly correlated variables. Both redundancies in the data as well as the presence of noise can be accounted for this way. PCA has previously been successfully applied to FTIR-spectroscopy data (Ami et al., 2013; Fu et al., 2014; Hori and Sugiyama, 2003; Mueller et al., 2013; Nieuwoudt et al., 2004). Simultaneously, the number of variables can be significantly reduced by applying PCA and thus speeding up the training process. Below we apply a PCA to the raw data as an example only.

library(factoextra)
tmp = test_dataset[["noise0"]][["raw"]]
pca = prcomp(tmp[ ,-1864]) # omitting class variable
var_info = factoextra::get_eigenvalue(pca)
# setting a threshold of 99% explained variance
threshold = 99
thresInd = which(var_info$cumulative.variance.percent>=threshold)[1]
pca_data = pca$x[,1:thresInd]

We can use the index variable thresInd we just defined to take a look upon all the principal components which explain 99% of the variance of the data.

	eigenvalue	variance.percent	cumulative.variance.percent
Dim.1	1.5618496	57.7479523	57.74795
Dim.2	0.3875503	14.3293173	72.07727
Dim.3	0.2226548	8.2324559	80.30973
Dim.4	0.1823539	6.7423696	87.05209
Dim.5	0.0897978	3.3201919	90.37229
Dim.6	0.0538395	1.9906665	92.36295
Dim.7	0.0443676	1.6404525	94.00341
Dim.8	0.0324243	1.1988595	95.20227
Dim.9	0.0297937	1.1015939	96.30386
Dim.10	0.0215566	0.7970366	97.10090
Dim.11	0.0173647	0.6420450	97.74294
Dim.12	0.0145994	0.5397977	98.28274
Dim.13	0.0113917	0.4211964	98.70394
Dim.14	0.0069264	0.2560972	98.96003
Dim.15	0.0042735	0.1580104	99.11804

We effectively reduced the number of variables from 1683 to 15 which still bears 99% of the variance we find in the original data. When it comes to machine learning, however, it is important to realize that this new data set is not fit to be used in a training process. If we now randomly split the observations into a training and testing set, we effectively mix up these two sets because information of the testing set has already influenced the outcome of the PCA. Therefore, the data set needs to be split before applying the PCA. The analysis is done on the training data only and then the same orthogonal transformations is applied to the testing data. This way it can be ensured that the test set is truly independent of the training process.

Cross Validation

We apply a 10-fold cross-validation which is repeated five times. The following code takes a complete data set as input, applies a splitting function from the caret package and then builds the PCA upon the the training set and finally applies the same transformation to the testing set. Here, it is only applied for the raw data. We also randomly split the data to a 50% training and a 50% testing set.

folds = 10
repeats = 5
split_percentage = 0.5
threshold = 99
tmp = test_dataset[["noise0"]][["raw"]]

set.seed(42) # ensure reproducibility
fold_index = lapply(1:repeats, caret::createDataPartition, y=tmp$class,
                   times = folds, p = split_percentage)
fold_index = do.call(c, fold_index)

pcaData = list()
for (rep in 1:repeats){
  rep_index = fold_index[(rep*folds-folds+1):(rep*folds)] # jumps to the correct number of folds forward in each repeat
  
  pcadata_fold = lapply(1:folds,function(x){
    
    # splitting for current fold
    training = tmp[unlist(rep_index[x]),]
    validation = tmp[-unlist(rep_index[x]),]
    
    # keep response
    responseTrain = training$class
    responseVal = validation$class
    
    # apply PCA
    pca = prcomp(training[,1:1863])
    varInfo = factoextra::get_eigenvalue(pca)
    thresInd = which(varInfo$cumulative.variance.percent >= threshold)[1]
    pca_training = pca$x[ ,1:thresInd]
    pca_validation = predict(pca, validation)[ ,1:thresInd]
    
    training = as.data.frame(pca_training)
    training$response = responseTrain
    validation = as.data.frame(pca_validation)
    validation$response = responseVal
    foldtmp = list(training, validation)
    names(foldtmp) = c("training","validation")
    return(foldtmp)
  })
  names(pcadata_fold) = paste("fold", 1:folds, sep ="")
  pcaData[[paste0("repeat",rep)]] = pcadata_fold
}

We now have a list object with the number of elements equivalent to the repeats. Below the level of repeats individual folds can be accessed. There the splitted data set can be accessed by referring to "training" and "testing".

pcaData[["repeat5"]][["fold10"]][["training"]][1:3,1:3]

         PC1        PC2         PC3
1 -0.4779164 -0.7786770 -0.02761266
2 -0.4869471 -0.7181236 -0.02492981
4 -0.4661699 -0.7463479 -0.11424932

pcaData[["repeat5"]][["fold10"]][["validation"]][1:3,1:3]

         PC1        PC2         PC3
3 -0.5240744 -0.5931111  0.05961383
5 -0.4718366 -0.6995376 -0.09273887
7 -0.5255215 -0.5928214  0.05184969

summary(pcaData[["repeat5"]][["fold10"]][["training"]]$response)

FIBRE   FUR  HDPE  LDPE    PA    PE   PES   PET    PP    PS   PUR  WOOD 
   14    12     5     6     7     4     8     5     6     4     4     2

summary(pcaData[["repeat5"]][["fold10"]][["validation"]]$response)

FIBRE   FUR  HDPE  LDPE    PA    PE   PES   PET    PP    PS   PUR  WOOD 
   13    11     5     5     7     4     7     4     6     3     3     2

Parameter Tuning

For the SVM algorithm a simple search pattern for the optimal configuration for the regularization parameter C and the kernel width γ was implemented. We did this by applying a search grid for the parameters and calculating each possible combination. Note that due to limits in computation capacities we restricted the search to 25 combinations only. The code below calculates a model for each possible combination, evaluating its capacity to correctly classify the training data and then chooses the optimal model to evaluate the validation data.

training = pcaData[["repeat1"]][["fold1"]][["training"]]
validation =pcaData[["repeat1"]][["fold1"]][["validation"]]
x_train = training[ ,1:ncol(training)-1]
y_train = training$response
x_test = validation[ ,1:ncol(validation)-1]
y_test = validation$response


tuneGrid = expand.grid(gamma =seq(0.1,1,0.2),cost = seq(1,5,1) )
accuracy = c()
models = list()
for ( i in 1:nrow(tuneGrid)){
  model = e1071::svm(x = x_train, y = y_train,
                    kernel = "radial",
                    gamma = tuneGrid$gamma[i],
                    cost = tuneGrid$cost[i])
  pred = predict(model, x_train)
  conf = caret::confusionMatrix(pred, y_train)
  accuracy = c(accuracy, conf$overall["Kappa"])
  models[[i]] = model
}

best_model = models[[which(accuracy == max(accuracy))[1]]]
prediction = predict(best_model, x_test)
confMat = caret::confusionMatrix(prediction, y_test)
print(confMat$table)

          Reference
Prediction FIBRE FUR HDPE LDPE PA PE PES PET PP PS PUR WOOD
     FIBRE    12   0    0    0  0  0   5   2  0  0   0    2
     FUR       1  11    0    0  0  0   0   0  0  0   0    0
     HDPE      0   0    3    0  0  0   0   0  0  0   0    0
     LDPE      0   0    0    5  0  2   0   0  0  0   0    0
     PA        0   0    0    0  7  0   0   0  0  0   0    0
     PE        0   0    1    0  0  2   0   0  0  0   0    0
     PES       0   0    0    0  0  0   2   1  0  0   0    0
     PET       0   0    1    0  0  0   0   1  0  2   0    0
     PP        0   0    0    0  0  0   0   0  6  0   0    0
     PS        0   0    0    0  0  0   0   0  0  1   0    0
     PUR       0   0    0    0  0  0   0   0  0  0   3    0
     WOOD      0   0    0    0  0  0   0   0  0  0   0    0

The process of splitting the data set into training and testing was automated by putting the above code in a function which can be found here. Finally, this function was applied to the different pre-procesesing levels, as discussed before.

source("code/functions.R")
wavenumbers = readRDS(paste0(ref,"wavenumbers.rds"))
# add noise to data
noisyData = addNoise(data,levels = c(0,10,100,250,500), category = "class")

# preprocessing
testDataset = createTrainingSet(noisyData, category = "class",
                                SGpara = list(p=3, w=11), lag=15,
                                type = c("raw", "norm", "sg", "sg.d1", "sg.d2",
                                  "sg.norm", "sg.norm.d1", "sg.norm.d2",
                                  "raw.d1", "raw.d2", "norm.d1", "norm.d2"))


types = names(testDataset[[1]])

levels = lapply(names(testDataset), function(x){
  rep(x, length(types))
})
levels = unlist(levels)

results = data.frame(level=levels, type = types, kappa = rep(0,length(levels)))

for (level in unique(levels)){
  for (type in types){

    print(paste0("Level: ",level," Type: ",type))
    tmpData = testDataset[[level]][[type]]
    tmpData[which(wavenumbers<=2420 & wavenumbers>=2200)] = 0 # setting C02 window to 0
    tmpModel = pcaCV(tmpData, folds = 10, repeats = 5, threshold = 99, metric = "Kappa", p=0.5, method="svm")
    saveRDS(tmpModel,file = paste0(output,"svm/model_",level,"_",type,"_",round(tmpModel[[1]],2),".rds"))
    results[which(results$level==level & results$type==type),"kappa"] = as.numeric(tmpModel[[1]])
    print(results)

  }
}
saveRDS(results, paste0(output,"svm/exploration.rds"))

Results

The plot below shows the Kappa scores the algorithm achieved during training for different representations of the data at increasing noise levels.

It can be observed that with higher noise level the Kappa score is reduced significantly. All data transformations yield to very low accuracies when the noise increases. In the absence of significant noise, however, the simple Savitzkiy-Golay filter, the raw data, as well as the first and second order derivatives yield to a kappa score of about 0.75. Looking at this more mathematically by calculating the average slopes of the Kappa scores reveals the transformations with the most stable results. Note that we only take the kappa score at noise level 0 and 10 to calculate the average slope, since higher noise levels yielded to unsatisfactory accuracies.

noise0 = results[results$level == "noise0", ]
noise10 = results[results$level == "noise10", ]
slopes = noise10$kappa - noise0$kappa 
types = unique(results$type)
df = data.frame(type = types, slope = slopes, row.names = NULL)
df = df[order(-slopes),]

	type	slope
3	sg	-0.0007272
1	raw	-0.0011139
4	sg.d1	-0.0148920
10	raw.d2	-0.0161989
9	raw.d1	-0.0167638
5	sg.d2	-0.0642191
7	sg.norm.d1	-0.2300057
8	sg.norm.d2	-0.2324091
2	norm	-0.5853342
6	sg.norm	-0.6006121

This confirms that the average decrease in Kappa score is the lowest for the Savitzkiy-Golay filtered data followed by the raw data. Then, the first derivative of the filtered data achieves the next lowest slope, but it has to be noted that the overall kappa score level of this data transformation is lower than for the derivatives of the unfiltered data.

Citations on this page

Ami, D., Mereghetti, P., Maria, S., 2013. Multivariate Analysis for Fourier Transform Infrared Spectra of Complex Biological Systems and Processes. Multivariate Analysis in Management, Engineering and the Sciences. https://doi.org/10.5772/53850

Boser, B.E., Guyon, I.M., Vapnik, V.N., 2004. A training algorithm for optimal margin classifiers, in: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92. ACM, New York, NY, USA, pp. 144–152. https://doi.org/10.1145/130385.130401

Burges, C.J., 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 121–167. https://doi.org/10.1023/A:1009715923555

Fu, Y., Toyoda, K., Ihara, I., 2014. Application of ATR-FTIR spectroscopy and principal component analysis in characterization of 15-acetyldeoxynivalenol in corn oil. Engineering in Agriculture, Environment and Food 7, 163–168. https://doi.org/10.1016/j.eaef.2014.07.001

Hori, R., Sugiyama, J., 2003. A combined FT-IR microscopy and principal component analysis on softwood cell walls. Carbohydrate Polymers 52, 449–453. https://doi.org/10.1016/S0144-8617(03)00013-4

Jung, M.R., Horgen, F.D., Orski, S.V., Rodriguez C., V., Beers, K.L., Balazs, G.H., Jones, T.T., Work, T.M., Brignac, K.C., Royer, S.J., Hyrenbach, K.D., Jensen, B.A., Lynch, J.M., 2018. Validation of ATR FT-IR to identify polymers of plastic marine debris, including those ingested by marine organisms. Marine Pollution Bulletin 127, 704–716. https://doi.org/10.1016/j.marpolbul.2017.12.061

Lorenzo-Navarro, J., Castrillón-Santana, M., Gómez, M., Herrera, A., Marín-Reyes, P.A., 2018. Automatic Counting and Classification of Microplastic Particles. https://doi.org/10.5220/0006725006460652

Mueller, D., Ferrão, M.F., Marder, L., Ben da Costa, A., de Cássia de Souza Schneider, R., 2013. Fourier transform infrared spectroscopy (FTIR) and multivariate analysis for identification of different vegetable oils used in biodiesel production. Sensors (Switzerland) 13, 4258–4271. https://doi.org/10.3390/s130404258

Nieuwoudt, H.H., Prior, B.A., Pretorius, I.S., Manley, M., Bauer, F.F., 2004. Principal component analysis applied to Fourier transform infrared spectroscopy for the design of calibration sets for glycerol prediction models in wine and for the detection and classification of outlier samples. Journal of Agricultural and Food Chemistry 52, 3726–3735. https://doi.org/10.1021/jf035431q

Primpke, S., Wirth, M., Lorenz, C., Gerdts, G., 2018. Reference database design for the automated analysis of microplastic samples based on Fourier transform infrared (FTIR) spectroscopy. Analytical and Bioanalytical Chemistry 410, 5131–5141. https://doi.org/10.1007/s00216-018-1156-x

Savitzky, A., Golay, M.J., 1964. Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Analytical Chemistry 36, 1627–1639. https://doi.org/10.1021/ac60214a047

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 19.1

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] plotly_4.9.0              knitr_1.24               
 [3] factoextra_1.0.5          tensorflow_1.14.0        
 [5] abind_1.4-5               e1071_1.7-2              
 [7] keras_2.2.4.1             workflowr_1.4.0.9001     
 [9] baseline_1.2-1            gridExtra_2.3            
[11] stringr_1.4.0             prospectr_0.1.3          
[13] RcppArmadillo_0.9.600.4.0 openxlsx_4.1.0.1         
[15] magrittr_1.5              ggplot2_3.2.0            
[17] reshape2_1.4.3            dplyr_0.8.3              

loaded via a namespace (and not attached):
 [1] httr_1.4.1         tidyr_0.8.3        viridisLite_0.3.0 
 [4] jsonlite_1.6       splines_3.6.1      foreach_1.4.7     
 [7] prodlim_2018.04.18 shiny_1.3.2        assertthat_0.2.1  
[10] stats4_3.6.1       highr_0.8          yaml_2.2.0        
[13] ggrepel_0.8.1      ipred_0.9-9        pillar_1.4.2      
[16] backports_1.1.4    lattice_0.20-38    glue_1.3.1        
[19] reticulate_1.13    digest_0.6.20      promises_1.0.1    
[22] colorspace_1.4-1   recipes_0.1.6      httpuv_1.5.1      
[25] htmltools_0.3.6    Matrix_1.2-17      plyr_1.8.4        
[28] timeDate_3043.102  pkgconfig_2.0.2    SparseM_1.77      
[31] caret_6.0-84       xtable_1.8-4       purrr_0.3.2       
[34] scales_1.0.0       whisker_0.3-2      later_0.8.0       
[37] gower_0.2.1        lava_1.6.5         git2r_0.26.1      
[40] tibble_2.1.3       generics_0.0.2     withr_2.1.2       
[43] nnet_7.3-12        lazyeval_0.2.2     mime_0.7          
[46] survival_2.44-1.1  crayon_1.3.4       evaluate_0.14     
[49] fs_1.3.1           nlme_3.1-140       MASS_7.3-51.4     
[52] class_7.3-15       tools_3.6.1        data.table_1.12.2 
[55] munsell_0.5.0      zip_2.0.3          compiler_3.6.1    
[58] rlang_0.4.0        grid_3.6.1         iterators_1.0.12  
[61] htmlwidgets_1.3    crosstalk_1.0.0    labeling_0.3      
[64] base64enc_0.1-3    rmarkdown_1.14     gtable_0.3.0      
[67] ModelMetrics_1.2.2 codetools_0.2-16   R6_2.4.0          
[70] tfruns_1.4         lubridate_1.7.4    zeallot_0.1.0     
[73] rprojroot_1.3-2    stringi_1.4.3      Rcpp_1.0.2        
[76] rpart_4.1-15       tidyselect_0.2.5   xfun_0.8