Last updated: 2019-08-14
Checks: 6 1
Knit directory: polymeRID/
This reproducible R Markdown analysis was created with workflowr (version 1.4.0.9001). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish
to commit the R Markdown file and build the HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20190729)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rprofile
Ignored: .Rproj.user/
Ignored: analysis/library.bib
Ignored: docs/figure/
Ignored: fun/
Ignored: output/20190810_1538/
Ignored: output/20190810_1546/
Ignored: output/20190810_1609/
Ignored: output/20190813_1044/
Ignored: output/logs/
Ignored: output/natural/
Ignored: output/nnet/
Ignored: output/svm/
Ignored: output/testRunII/
Ignored: output/testRunIII/
Ignored: packrat/lib-R/
Ignored: packrat/lib-ext/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/BH/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/FactoMineR/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/IDPmisc/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/KernSmooth/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/MASS/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/Matrix/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/MatrixModels/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ModelMetrics/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/R6/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/RColorBrewer/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rcpp/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppArmadillo/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppEigen/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppGSL/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppZiggurat/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rfast/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rgtsvm/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rmisc/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/SQUAREM/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/SparseM/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/abind/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/askpass/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/assertthat/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/backports/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/base64enc/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/baseline/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/bit/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/bit64/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/boot/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/callr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/car/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/carData/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/caret/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/cellranger/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/class/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/cli/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/clipr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/cluster/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/codetools/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/colorspace/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/config/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/cowplot/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/crayon/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/crosstalk/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/curl/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/data.table/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/dendextend/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/digest/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/doParallel/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/dplyr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/e1071/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ellipse/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ellipsis/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/evaluate/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/factoextra/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/fansi/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/flashClust/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/forcats/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/foreach/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/foreign/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/fs/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/generics/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/getPass/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggplot2/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggpubr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggrepel/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggsci/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggsignif/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/git2r/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/glue/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/gower/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/gridExtra/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/gtable/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/haven/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/hexbin/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/highr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/hms/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/htmltools/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/htmlwidgets/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/httpuv/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/httr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ipred/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/iterators/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/jsonlite/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/keras/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/kerasR/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/knitr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/labeling/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/later/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/lattice/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/lava/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/lazyeval/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/leaps/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/lme4/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/lubridate/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/magrittr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/maptools/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/markdown/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/mgcv/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/mime/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/minqa/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/munsell/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/nlme/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/nloptr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/nnet/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/numDeriv/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/openssl/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/openxlsx/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/packrat/tests/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/pbkrtest/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/pillar/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/pkgconfig/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/plogr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/plotly/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/plyr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/polynom/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/prettyunits/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/processx/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/prodlim/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/progress/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/promises/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/prospectr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ps/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/purrr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/quantreg/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/randomForest/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/readr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/readxl/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/recipes/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rematch/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/reshape2/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/reticulate/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rio/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rlang/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rmarkdown/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rpart/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rprojroot/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rstudioapi/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/scales/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/scatterplot3d/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/shiny/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/sourcetools/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/sp/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/stringi/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/stringr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/survival/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/sys/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tensorflow/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tfruns/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tibble/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tidyr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tidyselect/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/timeDate/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tinytex/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/utf8/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/vctrs/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/viridis/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/viridisLite/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/whisker/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/withr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/workflowr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/xfun/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/xtable/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/yaml/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/zeallot/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/zip/
Ignored: packrat/src/
Ignored: polymeRID.Rproj
Ignored: smp/20190812_1723_NNET/files/
Ignored: smp/20190812_1723_NNET/plots/
Ignored: smp/20190812_1729_NNET/files/
Ignored: smp/20190812_1729_NNET/plots/
Ignored: smp/20190812_1731_NNET/files/
Ignored: smp/20190812_1731_NNET/plots/
Ignored: smp/20190812_1733_NNET/files/
Ignored: smp/20190812_1733_NNET/plots/
Ignored: website/analysis/
Ignored: website/code/
Ignored: website/docs/
Ignored: website/mod/
Ignored: website/output/
Ignored: website/run/
Ignored: website/smp/
Unstaged changes:
Modified: analysis/rf_exploration.Rmd
Modified: analysis/svm_exploration.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 2385fbc | goergen95 | 2019-08-14 | republish for layout change |
Rmd | 293fd73 | goergen95 | 2019-08-14 | first step on svm_exploration |
html | 293fd73 | goergen95 | 2019-08-14 | first step on svm_exploration |
Support Vector Machine (SVM) is a non-parametric classification method which is initially designed for binary classification problems and which was developed in its current form by Boser, Guyon, and Vapnik (2004). A detailed overview of the SVM algorithm is found in Burges (1998). The principal idea behind SVM is to find an optimal hyperplane which separates two classes from another by the largest possible margin. The alogrithm is optimised by iterativly maximizing this margin but while only considering the closest observations of both classes to the margin. These specific observations are also called support-vectors. Multidimensional data can be processed by mapping the data into a higher dimensional feature space through a specified mapping function. This function is called kernel function, and mainly four different groups are used: linear, polynomial, radial and sigmoid functions (Burges 1998). In this project only the radial basis function was used. Multi-class problems are addressed by calculating an optimal margin following the one class-against-all pattern and conducting a majority vote and the end of the calculations. SVM need some tuning parameters. These are the regularization parameter C
and the kernel width γ
. The regularization parameter is also called penalty value, as it is a constant giving penalty to misclassified observations. There optimal values might change if different representations of the data are presented to the algorithm.
Different levels of data reprocessing might emphasize different features of the patterns to be learned by an algorithm. To grasp this, different transformations of the data were presented to the SVM algorithm. Additionally, the raw data signal was jittered to test which transformation might prove beneficial in delivering high classification accuracies even in the presence of noise. To test this, we define a function which adds noise to the raw data and returns a list with the number of elements equal to the levels of noise applied.
addNoise = function(data, levels = c(0), category="class"){
data.return = list()
index = which(names(data) == category)
for (n in levels){
tmp = as.matrix(data[ , -index])
if (n == 0){
tmp = data
}else{
tmp = as.data.frame(jitter(tmp, n))
tmp[category] = data[category]
}
data.return[[paste("noise", n, sep="")]] = tmp
}
return(data.return)
}
data = read.csv(file = paste0(ref, "reference_database.csv"), header = TRUE)
noisy_data = addNoise(data, levels = c(0,10,100,250,500), category = "class")
# indivitual elements can be selected by using [[ and refering to the index or the name
head(noisy_data[["noise100"]])[1:3,1:3]
wvn3992.63003826141 wvn3990.70123147964 wvn3988.77242469788
1 -0.010328834 0.01829306 0.001603137
2 0.005237827 0.01683373 -0.013921331
3 -0.011751581 0.01958805 0.002650597
Then, in another user defined function which uses the noisy_data
objected as input specified data transformations are applied. These are normalization which centers and scales the input data, as well as different forms of the Savitkiy-Golay filter (Savitzky and Golay 1964) and first and second derivative of a raw spectrum. The functions iterates through the noise level elements in the noisy_data
object and returns each specified transformation in a list element below the noise level. The exemplary function below applies the preprocessing for normalization, standard filtering and first derivative only. The implementation of the function used in the project can be found here.
createTrainingSet = function(data, category = "class",
SGpara = list(p=3,w=11), lag = 15){
data.return = list()
for (noise in names(data)){
tmp = as.data.frame(data[[noise]])
classes = tmp[,category]
tmp = tmp[!names(tmp) %in% category]
# original data
data.return[[noise]][["raw"]] = as.data.frame(data[[noise]])
# normalised data
data_norm = preprocess(tmp, type="norm")
data_norm[category] = classes
data.return[[noise]][["norm"]] = data_norm
# SG-filtered data
data_sg = preprocess(tmp, type="sg", SGpara = SGpara)
data_sg[category] = classes
data.return[[noise]][["sg"]] = data_sg
# first derivative of original data
data_rawd1 = preprocess(tmp, type="raw.d1", lag = lag)
data_rawd1[category] = classes
data.return[[noise]][["raw.d1"]] = data_rawd1
}
return(data.return)
}
# applying the function
test_dataset = createTrainingSet(noisy_data, category = "class")
# individual transformations at a certain noise level can be accessed with [[
head(test_dataset[["noise500"]][["raw.d1"]])[1:3,1:3]
wvn3963.69793653488 wvn3961.76912975311 wvn3959.84032297134
1 0.08295627 0.02234302 0.17399975
2 0.04253842 0.07918397 0.03849980
3 -0.05579848 0.11670539 0.02503375
The data base of (Primpke et al. 2018) currently shows 1863 variables for each observations. Most of these data points do not bear relevant information to distinguish between different types of particles. To shorten the computation time, one can use dimensionality reduction techniques such as principal component analysis (PCA). PCA also has been used to transform spectral data of micro-plastics in marine ecosystems before (Jung et al. 2018; Lorenzo-Navarro et al. 2018). PCA basically takes the input data for a given number of observation and by performing a orthogonal transformation to the data transforms these possible correlated variables to uncorrelated principal components. This way, both redundancies in the data as well as possible noise can be accounted for. PCA previously has been successfully applied to FTIR-spectrometer data [Hori and Sugiyama (2003);Nieuwoudt et al. (2004);Mueller et al. (2013);Ami, Mereghetti, and Maria (2013);Fu2014]. Simultaniously, the number of variables can be significantly reduced by applying PCA and thus speeding up the training process. Below we will apply a PCA to the raw data as an example.
library(factoextra)
tmp = test_dataset[["noise0"]][["raw"]]
pca = prcomp(tmp[ ,-1864]) # omitting class variable
var_info = factoextra::get_eigenvalue(pca)
# setting a threshold of 99% explained variance
threshold = 99
thresInd = which(var_info$cumulative.variance.percent>=threshold)[1]
pca_data = pca$x[,1:thresInd]
We can use the index variable thresInd
we just defined to take a look upon all the principal components which explain 99% of the variance in the data set.
eigenvalue | variance.percent | cumulative.variance.percent | |
---|---|---|---|
Dim.1 | 1.5618496 | 57.7479523 | 57.74795 |
Dim.2 | 0.3875503 | 14.3293173 | 72.07727 |
Dim.3 | 0.2226548 | 8.2324559 | 80.30973 |
Dim.4 | 0.1823539 | 6.7423696 | 87.05209 |
Dim.5 | 0.0897978 | 3.3201919 | 90.37229 |
Dim.6 | 0.0538395 | 1.9906665 | 92.36295 |
Dim.7 | 0.0443676 | 1.6404525 | 94.00341 |
Dim.8 | 0.0324243 | 1.1988595 | 95.20227 |
Dim.9 | 0.0297937 | 1.1015939 | 96.30386 |
Dim.10 | 0.0215566 | 0.7970366 | 97.10090 |
Dim.11 | 0.0173647 | 0.6420450 | 97.74294 |
Dim.12 | 0.0145994 | 0.5397977 | 98.28274 |
Dim.13 | 0.0113917 | 0.4211964 | 98.70394 |
Dim.14 | 0.0069264 | 0.2560972 | 98.96003 |
Dim.15 | 0.0042735 | 0.1580104 | 99.11804 |
We effectivly reduced the number of variables from 1683 to 15 which still bear 99% of the variance we can find in the original data set. However, when it comes to machine learning, it is important to realise that this new dataset is not fit to be used in a training process. If we now randomly split the observations into training and test, we effectivly mix up these two sets because information of the test set has already influenced the outcome of the PCA. Therefor, the data set need to be split beforhand of the PCA. The analysis is done on the training data only and then the same orthogonal transformations will be applied to the test data. This way it can be ensured that the test set is truely independent from the training process. Here, we apply a 10-fold cross-validation which is repeated 5 times. The following code takes a complete data set as input, applies a splitting function from the caret
package and then builds the PCA upon the the test set and finally applies the same transformation to the test set. We apply it for the raw data only. Also, we randomly split the data to 50% training and 50% test.
folds = 10
repeats = 5
split_percentage = 0.5
threshold = 99
tmp = test_dataset[["noise0"]][["raw"]]
set.seed(42) # ensure reproducibility
fold_index = lapply(1:repeats, caret::createDataPartition, y=tmp$class,
times = folds, p = split_percentage)
fold_index = do.call(c, fold_index)
pcaData = list()
for (rep in 1:repeats){
rep_index = fold_index[(rep*folds-folds+1):(rep*folds)] # jumps to the correct number of folds forward in each repeat
pcadata_fold = lapply(1:folds,function(x){
# splitting for current fold
training = tmp[unlist(rep_index[x]),]
validation = tmp[-unlist(rep_index[x]),]
# keep response
responseTrain = training$class
responseVal = validation$class
# apply PCA
pca = prcomp(training[,1:1863])
varInfo = factoextra::get_eigenvalue(pca)
thresInd = which(varInfo$cumulative.variance.percent >= threshold)[1]
pca_training = pca$x[ ,1:thresInd]
pca_validation = predict(pca, validation)[ ,1:thresInd]
training = as.data.frame(pca_training)
training$response = responseTrain
validation = as.data.frame(pca_validation)
validation$response = responseVal
foldtmp = list(training, validation)
names(foldtmp) = c("training","validation")
return(foldtmp)
})
names(pcadata_fold) = paste("fold", 1:folds, sep ="")
pcaData[[paste0("repeat",rep)]] = pcadata_fold
}
We now have a list object with the number of elements equivalent to the repeats. Below each repeat element we can access the individual folds. There we find two elements which we can access by refering to "training"
and "testing"
.
pcaData[["repeat5"]][["fold10"]][["training"]][1:3,1:3]
PC1 PC2 PC3
1 -0.4779164 -0.7786770 -0.02761266
2 -0.4869471 -0.7181236 -0.02492981
4 -0.4661699 -0.7463479 -0.11424932
pcaData[["repeat5"]][["fold10"]][["validation"]][1:3,1:3]
PC1 PC2 PC3
3 -0.5240744 -0.5931111 0.05961383
5 -0.4718366 -0.6995376 -0.09273887
7 -0.5255215 -0.5928214 0.05184969
summary(pcaData[["repeat5"]][["fold10"]][["training"]]$response)
FIBRE FUR HDPE LDPE PA PE PES PET PP PS PUR WOOD
14 12 5 6 7 4 8 5 6 4 4 2
summary(pcaData[["repeat5"]][["fold10"]][["validation"]]$response)
FIBRE FUR HDPE LDPE PA PE PES PET PP PS PUR WOOD
13 11 5 5 7 4 7 4 6 3 3 2
For the SVM algorithm we also implementd a simple search pattern for optimal paramters for the regularization parameter C
and the kernel width γ
. We did this by applying a search grid for the paramters and calculating each possible combination. Note that due to limits in computation capacities we restriced the search to 25 combinations only. The code below calculates a model for each possible combinations, evaluating it capacity to correctly classify the training data and then chooses the optimal model to evaluate the validation data.
training = pcaData[["repeat1"]][["fold1"]][["training"]]
validation =pcaData[["repeat1"]][["fold1"]][["validation"]]
x_train = training[ ,1:ncol(training)-1]
y_train = training$response
x_test = validation[ ,1:ncol(validation)-1]
y_test = validation$response
tuneGrid = expand.grid(gamma =seq(0.1,1,0.2),cost = seq(1,5,1) )
accuracy = c()
models = list()
for ( i in 1:nrow(tuneGrid)){
model = e1071::svm(x = x_train, y = y_train,
kernel = "radial",
gamma = tuneGrid$gamma[i],
cost = tuneGrid$cost[i])
pred = predict(model, x_train)
conf = caret::confusionMatrix(pred, y_train)
accuracy = c(accuracy, conf$overall["Kappa"])
models[[i]] = model
}
best_model = models[[which(accuracy == max(accuracy))[1]]]
prediction = predict(best_model, x_test)
confMat = caret::confusionMatrix(prediction, y_test)
print(confMat$table)
Reference
Prediction FIBRE FUR HDPE LDPE PA PE PES PET PP PS PUR WOOD
FIBRE 12 0 0 0 0 0 5 2 0 0 0 2
FUR 1 11 0 0 0 0 0 0 0 0 0 0
HDPE 0 0 3 0 0 0 0 0 0 0 0 0
LDPE 0 0 0 5 0 2 0 0 0 0 0 0
PA 0 0 0 0 7 0 0 0 0 0 0 0
PE 0 0 1 0 0 2 0 0 0 0 0 0
PES 0 0 0 0 0 0 2 1 0 0 0 0
PET 0 0 1 0 0 0 0 1 0 2 0 0
PP 0 0 0 0 0 0 0 0 6 0 0 0
PS 0 0 0 0 0 0 0 0 0 1 0 0
PUR 0 0 0 0 0 0 0 0 0 0 3 0
WOOD 0 0 0 0 0 0 0 0 0 0 0 0
This code is integrated into a function which applies this search pattern to all folds for all repeats and can be found here. Finally, we can apply this function to the different levels of preprocssing which were discussed before and obtaining the accuracies in the results
object.
source("code/functions.R")
wavenumbers = readRDS(paste0(ref,"wavenumbers.rds"))
# add noise to data
noisyData = addNoise(data,levels = c(0,10,100,250,500), category = "class")
# preprocessing
testDataset = createTrainingSet(noisyData, category = "class",
SGpara = list(p=3, w=11), lag=15,
type = c("raw", "norm", "sg", "sg.d1", "sg.d2",
"sg.norm", "sg.norm.d1", "sg.norm.d2",
"raw.d1", "raw.d2", "norm.d1", "norm.d2"))
types = names(testDataset[[1]])
levels = lapply(names(testDataset), function(x){
rep(x, length(types))
})
levels = unlist(levels)
results = data.frame(level=levels, type = types, kappa = rep(0,length(levels)))
for (level in unique(levels)){
for (type in types){
print(paste0("Level: ",level," Type: ",type))
tmpData = testDataset[[level]][[type]]
tmpData[which(wavenumbers<=2420 & wavenumbers>=2200)] = 0 # setting C02 window to 0
tmpModel = pcaCV(tmpData, folds = 10, repeats = 5, threshold = 99, metric = "Kappa", p=0.5, method="svm")
saveRDS(tmpModel,file = paste0(output,"svm/model_",level,"_",type,"_",round(tmpModel[[1]],2),".rds"))
results[which(results$level==level & results$type==type),"kappa"] = as.numeric(tmpModel[[1]])
print(results)
}
}
saveRDS(results, paste0(output,"svm/exploration.rds"))
We can now take a look at the kappa scores the algorithm achived during training for different representations of the data and at increasing noise levels.
We can observe that with higher noise ratios the kappa score is reduced significantly. All data transformations yield to very low accuracies when the noise is increases. In the absence of significant noise, however, the simple Savitzkiy-Golay filter, the raw data as well as the first and second order derivatives yield to a kappa score of about 0.75. We can look at this more mathmatically by calculating the average slopes of the data tranformations methods and order the data frame from low to high slopes. Note that we only take the kappa score at noise level 0 and 10 to calculate the average slope.noise0 = results[results$level == "noise0", ]
noise10 = results[results$level == "noise10", ]
slopes = noise10$kappa - noise0$kappa
types = unique(results$type)
df = data.frame(type = types, slope = slopes, row.names = NULL)
df = df[order(-slopes),]
type | slope | |
---|---|---|
3 | sg | -0.0007272 |
1 | raw | -0.0011139 |
4 | sg.d1 | -0.0148920 |
10 | raw.d2 | -0.0161989 |
9 | raw.d1 | -0.0167638 |
5 | sg.d2 | -0.0642191 |
7 | sg.norm.d1 | -0.2300057 |
8 | sg.norm.d2 | -0.2324091 |
2 | norm | -0.5853342 |
6 | sg.norm | -0.6006121 |
We can now confirm, that the average decrease in kappa score is the lowest for the Savitzkiy-Golay filtered data followed by the raw data. Then, the first derivative of the filtered data achieves the next lowest slope, but it has to be noted that the overall kappa score level of this data transformation is lower than for the derivatives of the unfiltered data.
Ami, Diletta, Paolo Mereghetti, and Silvia Maria. 2013. “Multivariate Analysis for Fourier Transform Infrared Spectra of Complex Biological Systems and Processes.” Multivariate Analysis in Management, Engineering and the Sciences. https://doi.org/10.5772/53850.
Boser, Bernhard E., Isabelle M. Guyon, and Vladimir N. Vapnik. 2004. “A training algorithm for optimal margin classifiers.” In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144–52. COLT ’92. New York, NY, USA: ACM. https://doi.org/10.1145/130385.130401.
Burges, Christopher J.C. 1998. “A tutorial on support vector machines for pattern recognition.” Data Mining and Knowledge Discovery 2 (2): 121–67. https://doi.org/10.1023/A:1009715923555.
Hori, Ritsuko, and Junji Sugiyama. 2003. “A combined FT-IR microscopy and principal component analysis on softwood cell walls.” Carbohydrate Polymers 52 (4): 449–53. https://doi.org/10.1016/S0144-8617(03)00013-4.
Jung, Melissa R., F. David Horgen, Sara V. Orski, Viviana Rodriguez C., Kathryn L. Beers, George H. Balazs, T. Todd Jones, et al. 2018. “Validation of ATR FT-IR to identify polymers of plastic marine debris, including those ingested by marine organisms.” Marine Pollution Bulletin 127 (December 2017). Elsevier: 704–16. https://doi.org/10.1016/j.marpolbul.2017.12.061.
Mueller, Daniela, Marco Flôres Ferrão, Luciano Marder, Adilson Ben da Costa, and Rosana de Cássia de Souza Schneider. 2013. “Fourier transform infrared spectroscopy (FTIR) and multivariate analysis for identification of different vegetable oils used in biodiesel production.” Sensors (Switzerland) 13 (4): 4258–71. https://doi.org/10.3390/s130404258.
Nieuwoudt, Helene H., Bernard A. Prior, Isak S. Pretorius, Marena Manley, and Florian F. Bauer. 2004. “Principal component analysis applied to Fourier transform infrared spectroscopy for the design of calibration sets for glycerol prediction models in wine and for the detection and classification of outlier samples.” Journal of Agricultural and Food Chemistry 52 (12): 3726–35. https://doi.org/10.1021/jf035431q.
Primpke, Sebastian, Marisa Wirth, Claudia Lorenz, and Gunnar Gerdts. 2018. “Reference database design for the automated analysis of microplastic samples based on Fourier transform infrared (FTIR) spectroscopy.” Analytical and Bioanalytical Chemistry 410 (21). Analytical; Bioanalytical Chemistry: 5131–41. https://doi.org/10.1007/s00216-018-1156-x.
Savitzky, Abraham, and Marcel J.E. Golay. 1964. “Smoothing and Differentiation of Data by Simplified Least Squares Procedures.” Analytical Chemistry 36 (8): 1627–39. https://doi.org/10.1021/ac60214a047.
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 19.1
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plotly_4.9.0 knitr_1.24
[3] factoextra_1.0.5 tensorflow_1.14.0
[5] abind_1.4-5 e1071_1.7-2
[7] keras_2.2.4.1 workflowr_1.4.0.9001
[9] baseline_1.2-1 gridExtra_2.3
[11] stringr_1.4.0 prospectr_0.1.3
[13] RcppArmadillo_0.9.600.4.0 openxlsx_4.1.0.1
[15] magrittr_1.5 ggplot2_3.2.0
[17] reshape2_1.4.3 dplyr_0.8.3
loaded via a namespace (and not attached):
[1] httr_1.4.1 tidyr_0.8.3 viridisLite_0.3.0
[4] jsonlite_1.6 splines_3.6.1 foreach_1.4.7
[7] prodlim_2018.04.18 shiny_1.3.2 assertthat_0.2.1
[10] stats4_3.6.1 highr_0.8 yaml_2.2.0
[13] ggrepel_0.8.1 ipred_0.9-9 pillar_1.4.2
[16] backports_1.1.4 lattice_0.20-38 glue_1.3.1
[19] reticulate_1.13 digest_0.6.20 promises_1.0.1
[22] colorspace_1.4-1 recipes_0.1.6 httpuv_1.5.1
[25] htmltools_0.3.6 Matrix_1.2-17 plyr_1.8.4
[28] timeDate_3043.102 pkgconfig_2.0.2 SparseM_1.77
[31] caret_6.0-84 xtable_1.8-4 purrr_0.3.2
[34] scales_1.0.0 whisker_0.3-2 later_0.8.0
[37] gower_0.2.1 lava_1.6.5 git2r_0.26.1
[40] tibble_2.1.3 generics_0.0.2 withr_2.1.2
[43] nnet_7.3-12 lazyeval_0.2.2 mime_0.7
[46] survival_2.44-1.1 crayon_1.3.4 evaluate_0.14
[49] fs_1.3.1 nlme_3.1-140 MASS_7.3-51.4
[52] class_7.3-15 tools_3.6.1 data.table_1.12.2
[55] munsell_0.5.0 zip_2.0.3 compiler_3.6.1
[58] rlang_0.4.0 grid_3.6.1 iterators_1.0.12
[61] htmlwidgets_1.3 crosstalk_1.0.0 labeling_0.3
[64] base64enc_0.1-3 rmarkdown_1.14 gtable_0.3.0
[67] ModelMetrics_1.2.2 codetools_0.2-16 R6_2.4.0
[70] tfruns_1.4 lubridate_1.7.4 zeallot_0.1.0
[73] rprojroot_1.3-2 stringi_1.4.3 Rcpp_1.0.2
[76] rpart_4.1-15 tidyselect_0.2.5 xfun_0.8