Last updated: 2019-08-22
Checks: 7 0
Knit directory: polymeRID/
This reproducible R Markdown analysis was created with workflowr (version 1.4.0.9001). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20190729)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rprofile
Ignored: .Rproj.user/
Ignored: analysis/library.bib
Ignored: docs/figure/
Ignored: fun/
Ignored: output/20190810_1538/
Ignored: output/20190810_1546/
Ignored: output/20190810_1609/
Ignored: output/20190813_1044/
Ignored: output/logs/
Ignored: output/natural/
Ignored: output/nnet/
Ignored: output/svm/
Ignored: output/testRunII/
Ignored: output/testRunIII/
Ignored: packrat/lib-R/
Ignored: packrat/lib-ext/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/BH/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/FactoMineR/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/IDPmisc/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/KernSmooth/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/MASS/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/Matrix/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/MatrixModels/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ModelMetrics/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/R6/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/RColorBrewer/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/RCurl/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rcpp/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppArmadillo/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppEigen/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppGSL/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/RcppZiggurat/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rfast/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rgtsvm/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/Rmisc/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/SQUAREM/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/SparseM/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/abind/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/askpass/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/assertthat/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/backports/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/base64enc/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/baseline/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/bit/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/bit64/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/bitops/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/boot/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/callr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/car/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/carData/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/caret/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/cellranger/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/class/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/cli/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/clipr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/cluster/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/codetools/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/colorspace/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/config/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/cowplot/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/crayon/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/crosstalk/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/curl/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/data.table/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/dendextend/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/digest/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/doParallel/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/dplyr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/e1071/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ellipse/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ellipsis/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/evaluate/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/factoextra/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/fansi/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/flashClust/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/forcats/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/foreach/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/foreign/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/fs/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/generics/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/getPass/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggplot2/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggpubr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggrepel/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggsci/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ggsignif/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/git2r/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/glue/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/gower/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/gridExtra/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/gtable/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/haven/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/hexbin/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/highr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/hms/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/htmltools/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/htmlwidgets/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/httpuv/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/httr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ipred/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/iterators/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/jsonlite/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/keras/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/kerasR/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/knitr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/labeling/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/later/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/lattice/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/lava/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/lazyeval/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/leaps/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/lme4/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/lubridate/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/magrittr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/maptools/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/markdown/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/mgcv/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/mime/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/minqa/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/munsell/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/nlme/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/nloptr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/nnet/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/numDeriv/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/openssl/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/openxlsx/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/packrat/tests/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/pbkrtest/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/pillar/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/pkgconfig/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/plogr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/plotly/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/plyr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/polynom/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/prettyunits/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/processx/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/prodlim/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/progress/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/promises/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/prospectr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/ps/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/purrr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/quantreg/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/randomForest/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/readr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/readxl/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/recipes/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rematch/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/reshape2/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/reticulate/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rio/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rlang/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rmarkdown/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rpart/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rprojroot/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rsconnect/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/rstudioapi/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/scales/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/scatterplot3d/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/shiny/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/sourcetools/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/sp/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/stringi/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/stringr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/survival/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/sys/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tensorflow/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tfruns/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tibble/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tidyr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tidyselect/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/timeDate/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/tinytex/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/utf8/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/vctrs/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/viridis/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/viridisLite/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/whisker/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/withr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/workflowr/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/xfun/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/xtable/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/yaml/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/zeallot/
Ignored: packrat/lib/x86_64-pc-linux-gnu/3.6.1/zip/
Ignored: packrat/src/
Ignored: polymeRID.Rproj
Ignored: smp/20190812_1723_NNET/files/
Ignored: smp/20190812_1723_NNET/plots/
Ignored: smp/20190812_1729_NNET/files/
Ignored: smp/20190812_1729_NNET/plots/
Ignored: smp/20190812_1731_NNET/files/
Ignored: smp/20190812_1731_NNET/plots/
Ignored: smp/20190812_1733_NNET/files/
Ignored: smp/20190812_1733_NNET/plots/
Ignored: smp/20190815_1847_FUSION/
Ignored: website/
Unstaged changes:
Modified: analysis/calibration.Rmd
Modified: analysis/classification.Rmd
Modified: analysis/cnn_crossvalidation.Rmd
Modified: analysis/cnn_exploration.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 0e16e95 | goergen95 | 2019-08-21 | updated rf_exploration.html |
html | 0e16e95 | goergen95 | 2019-08-21 | updated rf_exploration.html |
html | f2ee83c | goergen95 | 2019-08-19 | Build site. |
html | d960dc2 | goergen95 | 2019-08-19 | included calibration |
html | b846f0b | goergen95 | 2019-08-19 | Build site. |
Rmd | de84a71 | goergen95 | 2019-08-19 | large update for website |
html | de84a71 | goergen95 | 2019-08-19 | large update for website |
Rmd | 6bef5e6 | goergen95 | 2019-08-14 | confusion matrix output in rf_exploration |
html | 6bef5e6 | goergen95 | 2019-08-14 | confusion matrix output in rf_exploration |
html | 2385fbc | goergen95 | 2019-08-14 | republish for layout change |
Rmd | 293fd73 | goergen95 | 2019-08-14 | first step on svm_exploration |
html | 293fd73 | goergen95 | 2019-08-14 | first step on svm_exploration |
Rmd | 8b6e72a | goergen95 | 2019-08-14 | update rf_exploration |
html | 8b6e72a | goergen95 | 2019-08-14 | update rf_exploration |
Rmd | 5d28ce0 | goergen95 | 2019-08-14 | changed citation note |
html | 5d28ce0 | goergen95 | 2019-08-14 | changed citation note |
Rmd | c0c64ae | goergen95 | 2019-08-14 | proceed of rf_exploration |
html | c0c64ae | goergen95 | 2019-08-14 | proceed of rf_exploration |
Rmd | 90bd244 | goergen95 | 2019-08-14 | forward on rf_exploration |
html | 90bd244 | goergen95 | 2019-08-14 | forward on rf_exploration |
Rmd | c3f088e | goergen95 | 2019-08-13 | started exploration tab |
html | c3f088e | goergen95 | 2019-08-13 | started exploration tab |
Random Forest (RF) is a machine-learning algorithm which is based on the concept of traditional decision trees. Its popular implementation was developed by Breiman (2001). It represents an ensemble classifier which has been reported to be the primary choice between different types of ensemble classifiers due to its easy handling and high classification accuracy (Sagi and Rokach 2018). It is based on a non-parametric classification method where each branch of a tree decides on randomly chosen variables to split the input data into finer sub-categories. In the case of RF, a user-specified number of trees is grown. The final class decision for an observation is made by a simple majority vote from all trees in the forest. Internal accuracy assessment of the RF classifier is traditionally obtained through an out-of-bag (OOB) error estimation. In many cases, this error estimation is considered to be robust enough, so that no independent validation dataset is used to test the generalization capacity. Here, we kept the number of trees fixed at 500 since above that threshold no substantial gain in accuracy was observed. We also evaluated the RF model by a randomly chosen validation set of 50% of the original database.
The representation of the data presented to RF is of high importance here. Different techniques of data pre-processing might emphasize different features of the patterns to be learned by an algorithm. To grasp this, different transformations of the data were presented to the RF algorithm. Additionally, the raw data signal was jittered to test which transformation might prove beneficial in delivering a high classification accuracy even in the presence of noise. To test this, we define a function which adds noise to the raw data and returns a list with the number of elements equal to the levels of noise applied.
addNoise = function(data, levels = c(0), category="class"){
data.return = list()
index = which(names(data) == category)
for (n in levels){
tmp = as.matrix(data[ , -index])
if (n == 0){
tmp = data
}else{
tmp = as.data.frame(jitter(tmp, n))
tmp[category] = data[category]
}
data.return[[paste("noise", n, sep="")]] = tmp
}
return(data.return)
}
data = read.csv(file = paste0(ref, "reference_database.csv"), header = TRUE)
noisy_data = addNoise(data, levels = c(0,10,100,250,500), category = "class")
# indivitual elements can be selected by using [[ and refering to the index or the name
head(noisy_data[["noise100"]])[1:3,1:3]
wvn3992.63003826141 wvn3990.70123147964 wvn3988.77242469788
1 -0.010328834 0.01829306 0.001603137
2 0.005237827 0.01683373 -0.013921331
3 -0.011751581 0.01958805 0.002650597
In another function which uses the noisy_data
-object as input, specific data transformations are applied. These is normalization which centers and scales the input data, different forms of the Savitzkiy-Golay filter (Savitzky and Golay 1964), and first and second derivatives of the raw spectra. The function iterates through the elements in the noisy_data
object and returns each specified transformation in a list element below the noise level. The exemplary function below applies the pre-processing for normalization, standard filtering and first derivative only. The implementation of the function used in the project can be found here.
createTrainingSet = function(data, category = "class",
SGpara = list(p=3,w=11), lag = 15){
data.return = list()
for (noise in names(data)){
tmp = as.data.frame(data[[noise]])
classes = tmp[,category]
tmp = tmp[!names(tmp) %in% category]
# original data
data.return[[noise]][["raw"]] = as.data.frame(data[[noise]])
# normalised data
data_norm = preprocess(tmp, type="norm")
data_norm[category] = classes
data.return[[noise]][["norm"]] = data_norm
# SG-filtered data
data_sg = preprocess(tmp, type="sg", SGpara = SGpara)
data_sg[category] = classes
data.return[[noise]][["sg"]] = data_sg
# first derivative of original data
data_rawd1 = preprocess(tmp, type="raw.d1", lag = lag)
data_rawd1[category] = classes
data.return[[noise]][["raw.d1"]] = data_rawd1
}
return(data.return)
}
# applying the function
test_dataset = createTrainingSet(noisy_data, category = "class")
# individual transformations at a certain noise level can be accessed with [[
head(test_dataset[["noise500"]][["raw.d1"]])[1:3,1:3]
wvn3963.69793653488 wvn3961.76912975311 wvn3959.84032297134
1 0.08295627 0.02234302 0.17399975
2 0.04253842 0.07918397 0.03849980
3 -0.05579848 0.11670539 0.02503375
The database of Primpke et al. (2018) currently shows 1863 variables for each observation. Most of these data points do not hold relevant information to distinguish between different types of particles. To shorten the computation time, one can use dimensionality reduction techniques such as principal component analysis (PCA). PCA has already been used to transform spectral data of microplastic in marine ecosystems (Jung et al. 2018; Lorenzo-Navarro et al. 2018). PCA basically takes the input data for a given number of observations and performs a orthogonal transformation to derive uncorrelated principal components from the possibly correlated variables. Both redundancies in the data as well as the presence of noise can be accounted for this way. PCA has previously been successfully applied to FTIR-spectroscopy data (Hori and Sugiyama 2003; Nieuwoudt et al. 2004; Mueller et al. 2013; Ami, Mereghetti, and Maria 2013; Fu, Toyoda, and Ihara 2014). Simultaneously, the number of variables can be significantly reduced by applying PCA and thus speeding up the training process. Below we apply a PCA to the raw data as an example only.
library(factoextra)
tmp = test_dataset[["noise0"]][["raw"]]
pca = prcomp(tmp[ ,-1864]) # omitting class variable
var_info = factoextra::get_eigenvalue(pca)
# setting a threshold of 99% explained variance
threshold = 99
thresInd = which(var_info$cumulative.variance.percent>=threshold)[1]
pca_data = pca$x[,1:thresInd]
We can use the index variable thresInd
we just defined to take a look upon all the principal components which explain 99% of the variance of the data.
eigenvalue | variance.percent | cumulative.variance.percent | |
---|---|---|---|
Dim.1 | 1.5618496 | 57.7479523 | 57.74795 |
Dim.2 | 0.3875503 | 14.3293173 | 72.07727 |
Dim.3 | 0.2226548 | 8.2324559 | 80.30973 |
Dim.4 | 0.1823539 | 6.7423696 | 87.05209 |
Dim.5 | 0.0897978 | 3.3201919 | 90.37229 |
Dim.6 | 0.0538395 | 1.9906665 | 92.36295 |
Dim.7 | 0.0443676 | 1.6404525 | 94.00341 |
Dim.8 | 0.0324243 | 1.1988595 | 95.20227 |
Dim.9 | 0.0297937 | 1.1015939 | 96.30386 |
Dim.10 | 0.0215566 | 0.7970366 | 97.10090 |
Dim.11 | 0.0173647 | 0.6420450 | 97.74294 |
Dim.12 | 0.0145994 | 0.5397977 | 98.28274 |
Dim.13 | 0.0113917 | 0.4211964 | 98.70394 |
Dim.14 | 0.0069264 | 0.2560972 | 98.96003 |
Dim.15 | 0.0042735 | 0.1580104 | 99.11804 |
We effectively reduced the number of variables from 1683 to 15 which still bears 99% of the variance we find in the original data. When it comes to machine learning, however, it is important to realize that this new data set is not fit to be used in a training process. If we now randomly split the observations into a training and testing set, we effectively mix up these two sets because information of the testing set has already influenced the outcome of the PCA. Therefore, the data set needs to be split before applying the PCA. The analysis is done on the training data only and then the same orthogonal transformations is applied to the testing data. This way it can be ensured that the test set is truly independent of the training process.
We apply a 10-fold cross-validation which is repeated five times. The following code takes a complete data set as input, applies a splitting function from the caret
package and then builds the PCA upon the the training set and finally applies the same transformation to the testing set. Here, it is only applied for the raw data. We also randomly split the data to a 50% training and a 50% testing set.
folds = 10
repeats = 5
split_percentage = 0.5
threshold = 99
tmp = test_dataset[["noise0"]][["raw"]]
set.seed(42) # ensure reproducibility
fold_index = lapply(1:repeats, caret::createDataPartition, y=tmp$class,
times = folds, p = split_percentage)
fold_index = do.call(c, fold_index)
pcaData = list()
for (rep in 1:repeats){
rep_index = fold_index[(rep*folds-folds+1):(rep*folds)] # jumps to the correct number of folds forward in each repeat
pcadata_fold = lapply(1:folds,function(x){
# splitting for current fold
training = tmp[unlist(rep_index[x]),]
validation = tmp[-unlist(rep_index[x]),]
# keep response
responseTrain = training$class
responseVal = validation$class
# apply PCA
pca = prcomp(training[,1:1863])
varInfo = factoextra::get_eigenvalue(pca)
thresInd = which(varInfo$cumulative.variance.percent >= threshold)[1]
pca_training = pca$x[ ,1:thresInd]
pca_validation = predict(pca, validation)[ ,1:thresInd]
training = as.data.frame(pca_training)
training$response = responseTrain
validation = as.data.frame(pca_validation)
validation$response = responseVal
foldtmp = list(training, validation)
names(foldtmp) = c("training","validation")
return(foldtmp)
})
names(pcadata_fold) = paste("fold", 1:folds, sep ="")
pcaData[[paste0("repeat",rep)]] = pcadata_fold
}
We now have a list object with the number of elements equivalent to the repeats. Below the level of repeats individual folds can be accessed. There the splitted data set can be accessed by referring to "training"
and "testing"
.
pcaData[["repeat5"]][["fold10"]][["training"]][1:3,1:3]
PC1 PC2 PC3
1 -0.4779164 -0.7786770 -0.02761266
2 -0.4869471 -0.7181236 -0.02492981
4 -0.4661699 -0.7463479 -0.11424932
pcaData[["repeat5"]][["fold10"]][["validation"]][1:3,1:3]
PC1 PC2 PC3
3 -0.5240744 -0.5931111 0.05961383
5 -0.4718366 -0.6995376 -0.09273887
7 -0.5255215 -0.5928214 0.05184969
summary(pcaData[["repeat5"]][["fold10"]][["training"]]$response)
FIBRE FUR HDPE LDPE PA PE PES PET PP PS PUR WOOD
14 12 5 6 7 4 8 5 6 4 4 2
summary(pcaData[["repeat5"]][["fold10"]][["validation"]]$response)
FIBRE FUR HDPE LDPE PA PE PES PET PP PS PUR WOOD
13 11 5 5 7 4 7 4 6 3 3 2
For the RF algorithm only the parameter mtry
needs a search pattern since we hold the number of trees constant at a value of 500. mtry
effectively specifies the number of variables to look at each split within a tree. We took the square root of the number of variables divided by 3 as the first mtry
value, the square root itself as the second and the maximum number of variables as a third value. The optimal parameter is then selected to build the final model and its performance is evaluated with the test data.
training = pcaData[["repeat1"]][["fold1"]][["training"]]
validation =pcaData[["repeat1"]][["fold1"]][["validation"]]
x_train = training[ ,1:ncol(training)-1]
y_train = training$response
x_test = validation[ ,1:ncol(validation)-1]
y_test = validation$response
first = floor(sqrt(ncol(x_train)))/3
if(first <= 1) first <- 1
second = floor(sqrt(ncol(x_train)))
last = ncol(x_train)
mtries = c(first,second,last)
Mods = lapply(1:length(mtries),function(x){return(0)})
accuracy = c()
for (mtry in mtries){
Mods[which(mtries == mtry)] = list(randomForest::randomForest(x_train,
y_train,
ntree=500,
mtry=mtry))
pred = predict(Mods[[which(mtries==mtry)]], x_test)
conf = caret::confusionMatrix(pred, y_test)
accuracy = c(accuracy, conf$overall["Kappa"])
}
best_model = Mods[[which(accuracy == max(accuracy))[1]]]
prediction = predict(best_model, x_test)
confMat = caret::confusionMatrix(prediction, y_test)
print(confMat$table)
Reference
Prediction FIBRE FUR HDPE LDPE PA PE PES PET PP PS PUR WOOD
FIBRE 8 0 0 0 0 0 0 0 0 0 0 1
FUR 2 11 0 0 0 0 0 0 0 0 0 0
HDPE 0 0 3 0 0 0 0 0 0 0 0 0
LDPE 0 0 0 5 0 0 0 0 0 0 0 0
PA 0 0 0 0 7 0 0 0 0 0 0 0
PE 0 0 1 0 0 4 0 0 0 0 0 0
PES 0 0 0 0 0 0 6 0 0 0 0 0
PET 0 0 1 0 0 0 1 4 0 0 0 0
PP 0 0 0 0 0 0 0 0 6 0 0 0
PS 0 0 0 0 0 0 0 0 0 3 0 0
PUR 0 0 0 0 0 0 0 0 0 0 3 0
WOOD 3 0 0 0 0 0 0 0 0 0 0 1
The process of splitting the data set into training and testing was automated by putting the above code in a function which can be found here. Finally, this function was applied to the different pre-procesesing levels, as discussed before.
source("code/functions.R")
wavenumbers = readRDS(paste0(ref,"wavenumbers.rds"))
# add noise to data
noisyData = addNoise(data,levels = c(0,10,100,250,500), category = "class")
# preprocessing
testDataset = createTrainingSet(noisyData, category = "class",
SGpara = list(p=3, w=11), lag=15,
type = c("raw", "norm", "sg", "sg.d1", "sg.d2",
"sg.norm", "sg.norm.d1", "sg.norm.d2",
"raw.d1", "raw.d2", "norm.d1", "norm.d2"))
types = names(testDataset[[1]])
levels = lapply(names(testDataset), function(x){
rep(x, length(types))
})
levels = unlist(levels)
results = data.frame(level=levels,type = types, kappa = rep(0,length(levels)))
for (level in unique(levels)){
for (type in types){
print(paste0("Level: ",level," Type: ",type))
tmpData = testDataset[[level]][[type]]
tmpData[which(wavenumbers<=2420 & wavenumbers>=2200)] = 0 # setting C02 window to 0
tmpModel = pcaCV(tmpData, folds = 10, repeats = 5, threshold = 99, metric = "Kappa", p=0.5, method="rf")
saveRDS(tmpModel,file = paste0(output,"rf/model_",level,"_",type,"_",round(tmpModel[[1]],2),".rds"))
results[which(results$level==level & results$type==type),"kappa"] = as.numeric(tmpModel[[1]])
print(results)
}
}
saveRDS(paste0(output,"rf/exploration.rds"))
The plot below shows the Kappa scores the algorithm achieved during training for different representations of the data at increasing noise levels.
It can be observed that with higher noise levels the Kappa score is reduced significantly. However, there are some data transformations which are able to maintain a relatively high level of accuracy even in the presence of noise. One of the best tranformations might be the simple Savitzkiy-Golay filter, as well as the same filter applied to the normalized data. Equal robust results are observed for the raw and the normalized data. The other transformations do not show the same level of robustness to noise. Looking at this more mathematically by calculating the average slopes of the Kappa scores reveals the transformations with the most stable results.noise0 = results[results$level == "noise0", ]
noise500 = results[results$level == "noise500", ]
slopes = noise500$kappa - noise0$kappa
types = unique(results$type)
df = data.frame(type = types, slope = slopes, row.names = NULL)
df = df[order(-slopes),]
type | slope | |
---|---|---|
3 | sg | -0.1249425 |
6 | sg.norm | -0.1589274 |
1 | raw | -0.2333635 |
2 | norm | -0.2954945 |
9 | raw.d1 | -0.4136018 |
10 | raw.d2 | -0.5027853 |
4 | sg.d1 | -0.6313268 |
5 | sg.d2 | -0.6585208 |
8 | sg.norm.d2 | -0.8103958 |
7 | sg.norm.d1 | -0.8190748 |
This confirms that the average decrease in Kappa score is the lowest for the Savitzkiy-Golay filter applied to the raw data, and the normalized data followed by the raw data and the normalized data itself. The numbers can be interpreted as a loss in the Kappa score when the noise level increases from 0 to 500.
Ami, Diletta, Paolo Mereghetti, and Silvia Maria. 2013. “Multivariate Analysis for Fourier Transform Infrared Spectra of Complex Biological Systems and Processes.” Multivariate Analysis in Management, Engineering and the Sciences. https://doi.org/10.5772/53850.
Breiman, L. 2001. “Random forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Fu, Yongwei, Kiyohiko Toyoda, and Ikko Ihara. 2014. “Application of ATR-FTIR spectroscopy and principal component analysis in characterization of 15-acetyldeoxynivalenol in corn oil.” Engineering in Agriculture, Environment and Food 7 (4). Elsevier: 163–68. https://doi.org/10.1016/j.eaef.2014.07.001.
Hori, Ritsuko, and Junji Sugiyama. 2003. “A combined FT-IR microscopy and principal component analysis on softwood cell walls.” Carbohydrate Polymers 52 (4): 449–53. https://doi.org/10.1016/S0144-8617(03)00013-4.
Jung, Melissa R., F. David Horgen, Sara V. Orski, Viviana Rodriguez C., Kathryn L. Beers, George H. Balazs, T. Todd Jones, et al. 2018. “Validation of ATR FT-IR to identify polymers of plastic marine debris, including those ingested by marine organisms.” Marine Pollution Bulletin 127 (December 2017). Elsevier: 704–16. https://doi.org/10.1016/j.marpolbul.2017.12.061.
Mueller, Daniela, Marco Flôres Ferrão, Luciano Marder, Adilson Ben da Costa, and Rosana de Cássia de Souza Schneider. 2013. “Fourier transform infrared spectroscopy (FTIR) and multivariate analysis for identification of different vegetable oils used in biodiesel production.” Sensors (Switzerland) 13 (4): 4258–71. https://doi.org/10.3390/s130404258.
Nieuwoudt, Helene H., Bernard A. Prior, Isak S. Pretorius, Marena Manley, and Florian F. Bauer. 2004. “Principal component analysis applied to Fourier transform infrared spectroscopy for the design of calibration sets for glycerol prediction models in wine and for the detection and classification of outlier samples.” Journal of Agricultural and Food Chemistry 52 (12): 3726–35. https://doi.org/10.1021/jf035431q.
Primpke, Sebastian, Marisa Wirth, Claudia Lorenz, and Gunnar Gerdts. 2018. “Reference database design for the automated analysis of microplastic samples based on Fourier transform infrared (FTIR) spectroscopy.” Analytical and Bioanalytical Chemistry 410 (21). Analytical; Bioanalytical Chemistry: 5131–41. https://doi.org/10.1007/s00216-018-1156-x.
Sagi, Omer, and Lior Rokach. 2018. “Ensemble learning: A survey.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (4): 1–18. https://doi.org/10.1002/widm.1249.
Savitzky, Abraham, and Marcel J.E. Golay. 1964. “Smoothing and Differentiation of Data by Simplified Least Squares Procedures.” Analytical Chemistry 36 (8): 1627–39. https://doi.org/10.1021/ac60214a047.
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 19.1
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plotly_4.9.0 knitr_1.24
[3] factoextra_1.0.5 tensorflow_1.14.0
[5] abind_1.4-5 e1071_1.7-2
[7] keras_2.2.4.1 workflowr_1.4.0.9001
[9] baseline_1.2-1 gridExtra_2.3
[11] stringr_1.4.0 prospectr_0.1.3
[13] RcppArmadillo_0.9.600.4.0 openxlsx_4.1.0.1
[15] magrittr_1.5 ggplot2_3.2.0
[17] reshape2_1.4.3 dplyr_0.8.3
loaded via a namespace (and not attached):
[1] httr_1.4.1 tidyr_0.8.3 viridisLite_0.3.0
[4] jsonlite_1.6 splines_3.6.1 foreach_1.4.7
[7] prodlim_2018.04.18 shiny_1.3.2 assertthat_0.2.1
[10] stats4_3.6.1 highr_0.8 yaml_2.2.0
[13] ggrepel_0.8.1 ipred_0.9-9 pillar_1.4.2
[16] backports_1.1.4 lattice_0.20-38 glue_1.3.1
[19] reticulate_1.13 digest_0.6.20 promises_1.0.1
[22] randomForest_4.6-14 colorspace_1.4-1 recipes_0.1.6
[25] httpuv_1.5.1 htmltools_0.3.6 Matrix_1.2-17
[28] plyr_1.8.4 timeDate_3043.102 pkgconfig_2.0.2
[31] SparseM_1.77 caret_6.0-84 xtable_1.8-4
[34] purrr_0.3.2 scales_1.0.0 whisker_0.3-2
[37] later_0.8.0 gower_0.2.1 lava_1.6.5
[40] git2r_0.26.1 tibble_2.1.3 generics_0.0.2
[43] withr_2.1.2 nnet_7.3-12 lazyeval_0.2.2
[46] mime_0.7 survival_2.44-1.1 crayon_1.3.4
[49] evaluate_0.14 fs_1.3.1 nlme_3.1-140
[52] MASS_7.3-51.4 class_7.3-15 tools_3.6.1
[55] data.table_1.12.2 munsell_0.5.0 zip_2.0.3
[58] compiler_3.6.1 rlang_0.4.0 grid_3.6.1
[61] iterators_1.0.12 htmlwidgets_1.3 crosstalk_1.0.0
[64] labeling_0.3 base64enc_0.1-3 rmarkdown_1.14
[67] gtable_0.3.0 ModelMetrics_1.2.2 codetools_0.2-16
[70] R6_2.4.0 tfruns_1.4 lubridate_1.7.4
[73] zeallot_0.1.0 rprojroot_1.3-2 stringi_1.4.3
[76] Rcpp_1.0.2 rpart_4.1-15 tidyselect_0.2.5
[79] xfun_0.8