Last updated: 2024-09-06
Checks: 7 0
Knit directory: myproject/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20240905) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 9cfbfa1. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .BidstackAds-b51a10b9/
Ignored: .RData
Ignored: .Rhistory
Ignored: .Trash/
Ignored: .android/
Ignored: Templates/
Ignored: Untitled Folder/
Ignored: ilifu/
Ignored: sql/
Untracked files:
Untracked: Library/
Untracked: .AliView/
Untracked: .CFUserTextEncoding
Untracked: .DS_Store
Untracked: .IdentityService/
Untracked: .R/
Untracked: .RDataTmp
Untracked: .Rapp.history
Untracked: .ServiceHub/
Untracked: .Xauthority
Untracked: .anaconda/
Untracked: .anyconnect
Untracked: .aspnet/
Untracked: .azcopy/
Untracked: .bash_history
Untracked: .bash_profile
Untracked: .bashrc
Untracked: .bidstack-device-id
Untracked: .cache/
Untracked: .cisco/
Untracked: .conda/
Untracked: .condarc
Untracked: .config/
Untracked: .continuum/
Untracked: .cups/
Untracked: .docker/
Untracked: .dotnet/
Untracked: .dropbox/
Untracked: .gitconfig
Untracked: .gitignore
Untracked: .globusonline/
Untracked: .gsutil/
Untracked: .idlerc/
Untracked: .ipynb_checkpoints/
Untracked: .ipython/
Untracked: .jupyter/
Untracked: .keras/
Untracked: .lesshst
Untracked: .local/
Untracked: .matplotlib/
Untracked: .mono/
Untracked: .nuget/
Untracked: .nuuid.ini
Untracked: .oracle_jre_usage/
Untracked: .pdfbox.cache
Untracked: .python_history
Untracked: .sqlite_history
Untracked: .ssh/
Untracked: .tcshrc
Untracked: .templateengine/
Untracked: .test.txt.swp
Untracked: .viminfo
Untracked: .vscode/
Untracked: .wget-hsts
Untracked: .wine/
Untracked: .wing101-9
Untracked: .xonshrc
Untracked: .zprofile
Untracked: .zprofile.pysave
Untracked: .zsh_history
Untracked: .zsh_sessions/
Untracked: .zshrc
Untracked: Applications (Parallels)/
Untracked: Applications/
Untracked: Chunk.R
Untracked: Desktop/
Untracked: Documents/
Untracked: Downloads/
Untracked: Dropbox/
Untracked: Library/
Untracked: Movies/
Untracked: Music/
Untracked: Parallels/
Untracked: Pedigree.R
Untracked: Pictures/
Untracked: PlayOnMac's virtual drives
Untracked: Projects/
Untracked: Public/
Untracked: Untitled.ipynb
Untracked: _TyranoGameData/
Untracked: anaconda3/
Untracked: annotation/
Untracked: bcftools-1.9.tar.bz2
Untracked: bcftools-1.9/
Untracked: clumping_EX.txt
Untracked: df.csv
Untracked: df_R_tut.csv
Untracked: eval "$(ssh-agent -s)"
Untracked: eval "$(ssh-agent -s)".pub
Untracked: gfortran-4.8.2-darwin13.tar.bz2
Untracked: gnomad.genomes.v3.1.1.sites.chr1.vcf.bgz
Untracked: myenv/
Untracked: text.txt
Untracked: venv/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with
wflow_publish() to start tracking its development.
Supervised learning is learning in which we teach or train the machine using data that is well labeled meaning some data is already tagged with the correct answer.
The machine is provided with a test dataset so that the supervised learning algorithm analyses the training data and produces a correct outcome from labeled data.
Supervised learning itself is composed of;
Regression, where the output is numerical
Classification, where the output is
categorical
Regression analysis is a statistical technique used to model and
analyze the relationship between a dependent variable and one
(univariate regression) or more
independent
variables(multivariate regression).
Simple linear regression involves a single independent variable and fits the equation;
where;
\(y\) is the dependent variable
\(x\) is the independent variable
\(b_0\) is the intercept
\(b_1\) is the slope of the linear graph
The caret library is important for data partitioning,
model training and evaluation
library(caret)
# Load the dataset
df <- read.csv('data/tumor_size_patient_survival.csv')
# Display the first rows
head(df)
tumor_size patient_survival
1 26.3 279.9
2 12.2 347.4
3 41.5 227.3
4 21.5 330.1
5 28.0 292.1
6 22.2 310.6
Functions like head(), summary(),
str() can be used to get an overview of the data.
This step involves;
handling missing values by either removing missing values or mean, median or mode imputation
encoding categorical variables
normalising and standardising numerical features
Usually the dataset can be split into 75% for training and 25% for test. This facilitates data generalisation and avoids over fitting.
set.seed(45) # for reproducibility
trainIndex <- createDataPartition(df$patient_survival, p = 0.75, list = FALSE)
trainData <- df[trainIndex, ]
testData <- df[-trainIndex, ]
This involves fitting the model to the training set using the
lm() function
model <- lm(patient_survival ~ tumor_size, data = trainData)
# Extract coefficients
coefficients <- coef(model)
coefficients
(Intercept) tumor_size
395.077138 -3.853336
The linear equation that fits to the data in our training set is
This involves assessing the performance of the model on the testing set. There are various metrics for model evaluation including;
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R-Squared (R2)Score
test_predictions <- predict(model, newdata = testData)
mae <- MAE(test_predictions, testData$patient_survival)
rmse <- RMSE(test_predictions, testData$patient_survival)
r2_score <- summary(model)$r.squared
cat('MAE on test set (in days): ', mae, "\n",
'RMSE on test set (in days): ', rmse, "\n",
'R-Squared Score: ', r2_score)
MAE on test set (in days): 13.04724
RMSE on test set (in days): 15.93548
R-Squared Score: 0.820814
library(ggplot2)
# Add a column to differentiate between training and test data
trainData$dataset <- "Training"
testData$dataset <- "Test"
# Combine train and test data into a single dataframe for plotting
combinedData <- rbind(trainData, testData)
# Create a scatter plot with regression line for both training and test sets
ggplot(combinedData, aes(x = tumor_size, y = patient_survival, color = dataset, shape = dataset)) +
geom_point(alpha = 0.7) +
geom_smooth(data = trainData, aes(x = tumor_size, y = patient_survival), method = "lm", se = FALSE, color = "#00008B") +
labs(title = "Relationship between Tumor Size and Patient Survival",
x = "Tumor Size (mm)",
y = "Patient Survival (Days)") +
theme_minimal() +
scale_color_manual(values = c("Training" = "blue", "Test" = "red")) +
scale_shape_manual(values = c("Training" = 16, "Test" = 16)) +
guides(color = guide_legend(title = "Dataset"),
shape = guide_legend(title = "Dataset"))
`geom_smooth()` using formula = 'y ~ x'

Most real-life scenarios are characterised by multivariate or high-dimensional features where more than one independent variable influences the target or dependent variable. Multi variate algorithms fit the model;
The mpg dataset from the ggplot2 package
can be used for multivariate regression. It includes information on car
attributes. we will choose some relevant attributes to predict
hwy, miles per gallon (MPG).
# Load the dataset
library(ggplot2)
data(mpg)
df <- mpg
head(df)
# A tibble: 6 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
We can choose predictors; displ - (engine displacement),
cyl - (number of cylinders), year - (year of
the car) and class - (type of car)
library(caret)
set.seed(30) # for reproducibility
# Split the data into training and testing sets
trainIndex <- createDataPartition(df$hwy, p = 0.75, list = FALSE)
trainData <- df[trainIndex, ]
testData <- df[-trainIndex, ]
model_mv <- lm(hwy ~ displ + cyl + year + class, data = trainData)
summary(model_mv)
Call:
lm(formula = hwy ~ displ + cyl + year + class, data = trainData)
Residuals:
Min 1Q Median 3Q Max
-5.1700 -1.5322 -0.1473 1.0249 15.1030
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -183.14115 91.73394 -1.996 0.047511 *
displ -1.06300 0.52272 -2.034 0.043576 *
cyl -1.11663 0.36516 -3.058 0.002596 **
year 0.11135 0.04593 2.424 0.016399 *
classcompact -4.06888 1.94849 -2.088 0.038295 *
classmidsize -3.82081 1.89582 -2.015 0.045468 *
classminivan -6.82814 1.98819 -3.434 0.000749 ***
classpickup -10.55172 1.76709 -5.971 1.38e-08 ***
classsubcompact -3.29990 1.90675 -1.731 0.085364 .
classsuv -9.35511 1.70649 -5.482 1.53e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.7 on 167 degrees of freedom
Multiple R-squared: 0.8091, Adjusted R-squared: 0.7989
F-statistic: 78.67 on 9 and 167 DF, p-value: < 2.2e-16
test_predictions_mv <- predict(model_mv, newdata = testData)
mae <- MAE(test_predictions_mv, testData$hwy)
rmse <- RMSE(test_predictions_mv, testData$hwy)
r2_score <- summary(model)$r.squared
cat('MAE on test set (in days): ', mae, "\n",
'RMSE on test set (in days): ', rmse, "\n",
'R-Squared Score: ', r2_score)
MAE on test set (in days): 1.804035
RMSE on test set (in days): 2.325899
R-Squared Score: 0.820814
K-nearest neighbors works by directly measuring the (Euclidean) distance between observations and inferring the class of unlabeled data from the class of its nearest neighbors.
Typically in machine learning, there are two clear steps, where one
first trains a model and then uses the model to predict new
outputs (class labels in this case). In the kNN, these two
steps are combined into a single function call to knn.
Lets draw a set of 50 random iris observations to train the model and predict the species of another set of 50 randomly chosen flowers. The knn function takes the training data, the new data (to be inferred) and the labels of the training data, and returns (by default) the predicted class.
set.seed(12L)
train <- sample(150, 50)
test <- sample(150, 50)
library("class")
knnres <- knn(iris[train, -5], iris[test, -5], iris$Species[train])
head(knnres)
[1] versicolor setosa versicolor setosa setosa setosa
Levels: setosa versicolor virginica
Is a technique used to assess the generalisability of a model to new data. It involves dividing the dataset into multiple folds and training the model on each fold while using the remaining set for validation.
This function performs a 10-fold cross-validation on a given data set using Partial Least Squares (PLS) model. To assess the prediction ability of the model, a 10-fold cross-validation is conducted by generating splits with a ratio 1:9 of the data set. Permutation testing was undertaken to estimate the classification/regression performance of predictors.
library(KODAMA)
data(iris)
data=iris[,-5]
labels=iris[,5]
pp=pls.double.cv(data,labels)
..........
print(pp$Q2Y)
[1] 0.5671665 0.5591931 0.5659959 0.5611964 0.5616126 0.5761351 0.5695032
[8] 0.5516902 0.5649377 0.5630635
table(pp$Ypred,labels)
labels
setosa versicolor virginica
setosa 49 0 0
versicolor 1 34 10
virginica 0 16 40
Is the process of modifying and converting input features of a data set by applying mathematical operations to improve the learning and prediction performance of ML models.
Transformation techniques include scaling, normalisation and logarithmisation, which deal with differences in scale and distribution between features, non-linearity and outliers.
Input features (variables) may have different units, e.g. kilometre, day, year, etc., and so the variables have different scales and probably different distributions which increases the learning difficulty of ML algorithms from the data.
A number of different normalization methods are provided in KODAMA:
“none”: no normalization method is applied.
“pqn”: the Probabilistic Quotient Normalization is computed as described in Dieterle, et al. (2006).
“sum”: samples are normalized to the sum of the absolute value of all variables for a given sample.
“median”: samples are normalized to the median value of all variables for a given sample.
“sqrt”: samples are normalized to the root of the sum of the squared value of all variables for a given sample.
library(KODAMA)
data(MetRef)
u=MetRef$data;
u=u[,-which(colSums(u)==0)]
u=normalization(u)$newXtrain
class=as.numeric(as.factor(MetRef$gender))
cc=pca(u)
plot(cc$x,pch=21,bg=class)

A number of different scaling methods are provided in KODAMA:
“none”: no scaling method is applied.
“centering”: centers the mean to zero.
“autoscaling”: centers the mean to zero and scales
data by dividing each variable by the variance.
“rangescaling”: centers the mean to zero and scales
data by dividing each variable by the difference between the minimum and
the maximum value.
“paretoscaling”: centers the mean to zero and scales
data by dividing each variable by the square root of the standard
deviation. Unit scaling divides each variable by the standard deviation
so that each variance equal to 1.
library(KODAMA)
data(MetRef)
u=MetRef$data;
u=u[,-which(colSums(u)==0)]
u=scaling(u)$newXtrain
class=as.numeric(as.factor(MetRef$gender))
cc=pca(u)
plot(cc$x,pch=21,bg=class,xlab=cc$txt[1],ylab=cc$txt[2])

We can combine both normalisation and scaling to see the difference in the output
library(KODAMA)
data(MetRef)
u=MetRef$data;
u=u[,-which(colSums(u)==0)]
u=normalization(u)$newXtrain
u=scaling(u)$newXtrain
class=as.numeric(as.factor(MetRef$gender))
cc=pca(u)
plot(cc$x,pch=21,bg=class,xlab=cc$txt[1],ylab=cc$txt[2])

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.6.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Africa/Johannesburg
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] KODAMA_3.1 umap_0.2.10.0 Rtsne_0.17 minerva_1.5.10 class_7.3-22
[6] caret_6.0-94 lattice_0.22-6 ggplot2_3.5.1
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 timeDate_4032.109 dplyr_1.1.4
[4] farver_2.1.2 fastmap_1.2.0 pROC_1.18.5
[7] promises_1.3.0 digest_0.6.37 rpart_4.1.23
[10] timechange_0.3.0 lifecycle_1.0.4 survival_3.7-0
[13] magrittr_2.0.3 compiler_4.4.1 rlang_1.1.4
[16] sass_0.4.9 tools_4.4.1 utf8_1.2.4
[19] yaml_2.3.10 data.table_1.16.0 knitr_1.48
[22] askpass_1.2.0 labeling_0.4.3 reticulate_1.38.0
[25] plyr_1.8.9 workflowr_1.7.1 withr_3.0.1
[28] purrr_1.0.2 nnet_7.3-19 grid_4.4.1
[31] stats4_4.4.1 fansi_1.0.6 git2r_0.33.0
[34] colorspace_2.1-1 future_1.34.0 globals_0.16.3
[37] scales_1.3.0 iterators_1.0.14 MASS_7.3-61
[40] cli_3.6.3 rmarkdown_2.28 generics_0.1.3
[43] RSpectra_0.16-2 rstudioapi_0.16.0 future.apply_1.11.2
[46] reshape2_1.4.4 cachem_1.1.0 stringr_1.5.1
[49] splines_4.4.1 parallel_4.4.1 vctrs_0.6.5
[52] hardhat_1.4.0 Matrix_1.7-0 jsonlite_1.8.8
[55] listenv_0.9.1 foreach_1.5.2 gower_1.0.1
[58] jquerylib_0.1.4 recipes_1.1.0 glue_1.7.0
[61] parallelly_1.38.0 codetools_0.2-20 lubridate_1.9.3
[64] stringi_1.8.4 gtable_0.3.5 later_1.3.2
[67] munsell_0.5.1 tibble_3.2.1 pillar_1.9.0
[70] htmltools_0.5.8.1 openssl_2.2.1 ipred_0.9-15
[73] lava_1.8.0 R6_2.5.1 rprojroot_2.0.4
[76] evaluate_0.24.0 highr_0.11 png_0.1-8
[79] httpuv_1.6.15 bslib_0.8.0 Rcpp_1.0.13
[82] nlme_3.1-166 prodlim_2024.06.25 mgcv_1.9-1
[85] xfun_0.47 fs_1.6.4 pkgconfig_2.0.3
[88] ModelMetrics_1.2.2.2