Last updated: 2024-09-06

Checks: 7 0

Knit directory: myproject/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20240905)

The command set.seed(20240905) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 9cfbfa1

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 9cfbfa1. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .BidstackAds-b51a10b9/
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Trash/
    Ignored:    .android/
    Ignored:    Templates/
    Ignored:    Untitled Folder/
    Ignored:    ilifu/
    Ignored:    sql/

Untracked files:
    Untracked:   ⁨Library/
    Untracked:  .AliView/
    Untracked:  .CFUserTextEncoding
    Untracked:  .DS_Store
    Untracked:  .IdentityService/
    Untracked:  .R/
    Untracked:  .RDataTmp
    Untracked:  .Rapp.history
    Untracked:  .ServiceHub/
    Untracked:  .Xauthority
    Untracked:  .anaconda/
    Untracked:  .anyconnect
    Untracked:  .aspnet/
    Untracked:  .azcopy/
    Untracked:  .bash_history
    Untracked:  .bash_profile
    Untracked:  .bashrc
    Untracked:  .bidstack-device-id
    Untracked:  .cache/
    Untracked:  .cisco/
    Untracked:  .conda/
    Untracked:  .condarc
    Untracked:  .config/
    Untracked:  .continuum/
    Untracked:  .cups/
    Untracked:  .docker/
    Untracked:  .dotnet/
    Untracked:  .dropbox/
    Untracked:  .gitconfig
    Untracked:  .gitignore
    Untracked:  .globusonline/
    Untracked:  .gsutil/
    Untracked:  .idlerc/
    Untracked:  .ipynb_checkpoints/
    Untracked:  .ipython/
    Untracked:  .jupyter/
    Untracked:  .keras/
    Untracked:  .lesshst
    Untracked:  .local/
    Untracked:  .matplotlib/
    Untracked:  .mono/
    Untracked:  .nuget/
    Untracked:  .nuuid.ini
    Untracked:  .oracle_jre_usage/
    Untracked:  .pdfbox.cache
    Untracked:  .python_history
    Untracked:  .sqlite_history
    Untracked:  .ssh/
    Untracked:  .tcshrc
    Untracked:  .templateengine/
    Untracked:  .test.txt.swp
    Untracked:  .viminfo
    Untracked:  .vscode/
    Untracked:  .wget-hsts
    Untracked:  .wine/
    Untracked:  .wing101-9
    Untracked:  .xonshrc
    Untracked:  .zprofile
    Untracked:  .zprofile.pysave
    Untracked:  .zsh_history
    Untracked:  .zsh_sessions/
    Untracked:  .zshrc
    Untracked:  Applications (Parallels)/
    Untracked:  Applications/
    Untracked:  Chunk.R
    Untracked:  Desktop/
    Untracked:  Documents/
    Untracked:  Downloads/
    Untracked:  Dropbox/
    Untracked:  Library/
    Untracked:  Movies/
    Untracked:  Music/
    Untracked:  Parallels/
    Untracked:  Pedigree.R
    Untracked:  Pictures/
    Untracked:  PlayOnMac's virtual drives
    Untracked:  Projects/
    Untracked:  Public/
    Untracked:  Untitled.ipynb
    Untracked:  _TyranoGameData/
    Untracked:  anaconda3/
    Untracked:  annotation/
    Untracked:  bcftools-1.9.tar.bz2
    Untracked:  bcftools-1.9/
    Untracked:  clumping_EX.txt
    Untracked:  df.csv
    Untracked:  df_R_tut.csv
    Untracked:  eval "$(ssh-agent -s)"
    Untracked:  eval "$(ssh-agent -s)".pub
    Untracked:  gfortran-4.8.2-darwin13.tar.bz2
    Untracked:  gnomad.genomes.v3.1.1.sites.chr1.vcf.bgz
    Untracked:  myenv/
    Untracked:  text.txt
    Untracked:  venv/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Supervised Learning

Supervised learning is learning in which we teach or train the machine using data that is well labeled meaning some data is already tagged with the correct answer.

The machine is provided with a test dataset so that the supervised learning algorithm analyses the training data and produces a correct outcome from labeled data.

Supervised learning itself is composed of;

Regression, where the output is numerical
Classification, where the output is categorical

Regression

Regression analysis is a statistical technique used to model and analyze the relationship between a dependent variable and one (univariate regression) or more independent variables(multivariate regression).

Simple Linear Regression

Simple linear regression involves a single independent variable and fits the equation;

\(y=b_0 +b_1x\)

where;

\(y\) is the dependent variable
\(x\) is the independent variable
\(b_0\) is the intercept
\(b_1\) is the slope of the linear graph

Step 1: Loading libraries and import the dataset

The caret library is important for data partitioning, model training and evaluation

library(caret)   

# Load the dataset
df <- read.csv('data/tumor_size_patient_survival.csv')

# Display the first rows
head(df)

  tumor_size patient_survival
1       26.3            279.9
2       12.2            347.4
3       41.5            227.3
4       21.5            330.1
5       28.0            292.1
6       22.2            310.6

Functions like head(), summary(), str() can be used to get an overview of the data.

Step 2: Data Pre-Processing

This step involves;

handling missing values by either removing missing values or mean, median or mode imputation
encoding categorical variables
normalising and standardising numerical features

Step 3: Splitting the dataset into training and test set

Usually the dataset can be split into 75% for training and 25% for test. This facilitates data generalisation and avoids over fitting.

set.seed(45)  # for reproducibility
trainIndex <- createDataPartition(df$patient_survival, p = 0.75, list = FALSE)
trainData <- df[trainIndex, ]
testData <- df[-trainIndex, ]

Step 4: Train the linear regression model

This involves fitting the model to the training set using the lm() function

model <- lm(patient_survival ~ tumor_size, data = trainData)

# Extract coefficients
coefficients <- coef(model)
coefficients

(Intercept)  tumor_size 
 395.077138   -3.853336

The linear equation that fits to the data in our training set is

\(y = 395 - 4x\)

Step 5: Evaluating the model

This involves assessing the performance of the model on the testing set. There are various metrics for model evaluation including;

Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R-Squared (R²)Score

test_predictions <- predict(model, newdata = testData)
mae <- MAE(test_predictions, testData$patient_survival)
rmse <- RMSE(test_predictions, testData$patient_survival)
r2_score <- summary(model)$r.squared

cat('MAE on test set (in days): ', mae, "\n",
  'RMSE on test set (in days): ', rmse, "\n",
  'R-Squared Score: ', r2_score)

MAE on test set (in days):  13.04724 
 RMSE on test set (in days):  15.93548 
 R-Squared Score:  0.820814

Step 6: Visualising the model

library(ggplot2)

# Add a column to differentiate between training and test data
trainData$dataset <- "Training"
testData$dataset <- "Test"

# Combine train and test data into a single dataframe for plotting
combinedData <- rbind(trainData, testData)

# Create a scatter plot with regression line for both training and test sets
ggplot(combinedData, aes(x = tumor_size, y = patient_survival, color = dataset, shape = dataset)) +
  geom_point(alpha = 0.7) +
  geom_smooth(data = trainData, aes(x = tumor_size, y = patient_survival), method = "lm", se = FALSE, color = "#00008B") +
  labs(title = "Relationship between Tumor Size and Patient Survival",
       x = "Tumor Size (mm)",
       y = "Patient Survival (Days)") +
  theme_minimal() +
  scale_color_manual(values = c("Training" = "blue", "Test" = "red")) +
  scale_shape_manual(values = c("Training" = 16, "Test" = 16)) +
  guides(color = guide_legend(title = "Dataset"),
         shape = guide_legend(title = "Dataset"))

`geom_smooth()` using formula = 'y ~ x'

Multivariate Linear Regression

Most real-life scenarios are characterised by multivariate or high-dimensional features where more than one independent variable influences the target or dependent variable. Multi variate algorithms fit the model;

\(y = b_0 + b_1x_1 +b_2x_2 + b_3x_3 + ... + b_nx_n\)

The mpg dataset from the ggplot2 package can be used for multivariate regression. It includes information on car attributes. we will choose some relevant attributes to predict hwy, miles per gallon (MPG).

Step 1: Loading the dataset

# Load the dataset
library(ggplot2)
data(mpg)
df <- mpg
head(df)

# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

We can choose predictors; displ - (engine displacement), cyl - (number of cylinders), year - (year of the car) and class - (type of car)

Step 2: Splitting and preparing the dataset

library(caret)

set.seed(30) # for reproducibility

# Split the data into training and testing sets
trainIndex <- createDataPartition(df$hwy, p = 0.75, list = FALSE)
trainData <- df[trainIndex, ]
testData <- df[-trainIndex, ]

Step 3: Fitting the model

model_mv <- lm(hwy ~ displ + cyl + year + class, data = trainData)
summary(model_mv)


Call:
lm(formula = hwy ~ displ + cyl + year + class, data = trainData)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1700 -1.5322 -0.1473  1.0249 15.1030 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -183.14115   91.73394  -1.996 0.047511 *  
displ             -1.06300    0.52272  -2.034 0.043576 *  
cyl               -1.11663    0.36516  -3.058 0.002596 ** 
year               0.11135    0.04593   2.424 0.016399 *  
classcompact      -4.06888    1.94849  -2.088 0.038295 *  
classmidsize      -3.82081    1.89582  -2.015 0.045468 *  
classminivan      -6.82814    1.98819  -3.434 0.000749 ***
classpickup      -10.55172    1.76709  -5.971 1.38e-08 ***
classsubcompact   -3.29990    1.90675  -1.731 0.085364 .  
classsuv          -9.35511    1.70649  -5.482 1.53e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.7 on 167 degrees of freedom
Multiple R-squared:  0.8091,    Adjusted R-squared:  0.7989 
F-statistic: 78.67 on 9 and 167 DF,  p-value: < 2.2e-16

Step 4: Evaluating the model

test_predictions_mv <- predict(model_mv, newdata = testData)
mae <- MAE(test_predictions_mv, testData$hwy)
rmse <- RMSE(test_predictions_mv, testData$hwy)
r2_score <- summary(model)$r.squared

cat('MAE on test set (in days): ', mae, "\n",
  'RMSE on test set (in days): ', rmse, "\n",
  'R-Squared Score: ', r2_score)

MAE on test set (in days):  1.804035 
 RMSE on test set (in days):  2.325899 
 R-Squared Score:  0.820814

Classification

k-Nearest Neighbors (kNN)

K-nearest neighbors works by directly measuring the (Euclidean) distance between observations and inferring the class of unlabeled data from the class of its nearest neighbors.

Typically in machine learning, there are two clear steps, where one first trains a model and then uses the model to predict new outputs (class labels in this case). In the kNN, these two steps are combined into a single function call to knn.

Lets draw a set of 50 random iris observations to train the model and predict the species of another set of 50 randomly chosen flowers. The knn function takes the training data, the new data (to be inferred) and the labels of the training data, and returns (by default) the predicted class.

set.seed(12L)
train <- sample(150, 50)
test <- sample(150, 50)
library("class")
knnres <- knn(iris[train, -5], iris[test, -5], iris$Species[train])
head(knnres)

[1] versicolor setosa     versicolor setosa     setosa     setosa    
Levels: setosa versicolor virginica

Cross-Validation

Is a technique used to assess the generalisability of a model to new data. It involves dividing the dataset into multiple folds and training the model on each fold while using the remaining set for validation.

Cross-Validation with PLS-DA.

This function performs a 10-fold cross-validation on a given data set using Partial Least Squares (PLS) model. To assess the prediction ability of the model, a 10-fold cross-validation is conducted by generating splits with a ratio 1:9 of the data set. Permutation testing was undertaken to estimate the classification/regression performance of predictors.

library(KODAMA)
data(iris)
data=iris[,-5]
labels=iris[,5]
pp=pls.double.cv(data,labels)

..........

print(pp$Q2Y)

 [1] 0.5671665 0.5591931 0.5659959 0.5611964 0.5616126 0.5761351 0.5695032
 [8] 0.5516902 0.5649377 0.5630635

table(pp$Ypred,labels)

            labels
             setosa versicolor virginica
  setosa         49          0         0
  versicolor      1         34        10
  virginica       0         16        40

Feature transformation

Is the process of modifying and converting input features of a data set by applying mathematical operations to improve the learning and prediction performance of ML models.

Transformation techniques include scaling, normalisation and logarithmisation, which deal with differences in scale and distribution between features, non-linearity and outliers.

Input features (variables) may have different units, e.g. kilometre, day, year, etc., and so the variables have different scales and probably different distributions which increases the learning difficulty of ML algorithms from the data.

Normalisation

A number of different normalization methods are provided in KODAMA:

“none”: no normalization method is applied.

“pqn”: the Probabilistic Quotient Normalization is computed as described in Dieterle, et al. (2006).

“sum”: samples are normalized to the sum of the absolute value of all variables for a given sample.

“median”: samples are normalized to the median value of all variables for a given sample.

“sqrt”: samples are normalized to the root of the sum of the squared value of all variables for a given sample.

library(KODAMA)

data(MetRef)
u=MetRef$data;
u=u[,-which(colSums(u)==0)]
u=normalization(u)$newXtrain
class=as.numeric(as.factor(MetRef$gender))
cc=pca(u)
plot(cc$x,pch=21,bg=class)

Scaling and Standardisation

A number of different scaling methods are provided in KODAMA:

“none”: no scaling method is applied.
“centering”: centers the mean to zero.
“autoscaling”: centers the mean to zero and scales data by dividing each variable by the variance.
“rangescaling”: centers the mean to zero and scales data by dividing each variable by the difference between the minimum and the maximum value.
“paretoscaling”: centers the mean to zero and scales data by dividing each variable by the square root of the standard deviation. Unit scaling divides each variable by the standard deviation so that each variance equal to 1.

library(KODAMA)
data(MetRef)
u=MetRef$data;
u=u[,-which(colSums(u)==0)]
u=scaling(u)$newXtrain
class=as.numeric(as.factor(MetRef$gender))
cc=pca(u)
plot(cc$x,pch=21,bg=class,xlab=cc$txt[1],ylab=cc$txt[2])

We can combine both normalisation and scaling to see the difference in the output

library(KODAMA)
data(MetRef)
u=MetRef$data;
u=u[,-which(colSums(u)==0)]
u=normalization(u)$newXtrain
u=scaling(u)$newXtrain
class=as.numeric(as.factor(MetRef$gender))
cc=pca(u)
plot(cc$x,pch=21,bg=class,xlab=cc$txt[1],ylab=cc$txt[2])

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Africa/Johannesburg
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] KODAMA_3.1     umap_0.2.10.0  Rtsne_0.17     minerva_1.5.10 class_7.3-22  
[6] caret_6.0-94   lattice_0.22-6 ggplot2_3.5.1 

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1     timeDate_4032.109    dplyr_1.1.4         
 [4] farver_2.1.2         fastmap_1.2.0        pROC_1.18.5         
 [7] promises_1.3.0       digest_0.6.37        rpart_4.1.23        
[10] timechange_0.3.0     lifecycle_1.0.4      survival_3.7-0      
[13] magrittr_2.0.3       compiler_4.4.1       rlang_1.1.4         
[16] sass_0.4.9           tools_4.4.1          utf8_1.2.4          
[19] yaml_2.3.10          data.table_1.16.0    knitr_1.48          
[22] askpass_1.2.0        labeling_0.4.3       reticulate_1.38.0   
[25] plyr_1.8.9           workflowr_1.7.1      withr_3.0.1         
[28] purrr_1.0.2          nnet_7.3-19          grid_4.4.1          
[31] stats4_4.4.1         fansi_1.0.6          git2r_0.33.0        
[34] colorspace_2.1-1     future_1.34.0        globals_0.16.3      
[37] scales_1.3.0         iterators_1.0.14     MASS_7.3-61         
[40] cli_3.6.3            rmarkdown_2.28       generics_0.1.3      
[43] RSpectra_0.16-2      rstudioapi_0.16.0    future.apply_1.11.2 
[46] reshape2_1.4.4       cachem_1.1.0         stringr_1.5.1       
[49] splines_4.4.1        parallel_4.4.1       vctrs_0.6.5         
[52] hardhat_1.4.0        Matrix_1.7-0         jsonlite_1.8.8      
[55] listenv_0.9.1        foreach_1.5.2        gower_1.0.1         
[58] jquerylib_0.1.4      recipes_1.1.0        glue_1.7.0          
[61] parallelly_1.38.0    codetools_0.2-20     lubridate_1.9.3     
[64] stringi_1.8.4        gtable_0.3.5         later_1.3.2         
[67] munsell_0.5.1        tibble_3.2.1         pillar_1.9.0        
[70] htmltools_0.5.8.1    openssl_2.2.1        ipred_0.9-15        
[73] lava_1.8.0           R6_2.5.1             rprojroot_2.0.4     
[76] evaluate_0.24.0      highr_0.11           png_0.1-8           
[79] httpuv_1.6.15        bslib_0.8.0          Rcpp_1.0.13         
[82] nlme_3.1-166         prodlim_2024.06.25   mgcv_1.9-1          
[85] xfun_0.47            fs_1.6.4             pkgconfig_2.0.3     
[88] ModelMetrics_1.2.2.2

Module 4: Supervised Learning

Stefano Cacciatore

September 06, 2024