Last updated: 2025-03-09

Checks: 6 1

Knit directory: SAPPHIRE/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20240923)

The command set.seed(20240923) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 9217a86

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 9217a86. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    output/.DS_Store

Unstaged changes:
    Modified:   analysis/pigmentationdata.Rmd
    Modified:   analysis/summer-winter.Rmd
    Modified:   output/coeff_plots_summer-winter/coeff_Forehead_B.png
    Modified:   output/coeff_plots_summer-winter/coeff_Forehead_CIE_L.png
    Modified:   output/coeff_plots_summer-winter/coeff_Forehead_CIE_a.png
    Modified:   output/coeff_plots_summer-winter/coeff_Forehead_CIE_b.png
    Modified:   output/coeff_plots_summer-winter/coeff_Forehead_E.png
    Modified:   output/coeff_plots_summer-winter/coeff_Forehead_G.png
    Modified:   output/coeff_plots_summer-winter/coeff_Forehead_M.png
    Modified:   output/coeff_plots_summer-winter/coeff_Forehead_R.png
    Modified:   output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_B.png
    Modified:   output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_CIE_L.png
    Modified:   output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_CIE_a.png
    Modified:   output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_CIE_b.png
    Modified:   output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_E.png
    Modified:   output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_G.png
    Modified:   output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_M.png
    Modified:   output/coeff_plots_summer-winter/coeff_LeftUpperInnerArm_R.png
    Modified:   output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_B.png
    Modified:   output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_CIE_L.png
    Modified:   output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_CIE_a.png
    Modified:   output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_CIE_b.png
    Modified:   output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_E.png
    Modified:   output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_G.png
    Modified:   output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_M.png
    Modified:   output/coeff_plots_summer-winter/coeff_RightUpperInnerArm_R.png
    Modified:   output/withinstats_filtered/Forehead_B.png
    Modified:   output/withinstats_filtered/Forehead_CIE_L.png
    Modified:   output/withinstats_filtered/Forehead_CIE_a.png
    Modified:   output/withinstats_filtered/Forehead_CIE_b.png
    Modified:   output/withinstats_filtered/Forehead_E.png
    Modified:   output/withinstats_filtered/Forehead_G.png
    Modified:   output/withinstats_filtered/Forehead_M.png
    Modified:   output/withinstats_filtered/Forehead_R.png
    Modified:   output/withinstats_filtered/LeftUpperInnerArm_B.png
    Modified:   output/withinstats_filtered/LeftUpperInnerArm_CIE_L.png
    Modified:   output/withinstats_filtered/LeftUpperInnerArm_CIE_a.png
    Modified:   output/withinstats_filtered/LeftUpperInnerArm_CIE_b.png
    Modified:   output/withinstats_filtered/LeftUpperInnerArm_E.png
    Modified:   output/withinstats_filtered/LeftUpperInnerArm_G.png
    Modified:   output/withinstats_filtered/LeftUpperInnerArm_M.png
    Modified:   output/withinstats_filtered/LeftUpperInnerArm_R.png
    Modified:   output/withinstats_filtered/RightUpperInnerArm_B.png
    Modified:   output/withinstats_filtered/RightUpperInnerArm_CIE_L.png
    Modified:   output/withinstats_filtered/RightUpperInnerArm_CIE_a.png
    Modified:   output/withinstats_filtered/RightUpperInnerArm_CIE_b.png
    Modified:   output/withinstats_filtered/RightUpperInnerArm_E.png
    Modified:   output/withinstats_filtered/RightUpperInnerArm_G.png
    Modified:   output/withinstats_filtered/RightUpperInnerArm_M.png
    Modified:   output/withinstats_filtered/RightUpperInnerArm_R.png

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/pigmentationdata.Rmd) and HTML (docs/pigmentationdata.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	9217a86	Tina Lasisi	2025-03-07	Cleaned data and redid analyses

Introduction

# Install packages if needed:
# install.packages("readxl")
# 

library(readxl)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Path to your Excel file
workbook_path <- "data/serum_vit_D_study_with_lab_results.xlsx"

# Get the list of all sheet names in the workbook
sheets <- excel_sheets(workbook_path)

# Loop through each sheet, read it, and write it to a CSV
for (sheet_name in sheets) {
  df <- read_excel(workbook_path, sheet = sheet_name)
  
  # Replace problematic characters in sheet name
  safe_sheet_name <- gsub("[^A-Za-z0-9\\-_]+", "_", sheet_name)
  
  # Create file name and save
  output_path <- file.path("data", paste0(safe_sheet_name, ".csv"))
  write.csv(df, output_path, row.names = FALSE)
  
  message("Created CSV file for sheet '", sheet_name, "': ", output_path)
}

Created CSV file for sheet 'ScreeningDataCollectionSummer': data/ScreeningDataCollectionSummer.csv
Created CSV file for sheet 'ScreeningDataCollectionWinter': data/ScreeningDataCollectionWinter.csv
Created CSV file for sheet 'ScreeningDataCollection6Weeks': data/ScreeningDataCollection6Weeks.csv
Created CSV file for sheet 'FoodFrequencySummer': data/FoodFrequencySummer.csv
Created CSV file for sheet 'FoodFrequencyWinter': data/FoodFrequencyWinter.csv
Created CSV file for sheet 'FoodFrequency6Weeks': data/FoodFrequency6Weeks.csv
Created CSV file for sheet 'SunExposureSummer': data/SunExposureSummer.csv
Created CSV file for sheet 'SunExposureWinter': data/SunExposureWinter.csv
Created CSV file for sheet 'SunExposure6Weeks': data/SunExposure6Weeks.csv

# (Optional) Install packages if needed
# install.packages("dplyr")

# Read each CSV and add a "Season" column
summer_data  <- read.csv("data/ScreeningDataCollectionSummer.csv") %>%
  mutate(Season = "Summer")

winter_data  <- read.csv("data/ScreeningDataCollectionWinter.csv") %>%
  mutate(Season = "Winter")

sixweek_data <- read.csv("data/ScreeningDataCollection6Weeks.csv") %>%
  mutate(Season = "6Weeks")

# Combine into one data frame
combined_data <- bind_rows(summer_data, winter_data, sixweek_data)

# Write the combined data to a single CSV
write.csv(combined_data, "data/ScreeningDataCollection.csv", row.names = FALSE)

# Step 1: Read & Clean the Data
# Load required libraries
library(tidyverse)
library(readr)

# Read the original data
df_original <- read.csv("data/ScreeningDataCollection.csv")

# Clean and rename variables
df_cleaned <- df_original |>
  mutate(
    # Convert EthnicityColoured -> Ethnicity (adapt if it's character/string)
    Ethnicity = if_else(EthnicityColoured == TRUE, "CapeMixed", "Xhosa")
  ) |>
  select(
    ParticipantCentreID,
    TodayDate,
    Gender,
    Ethnicity,
    Season,
    starts_with("SkinReflectance"),
    VitDResult
  )

# Get all column names
all_columns <- colnames(df_cleaned)

# Find columns with L., a., and b. and rename them
renamed_columns <- all_columns

# Create a function to rename CIE variables
rename_cie_vars <- function(column_name) {
  # Replace L. with CIE_L
  column_name <- gsub("SkinReflectanceForehead[L]\\.([1-3])", "SkinReflectanceForeheadCIE_L\\1", column_name)
  column_name <- gsub("SkinReflectanceRightUpperInnerArm[L]\\.([1-3])", "SkinReflectanceRightUpperInnerArmCIE_L\\1", column_name)
  column_name <- gsub("SkinReflectanceLeftUpperInnerArm[L]\\.([1-3])", "SkinReflectanceLeftUpperInnerArmCIE_L\\1", column_name)
  
  # Replace a. with CIE_a
  column_name <- gsub("SkinReflectanceForehead[a]\\.([1-3])", "SkinReflectanceForeheadCIE_a\\1", column_name)
  column_name <- gsub("SkinReflectanceRightUpperInnerArm[a]\\.([1-3])", "SkinReflectanceRightUpperInnerArmCIE_a\\1", column_name)
  column_name <- gsub("SkinReflectanceLeftUpperInnerArm[a]\\.([1-3])", "SkinReflectanceLeftUpperInnerArmCIE_a\\1", column_name)
  
  # Replace b. with CIE_b
  column_name <- gsub("SkinReflectanceForehead[b]\\.([1-3])", "SkinReflectanceForeheadCIE_b\\1", column_name)
  column_name <- gsub("SkinReflectanceRightUpperInnerArm[b]\\.([1-3])", "SkinReflectanceRightUpperInnerArmCIE_b\\1", column_name)
  column_name <- gsub("SkinReflectanceLeftUpperInnerArm[b]\\.([1-3])", "SkinReflectanceLeftUpperInnerArmCIE_b\\1", column_name)
  
  return(column_name)
}

# Apply the renaming function to each column
for (i in 1:length(all_columns)) {
  renamed_columns[i] <- rename_cie_vars(all_columns[i])
}

# Create a rename map
rename_map <- setNames(all_columns, renamed_columns)

# Check if any columns were changed
changed_columns <- which(renamed_columns != all_columns)
if (length(changed_columns) > 0) {
  cat("Renamed columns:\n")
  for (i in changed_columns) {
    cat(paste("  ", all_columns[i], "->", renamed_columns[i], "\n"))
  }
} else {
  cat("No columns were renamed.\n")
}

Renamed columns:
   SkinReflectanceForeheadL.1 -> SkinReflectanceForeheadCIE_L1 
   SkinReflectanceForeheadL.2 -> SkinReflectanceForeheadCIE_L2 
   SkinReflectanceForeheadL.3 -> SkinReflectanceForeheadCIE_L3 
   SkinReflectanceForeheada.1 -> SkinReflectanceForeheadCIE_a1 
   SkinReflectanceForeheada.2 -> SkinReflectanceForeheadCIE_a2 
   SkinReflectanceForeheada.3 -> SkinReflectanceForeheadCIE_a3 
   SkinReflectanceForeheadb.1 -> SkinReflectanceForeheadCIE_b1 
   SkinReflectanceForeheadb.2 -> SkinReflectanceForeheadCIE_b2 
   SkinReflectanceForeheadb.3 -> SkinReflectanceForeheadCIE_b3 
   SkinReflectanceRightUpperInnerArmL.1 -> SkinReflectanceRightUpperInnerArmCIE_L1 
   SkinReflectanceRightUpperInnerArmL.2 -> SkinReflectanceRightUpperInnerArmCIE_L2 
   SkinReflectanceRightUpperInnerArmL.3 -> SkinReflectanceRightUpperInnerArmCIE_L3 
   SkinReflectanceRightUpperInnerArma.1 -> SkinReflectanceRightUpperInnerArmCIE_a1 
   SkinReflectanceRightUpperInnerArma.2 -> SkinReflectanceRightUpperInnerArmCIE_a2 
   SkinReflectanceRightUpperInnerArma.3 -> SkinReflectanceRightUpperInnerArmCIE_a3 
   SkinReflectanceRightUpperInnerArmb.1 -> SkinReflectanceRightUpperInnerArmCIE_b1 
   SkinReflectanceRightUpperInnerArmb.2 -> SkinReflectanceRightUpperInnerArmCIE_b2 
   SkinReflectanceRightUpperInnerArmb.3 -> SkinReflectanceRightUpperInnerArmCIE_b3 
   SkinReflectanceLeftUpperInnerArmL.1 -> SkinReflectanceLeftUpperInnerArmCIE_L1 
   SkinReflectanceLeftUpperInnerArmL.2 -> SkinReflectanceLeftUpperInnerArmCIE_L2 
   SkinReflectanceLeftUpperInnerArmL.3 -> SkinReflectanceLeftUpperInnerArmCIE_L3 
   SkinReflectanceLeftUpperInnerArma.1 -> SkinReflectanceLeftUpperInnerArmCIE_a1 
   SkinReflectanceLeftUpperInnerArma.2 -> SkinReflectanceLeftUpperInnerArmCIE_a2 
   SkinReflectanceLeftUpperInnerArma.3 -> SkinReflectanceLeftUpperInnerArmCIE_a3 
   SkinReflectanceLeftUpperInnerArmb.1 -> SkinReflectanceLeftUpperInnerArmCIE_b1 
   SkinReflectanceLeftUpperInnerArmb.2 -> SkinReflectanceLeftUpperInnerArmCIE_b2 
   SkinReflectanceLeftUpperInnerArmb.3 -> SkinReflectanceLeftUpperInnerArmCIE_b3

# Now let's try a different approach since the regex might not be capturing correctly
# Directly identify and rename columns with specific patterns

# Create a new renamed dataframe
df_renamed <- df_cleaned

# Iterate through each column and apply renaming
rename_columns <- function(df) {
  col_names <- colnames(df)
  new_names <- col_names
  
  for (i in seq_along(col_names)) {
    # Check for L. pattern
    if (grepl("L\\.", col_names[i])) {
      new_names[i] <- gsub("L\\.", "CIE_L", col_names[i])
    }
    # Check for a. pattern
    else if (grepl("a\\.", col_names[i])) {
      new_names[i] <- gsub("a\\.", "CIE_a", col_names[i])
    }
    # Check for b. pattern
    else if (grepl("b\\.", col_names[i])) {
      new_names[i] <- gsub("b\\.", "CIE_b", col_names[i])
    }
  }
  
  # Show which columns were renamed
  changed <- which(new_names != col_names)
  if (length(changed) > 0) {
    cat("Renamed columns:\n")
    for (i in changed) {
      cat(paste("  ", col_names[i], "->", new_names[i], "\n"))
    }
  }
  
  # Rename the columns
  colnames(df) <- new_names
  return(df)
}

# Apply the renaming
df_renamed <- rename_columns(df_cleaned)

Renamed columns:
   SkinReflectanceForeheadL.1 -> SkinReflectanceForeheadCIE_L1 
   SkinReflectanceForeheadL.2 -> SkinReflectanceForeheadCIE_L2 
   SkinReflectanceForeheadL.3 -> SkinReflectanceForeheadCIE_L3 
   SkinReflectanceForeheada.1 -> SkinReflectanceForeheadCIE_a1 
   SkinReflectanceForeheada.2 -> SkinReflectanceForeheadCIE_a2 
   SkinReflectanceForeheada.3 -> SkinReflectanceForeheadCIE_a3 
   SkinReflectanceForeheadb.1 -> SkinReflectanceForeheadCIE_b1 
   SkinReflectanceForeheadb.2 -> SkinReflectanceForeheadCIE_b2 
   SkinReflectanceForeheadb.3 -> SkinReflectanceForeheadCIE_b3 
   SkinReflectanceRightUpperInnerArmL.1 -> SkinReflectanceRightUpperInnerArmCIE_L1 
   SkinReflectanceRightUpperInnerArmL.2 -> SkinReflectanceRightUpperInnerArmCIE_L2 
   SkinReflectanceRightUpperInnerArmL.3 -> SkinReflectanceRightUpperInnerArmCIE_L3 
   SkinReflectanceRightUpperInnerArma.1 -> SkinReflectanceRightUpperInnerArmCIE_a1 
   SkinReflectanceRightUpperInnerArma.2 -> SkinReflectanceRightUpperInnerArmCIE_a2 
   SkinReflectanceRightUpperInnerArma.3 -> SkinReflectanceRightUpperInnerArmCIE_a3 
   SkinReflectanceRightUpperInnerArmb.1 -> SkinReflectanceRightUpperInnerArmCIE_b1 
   SkinReflectanceRightUpperInnerArmb.2 -> SkinReflectanceRightUpperInnerArmCIE_b2 
   SkinReflectanceRightUpperInnerArmb.3 -> SkinReflectanceRightUpperInnerArmCIE_b3 
   SkinReflectanceLeftUpperInnerArmL.1 -> SkinReflectanceLeftUpperInnerArmCIE_L1 
   SkinReflectanceLeftUpperInnerArmL.2 -> SkinReflectanceLeftUpperInnerArmCIE_L2 
   SkinReflectanceLeftUpperInnerArmL.3 -> SkinReflectanceLeftUpperInnerArmCIE_L3 
   SkinReflectanceLeftUpperInnerArma.1 -> SkinReflectanceLeftUpperInnerArmCIE_a1 
   SkinReflectanceLeftUpperInnerArma.2 -> SkinReflectanceLeftUpperInnerArmCIE_a2 
   SkinReflectanceLeftUpperInnerArma.3 -> SkinReflectanceLeftUpperInnerArmCIE_a3 
   SkinReflectanceLeftUpperInnerArmb.1 -> SkinReflectanceLeftUpperInnerArmCIE_b1 
   SkinReflectanceLeftUpperInnerArmb.2 -> SkinReflectanceLeftUpperInnerArmCIE_b2 
   SkinReflectanceLeftUpperInnerArmb.3 -> SkinReflectanceLeftUpperInnerArmCIE_b3

# Write to csv
write.csv(df_renamed, "data/ScreeningDataCollection_cleaned.csv", row.names = FALSE)

# Verify the renamed file
df_check <- read.csv("data/ScreeningDataCollection_cleaned.csv")
cat("\nVerification of renamed columns in saved file:\n")


Verification of renamed columns in saved file:

cie_columns <- grep("CIE_[Lab]", colnames(df_check), value = TRUE)
if (length(cie_columns) > 0) {
  cat("CIE columns found in the renamed file:\n")
  print(cie_columns)
} else {
  cat("No CIE columns found in the renamed file.\n")
}

CIE columns found in the renamed file:
 [1] "SkinReflectanceForeheadCIE_L1"          
 [2] "SkinReflectanceForeheadCIE_L2"          
 [3] "SkinReflectanceForeheadCIE_L3"          
 [4] "SkinReflectanceForeheadCIE_a1"          
 [5] "SkinReflectanceForeheadCIE_a2"          
 [6] "SkinReflectanceForeheadCIE_a3"          
 [7] "SkinReflectanceForeheadCIE_b1"          
 [8] "SkinReflectanceForeheadCIE_b2"          
 [9] "SkinReflectanceForeheadCIE_b3"          
[10] "SkinReflectanceRightUpperInnerArmCIE_L1"
[11] "SkinReflectanceRightUpperInnerArmCIE_L2"
[12] "SkinReflectanceRightUpperInnerArmCIE_L3"
[13] "SkinReflectanceRightUpperInnerArmCIE_a1"
[14] "SkinReflectanceRightUpperInnerArmCIE_a2"
[15] "SkinReflectanceRightUpperInnerArmCIE_a3"
[16] "SkinReflectanceRightUpperInnerArmCIE_b1"
[17] "SkinReflectanceRightUpperInnerArmCIE_b2"
[18] "SkinReflectanceRightUpperInnerArmCIE_b3"
[19] "SkinReflectanceLeftUpperInnerArmCIE_L1" 
[20] "SkinReflectanceLeftUpperInnerArmCIE_L2" 
[21] "SkinReflectanceLeftUpperInnerArmCIE_L3" 
[22] "SkinReflectanceLeftUpperInnerArmCIE_a1" 
[23] "SkinReflectanceLeftUpperInnerArmCIE_a2" 
[24] "SkinReflectanceLeftUpperInnerArmCIE_a3" 
[25] "SkinReflectanceLeftUpperInnerArmCIE_b1" 
[26] "SkinReflectanceLeftUpperInnerArmCIE_b2" 
[27] "SkinReflectanceLeftUpperInnerArmCIE_b3"

# Print a summary of all column names for verification
cat("\nAll column names in final dataset:\n")


All column names in final dataset:

all_final_columns <- colnames(df_check)
print(all_final_columns)

 [1] "ParticipantCentreID"                    
 [2] "TodayDate"                              
 [3] "Gender"                                 
 [4] "Ethnicity"                              
 [5] "Season"                                 
 [6] "SkinReflectanceForeheadE1"              
 [7] "SkinReflectanceForeheadE2"              
 [8] "SkinReflectanceForeheadE3"              
 [9] "SkinReflectanceForeheadM1"              
[10] "SkinReflectanceForeheadM2"              
[11] "SkinReflectanceForeheadM3"              
[12] "SkinReflectanceForeheadR1"              
[13] "SkinReflectanceForeheadR2"              
[14] "SkinReflectanceForeheadR3"              
[15] "SkinReflectanceForeheadG1"              
[16] "SkinReflectanceForeheadG2"              
[17] "SkinReflectanceForeheadG3"              
[18] "SkinReflectanceForeheadB1"              
[19] "SkinReflectanceForeheadB2"              
[20] "SkinReflectanceForeheadB3"              
[21] "SkinReflectanceForeheadCIE_L1"          
[22] "SkinReflectanceForeheadCIE_L2"          
[23] "SkinReflectanceForeheadCIE_L3"          
[24] "SkinReflectanceForeheadCIE_a1"          
[25] "SkinReflectanceForeheadCIE_a2"          
[26] "SkinReflectanceForeheadCIE_a3"          
[27] "SkinReflectanceForeheadCIE_b1"          
[28] "SkinReflectanceForeheadCIE_b2"          
[29] "SkinReflectanceForeheadCIE_b3"          
[30] "SkinReflectanceRightUpperInnerArmE1"    
[31] "SkinReflectanceRightUpperInnerArmE2"    
[32] "SkinReflectanceRightUpperInnerArmE3"    
[33] "SkinReflectanceRightUpperInnerArmM1"    
[34] "SkinReflectanceRightUpperInnerArmM2"    
[35] "SkinReflectanceRightUpperInnerArmM3"    
[36] "SkinReflectanceRightUpperInnerArmR1"    
[37] "SkinReflectanceRightUpperInnerArmR2"    
[38] "SkinReflectanceRightUpperInnerArmR3"    
[39] "SkinReflectanceRightUpperInnerArmG1"    
[40] "SkinReflectanceRightUpperInnerArmG2"    
[41] "SkinReflectanceRightUpperInnerArmG3"    
[42] "SkinReflectanceRightUpperInnerArmB1"    
[43] "SkinReflectanceRightUpperInnerArmB2"    
[44] "SkinReflectanceRightUpperInnerArmB3"    
[45] "SkinReflectanceRightUpperInnerArmCIE_L1"
[46] "SkinReflectanceRightUpperInnerArmCIE_L2"
[47] "SkinReflectanceRightUpperInnerArmCIE_L3"
[48] "SkinReflectanceRightUpperInnerArmCIE_a1"
[49] "SkinReflectanceRightUpperInnerArmCIE_a2"
[50] "SkinReflectanceRightUpperInnerArmCIE_a3"
[51] "SkinReflectanceRightUpperInnerArmCIE_b1"
[52] "SkinReflectanceRightUpperInnerArmCIE_b2"
[53] "SkinReflectanceRightUpperInnerArmCIE_b3"
[54] "SkinReflectanceLeftUpperInnerArmE1"     
[55] "SkinReflectanceLeftUpperInnerArmE2"     
[56] "SkinReflectanceLeftUpperInnerArmE3"     
[57] "SkinReflectanceLeftUpperInnerArmM1"     
[58] "SkinReflectanceLeftUpperInnerArmM2"     
[59] "SkinReflectanceLeftUpperInnerArmM3"     
[60] "SkinReflectanceLeftUpperInnerArmR1"     
[61] "SkinReflectanceLeftUpperInnerArmR2"     
[62] "SkinReflectanceLeftUpperInnerArmR3"     
[63] "SkinReflectanceLeftUpperInnerArmG1"     
[64] "SkinReflectanceLeftUpperInnerArmG2"     
[65] "SkinReflectanceLeftUpperInnerArmG3"     
[66] "SkinReflectanceLeftUpperInnerArmB1"     
[67] "SkinReflectanceLeftUpperInnerArmB2"     
[68] "SkinReflectanceLeftUpperInnerArmB3"     
[69] "SkinReflectanceLeftUpperInnerArmCIE_L1" 
[70] "SkinReflectanceLeftUpperInnerArmCIE_L2" 
[71] "SkinReflectanceLeftUpperInnerArmCIE_L3" 
[72] "SkinReflectanceLeftUpperInnerArmCIE_a1" 
[73] "SkinReflectanceLeftUpperInnerArmCIE_a2" 
[74] "SkinReflectanceLeftUpperInnerArmCIE_a3" 
[75] "SkinReflectanceLeftUpperInnerArmCIE_b1" 
[76] "SkinReflectanceLeftUpperInnerArmCIE_b2" 
[77] "SkinReflectanceLeftUpperInnerArmCIE_b3" 
[78] "VitDResult"

# Count of each type of column
cat("\nCount of each type of reflectance measurement:\n")


Count of each type of reflectance measurement:

count_pattern <- function(pattern) {
  sum(grepl(pattern, all_final_columns))
}

cat("E columns:", count_pattern("E[1-3]$"), "\n")

E columns: 9

cat("M columns:", count_pattern("M[1-3]$"), "\n")

M columns: 9

cat("R columns:", count_pattern("R[1-3]$"), "\n")

R columns: 9

cat("G columns:", count_pattern("G[1-3]$"), "\n")

G columns: 9

cat("B columns:", count_pattern("B[1-3]$"), "\n")

B columns: 9

cat("CIE_L columns:", count_pattern("CIE_L[1-3]$"), "\n")

CIE_L columns: 9

cat("CIE_a columns:", count_pattern("CIE_a[1-3]$"), "\n")

CIE_a columns: 9

cat("CIE_b columns:", count_pattern("CIE_b[1-3]$"), "\n")

CIE_b columns: 9

Long data pivot

library(tidyverse)
library(readr)

# 1) Read your dataset (wide format).
skin_data <- read_csv("data/ScreeningDataCollection_cleaned.csv")

Rows: 220 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): ParticipantCentreID, Ethnicity, Season
dbl  (74): Gender, SkinReflectanceForeheadE1, SkinReflectanceForeheadE2, Ski...
date  (1): TodayDate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# 2) Convert to long + calculate CV.
#    This function is essentially your original pivot plus CV calculation.
prepare_triplicates_for_plotting_with_cv <- function(data) {
  body_sites    <- c("Forehead", "RightUpperInnerArm", "LeftUpperInnerArm")
  measure_types <- c("E", "M", "R", "G", "B", "CIE_L", "CIE_a", "CIE_b")
  
  all_triplicates <- tibble()
  
  for (body_site in body_sites) {
    for (measure_type in measure_types) {
      # Identify columns like SkinReflectanceForeheadE1..3
      col_pattern <- paste0("SkinReflectance", body_site, measure_type, "[1-3]$")
      triplicate_cols <- names(data)[grepl(col_pattern, names(data))]
      
      if (length(triplicate_cols) != 3) {
        cat("Warning: Did not find exactly 3 columns for", body_site, "-", measure_type,
            ". Found", length(triplicate_cols), "columns.\n")
        next
      }
      
      # Pivot these columns to long
      trip_long <- data %>%
        select(ParticipantCentreID, Gender, Ethnicity, Season, all_of(triplicate_cols)) %>%
        pivot_longer(
          cols      = all_of(triplicate_cols),
          names_to  = "measurement",
          values_to = "value"
        ) %>%
        mutate(
          replicate        = str_extract(measurement, "\\d$"),
          body_site        = body_site,
          measurement_type = measure_type
        )
      
      # Compute CV info for each participant-season-bodySite-metric group
      cv_info <- trip_long %>%
        group_by(ParticipantCentreID, Season, body_site, measurement_type) %>%
        summarize(
          mean_val = mean(value, na.rm = TRUE),
          sd_val   = sd(value, na.rm = TRUE),
          n_valid  = sum(!is.na(value)),
          cv = ifelse(n_valid >= 2, (sd_val / mean_val) * 100, NA_real_),
          .groups = "drop"
        ) %>%
        mutate(
          cv_category = case_when(
            is.na(cv)    ~ NA_character_,
            cv <= 5      ~ "Low",
            cv <= 15     ~ "Medium",
            TRUE         ~ "High"
          )
        )
      
      # Join the CV info back to each replicate row
      trip_cv <- trip_long %>%
        left_join(cv_info, 
                  by = c("ParticipantCentreID","Season","body_site","measurement_type"))
      
      all_triplicates <- bind_rows(all_triplicates, trip_cv)
    }
  }
  
  return(all_triplicates)
}

library(dplyr)

calculate_triplicate_cv <- function(long_data) {
  library(dplyr)
  
  cv_info <- long_data %>%
    group_by(ParticipantCentreID, Season, body_site, measurement_type) %>%
    summarize(
      mean_val = mean(value, na.rm = TRUE),
      sd_val   = sd(value, na.rm = TRUE),
      n_valid  = sum(!is.na(value)),
      cv       = ifelse(n_valid >= 2 & mean_val != 0 & !is.na(mean_val),
                        (sd_val / mean_val)*100, NA_real_),
      .groups  = "drop"
    ) %>%
    mutate(
      cv_category = case_when(
        is.na(cv)   ~ NA_character_,
        cv <= 5     ~ "Low",
        cv <= 15    ~ "Medium",
        TRUE        ~ "High"
      )
    )
  
  # Remove old CV columns if they exist, then join new ones
  long_data_stripped <- long_data %>%
    select(-any_of(c("mean_val","sd_val","n_valid","cv","cv_category")))
  
  long_data_cv <- long_data_stripped %>%
    left_join(cv_info, by=c("ParticipantCentreID","Season","body_site","measurement_type"))
  
  return(long_data_cv)
}

library(ggplot2)

plot_triplicate_cv_lines <- function(long_data_cv, output_dir = "output/cv_line_plots") {
  dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
  
  # Unique combos of body_site & measurement_type
  combos <- long_data_cv %>%
    select(body_site, measurement_type) %>%
    distinct()
  
  all_plots <- list()
  
  for (i in seq_len(nrow(combos))) {
    bs <- combos$body_site[i]
    mt <- combos$measurement_type[i]
    
    plot_data <- long_data_cv %>%
      filter(body_site == bs, measurement_type == mt)
    
    p <- ggplot(plot_data, aes(
      x = replicate,
      y = value,
      group = interaction(ParticipantCentreID, Season),
      color = cv_category
    )) +
      geom_line(alpha = 0.6, na.rm = TRUE) +
      geom_point(alpha = 0.6, na.rm = TRUE) +
      facet_wrap(~ Season) +
      scale_color_manual(
        values  = c("Low" = "green", "Medium" = "orange", "High" = "red"),
        na.value = "grey50",
        name    = "CV Category"
      ) +
      labs(
        title    = paste("Triplicate Measurements for", bs, "-", mt),
        subtitle = "Colored by CV: ≤5% (Green), 5–15% (Orange), >15% (Red)",
        x        = "Replicate",
        y        = "Value"
      ) +
      theme_minimal()
    
    filename <- file.path(output_dir, paste0("cv_", bs, "_", mt, ".png"))
    ggsave(filename, p, width = 12, height = 8)
    
    all_plots[[paste(bs, mt, sep="_")]] <- p
  }
  
  return(all_plots)
}

## ---- pivot-long-original ----
library(tidyverse)
library(readr)

# 1) Read the wide dataset
skin_data <- read_csv("data/ScreeningDataCollection_cleaned.csv")

Rows: 220 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): ParticipantCentreID, Ethnicity, Season
dbl  (74): Gender, SkinReflectanceForeheadE1, SkinReflectanceForeheadE2, Ski...
date  (1): TodayDate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# 2) Convert to long + calculate initial CV
all_triplicates_cv <- prepare_triplicates_for_plotting_with_cv(skin_data)

# 3) Plot lines color-coded by the initial CV
cv_plots_original <- plot_triplicate_cv_lines(
  all_triplicates_cv,
  output_dir = "output/cv_line_plots_original"
)

detect_distribution_outliers_by_metric <- function(long_data, iqr_factor = 1.5) {
  # Summarize distribution stats by measurement_type ONLY
  distribution_stats <- long_data %>%
    group_by(measurement_type) %>%
    summarize(
      n           = n(),
      min_value   = min(value, na.rm = TRUE),
      q1          = quantile(value, 0.25, na.rm = TRUE),
      median_val  = median(value, na.rm = TRUE),
      q3          = quantile(value, 0.75, na.rm = TRUE),
      max_value   = max(value, na.rm = TRUE),
      iqr         = q3 - q1,
      lower_thresh= q1 - iqr_factor * iqr,
      upper_thresh= q3 + iqr_factor * iqr,
      .groups     = "drop"
    )
  
  # Join thresholds back by measurement_type only
  data_with_thresh <- long_data %>%
    left_join(distribution_stats, by = "measurement_type")
  
  # Mark outliers
  data_out <- data_with_thresh %>%
    mutate(is_outlier_allSites = value < lower_thresh | value > upper_thresh)
  
  list(
    stats = distribution_stats,
    data  = data_out
  )
}

plot_distribution_outliers_by_metric <- function(data_with_outliers, dist_stats,
                                                 output_dir = "output/dist_plots_by_metric") {
  dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
  
  # Unique measurement_types
  metrics <- dist_stats$measurement_type
  
  all_plots <- list()
  
  for (mt in metrics) {
    # Pull the row with thresholds for this metric
    thresh <- dist_stats %>%
      filter(measurement_type == mt)
    
    # If there's no data for that metric, skip
    if (nrow(thresh) == 0) next
    
    # Filter the main data to just this metric
    plot_data <- data_with_outliers %>%
      filter(measurement_type == mt)
    
    # Build a histogram ignoring site
    p <- ggplot(plot_data, aes(x = value)) +
      geom_histogram(
        binwidth = (thresh$max_value - thresh$min_value) / 30,
        fill = "skyblue", alpha = 0.7
      ) +
      geom_vline(xintercept = thresh$lower_thresh, color = "red", linetype = "dashed") +
      geom_vline(xintercept = thresh$upper_thresh, color = "red", linetype = "dashed") +
      labs(
        title = paste("Distribution of", mt, "(All Body Sites)"),
        subtitle = paste0("Red dashed lines = outlier cutoffs (",
                          round(thresh$lower_thresh, 2), ", ",
                          round(thresh$upper_thresh, 2), ")"),
        x = paste(mt, "Value"),
        y = "Count"
      ) +
      theme_minimal()
    
    # Save the plot
    filename <- file.path(output_dir, paste0("dist_", mt, "_allSites.png"))
    ggsave(filename, p, width = 10, height = 6)
    
    all_plots[[mt]] <- p
  }
  
  return(all_plots)
}

## ---- distribution-detection ----

# 1) Detect outliers ignoring body site
res <- detect_distribution_outliers_by_metric(all_triplicates_cv) 
dist_stats <- res$stats
dist_data  <- res$data  # 'dist_data' has is_outlier_allSites = TRUE/FALSE

# 2) Plot distribution ignoring site
dist_plots_allSites <- plot_distribution_outliers_by_metric(
  data_with_outliers = dist_data,
  dist_stats         = dist_stats,
  output_dir         = "output/dist_plots_by_metric"
)

Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 18 rows containing non-finite outside the scale range (`stat_bin()`).
Removed 18 rows containing non-finite outside the scale range (`stat_bin()`).

Warning: Removed 19 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 18 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 18 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 76 rows containing non-finite outside the scale range
(`stat_bin()`).

## ---- distribution-removal-stage1 ----

# 1) Remove distribution outliers => set 'value' to NA if is_outlier_allSites == TRUE
dist_data_clean <- dist_data %>%
  mutate(value = ifelse(is_outlier_allSites, NA_real_, value))

# 2) Recompute CV on this newly "cleaned" data
dist_data_clean_cv <- calculate_triplicate_cv(dist_data_clean)

# 3) Plot lines with updated CV categories
cv_plots_stage1 <- plot_triplicate_cv_lines(
  dist_data_clean_cv,
  output_dir = "output/cv_line_plots_stage1"
)

Visualize CV after outlier removal

## ---- distribution-removal-stage1 ----

# 1) Remove distribution outliers => set 'value' to NA if is_outlierAllSites == TRUE
dist_data_clean <- dist_data %>%
  mutate(value = ifelse(is_outlier_allSites, NA_real_, value))


# 2) (Optional) Save or examine replicate-level data after Stage 1
write_csv(dist_data_clean, "data/ScreeningDataCollection_stage1.csv")

names(dist_data_clean)

 [1] "ParticipantCentreID" "Gender"              "Ethnicity"          
 [4] "Season"              "measurement"         "value"              
 [7] "replicate"           "body_site"           "measurement_type"   
[10] "mean_val"            "sd_val"              "n_valid"            
[13] "cv"                  "cv_category"         "n"                  
[16] "min_value"           "q1"                  "median_val"         
[19] "q3"                  "max_value"           "iqr"                
[22] "lower_thresh"        "upper_thresh"        "is_outlier_allSites"

head(dist_data_clean)

# A tibble: 6 × 24
  ParticipantCentreID Gender Ethnicity Season measurement        value replicate
  <chr>                <dbl> <chr>     <chr>  <chr>              <dbl> <chr>    
1 VDKH001                  1 Xhosa     Summer SkinReflectanceFo…  17.0 1        
2 VDKH001                  1 Xhosa     Summer SkinReflectanceFo…  18.7 2        
3 VDKH001                  1 Xhosa     Summer SkinReflectanceFo…  18.6 3        
4 VDKH002                  1 Xhosa     Summer SkinReflectanceFo…  17.2 1        
5 VDKH002                  1 Xhosa     Summer SkinReflectanceFo…  19.8 2        
6 VDKH002                  1 Xhosa     Summer SkinReflectanceFo…  17.4 3        
# ℹ 17 more variables: body_site <chr>, measurement_type <chr>, mean_val <dbl>,
#   sd_val <dbl>, n_valid <int>, cv <dbl>, cv_category <chr>, n <int>,
#   min_value <dbl>, q1 <dbl>, median_val <dbl>, q3 <dbl>, max_value <dbl>,
#   iqr <dbl>, lower_thresh <dbl>, upper_thresh <dbl>,
#   is_outlier_allSites <lgl>

# 2) Recompute CV on this newly "cleaned" data
dist_data_clean_cv <- calculate_triplicate_cv(dist_data_clean)

names(dist_data_clean_cv)

 [1] "ParticipantCentreID" "Gender"              "Ethnicity"          
 [4] "Season"              "measurement"         "value"              
 [7] "replicate"           "body_site"           "measurement_type"   
[10] "n"                   "min_value"           "q1"                 
[13] "median_val"          "q3"                  "max_value"          
[16] "iqr"                 "lower_thresh"        "upper_thresh"       
[19] "is_outlier_allSites" "mean_val"            "sd_val"             
[22] "n_valid"             "cv"                  "cv_category"

head(dist_data_clean_cv)

# A tibble: 6 × 24
  ParticipantCentreID Gender Ethnicity Season measurement        value replicate
  <chr>                <dbl> <chr>     <chr>  <chr>              <dbl> <chr>    
1 VDKH001                  1 Xhosa     Summer SkinReflectanceFo…  17.0 1        
2 VDKH001                  1 Xhosa     Summer SkinReflectanceFo…  18.7 2        
3 VDKH001                  1 Xhosa     Summer SkinReflectanceFo…  18.6 3        
4 VDKH002                  1 Xhosa     Summer SkinReflectanceFo…  17.2 1        
5 VDKH002                  1 Xhosa     Summer SkinReflectanceFo…  19.8 2        
6 VDKH002                  1 Xhosa     Summer SkinReflectanceFo…  17.4 3        
# ℹ 17 more variables: body_site <chr>, measurement_type <chr>, n <int>,
#   min_value <dbl>, q1 <dbl>, median_val <dbl>, q3 <dbl>, max_value <dbl>,
#   iqr <dbl>, lower_thresh <dbl>, upper_thresh <dbl>,
#   is_outlier_allSites <lgl>, mean_val <dbl>, sd_val <dbl>, n_valid <int>,
#   cv <dbl>, cv_category <chr>

Remove based on CV

# 2) Remove high CV replicate
remove_high_cv_replicate <- function(data, cv_threshold = 15) {
  library(dplyr)
  
  data_new <- data %>%
    group_by(ParticipantCentreID, Season, body_site, measurement_type) %>%
    group_modify(~ {
      .x <- .x
      
      # "Classic" CV with mean
      mean_val  <- mean(.x$value, na.rm=TRUE)
      sd_val    <- sd(.x$value, na.rm=TRUE)
      n_valid   <- sum(!is.na(.x$value))
      cv_val    <- if (n_valid >= 2 && !is.na(mean_val) && mean_val != 0) {
        (sd_val / mean_val) * 100
      } else {
        NA_real_
      }
      
      # Use median to pick the farthest replicate if CV>threshold
      median_val <- median(.x$value, na.rm=TRUE)
      
      if (!is.na(cv_val) && cv_val > cv_threshold) {
        idx_farthest <- which.max(abs(.x$value - median_val))
        .x$value[idx_farthest] <- NA
      }
      .x
    }) %>%
    ungroup()
  
  return(data_new)
}

## ---- stage2-highcv-removal ----
# 5) Remove high CV replicate from Stage 1–cleaned data
data_stage2 <- remove_high_cv_replicate(dist_data_clean, cv_threshold = 15)

# 6) Recompute CV again with final replicate set
data_stage2_cv <- calculate_triplicate_cv(data_stage2)

# 7) Plot final lines color-coded by updated CV for Stage 2
cv_plots_stage2 <- plot_triplicate_cv_lines(
  data_stage2_cv,
  output_dir = "output/cv_line_plots_stage2"
)

# 8) (Optional) Save replicate-level final dataset (Stage 2 cleaned)
write_csv(data_stage2, "data/ScreeningDataCollection_stage2.csv")

## ---- stage2-final-medians ----
# 9) Create final medians from data_stage2
df_final_medians <- data_stage2 %>%
  group_by(ParticipantCentreID, Ethnicity, Gender, Season, body_site, measurement_type) %>%
  summarize(
    final_median = median(value, na.rm = TRUE),
    n_valid = sum(!is.na(value)),
    .groups = "drop"
  )

# 10) Write final medians
write_csv(df_final_medians, "data/ScreeningDataCollection_medians.csv")

sessionInfo()

R version 4.4.3 (2025-02-28)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Detroit
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] purrr_1.0.4     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
 [9] ggplot2_3.5.1   tidyverse_2.0.0 readxl_1.4.3    workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] gtable_0.3.6      xfun_0.50         bslib_0.9.0       processx_3.8.5   
 [5] callr_3.7.6       tzdb_0.4.0        vctrs_0.6.5       tools_4.4.3      
 [9] ps_1.8.1          generics_0.1.3    parallel_4.4.3    pkgconfig_2.0.3  
[13] lifecycle_1.0.4   compiler_4.4.3    farver_2.1.2      git2r_0.35.0     
[17] textshaping_1.0.0 munsell_0.5.1     getPass_0.2-4     httpuv_1.6.15    
[21] htmltools_0.5.8.1 sass_0.4.9        yaml_2.3.10       later_1.4.1      
[25] pillar_1.10.1     crayon_1.5.3      jquerylib_0.1.4   whisker_0.4.1    
[29] cachem_1.1.0      tidyselect_1.2.1  digest_0.6.37     stringi_1.8.4    
[33] labeling_0.4.3    rprojroot_2.0.4   fastmap_1.2.0     grid_4.4.3       
[37] colorspace_2.1-1  cli_3.6.3         magrittr_2.0.3    utf8_1.2.4       
[41] withr_3.0.2       scales_1.3.0      promises_1.3.2    bit64_4.6.0-1    
[45] timechange_0.3.0  rmarkdown_2.29    httr_1.4.7        bit_4.5.0.1      
[49] cellranger_1.1.0  ragg_1.3.3        hms_1.1.3         evaluate_1.0.3   
[53] knitr_1.49        rlang_1.1.5       Rcpp_1.0.14       glue_1.8.0       
[57] rstudioapi_0.17.1 vroom_1.6.5       jsonlite_1.8.9    R6_2.5.1         
[61] systemfonts_1.2.1 fs_1.6.5

Pigmentation Data

Tina Lasisi

2025-03-09 16:00:13

Introduction

Long data pivot

Visualize CV after outlier removal

Remove based on CV