Last updated: 2024-10-07

Checks: 6 1

Knit directory: SAPPHIRE/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20240923) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 5f88ad5. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rapp.history
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    data/.DS_Store

Unstaged changes:
    Deleted:    analysis/data_cleaning_1.Rmd
    Deleted:    analysis/data_cleaning_2.Rmd
    Deleted:    analysis/data_cleaning_3.Rmd
    Modified:   analysis/data_cleaning_composite.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/data_cleaning_composite.Rmd) and HTML (docs/data_cleaning_composite.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html 5f88ad5 calliquire 2024-10-07 Build site.
Rmd 0acb3db calliquire 2024-10-07 wflow_publish("analysis/data_cleaning_composite.Rmd")

About This Analysis: This analysis is the composite data-cleaning process that includes all three data_cleaning_ pages in one.

Part 1: Initial Data-Cleaning

The goal of this analysis is to initiate the wrangling of the original, raw data through manual editing in Google Sheets and a R pipeline to leave us with a long-form data set.

Set Up

  1. Load the relevant packages.
library(readxl)
library(dplyr)
library(janitor)
library(tidyverse)

Prepare Data

  1. In Google Sheets, make a copy of the original data, save that copy as “serum_vit_D_study_with_lab_results.xlsx” and manually edit the following:
  • delete floating note in ScreeningDataCollectionWinter sheet: “Note: VDKH001 had no second weight measurement so the initial measurement was used.”
  • delete floating note in ScreeningDataCollection6Weeks sheet: “Note: VDKH007 and VDKH012 had no second weight measurement so the initial measurement was used.”
  • change column DT name in ScreeningDataCollectionWinter sheet from “Result” to “VitDResult” to match the same measures in the Summer and 6 Weeks sheets
  • delete columns A and B (named ParticipantID and ParticipantCentre, respectively) so that column C becomes your unique identifier (named ParticipantCentreID) in all of the individual sheets within the workbook
  • delete floating notes in FoodFrequencySummer sheet: “Notes: VDKH016 - No information for Pilchards or Liver beef/lamb; VDTG011 - No Vit D for Marg soft.”
  • delete floating notes in FoodFrequencyWinter sheet: “Note: No amounts calculated for VDKH047 - Soft Marg - or VDKH050 - Snoek; Where two types of margerine were specified, the brand with the lower Vit D percentage was used to calculate Vit D ie VDTG023, 032,045”

Begin Wrangling in R

  1. Specify the path to the new Excel file. This .xlsx file is located in ~/GitHub/SAPPHIRE/data.
file_path <- "data/serum_vit_D_study_with_lab_results.xlsx"

Category 1: Screening Data Collection

  1. Load data from the Screening Data Collection sheets.
screening_summer <- read_excel(file_path, sheet = "ScreeningDataCollectionSummer")
screening_winter <- read_excel(file_path, sheet = "ScreeningDataCollectionWinter")
screening_6weeks <- read_excel(file_path, sheet = "ScreeningDataCollection6Weeks")
  1. Convert the column names to snake_case.
screening_summer <- screening_summer %>% clean_names()
screening_winter <- screening_winter %>% clean_names()
screening_6weeks <- screening_6weeks %>% clean_names()
  1. Standardize the data types for critical columns, for example, age_years.
screening_summer <- screening_summer %>% mutate(age_years = as.numeric(age_years))
screening_winter <- screening_winter %>% mutate(age_years = as.numeric(age_years))
screening_6weeks <- screening_6weeks %>% mutate(age_years = as.numeric(age_years))
  1. Add a ‘collection_period’ column to each data frame to indicate when the data was collected.
screening_summer <- screening_summer %>% mutate(collection_period = "Summer")
screening_winter <- screening_winter %>% mutate(collection_period = "Winter")
screening_6weeks <- screening_6weeks %>% mutate(collection_period = "6Weeks")
  1. Combine the dataframes into a long-form dataset.
screening_long <- bind_rows(screening_summer, screening_winter, screening_6weeks)
  1. Save the long-form datasets as .csv files in /data.
write.csv(screening_long, "data/screening_long.csv", row.names = FALSE)
  1. Manually copy these files over into the project_SAPPHIRE google drive.

Part 2: Secondary Data-Cleaning

The goal of this analysis is to clean the screening_long.csv long-form dataset to have columns for body_site and reflectance_type (as well as collection_period, which was done in Part 1).

Set Up

  1. Load the relevant packages.
library(tidyr)
library(dplyr)
  1. Load the data.
data <- read.csv("data/screening_long.csv")

Reshape and Clean Data

  1. Reshaping the data to include each reflectance value in its own row, with separate columns for body site and reflectance type. This will create a long format dataset with two new columns:
  • Measurement: Contains the original column names (e.g., skin_reflectance_forehead_M1).
  • Reflectance: Contains the reflectance values.
data_long <- pivot_longer(data, 
                          cols = starts_with("skin_reflectance"), 
                          names_to = "measurement", 
                          values_to = "reflectance_value")
  1. Extract body site and reflectance type data from the new measurement column.
data_long <- data_long %>%
  extract(col = "measurement",
          into = c("body_site", "reflectance_metric"),
          regex = "skin_reflectance_([a-zA-Z_]+?)(l[123]|l_[123]|a[123]|a_[123]|b[123]|b_[123]|g[123]|r[123]|m[123]|e[123])",
          remove = FALSE) %>%
  mutate(body_site = gsub("_$", "", body_site))  # Remove any trailing underscore from body_site
  1. Delete the “measurements” column since it is no longer needed.
data_long <- data_long %>%
  select(-measurement)
  1. Save the cleaned data file.
write.csv(data_long, "data/cleaned_screening_long.csv", row.names = FALSE)

Part 3: Tertiary and Final Data-Cleaning

The goal of this analysis is to:
- Handle missing values
- Convert categorical values to factors
- Reshape and filter data again
- Remove irrelevant data columns

Set Up

  1. Load relevant libraries.
# Load necessary libraries
library(dplyr)
  1. Read in the data.
df <- read.csv("data/cleaned_screening_long.csv")

Wrangling

  1. Check for missing values.
# Check for missing values
missing_summary <- sapply(df, function(x) sum(is.na(x)))
print(missing_summary)
                                       participant_centre_id 
                                                           0 
                                            interviewer_name 
                                                           0 
                                                  today_date 
                                                          72 
                                                   age_years 
                                                          72 
                                               date_of_birth 
                                                           0 
                                                      gender 
                                                           0 
                                          ethnicity_coloured 
                                                           0 
                                             ethnicity_white 
                                                           0 
                                     ethnicity_african_black 
                                                           0 
                                      ethnicity_indian_asian 
                                                           0 
                                             ethnicity_other 
                                                           0 
                                     ethnicity_specify_other 
                                                       15840 
                                                   ethnicity 
                                                       15840 
                                            refuse_to_answer 
                                                           0 
                                             weight_measure1 
                                                           0 
                                             weight_measure2 
                                                           0 
                                                  avg_weight 
                                                           0 
                                             height_measure1 
                                                           0 
                                             height_measure2 
                                                          72 
                                                  avg_height 
                                                          72 
                                                         bmi 
                                                          72 
                                             sore_throat_yes 
                                                           0 
                                              sore_throat_no 
                                                           0 
                                              runny_nose_yes 
                                                           0 
                                               runny_nose_no 
                                                           0 
                                                   cough_yes 
                                                           0 
                                                    cough_no 
                                                           0 
                                                   fever_yes 
                                                           0 
                                                    fever_no 
                                                           0 
                                            night_sweats_yes 
                                                           0 
                                             night_sweats_no 
                                                           0 
                                 unexplained_weight_loss_yes 
                                                           0 
                                  unexplained_weight_loss_no 
                                                           0 
                            currently_taking_supplements_yes 
                                                           0 
                             currently_taking_supplements_no 
                                                           0 
                                                 supplements 
                                                       15120 
                            currently_taking_medications_yes 
                                                           0 
                             currently_taking_medications_no 
                                                           0 
                                                 medications 
                                                       15264 
                                    participant_included_yes 
                                                           0 
                                     participant_included_no 
                                                           0 
                                  if_no_reason_for_exclusion 
                                                       15768 
smoking_do_you_regularly_smoke_at_least1cigarette5daysa_week 
                                                           0 
                                            smoking_comments 
                                                       11304 
                                    pathcare_sticker_barcode 
                                                         288 
                                                     req_num 
                                                         288 
                                                   serv_date 
                                                         216 
                                                vit_d_result 
                                                         216 
                                               res_date_time 
                                                         288 
                                           collection_period 
                                                           0 
                                          continued_in_study 
                                                        9864 
                            if_not_continued_in_study_reason 
                                                       15624 
                                                   body_site 
                                                           0 
                                          reflectance_metric 
                                                           0 
                                           reflectance_value 
                                                         319 
  1. Create a column for ethnicity (xhosa or cape_mixed). If participants answered TRUE to ethnicity_african_black, and FALSE to all other ethnicity_ questions, they are designated as xhosa. If participants answered TRUE to ethnicity_coloured and FALSE to all other ethnicity_ questions, they are designated as cape_coloured.
# Create the ethnicity column based on conditions
df <- df %>%
  mutate(ethnicity = case_when(
    ethnicity_african_black == TRUE & 
    ethnicity_coloured == FALSE & 
    ethnicity_white == FALSE & 
    ethnicity_indian_asian == FALSE & 
    ethnicity_other == FALSE ~ "xhosa",
    
    ethnicity_coloured == TRUE & 
    ethnicity_african_black == FALSE & 
    ethnicity_white == FALSE & 
    ethnicity_indian_asian == FALSE & 
    ethnicity_other == FALSE ~ "cape_coloured",
    
    TRUE ~ NA_character_  # Assign NA for all other cases
  ))

# Fill NA values with 'Other' if needed
df$ethnicity[is.na(df$ethnicity)] <- "Other"
  1. Check the updated data frame.

  2. Covert categorical variables into factors.

df$body_site <- factor(df$body_site)
df$collection_period <- factor(df$collection_period)
df$reflectance_metric <- factor(df$reflectance_metric)
df$ethnicity <- factor(df$ethnicity)
  1. Check structure of data frame.

  2. Remove columns that are irrelevant to the analysis. Do this through filtering for the columns/variables that are relevant.

df_filtered <- df %>%
  select(participant_centre_id, gender, ethnicity, 
         vit_d_result, collection_period, body_site, 
         reflectance_metric, reflectance_value)
  1. Save the filtered data frame.
write.csv(df_filtered, "data/filtered_screening_long.csv", row.names = FALSE)

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Detroit
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   purrr_1.0.2    
 [5] readr_2.1.5     tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.1  
 [9] tidyverse_2.0.0 janitor_2.2.0   dplyr_1.1.4     readxl_1.4.3   

loaded via a namespace (and not attached):
 [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    stringi_1.8.4    
 [5] hms_1.1.3         digest_0.6.37     magrittr_2.0.3    evaluate_1.0.0   
 [9] grid_4.4.1        timechange_0.3.0  fastmap_1.2.0     cellranger_1.1.0 
[13] rprojroot_2.0.4   workflowr_1.7.1   jsonlite_1.8.9    whisker_0.4.1    
[17] promises_1.3.0    fansi_1.0.6       scales_1.3.0      jquerylib_0.1.4  
[21] cli_3.6.3         rlang_1.1.4       munsell_0.5.1     withr_3.0.1      
[25] cachem_1.1.0      yaml_2.3.10       tools_4.4.1       tzdb_0.4.0       
[29] colorspace_2.1-1  httpuv_1.6.15     vctrs_0.6.5       R6_2.5.1         
[33] lifecycle_1.0.4   git2r_0.33.0      snakecase_0.11.1  fs_1.6.4         
[37] pkgconfig_2.0.3   pillar_1.9.0      bslib_0.8.0       later_1.3.2      
[41] gtable_0.3.5      glue_1.7.0        Rcpp_1.0.13       xfun_0.47        
[45] tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.48        htmltools_0.5.8.1
[49] rmarkdown_2.28    compiler_4.4.1