data_cleaning

Last updated: 2024-10-07

Checks: 6 1

Knit directory: SAPPHIRE/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20240923)

The command set.seed(20240923) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: dfd6c47

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version dfd6c47. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rapp.history
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    data/.DS_Store

Untracked files:
    Untracked:  data/filtered_screening_long.csv

Unstaged changes:
    Modified:   analysis/data_cleaning_3.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/data_cleaning_3.Rmd) and HTML (docs/data_cleaning_3.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	dfd6c47	calliquire	2024-10-07	made data_cleaning_3 and started optimizing data for mixed effects mnodel
html	dfd6c47	calliquire	2024-10-07	made data_cleaning_3 and started optimizing data for mixed effects mnodel
Rmd	cfcc9bc	calliquire	2024-10-07	made links to data cleaning 1 and 2 on index

About This Analysis: This is the third iteration of data cleaning that must be done after data_cleaning_2.

Overview

Handling Missing Values.
Converting Categorical Variables to Factors.
Reshaping or Filtering the Data (if necessary).
Removing Unnecessary Columns.

Load relevant libraries.

# Load necessary libraries
library(dplyr)

Read in the data.

df <- read.csv("data/cleaned_screening_long.csv")

Check for missing values.

# Check for missing values
missing_summary <- sapply(df, function(x) sum(is.na(x)))
print(missing_summary)

                                       participant_centre_id 
                                                           0 
                                            interviewer_name 
                                                           0 
                                                  today_date 
                                                          72 
                                                   age_years 
                                                          72 
                                               date_of_birth 
                                                           0 
                                                      gender 
                                                           0 
                                          ethnicity_coloured 
                                                           0 
                                             ethnicity_white 
                                                           0 
                                     ethnicity_african_black 
                                                           0 
                                      ethnicity_indian_asian 
                                                           0 
                                             ethnicity_other 
                                                           0 
                                     ethnicity_specify_other 
                                                       15840 
                                                   ethnicity 
                                                       15840 
                                            refuse_to_answer 
                                                           0 
                                             weight_measure1 
                                                           0 
                                             weight_measure2 
                                                           0 
                                                  avg_weight 
                                                           0 
                                             height_measure1 
                                                           0 
                                             height_measure2 
                                                          72 
                                                  avg_height 
                                                          72 
                                                         bmi 
                                                          72 
                                             sore_throat_yes 
                                                           0 
                                              sore_throat_no 
                                                           0 
                                              runny_nose_yes 
                                                           0 
                                               runny_nose_no 
                                                           0 
                                                   cough_yes 
                                                           0 
                                                    cough_no 
                                                           0 
                                                   fever_yes 
                                                           0 
                                                    fever_no 
                                                           0 
                                            night_sweats_yes 
                                                           0 
                                             night_sweats_no 
                                                           0 
                                 unexplained_weight_loss_yes 
                                                           0 
                                  unexplained_weight_loss_no 
                                                           0 
                            currently_taking_supplements_yes 
                                                           0 
                             currently_taking_supplements_no 
                                                           0 
                                                 supplements 
                                                       15120 
                            currently_taking_medications_yes 
                                                           0 
                             currently_taking_medications_no 
                                                           0 
                                                 medications 
                                                       15264 
                                    participant_included_yes 
                                                           0 
                                     participant_included_no 
                                                           0 
                                  if_no_reason_for_exclusion 
                                                       15768 
smoking_do_you_regularly_smoke_at_least1cigarette5daysa_week 
                                                           0 
                                            smoking_comments 
                                                       11304 
                                    pathcare_sticker_barcode 
                                                         288 
                                                     req_num 
                                                         288 
                                                   serv_date 
                                                         216 
                                                vit_d_result 
                                                         216 
                                               res_date_time 
                                                         288 
                                           collection_period 
                                                           0 
                                          continued_in_study 
                                                        9864 
                            if_not_continued_in_study_reason 
                                                       15624 
                                                   body_site 
                                                           0 
                                          reflectance_metric 
                                                           0 
                                           reflectance_value 
                                                         319

Create column for ethnicity (xhosa or cape_mixed). If participants answered TRUE to ethnicity_african_black, and FALSE to all other ethnicity_ questions, they are designated as xhosa. If participants answered TRUE to ethnicity_coloured and FALSE to all other ethnicity_ questions, they are designated as cape_coloured.

# Create the ethnicity column based on conditions
df <- df %>%
  mutate(ethnicity = case_when(
    ethnicity_african_black == TRUE & 
    ethnicity_coloured == FALSE & 
    ethnicity_white == FALSE & 
    ethnicity_indian_asian == FALSE & 
    ethnicity_other == FALSE ~ "xhosa",
    
    ethnicity_coloured == TRUE & 
    ethnicity_african_black == FALSE & 
    ethnicity_white == FALSE & 
    ethnicity_indian_asian == FALSE & 
    ethnicity_other == FALSE ~ "cape_coloured",
    
    TRUE ~ NA_character_  # Assign NA for all other cases
  ))

# Fill NA values with 'Other' if needed
df$ethnicity[is.na(df$ethnicity)] <- "Other"

# Check the updated dataframe
head(df)

  participant_centre_id interviewer_name today_date age_years date_of_birth
1               VDKH001            Betty 2013-02-11        20    1993-08-07
2               VDKH001            Betty 2013-02-11        20    1993-08-07
3               VDKH001            Betty 2013-02-11        20    1993-08-07
4               VDKH001            Betty 2013-02-11        20    1993-08-07
5               VDKH001            Betty 2013-02-11        20    1993-08-07
6               VDKH001            Betty 2013-02-11        20    1993-08-07
  gender ethnicity_coloured ethnicity_white ethnicity_african_black
1      1              FALSE           FALSE                    TRUE
2      1              FALSE           FALSE                    TRUE
3      1              FALSE           FALSE                    TRUE
4      1              FALSE           FALSE                    TRUE
5      1              FALSE           FALSE                    TRUE
6      1              FALSE           FALSE                    TRUE
  ethnicity_indian_asian ethnicity_other ethnicity_specify_other ethnicity
1                  FALSE           FALSE                      NA     xhosa
2                  FALSE           FALSE                      NA     xhosa
3                  FALSE           FALSE                      NA     xhosa
4                  FALSE           FALSE                      NA     xhosa
5                  FALSE           FALSE                      NA     xhosa
6                  FALSE           FALSE                      NA     xhosa
  refuse_to_answer weight_measure1 weight_measure2 avg_weight height_measure1
1            FALSE            60.6            60.6       60.6           1.728
2            FALSE            60.6            60.6       60.6           1.728
3            FALSE            60.6            60.6       60.6           1.728
4            FALSE            60.6            60.6       60.6           1.728
5            FALSE            60.6            60.6       60.6           1.728
6            FALSE            60.6            60.6       60.6           1.728
  height_measure2 avg_height      bmi sore_throat_yes sore_throat_no
1           1.728      1.728 20.29482           FALSE           TRUE
2           1.728      1.728 20.29482           FALSE           TRUE
3           1.728      1.728 20.29482           FALSE           TRUE
4           1.728      1.728 20.29482           FALSE           TRUE
5           1.728      1.728 20.29482           FALSE           TRUE
6           1.728      1.728 20.29482           FALSE           TRUE
  runny_nose_yes runny_nose_no cough_yes cough_no fever_yes fever_no
1          FALSE          TRUE     FALSE     TRUE     FALSE     TRUE
2          FALSE          TRUE     FALSE     TRUE     FALSE     TRUE
3          FALSE          TRUE     FALSE     TRUE     FALSE     TRUE
4          FALSE          TRUE     FALSE     TRUE     FALSE     TRUE
5          FALSE          TRUE     FALSE     TRUE     FALSE     TRUE
6          FALSE          TRUE     FALSE     TRUE     FALSE     TRUE
  night_sweats_yes night_sweats_no unexplained_weight_loss_yes
1            FALSE            TRUE                       FALSE
2            FALSE            TRUE                       FALSE
3            FALSE            TRUE                       FALSE
4            FALSE            TRUE                       FALSE
5            FALSE            TRUE                       FALSE
6            FALSE            TRUE                       FALSE
  unexplained_weight_loss_no currently_taking_supplements_yes
1                       TRUE                            FALSE
2                       TRUE                            FALSE
3                       TRUE                            FALSE
4                       TRUE                            FALSE
5                       TRUE                            FALSE
6                       TRUE                            FALSE
  currently_taking_supplements_no supplements currently_taking_medications_yes
1                            TRUE        <NA>                            FALSE
2                            TRUE        <NA>                            FALSE
3                            TRUE        <NA>                            FALSE
4                            TRUE        <NA>                            FALSE
5                            TRUE        <NA>                            FALSE
6                            TRUE        <NA>                            FALSE
  currently_taking_medications_no medications participant_included_yes
1                            TRUE        <NA>                     TRUE
2                            TRUE        <NA>                     TRUE
3                            TRUE        <NA>                     TRUE
4                            TRUE        <NA>                     TRUE
5                            TRUE        <NA>                     TRUE
6                            TRUE        <NA>                     TRUE
  participant_included_no if_no_reason_for_exclusion
1                   FALSE                       <NA>
2                   FALSE                       <NA>
3                   FALSE                       <NA>
4                   FALSE                       <NA>
5                   FALSE                       <NA>
6                   FALSE                       <NA>
  smoking_do_you_regularly_smoke_at_least1cigarette5daysa_week smoking_comments
1                                                        FALSE             <NA>
2                                                        FALSE             <NA>
3                                                        FALSE             <NA>
4                                                        FALSE             <NA>
5                                                        FALSE             <NA>
6                                                        FALSE             <NA>
  pathcare_sticker_barcode   req_num           serv_date vit_d_result
1                785293139 785293139 2013-02-11 10:40:00         29.5
2                785293139 785293139 2013-02-11 10:40:00         29.5
3                785293139 785293139 2013-02-11 10:40:00         29.5
4                785293139 785293139 2013-02-11 10:40:00         29.5
5                785293139 785293139 2013-02-11 10:40:00         29.5
6                785293139 785293139 2013-02-11 10:40:00         29.5
        res_date_time collection_period continued_in_study
1 2013-02-13 12:13:00            Summer               <NA>
2 2013-02-13 12:13:00            Summer               <NA>
3 2013-02-13 12:13:00            Summer               <NA>
4 2013-02-13 12:13:00            Summer               <NA>
5 2013-02-13 12:13:00            Summer               <NA>
6 2013-02-13 12:13:00            Summer               <NA>
  if_not_continued_in_study_reason body_site reflectance_metric
1                             <NA>  forehead                 e1
2                             <NA>  forehead                 e2
3                             <NA>  forehead                 e3
4                             <NA>  forehead                 m1
5                             <NA>  forehead                 m2
6                             <NA>  forehead                 m3
  reflectance_value
1             17.01
2             18.73
3             18.61
4             55.99
5             70.57
6             76.28

Covert categorical variables into factors.

df$body_site <- factor(df$body_site)
df$collection_period <- factor(df$collection_period)
df$reflectance_metric <- factor(df$reflectance_metric)
df$ethnicity <- factor(df$ethnicity)

Check structure of data frame.

str(df)

'data.frame':   15840 obs. of  55 variables:
 $ participant_centre_id                                       : chr  "VDKH001" "VDKH001" "VDKH001" "VDKH001" ...
 $ interviewer_name                                            : chr  "Betty" "Betty" "Betty" "Betty" ...
 $ today_date                                                  : chr  "2013-02-11" "2013-02-11" "2013-02-11" "2013-02-11" ...
 $ age_years                                                   : int  20 20 20 20 20 20 20 20 20 20 ...
 $ date_of_birth                                               : chr  "1993-08-07" "1993-08-07" "1993-08-07" "1993-08-07" ...
 $ gender                                                      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ ethnicity_coloured                                          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ ethnicity_white                                             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ ethnicity_african_black                                     : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ ethnicity_indian_asian                                      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ ethnicity_other                                             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ ethnicity_specify_other                                     : logi  NA NA NA NA NA NA ...
 $ ethnicity                                                   : Factor w/ 2 levels "cape_coloured",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ refuse_to_answer                                            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ weight_measure1                                             : num  60.6 60.6 60.6 60.6 60.6 ...
 $ weight_measure2                                             : num  60.6 60.6 60.6 60.6 60.6 ...
 $ avg_weight                                                  : num  60.6 60.6 60.6 60.6 60.6 ...
 $ height_measure1                                             : num  1.73 1.73 1.73 1.73 1.73 ...
 $ height_measure2                                             : num  1.73 1.73 1.73 1.73 1.73 ...
 $ avg_height                                                  : num  1.73 1.73 1.73 1.73 1.73 ...
 $ bmi                                                         : num  20.3 20.3 20.3 20.3 20.3 ...
 $ sore_throat_yes                                             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ sore_throat_no                                              : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ runny_nose_yes                                              : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ runny_nose_no                                               : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ cough_yes                                                   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ cough_no                                                    : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ fever_yes                                                   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ fever_no                                                    : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ night_sweats_yes                                            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ night_sweats_no                                             : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ unexplained_weight_loss_yes                                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ unexplained_weight_loss_no                                  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ currently_taking_supplements_yes                            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ currently_taking_supplements_no                             : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ supplements                                                 : chr  NA NA NA NA ...
 $ currently_taking_medications_yes                            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ currently_taking_medications_no                             : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ medications                                                 : chr  NA NA NA NA ...
 $ participant_included_yes                                    : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ participant_included_no                                     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ if_no_reason_for_exclusion                                  : chr  NA NA NA NA ...
 $ smoking_do_you_regularly_smoke_at_least1cigarette5daysa_week: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ smoking_comments                                            : chr  NA NA NA NA ...
 $ pathcare_sticker_barcode                                    : int  785293139 785293139 785293139 785293139 785293139 785293139 785293139 785293139 785293139 785293139 ...
 $ req_num                                                     : int  785293139 785293139 785293139 785293139 785293139 785293139 785293139 785293139 785293139 785293139 ...
 $ serv_date                                                   : chr  "2013-02-11 10:40:00" "2013-02-11 10:40:00" "2013-02-11 10:40:00" "2013-02-11 10:40:00" ...
 $ vit_d_result                                                : num  29.5 29.5 29.5 29.5 29.5 29.5 29.5 29.5 29.5 29.5 ...
 $ res_date_time                                               : chr  "2013-02-13 12:13:00" "2013-02-13 12:13:00" "2013-02-13 12:13:00" "2013-02-13 12:13:00" ...
 $ collection_period                                           : Factor w/ 3 levels "6Weeks","Summer",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ continued_in_study                                          : chr  NA NA NA NA ...
 $ if_not_continued_in_study_reason                            : chr  NA NA NA NA ...
 $ body_site                                                   : Factor w/ 3 levels "forehead","left_upper_inner_arm",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ reflectance_metric                                          : Factor w/ 24 levels "a_1","a_2","a_3",..: 10 11 12 19 20 21 22 23 24 13 ...
 $ reflectance_value                                           : num  17 18.7 18.6 56 70.6 ...

Remove columns that are irrelevant to the analysis. Do this through filtering for the columns that are relevant.

df_filtered <- df %>%
  select(participant_centre_id, gender, ethnicity, 
         vit_d_result, collection_period, body_site, 
         reflectance_metric, reflectance_value)

Save the filtered data frame.

write.csv(df_filtered, "data/filtered_screening_long.csv", row.names = FALSE)

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Detroit
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4

loaded via a namespace (and not attached):
 [1] jsonlite_1.8.9    compiler_4.4.1    promises_1.3.0    tidyselect_1.2.1 
 [5] Rcpp_1.0.13       stringr_1.5.1     git2r_0.33.0      later_1.3.2      
 [9] jquerylib_0.1.4   yaml_2.3.10       fastmap_1.2.0     R6_2.5.1         
[13] generics_0.1.3    workflowr_1.7.1   knitr_1.48        tibble_3.2.1     
[17] rprojroot_2.0.4   bslib_0.8.0       pillar_1.9.0      rlang_1.1.4      
[21] utf8_1.2.4        cachem_1.1.0      stringi_1.8.4     httpuv_1.6.15    
[25] xfun_0.47         fs_1.6.4          sass_0.4.9        cli_3.6.3        
[29] withr_3.0.1       magrittr_2.0.3    digest_0.6.37     rstudioapi_0.16.0
[33] lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.0    glue_1.7.0       
[37] whisker_0.4.1     fansi_1.0.6       rmarkdown_2.28    tools_4.4.1      
[41] pkgconfig_2.0.3   htmltools_0.5.8.1

data_cleaning_3

calliquire

2024-10-07

Overview