data_cleaning

Last updated: 2024-10-03

Checks: 6 1

Knit directory: SAPPHIRE/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20240923)

The command set.seed(20240923) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: b0ffd83

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version b0ffd83. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rapp.history
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    data/.DS_Store

Unstaged changes:
    Modified:   analysis/data_cleaning_1.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/data_cleaning_1.Rmd) and HTML (docs/data_cleaning_1.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	e8e0e54	calliquire	2024-10-03	data cleaning 2

About This Analysis: The goal of this analysis is to initiate the wrangling of the original, raw data through manual editing in Google Sheets and a R pipeline to leave us with three long-form data sets (one for each category of data).

Set Up

Load the relevant packages.

library(readxl)
library(dplyr)
library(janitor)
library(tidyverse)

Prepare Your Data

In Google Sheets, make a copy of the original data, save that copy as “serum_vit_D_study_with_lab_results.xlsx” and manually edit the following:

delete floating note in ScreeningDataCollectionWinter sheet: “Note: VDKH001 had no second weight measurement so the initial measurement was used.”
delete floating note in ScreeningDataCollection6Weeks sheet: “Note: VDKH007 and VDKH012 had no second weight measurement so the initial measurement was used.”
change column DT name in ScreeningDataCollectionWinter sheet from “Result” to “VitDResult” to match the same measures in the Summer and 6 Weeks sheets
delete columns A and B (named ParticipantID and ParticipantCentre, respectively) so that column C becomes your unique identifier (named ParticipantCentreID) in all of the individual sheets within the workbook
delete floating notes in FoodFrequencySummer sheet: “Notes: VDKH016 - No information for Pilchards or Liver beef/lamb; VDTG011 - No Vit D for Marg soft.”
delete floating notes in FoodFrequencyWinter sheet: “Note: No amounts calculated for VDKH047 - Soft Marg - or VDKH050 - Snoek; Where two types of margerine were specified, the brand with the lower Vit D percentage was used to calculate Vit D ie VDTG023, 032,045”

Begin Wrangling in R

Specify the path to the new Excel file. This .xlsx file is located in ~/GitHub/SAPPHIRE/data.

file_path <- "data/serum_vit_D_study_with_lab_results.xlsx"

Category 1: Screening Data Collection

Load data from the Screening Data Collection sheets.

screening_summer <- read_excel(file_path, sheet = "ScreeningDataCollectionSummer")
screening_winter <- read_excel(file_path, sheet = "ScreeningDataCollectionWinter")
screening_6weeks <- read_excel(file_path, sheet = "ScreeningDataCollection6Weeks")

Convert the column names to snake_case.

screening_summer <- screening_summer %>% clean_names()
screening_winter <- screening_winter %>% clean_names()
screening_6weeks <- screening_6weeks %>% clean_names()

Standardize the data types for critical columns, for example, age_years.

screening_summer <- screening_summer %>% mutate(age_years = as.numeric(age_years))
screening_winter <- screening_winter %>% mutate(age_years = as.numeric(age_years))
screening_6weeks <- screening_6weeks %>% mutate(age_years = as.numeric(age_years))

Add a ‘collection_period’ column to each data frame to indicate when the data was collected.

screening_summer <- screening_summer %>% mutate(collection_period = "Summer")
screening_winter <- screening_winter %>% mutate(collection_period = "Winter")
screening_6weeks <- screening_6weeks %>% mutate(collection_period = "6Weeks")

Combine the dataframes into a long-form dataset.

screening_long <- bind_rows(screening_summer, screening_winter, screening_6weeks)

View the resulting long-form dataset

print(head(screening_long))

# A tibble: 6 × 124
  participant_centre_id interviewer_name today_date          age_years
  <chr>                 <chr>            <dttm>                  <dbl>
1 VDKH001               Betty            2013-02-11 00:00:00        20
2 VDKH002               Betty            2013-02-11 00:00:00        23
3 VDKH003               Betty            2013-02-11 00:00:00        23
4 VDKH004               Betty            2013-02-12 00:00:00        20
5 VDKH005               Betty            2013-02-12 00:00:00        19
6 VDKH006               Betty            2013-02-12 00:00:00        21
# ℹ 120 more variables: date_of_birth <dttm>, gender <dbl>,
#   ethnicity_coloured <lgl>, ethnicity_white <lgl>,
#   ethnicity_african_black <lgl>, ethnicity_indian_asian <lgl>,
#   ethnicity_other <lgl>, ethnicity_specify_other <lgl>, ethnicity <lgl>,
#   refuse_to_answer <lgl>, weight_measure1 <dbl>, weight_measure2 <dbl>,
#   avg_weight <dbl>, height_measure1 <dbl>, height_measure2 <dbl>,
#   avg_height <dbl>, bmi <dbl>, sore_throat_yes <lgl>, sore_throat_no <lgl>, …

Repeat the above steps for the sheets in the Food Frequency category and in the Sun Exposure category.

Category 2: Food Frequency

Load data from the Food Frequency sheets.

food_freq_summer <- read_excel(file_path, sheet = "FoodFrequencySummer")
food_freq_winter <- read_excel(file_path, sheet = "FoodFrequencyWinter")
food_freq_6weeks <- read_excel(file_path, sheet = "FoodFrequency6Weeks")

Convert the column names to snake_case.

food_freq_summer <- food_freq_summer %>% clean_names()
food_freq_winter <- food_freq_winter %>% clean_names()
food_freq_6weeks <- food_freq_6weeks %>% clean_names()

Step 6 unnecessary.
Add a ‘collection_period’ column to each data frame to indicate when the data was collected.

food_freq_summer <- food_freq_summer %>% mutate(collection_period = "Summer")
food_freq_winter <- food_freq_winter %>% mutate(collection_period = "Winter")
food_freq_6weeks <- food_freq_6weeks %>% mutate(collection_period = "6Weeks")

Combine the dataframes into a long-form dataset.

food_freq_long <- bind_rows(food_freq_summer, food_freq_winter, food_freq_6weeks)

View the resulting long-form dataset.

print(head(food_freq_long))

# A tibble: 6 × 201
  participant_centre_id interviewer_name today_date          sardines_no
  <chr>                 <chr>            <dttm>              <lgl>      
1 VDKH001               Betty            2013-02-11 00:00:00 TRUE       
2 VDKH002               Betty            2013-02-11 00:00:00 TRUE       
3 VDKH003               Betty            2013-02-11 00:00:00 TRUE       
4 VDKH004               Betty            2013-02-12 00:00:00 TRUE       
5 VDKH005               Betty            2013-02-12 00:00:00 TRUE       
6 VDKH006               Betty            2013-02-12 00:00:00 TRUE       
# ℹ 197 more variables: sardines_yes <lgl>, sardines_times_per_week <dbl>,
#   sardines_times_per_day <dbl>, sardines_brand_description <chr>,
#   sardines_amount_usually_eaten_per_occasion <chr>,
#   sardines_amount_eaten_per_day_as_proportion_of100g <dbl>,
#   sardines_vit_d_micrograms_per_100g <dbl>,
#   sardines_total_daily_vit_d_intake_micro_grams <dbl>, sardines_source <lgl>,
#   sardines_notes <lgl>, pilchards_no <lgl>, pilchards_yes <lgl>, …

Category 3: Sun Exposure

Load data from the Sun Exposure sheets.

sun_expos_summer <- read_excel(file_path, sheet = "SunExposureSummer")
sun_expos_winter <- read_excel(file_path, sheet = "SunExposureWinter")
sun_expos_6weeks <- read_excel(file_path, sheet = "SunExposure6Weeks")

Convert the column names to snake_case.

sun_expos_summer <- sun_expos_summer %>% clean_names()
sun_expos_winter <- sun_expos_winter %>% clean_names()
sun_expos_6weeks <- sun_expos_6weeks %>% clean_names()

Step 6 unnecessary.
Add a ‘collection_period’ column to each data frame to indicate when the data was collected.

sun_expos_summer <- sun_expos_summer %>% mutate(collection_period = "Summer")
sun_expos_winter <- sun_expos_winter %>% mutate(collection_period = "Winter")
sun_expos_6weeks <- sun_expos_6weeks %>% mutate(collection_period = "6Weeks")

Combine the dataframes into a long-form dataset.

sun_expos_long <- bind_rows(sun_expos_summer, sun_expos_winter, sun_expos_6weeks)

View the resulting long-form dataset.

print(head(sun_expos_long))

# A tibble: 6 × 29
  participant_centre_id interviewer_name today_date         
  <chr>                 <chr>            <dttm>             
1 VDKH001               Rene             2013-02-11 00:00:00
2 VDKH002               Betty            2013-02-11 00:00:00
3 VDKH003               Rene             2013-02-11 00:00:00
4 VDKH004               Rene             2013-02-12 00:00:00
5 VDKH005               Rene             2013-02-12 00:00:00
6 VDKH006               Rene             2013-02-12 00:00:00
# ℹ 26 more variables: x1time_spent_outdoors_weekday <chr>,
#   x2time_of_day_begin_of_exposure_weekday <chr>,
#   x3time_spent_outdoors_weekend_day <chr>,
#   x4time_of_day_begin_of_exposure_weekend <chr>, x5forehead_exposed <chr>,
#   x5cheek_exposed <chr>, x5hand_exposed <chr>, x5shoulder_exposed <chr>,
#   x5arm_exposed <chr>, x5back_exposed <chr>, x5chest_exposed <chr>,
#   x5thigh_exposed <chr>, x5calf_exposed <chr>, …

Save Your Work

Save your long-form datasets as .csv files in /data.

write.csv(screening_long, "data/screening_long.csv", row.names = FALSE)
write.csv(food_freq_long, "data/food_freq_long.csv", row.names = FALSE)
write.csv(sun_expos_long, "data/sun_expos_long.csv", row.names = FALSE)

Manually copy these files over into the project_SAPPHIRE google drive.

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Detroit
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   purrr_1.0.2    
 [5] readr_2.1.5     tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.1  
 [9] tidyverse_2.0.0 janitor_2.2.0   dplyr_1.1.4     readxl_1.4.3   

loaded via a namespace (and not attached):
 [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    stringi_1.8.4    
 [5] hms_1.1.3         digest_0.6.37     magrittr_2.0.3    evaluate_1.0.0   
 [9] grid_4.4.1        timechange_0.3.0  fastmap_1.2.0     cellranger_1.1.0 
[13] rprojroot_2.0.4   workflowr_1.7.1   jsonlite_1.8.9    whisker_0.4.1    
[17] promises_1.3.0    fansi_1.0.6       scales_1.3.0      jquerylib_0.1.4  
[21] cli_3.6.3         rlang_1.1.4       munsell_0.5.1     withr_3.0.1      
[25] cachem_1.1.0      yaml_2.3.10       tools_4.4.1       tzdb_0.4.0       
[29] colorspace_2.1-1  httpuv_1.6.15     vctrs_0.6.5       R6_2.5.1         
[33] lifecycle_1.0.4   git2r_0.33.0      snakecase_0.11.1  fs_1.6.4         
[37] pkgconfig_2.0.3   pillar_1.9.0      bslib_0.8.0       later_1.3.2      
[41] gtable_0.3.5      glue_1.7.0        Rcpp_1.0.13       xfun_0.47        
[45] tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.48        htmltools_0.5.8.1
[49] rmarkdown_2.28    compiler_4.4.1

data_cleaning_1

calliquire

2024-10-01