class: top, right, inverse ## ACCE Research Data and Project Management *** .bottom[ # Basic Data Hygiene #### 10-11 April 2019, University of Sheffield #### Dr Anna Krystalli @annakrystalli ] --- # Start at the beginning ### **Plan your Research Data Management** - **Start early**. Make an RDM plan before collecting data. - [**RDM checklist**](http://www.dcc.ac.uk/sites/default/files/documents/resource/DMP/DMP_Checklist_2013.pdf) - Anticipate **data products** as part of your thesis **outputs** - Think about what technologies to use --- # Own your data ### **Take initiative & responsibility. Think long term.** <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr">Act as though every short term study will become a long term one <a href="https://twitter.com/tomjwebb?ref_src=twsrc%5Etfw">@tomjwebb</a>. Needs to be reproducible in 3, 20, 100 yrs</p>— Oceans Initiative (@oceansresearch) <a href="https://twitter.com/oceansresearch/status/556107891610894337?ref_src=twsrc%5Etfw">January 16, 2015</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr">Act as though every short term study will become a long term one <a href="https://twitter.com/tomjwebb">@tomjwebb</a>. Needs to be reproducible in 3, 20, 100 yrs</p>— oceans initiative (@oceansresearch) <a href="https://twitter.com/oceansresearch/status/556107891610894337">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> --- class: top, right, inverse # Data management *** --- ## Spreadsheets ### extreme but in many ways defendable <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> stay away from excel at all costs?</p>— Timothée Poisot (@tpoi) <a href="https://twitter.com/tpoi/status/556107000950829056">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> --- ## **excel: `read/entry only`** <blockquote class="twitter-tweet" data-conversation="none" data-cards="hidden" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> <a href="https://twitter.com/tpoi">@tpoi</a> excel is fine for data entry. Just save in plain text format like csv. Some additional tips: <a href="https://t.co/8fUv9PyVjC">pic.twitter.com/8fUv9PyVjC</a></p>— Jaime Ashander (@jaimedash) <a href="https://twitter.com/jaimedash/status/556113131932381185">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/jaimedash">@jaimedash</a> just don’t let excel anywhere near dates or times. <a href="https://twitter.com/tomjwebb">@tomjwebb</a> <a href="https://twitter.com/tpoi">@tpoi</a> <a href="https://twitter.com/larysar">@larysar</a></p>— Dave Harris (@davidjayharris) <a href="https://twitter.com/davidjayharris/status/556126474550263809">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> --- ## **Databases: more robust** Stronger quality control features. Advisable for multiple contributors <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> databases? <a href="https://twitter.com/swcarpentry">@swcarpentry</a> has a good course on SQLite</p>— Timothée Poisot (@tpoi) <a href="https://twitter.com/tpoi/status/556142573308608513">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> <a href="https://twitter.com/tpoi">@tpoi</a> if the data are moderately complex, or involve multiple people, best to set up a database with well designed entry form 1/2</p>— Luca Borger (@lucaborger) <a href="https://twitter.com/lucaborger/status/556226732496535552">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> --- ## **Databases: benefits** <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> Entering via a database management system (e.g., Access, Filemaker) can make entry easier & help prevent data entry errors <a href="https://twitter.com/tpoi">@tpoi</a></p>— Ethan White (@ethanwhite) <a href="https://twitter.com/ethanwhite/status/556119480493813760">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/ethanwhite">@ethanwhite</a> +1 Enforcing data types, options from selection etc, just some useful things a DB gives you, if you turn them on <a href="https://twitter.com/tomjwebb">@tomjwebb</a> <a href="https://twitter.com/tpoi">@tpoi</a></p>— Gavin Simpson (@ucfagls) <a href="https://twitter.com/ucfagls/status/556120176748290048">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> --- <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> it also prevents a lot of different bad practices. It is possible to do some of this in Excel. <a href="https://twitter.com/tpoi">@tpoi</a></p>— Ethan White (@ethanwhite) <a href="https://twitter.com/ethanwhite/status/556119826582605824">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> Have a look at the Data Carpentry [**SQL for Ecology** lesson](http://www.datacarpentry.org/sql-ecology-lesson/) --- class: top, right, inverse # Data formats *** --- ## **Data formats** - **`.csv`**: *comma* separated values. - **`.tsv`**: *tab* separated values. - **`.txt`**: no formatting specified. <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> It has to be interoperability/openness - can I read your data with whatever I use, without having to convert it?</p>— Paul Swaddle (@paul_swaddle) <a href="https://twitter.com/paul_swaddle/status/556148166270406656">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> #### **more unusual formats will need instructions on use.** --- ## **Ensure data is machine readable** ### bad <img src="assets/img/bad_xl1.png" width=600px> --- ### bad <img src="assets/img/bad_xl2.png" width=600px> --- ### good <img src="assets/img/good_xl.png" width=600px> --- ### ok <img src="assets/img/ok_xl.png" width=600px> - could help data entry - `.csv` or `.tsv` copy would need to be saved. --- class: top, right, inverse # Basic quality control *** --- ## **Use good null values** ### Missing values are a fact of life - Usually, best solution is to **leave blank** - **`NA`** or **`NULL`** are also good options - **NEVER use `0`**. Avoid numbers like **`-999`** - Don’t make up your own code for missing values --- ## [**`read.csv()`**](http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html) **utilities** - **`na.string`:** character vector of values to be coded missing and replaced with `NA` to argument eg - **`strip.white`:** Logical. if `TRUE` strips leading and trailing white space from unquoted character fields - **`blank.lines.skip`:** Logical: if `TRUE` blank lines in the input are ignored. - **`fileEncoding`:** if you're getting funny characters, you probably need to specify the correct encoding. ```r read.csv(file, na.strings = c("NA", "-999"), strip.white = TRUE, blank.lines.skip = TRUE, fileEncoding = "mac") ``` --- ## [**`readr::read_csv()`**](https://cran.r-project.org/web/packages/readr/readr.pdf) **utilities** - **`na`:** character vector of values to be coded missing and replaced with `NA` to argument eg - **`trim_ws`:** Logical. if `TRUE` strips leading and trailing white space from unquoted character fields - **`col_types`:** Allows for column data type specification. ([see more](https://cran.r-project.org/web/packages/readxl/vignettes/cell-and-column-types.html)) - **`locale`:** controls things like the default time zone, encoding, decimal mark, big mark, and day/month names - **`skip`:** Number of lines to skip before reading data. - **`n_max`:** Maximum number of records to read. ```r read_csv(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA", "-999"), trim_ws = TRUE, skip = 0, n_max = Inf) ``` --- ## Inspect #### Have a look at your data with `View(df)` ```r View(mtcars) ``` ![](assets/view_mtcars.png) - Check **empty cells** --- ### Print Check your **software interprets your data correctly** - eg see top few rows with `head(df)` ```r head(mtcars) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 ``` --- ### Structure see structure of any object with `str()`. ```r str(mtcars) ``` ``` ## 'data.frame': 32 obs. of 11 variables: ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... ## $ disp: num 160 160 108 258 360 ... ## $ hp : num 110 110 93 110 175 105 245 62 95 123 ... ## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec: num 16.5 17 18.6 19.4 17 ... ## $ vs : num 0 0 1 1 0 1 0 1 1 1 ... ## $ am : num 1 1 1 0 0 0 0 0 0 0 ... ## $ gear: num 4 4 4 3 3 3 3 4 4 4 ... ## $ carb: num 4 4 1 1 2 1 4 2 2 4 ... ``` --- ### Summarise - Check the **range of values** (and value types) in each column matches expectation. - Check **units of measurement are what you expect** ```r summary(mtcars) ``` ``` ## mpg cyl disp hp ## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 ## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 ## Median :19.20 Median :6.000 Median :196.3 Median :123.0 ## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 ## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 ## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 ## drat wt qsec vs ## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 ## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 ## Median :3.695 Median :3.325 Median :17.71 Median :0.0000 ## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 ## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 ## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 ## am gear carb ## Min. :0.0000 Min. :3.000 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 ## Median :0.0000 Median :4.000 Median :2.000 ## Mean :0.4062 Mean :3.688 Mean :2.812 ## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 ## Max. :1.0000 Max. :5.000 Max. :8.000 ``` --- ### pkg [`skimr`](https://github.com/ropenscilabs/skimr) `skimr` provides a frictionless approach to displaying summary statistics the user can skim quickly to understand their data ```r install.packages("skimr") ``` --- ```r library(skimr) skim(mtcars) ``` ``` ## Skim summary statistics ## n obs: 32 ## n variables: 11 ## ## ── Variable type:numeric ────────────────────────────────────────────── ## variable missing complete n mean sd p0 p25 p50 p75 ## am 0 32 32 0.41 0.5 0 0 0 1 ## carb 0 32 32 2.81 1.62 1 2 2 4 ## cyl 0 32 32 6.19 1.79 4 4 6 8 ## disp 0 32 32 230.72 123.94 71.1 120.83 196.3 326 ## drat 0 32 32 3.6 0.53 2.76 3.08 3.7 3.92 ## gear 0 32 32 3.69 0.74 3 3 4 4 ## hp 0 32 32 146.69 68.56 52 96.5 123 180 ## mpg 0 32 32 20.09 6.03 10.4 15.43 19.2 22.8 ## qsec 0 32 32 17.85 1.79 14.5 16.89 17.71 18.9 ## vs 0 32 32 0.44 0.5 0 0 0 1 ## wt 0 32 32 3.22 0.98 1.51 2.58 3.33 3.61 ## p100 hist ## 1 ▇▁▁▁▁▁▁▆ ## 8 ▆▇▂▇▁▁▁▁ ## 8 ▆▁▁▃▁▁▁▇ ## 472 ▇▆▁▂▅▃▁▂ ## 4.93 ▃▇▁▅▇▂▁▁ ## 5 ▇▁▁▆▁▁▁▂ ## 335 ▃▇▃▅▂▃▁▁ ## 33.9 ▃▇▇▇▃▂▂▂ ## 22.9 ▃▂▇▆▃▃▁▁ ## 1 ▇▁▁▁▁▁▁▆ ## 5.42 ▃▃▃▇▆▁▁▂ ``` --- ## Validate ### pkg [`assertr`](https://github.com/ropensci/assertr) The `assertr` package supplies a suite of functions designed to verify assumptions about data and can be used so detect data errors during analysis. ```r install.packages("assertr") ``` e.g confirm that `mtcars` - has the columns "mpg", "vs", and "am" - contains more than 10 observations - column for 'miles per gallon' (mpg) is a positive number before further analysis: ```r library(dplyr) library(assertr) mtcars %>% verify(has_all_names("mpg", "vs", "am", "wt")) %>% verify(nrow(.) > 10) %>% verify(mpg > 0) ``` --- class: top, right, inverse # Data security *** --- ## **Raw data are sacrosanct** <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> don't, not even with a barge pole, not for one second, touch or otherwise edit the raw data files. Do any manipulations in script</p>— Gavin Simpson (@ucfagls) <a href="https://twitter.com/ucfagls/status/556107371634634755">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> <a href="https://twitter.com/srsupp">@srsupp</a> Keep one or a few good master data files (per data collection of interest), and code your formatting with good annotation.</p>— Desiree Narango (@DLNarango) <a href="https://twitter.com/DLNarango/status/556128407445323778">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> --- ### **Give yourself less rope** .pull-left[ - It's a good idea to **[revoke your own write permission](https://kb.iu.edu/d/abdb) to the raw data file**. - Then you **can't accidentally edit it**. - It also makes it **harder to do manual edits** in a moment of weakness, when you know you should **just add a line to your data cleaning script**. ] .pull-right[ ![](assets/jon-moore-399469-unsplash.jpg) _Photo by Jon Moore on Unsplash_ ] --- ## **Know your masters** .pull-left[ - identify the `master` copy of files - keep it safe and and accessible - consider version control - consider centralising ] .pull-right[ <img src="assets/abstract-art-background-270456.jpg" width=400px> _source: Pexels CC0_ ] --- ## **Avoid catastrophe** ### **Backup: on disk** - consider using backup software like [Time Machine](https://support.apple.com/en-gb/HT201250) (mac) or [File History](http://www.thundercloud.net/infoave/new/windows-10-has-a-time-machine/) (Windows 10) ### **Backup: in the cloud** - dropbox, googledrive etc. - if [installed](https://tools.google.com/dlpage/drive) on your system, can programmatically access them through `R` - some version control <blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> Back it up</p>— Ben Bond-Lamberty (@BenBondLamberty) <a href="https://twitter.com/BenBondLamberty/status/556120946722222080">January 16, 2015</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script> --- ## **Backup: the Open Science Framework** [osf.io](https://osf.io/) - version controlled - easily shareable - works with other apps (eg googledrive, github) - work on an interface with R ([OSFr](https://github.com/chartgerink/osfr)) is in progress. See more [here](https://youtu.be/cnE3AcdeGVY) --- ## **Backup: Github** - most solid version control. - keep everything in one project folder. - Can be problematic with really large files. --- ## Get back [home](index.html)