Basic Data Hygiene

## ACCE Research Data and Project Management

***

.bottom[
# Basic Data Hygiene
#### 10-11 April 2019, University of Sheffield
#### Dr Anna Krystalli @annakrystalli
]

---

# Start at the beginning

### **Plan your Research Data Management**

- **Start early**. Make an RDM plan before collecting data.
    - [**RDM checklist**](http://www.dcc.ac.uk/sites/default/files/documents/resource/DMP/DMP_Checklist_2013.pdf)

- Anticipate **data products** as part of your thesis **outputs**
- Think about what technologies to use

---

# Own your data

### **Take initiative & responsibility. Think long term.**

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr">Act as though every short term study will become a long term one <a href="https://twitter.com/tomjwebb?ref_src=twsrc%5Etfw">@tomjwebb</a>. Needs to be reproducible in 3, 20, 100 yrs</p>&mdash; Oceans Initiative (@oceansresearch) <a href="https://twitter.com/oceansresearch/status/556107891610894337?ref_src=twsrc%5Etfw">January 16, 2015</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr">Act as though every short term study will become a long term one <a href="https://twitter.com/tomjwebb">@tomjwebb</a>. Needs to be reproducible in 3, 20, 100 yrs</p>&mdash; oceans initiative (@oceansresearch) <a href="https://twitter.com/oceansresearch/status/556107891610894337">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

---

# Data management

***

---
## Spreadsheets

### extreme but in many ways defendable

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> stay away from excel at all costs?</p>&mdash; Timothée Poisot (@tpoi) <a href="https://twitter.com/tpoi/status/556107000950829056">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

---

## **excel: `read/entry only`**

<blockquote class="twitter-tweet" data-conversation="none" data-cards="hidden" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> <a href="https://twitter.com/tpoi">@tpoi</a> excel is fine for data entry. Just save in plain text format like csv. Some additional tips: <a href="https://t.co/8fUv9PyVjC">pic.twitter.com/8fUv9PyVjC</a></p>&mdash; Jaime Ashander (@jaimedash) <a href="https://twitter.com/jaimedash/status/556113131932381185">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/jaimedash">@jaimedash</a> just don’t let excel anywhere near dates or times.  <a href="https://twitter.com/tomjwebb">@tomjwebb</a> <a href="https://twitter.com/tpoi">@tpoi</a> <a href="https://twitter.com/larysar">@larysar</a></p>&mdash; Dave Harris (@davidjayharris) <a href="https://twitter.com/davidjayharris/status/556126474550263809">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

---

## **Databases: more robust**

Stronger quality control features. Advisable for multiple contributors

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> databases? <a href="https://twitter.com/swcarpentry">@swcarpentry</a> has a good course on SQLite</p>&mdash; Timothée Poisot (@tpoi) <a href="https://twitter.com/tpoi/status/556142573308608513">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> <a href="https://twitter.com/tpoi">@tpoi</a> if the data are moderately complex, or involve multiple people, best to set up a database with well designed entry form 1/2</p>&mdash; Luca Borger (@lucaborger) <a href="https://twitter.com/lucaborger/status/556226732496535552">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

---

## **Databases: benefits**

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> Entering via a database management system (e.g., Access, Filemaker) can make entry easier &amp; help prevent data entry errors <a href="https://twitter.com/tpoi">@tpoi</a></p>&mdash; Ethan White (@ethanwhite) <a href="https://twitter.com/ethanwhite/status/556119480493813760">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/ethanwhite">@ethanwhite</a> +1 Enforcing data types, options from selection etc, just some useful things a DB gives you, if you turn them on <a href="https://twitter.com/tomjwebb">@tomjwebb</a> <a href="https://twitter.com/tpoi">@tpoi</a></p>&mdash; Gavin Simpson (@ucfagls) <a href="https://twitter.com/ucfagls/status/556120176748290048">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

---

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> it also prevents a lot of different bad practices. It is possible to do some of this in Excel. <a href="https://twitter.com/tpoi">@tpoi</a></p>&mdash; Ethan White (@ethanwhite) <a href="https://twitter.com/ethanwhite/status/556119826582605824">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Have a look at the Data Carpentry [**SQL for Ecology** lesson](http://www.datacarpentry.org/sql-ecology-lesson/)

---

# Data formats

***

---

## **Data formats**

- **`.csv`**: *comma* separated values. 
- **`.tsv`**: *tab* separated values.
- **`.txt`**: no formatting specified.

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> It has to be interoperability/openness - can I read your data with whatever I use, without having to convert it?</p>&mdash; Paul Swaddle (@paul_swaddle) <a href="https://twitter.com/paul_swaddle/status/556148166270406656">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

#### **more unusual formats will need instructions on use.**

---

## **Ensure data is machine readable**

### bad

---

### bad
<img src="assets/img/bad_xl2.png" width=600px>

---

### good 
<img src="assets/img/good_xl.png" width=600px>

---

### ok

- could help data entry
- `.csv` or `.tsv` copy would need to be saved.

---

# Basic quality control

***

---

## **Use good null values**

### Missing values are a fact of life

- Usually, best solution is to **leave blank**
- **`NA`** or **`NULL`** are also good options
- **NEVER use `0`**. Avoid numbers like **`-999`**
- Don’t make up your own code for missing values

---

## [**`read.csv()`**](http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html) **utilities**

- **`na.string`:** character vector of values to be coded missing and replaced with `NA` to argument eg
- **`strip.white`:** Logical. if `TRUE` strips leading and trailing white space from unquoted character fields 
- **`blank.lines.skip`:** Logical: if `TRUE` blank lines in the input are ignored.
- **`fileEncoding`:** if you're getting funny characters, you probably need to specify the correct encoding.

```r
read.csv(file, na.strings = c("NA", "-999"), strip.white = TRUE, 
         blank.lines.skip = TRUE, fileEncoding = "mac")
```

---

## [**`readr::read_csv()`**](https://cran.r-project.org/web/packages/readr/readr.pdf) **utilities**

- **`na`:** character vector of values to be coded missing and replaced with `NA` to argument eg
- **`trim_ws`:** Logical. if `TRUE` strips leading and trailing white space from unquoted character fields 
- **`col_types`:** Allows for column data type specification. ([see more](https://cran.r-project.org/web/packages/readxl/vignettes/cell-and-column-types.html))
- **`locale`:** controls things like the default time zone, encoding, decimal mark, big mark, and day/month names
- **`skip`:** Number of lines to skip before reading data.
- **`n_max`:** Maximum number of records to read.

```r
read_csv(file, col_names = TRUE, col_types = NULL, locale = default_locale(), 
         na = c("", "NA", "-999"), trim_ws = TRUE, skip = 0, n_max = Inf)
```

---

## Inspect

#### Have a look at your data with `View(df)`

```r
View(mtcars)
```

![](assets/view_mtcars.png)

- Check **empty cells**

---
### Print

Check your **software interprets your data correctly** 
    - eg see top few rows with `head(df)`

```r
head(mtcars)
```

```
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

---
### Structure
see structure of any object with `str()`.

```r
str(mtcars)
```

```
## 'data.frame':	32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
```

---

### Summarise

- Check the **range of values** (and value types) in each column matches expectation. 
- Check **units of measurement are what you expect**

```r
summary(mtcars)
```

```
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
```

---

### pkg [`skimr`](https://github.com/ropenscilabs/skimr)

`skimr` provides a frictionless approach to displaying summary statistics the user can skim quickly to understand their data

```r
install.packages("skimr")
```

---

```r
library(skimr)
skim(mtcars)
```

```
## Skim summary statistics
##  n obs: 32 
##  n variables: 11 
## 
## ── Variable type:numeric ──────────────────────────────────────────────
##  variable missing complete  n   mean     sd    p0    p25    p50    p75
##        am       0       32 32   0.41   0.5   0      0      0      1   
##      carb       0       32 32   2.81   1.62  1      2      2      4   
##       cyl       0       32 32   6.19   1.79  4      4      6      8   
##      disp       0       32 32 230.72 123.94 71.1  120.83 196.3  326   
##      drat       0       32 32   3.6    0.53  2.76   3.08   3.7    3.92
##      gear       0       32 32   3.69   0.74  3      3      4      4   
##        hp       0       32 32 146.69  68.56 52     96.5  123    180   
##       mpg       0       32 32  20.09   6.03 10.4   15.43  19.2   22.8 
##      qsec       0       32 32  17.85   1.79 14.5   16.89  17.71  18.9 
##        vs       0       32 32   0.44   0.5   0      0      0      1   
##        wt       0       32 32   3.22   0.98  1.51   2.58   3.33   3.61
##    p100     hist
##    1    ▇▁▁▁▁▁▁▆
##    8    ▆▇▂▇▁▁▁▁
##    8    ▆▁▁▃▁▁▁▇
##  472    ▇▆▁▂▅▃▁▂
##    4.93 ▃▇▁▅▇▂▁▁
##    5    ▇▁▁▆▁▁▁▂
##  335    ▃▇▃▅▂▃▁▁
##   33.9  ▃▇▇▇▃▂▂▂
##   22.9  ▃▂▇▆▃▃▁▁
##    1    ▇▁▁▁▁▁▁▆
##    5.42 ▃▃▃▇▆▁▁▂
```

---

## Validate

### pkg [`assertr`](https://github.com/ropensci/assertr)

The `assertr` package supplies a suite of functions designed to verify assumptions about data and can be used so detect data errors during analysis.

```r
install.packages("assertr")
```

e.g confirm that `mtcars`

- has the columns "mpg", "vs", and "am"
- contains more than 10 observations
- column for 'miles per gallon' (mpg) is a positive number
before further analysis:

```r
library(dplyr)
library(assertr)
mtcars %>%
    verify(has_all_names("mpg", "vs", "am", "wt")) %>%
    verify(nrow(.) > 10) %>%
    verify(mpg > 0)
```

---

# Data security

***

---

## **Raw data are sacrosanct**

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> don&#39;t, not even with a barge pole, not for one second, touch or otherwise edit the raw data files. Do any manipulations in script</p>&mdash; Gavin Simpson (@ucfagls) <a href="https://twitter.com/ucfagls/status/556107371634634755">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> <a href="https://twitter.com/srsupp">@srsupp</a> Keep one or a few good master data files (per data collection of interest), and code your formatting with good annotation.</p>&mdash; Desiree Narango (@DLNarango) <a href="https://twitter.com/DLNarango/status/556128407445323778">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

---

### **Give yourself less rope**

.pull-left[
- It's a good idea to **[revoke your own write permission](https://kb.iu.edu/d/abdb) to the raw data file**.

- Then you **can't accidentally edit it**.

- It also makes it **harder to do manual edits** in a moment of weakness, when you know you should **just add a line to your data cleaning script**.
]

---

## **Know your masters**

.pull-left[
- identify the `master` copy of files
- keep it safe and and accessible
- consider version control
- consider centralising
]

<img src="assets/abstract-art-background-270456.jpg" width=400px> 
_source: Pexels CC0_
]

---

## **Avoid catastrophe**

### **Backup: on disk**

- consider using backup software like [Time Machine](https://support.apple.com/en-gb/HT201250) (mac) or [File History](http://www.thundercloud.net/infoave/new/windows-10-has-a-time-machine/) (Windows 10)

### **Backup: in the cloud**

- dropbox, googledrive etc.
- if [installed](https://tools.google.com/dlpage/drive) on your system, can programmatically access them through `R`
- some version control

<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/tomjwebb">@tomjwebb</a> Back it up</p>&mdash; Ben Bond-Lamberty (@BenBondLamberty) <a href="https://twitter.com/BenBondLamberty/status/556120946722222080">January 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

---

## **Backup: the Open Science Framework** [osf.io](https://osf.io/)

- version controlled
- easily shareable
- works with other apps (eg googledrive, github)
- work on an interface with R ([OSFr](https://github.com/chartgerink/osfr)) is in progress. See more [here](https://youtu.be/cnE3AcdeGVY)

---

## **Backup: Github**

- most solid version control.

- keep everything in one project folder.

- Can be problematic with really large files.

---

## Get back [home](index.html)