+ - 0:00:00
Notes for current slide
Notes for next slide

ACCE Research Data and Project Management


Basic Data Hygiene

10-11 April 2019, University of Sheffield

Dr Anna Krystalli @annakrystalli

1 / 34

Start at the beginning

Plan your Research Data Management

  • Start early. Make an RDM plan before collecting data.

  • Anticipate data products as part of your thesis outputs

  • Think about what technologies to use
2 / 34

Own your data

Take initiative & responsibility. Think long term.

3 / 34

Data management


4 / 34

Spreadsheets

extreme but in many ways defendable

5 / 34

excel: read/entry only

6 / 34

Databases: more robust

Stronger quality control features. Advisable for multiple contributors

7 / 34

Databases: benefits

8 / 34

Have a look at the Data Carpentry SQL for Ecology lesson

9 / 34

Data formats


10 / 34

Data formats

  • .csv: comma separated values.
  • .tsv: tab separated values.
  • .txt: no formatting specified.

more unusual formats will need instructions on use.

11 / 34

Ensure data is machine readable

bad

12 / 34

bad

13 / 34

good

14 / 34

ok

  • could help data entry
  • .csv or .tsv copy would need to be saved.
15 / 34

Basic quality control


16 / 34

Use good null values

Missing values are a fact of life

  • Usually, best solution is to leave blank
  • NA or NULL are also good options
  • NEVER use 0. Avoid numbers like -999
  • Don’t make up your own code for missing values
17 / 34

read.csv() utilities

  • na.string: character vector of values to be coded missing and replaced with NA to argument eg
  • strip.white: Logical. if TRUE strips leading and trailing white space from unquoted character fields
  • blank.lines.skip: Logical: if TRUE blank lines in the input are ignored.
  • fileEncoding: if you're getting funny characters, you probably need to specify the correct encoding.
read.csv(file, na.strings = c("NA", "-999"), strip.white = TRUE,
blank.lines.skip = TRUE, fileEncoding = "mac")
18 / 34

readr::read_csv() utilities

  • na: character vector of values to be coded missing and replaced with NA to argument eg
  • trim_ws: Logical. if TRUE strips leading and trailing white space from unquoted character fields
  • col_types: Allows for column data type specification. (see more)
  • locale: controls things like the default time zone, encoding, decimal mark, big mark, and day/month names
  • skip: Number of lines to skip before reading data.
  • n_max: Maximum number of records to read.
read_csv(file, col_names = TRUE, col_types = NULL, locale = default_locale(),
na = c("", "NA", "-999"), trim_ws = TRUE, skip = 0, n_max = Inf)
19 / 34

Inspect

Have a look at your data with View(df)

View(mtcars)

  • Check empty cells
20 / 34

Print

Check your software interprets your data correctly

- eg see top few rows with `head(df)`
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
21 / 34

Structure

see structure of any object with str().

str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
22 / 34

Summarise

  • Check the range of values (and value types) in each column matches expectation.
  • Check units of measurement are what you expect
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
23 / 34

pkg skimr

skimr provides a frictionless approach to displaying summary statistics the user can skim quickly to understand their data

install.packages("skimr")
24 / 34
library(skimr)
skim(mtcars)
## Skim summary statistics
## n obs: 32
## n variables: 11
##
## ── Variable type:numeric ──────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75
## am 0 32 32 0.41 0.5 0 0 0 1
## carb 0 32 32 2.81 1.62 1 2 2 4
## cyl 0 32 32 6.19 1.79 4 4 6 8
## disp 0 32 32 230.72 123.94 71.1 120.83 196.3 326
## drat 0 32 32 3.6 0.53 2.76 3.08 3.7 3.92
## gear 0 32 32 3.69 0.74 3 3 4 4
## hp 0 32 32 146.69 68.56 52 96.5 123 180
## mpg 0 32 32 20.09 6.03 10.4 15.43 19.2 22.8
## qsec 0 32 32 17.85 1.79 14.5 16.89 17.71 18.9
## vs 0 32 32 0.44 0.5 0 0 0 1
## wt 0 32 32 3.22 0.98 1.51 2.58 3.33 3.61
## p100 hist
## 1 ▇▁▁▁▁▁▁▆
## 8 ▆▇▂▇▁▁▁▁
## 8 ▆▁▁▃▁▁▁▇
## 472 ▇▆▁▂▅▃▁▂
## 4.93 ▃▇▁▅▇▂▁▁
## 5 ▇▁▁▆▁▁▁▂
## 335 ▃▇▃▅▂▃▁▁
## 33.9 ▃▇▇▇▃▂▂▂
## 22.9 ▃▂▇▆▃▃▁▁
## 1 ▇▁▁▁▁▁▁▆
## 5.42 ▃▃▃▇▆▁▁▂
25 / 34

Validate

pkg assertr

The assertr package supplies a suite of functions designed to verify assumptions about data and can be used so detect data errors during analysis.

install.packages("assertr")

e.g confirm that mtcars

  • has the columns "mpg", "vs", and "am"
  • contains more than 10 observations
  • column for 'miles per gallon' (mpg) is a positive number before further analysis:
library(dplyr)
library(assertr)
mtcars %>%
verify(has_all_names("mpg", "vs", "am", "wt")) %>%
verify(nrow(.) > 10) %>%
verify(mpg > 0)
26 / 34

Data security


27 / 34

Raw data are sacrosanct

28 / 34

Give yourself less rope

  • It's a good idea to revoke your own write permission to the raw data file.

  • Then you can't accidentally edit it.

  • It also makes it harder to do manual edits in a moment of weakness, when you know you should just add a line to your data cleaning script.

Photo by Jon Moore on Unsplash

29 / 34

Know your masters

  • identify the master copy of files
  • keep it safe and and accessible
  • consider version control
  • consider centralising

source: Pexels CC0

30 / 34

Avoid catastrophe

Backup: on disk

Backup: in the cloud

  • dropbox, googledrive etc.
  • if installed on your system, can programmatically access them through R
  • some version control
31 / 34

Backup: the Open Science Framework osf.io

  • version controlled
  • easily shareable
  • works with other apps (eg googledrive, github)
  • work on an interface with R (OSFr) is in progress. See more here
32 / 34

Backup: Github

  • most solid version control.

  • keep everything in one project folder.

  • Can be problematic with really large files.

33 / 34

Get back home

34 / 34

Start at the beginning

Plan your Research Data Management

  • Start early. Make an RDM plan before collecting data.

  • Anticipate data products as part of your thesis outputs

  • Think about what technologies to use
2 / 34
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow