Start early. Make an RDM plan before collecting data.
Anticipate data products as part of your thesis outputs
Act as though every short term study will become a long term one @tomjwebb. Needs to be reproducible in 3, 20, 100 yrs
— Oceans Initiative (@oceansresearch) January 16, 2015
Act as though every short term study will become a long term one @tomjwebb. Needs to be reproducible in 3, 20, 100 yrs
— oceans initiative (@oceansresearch) January 16, 2015
@tomjwebb stay away from excel at all costs?
— Timothée Poisot (@tpoi) January 16, 2015
read/entry only
@tomjwebb @tpoi excel is fine for data entry. Just save in plain text format like csv. Some additional tips: pic.twitter.com/8fUv9PyVjC
— Jaime Ashander (@jaimedash) January 16, 2015
@jaimedash just don’t let excel anywhere near dates or times. @tomjwebb @tpoi @larysar
— Dave Harris (@davidjayharris) January 16, 2015
Stronger quality control features. Advisable for multiple contributors
@tomjwebb databases? @swcarpentry has a good course on SQLite
— Timothée Poisot (@tpoi) January 16, 2015
@tomjwebb @tpoi if the data are moderately complex, or involve multiple people, best to set up a database with well designed entry form 1/2
— Luca Borger (@lucaborger) January 16, 2015
@tomjwebb Entering via a database management system (e.g., Access, Filemaker) can make entry easier & help prevent data entry errors @tpoi
— Ethan White (@ethanwhite) January 16, 2015
@ethanwhite +1 Enforcing data types, options from selection etc, just some useful things a DB gives you, if you turn them on @tomjwebb @tpoi
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb it also prevents a lot of different bad practices. It is possible to do some of this in Excel. @tpoi
— Ethan White (@ethanwhite) January 16, 2015
Have a look at the Data Carpentry SQL for Ecology lesson
.csv
: comma separated values. .tsv
: tab separated values..txt
: no formatting specified.@tomjwebb It has to be interoperability/openness - can I read your data with whatever I use, without having to convert it?
— Paul Swaddle (@paul_swaddle) January 16, 2015
.csv
or .tsv
copy would need to be saved.NA
or NULL
are also good options0
. Avoid numbers like -999
read.csv()
utilitiesna.string
: character vector of values to be coded missing and replaced with NA
to argument egstrip.white
: Logical. if TRUE
strips leading and trailing white space from unquoted character fields blank.lines.skip
: Logical: if TRUE
blank lines in the input are ignored.fileEncoding
: if you're getting funny characters, you probably need to specify the correct encoding.read.csv(file, na.strings = c("NA", "-999"), strip.white = TRUE, blank.lines.skip = TRUE, fileEncoding = "mac")
readr::read_csv()
utilitiesna
: character vector of values to be coded missing and replaced with NA
to argument egtrim_ws
: Logical. if TRUE
strips leading and trailing white space from unquoted character fields col_types
: Allows for column data type specification. (see more)locale
: controls things like the default time zone, encoding, decimal mark, big mark, and day/month namesskip
: Number of lines to skip before reading data.n_max
: Maximum number of records to read.read_csv(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA", "-999"), trim_ws = TRUE, skip = 0, n_max = Inf)
View(df)
View(mtcars)
Check your software interprets your data correctly
- eg see top few rows with `head(df)`
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
see structure of any object with str()
.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...## $ disp: num 160 160 108 258 360 ...## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...## $ qsec: num 16.5 17 18.6 19.4 17 ...## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...## $ am : num 1 1 1 0 0 0 0 0 0 0 ...## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
## mpg cyl disp hp ## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 ## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 ## Median :19.20 Median :6.000 Median :196.3 Median :123.0 ## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 ## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 ## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 ## drat wt qsec vs ## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 ## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 ## Median :3.695 Median :3.325 Median :17.71 Median :0.0000 ## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 ## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 ## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 ## am gear carb ## Min. :0.0000 Min. :3.000 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 ## Median :0.0000 Median :4.000 Median :2.000 ## Mean :0.4062 Mean :3.688 Mean :2.812 ## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 ## Max. :1.0000 Max. :5.000 Max. :8.000
skimr
skimr
provides a frictionless approach to displaying summary statistics the user can skim quickly to understand their data
install.packages("skimr")
library(skimr)skim(mtcars)
## Skim summary statistics## n obs: 32 ## n variables: 11 ## ## ── Variable type:numeric ──────────────────────────────────────────────## variable missing complete n mean sd p0 p25 p50 p75## am 0 32 32 0.41 0.5 0 0 0 1 ## carb 0 32 32 2.81 1.62 1 2 2 4 ## cyl 0 32 32 6.19 1.79 4 4 6 8 ## disp 0 32 32 230.72 123.94 71.1 120.83 196.3 326 ## drat 0 32 32 3.6 0.53 2.76 3.08 3.7 3.92## gear 0 32 32 3.69 0.74 3 3 4 4 ## hp 0 32 32 146.69 68.56 52 96.5 123 180 ## mpg 0 32 32 20.09 6.03 10.4 15.43 19.2 22.8 ## qsec 0 32 32 17.85 1.79 14.5 16.89 17.71 18.9 ## vs 0 32 32 0.44 0.5 0 0 0 1 ## wt 0 32 32 3.22 0.98 1.51 2.58 3.33 3.61## p100 hist## 1 ▇▁▁▁▁▁▁▆## 8 ▆▇▂▇▁▁▁▁## 8 ▆▁▁▃▁▁▁▇## 472 ▇▆▁▂▅▃▁▂## 4.93 ▃▇▁▅▇▂▁▁## 5 ▇▁▁▆▁▁▁▂## 335 ▃▇▃▅▂▃▁▁## 33.9 ▃▇▇▇▃▂▂▂## 22.9 ▃▂▇▆▃▃▁▁## 1 ▇▁▁▁▁▁▁▆## 5.42 ▃▃▃▇▆▁▁▂
assertr
The assertr
package supplies a suite of functions designed to verify assumptions about data and can be used so detect data errors during analysis.
install.packages("assertr")
e.g confirm that mtcars
library(dplyr)library(assertr)mtcars %>% verify(has_all_names("mpg", "vs", "am", "wt")) %>% verify(nrow(.) > 10) %>% verify(mpg > 0)
@tomjwebb don't, not even with a barge pole, not for one second, touch or otherwise edit the raw data files. Do any manipulations in script
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb @srsupp Keep one or a few good master data files (per data collection of interest), and code your formatting with good annotation.
— Desiree Narango (@DLNarango) January 16, 2015
It's a good idea to revoke your own write permission to the raw data file.
Then you can't accidentally edit it.
It also makes it harder to do manual edits in a moment of weakness, when you know you should just add a line to your data cleaning script.
Photo by Jon Moore on Unsplash
master
copy of files
source: Pexels CC0
R
@tomjwebb Back it up
— Ben Bond-Lamberty (@BenBondLamberty) January 16, 2015
most solid version control.
keep everything in one project folder.
Can be problematic with really large files.
Start early. Make an RDM plan before collecting data.
Anticipate data products as part of your thesis outputs
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |