ACCE Research Data and Project Management


10-11 April 2019, University of Sheffield

Dr Anna Krystalli @annakrystalli

You got data. Is it enough?

#otherpeoplesdata dream match!

Thought experiment: Imagine a dream open data set

How would you locate it?

  • what details would you need to know to determine relevance?
  • what information would you need to know to use it?

metadata = data about data

"Information that describes, explains, locates, or in some way makes it easier to find, access, and use a resource (in this case, data).""

Backbone of digital curation

Without it, a digital resource may be irretrievable, unidentifiable or unusable

  • enables identification, location and retrieval of data, often includes use of controlled vocabularies for classification and indexing.


  • describes the technical processes used to produce, or required to use a digital data object.


  • used to manage administrative aspects of the digital object e.g. intellectual property rights and acquisition.
Elements of metadata

  • Structured data files:

    • readable by machines and humans, accessible through the web
  • Controlled vocabularies eg. NERC Vocabulary server

    • allows for connectivity of data


  • By structuring & adhering to controlled vocabularies, data can be combined, accessed and searched!
  • Different communities develop different standards which define both the structure and content of metadata
metadata in research

Identifying the right metadata standard

Seek help from support teams

Most university libraries have assistants dedicated to Research Data Management:

Key metadata:

the bare minimum

document data coverage information

  • taxonomic coverage: a table containing taxonomic information on species in data.
    • also record authority / source
  • temporal coverage: temporal range and resolution details
  • spatial coverage:
    • a human readable geographic description of the study area
    • spatial range and resolution details
    • include depth (marine/freshwater) or altitudinal (terrestrial) information

Make sure to record units!

document protocols in a methods document

Keep a dynamic document used to plan, record and write up methods.

Any additional information other users would need to combine your data with theirs? Record it

14 / 34

Practical metadata

  • Defining Metadata & explaining importance: ✅
  • Advising on domain specific Controlled Vocabularies & structure

  • How can we practice creating metadata?

May 21 - 22, 2018. Seattle

rOpenSci Unconf mission

bringing together scientists, developers, and open data enthusiasts from academia, industry, government, and non-profits to get together for a few days and hack on various projects.

Ideas for projects submitted through GitHub issues in the runconf18 repo

issue #72 🙋

Metadata team!

Luckily, a whole bunch of other awesome folks were also thinking about these topics and interested in working on them! 🤩

(in alphabetical order):

rOpenSciLabs pkg dataspice

Package dataspice makes it easier for researchers to create basic, lightweight and concise metadata files for their datasets.

  • Metadata collected in csv files
  • Metadata fields are based on schema.org

    • underlies Google Datasets metadata specification
  • Metadata fields are based on schema.org

    • underlies Google Datasets metadata specification
  • Helper functions and shinyapps to extract and edit metadata files.
  • Metadata fields are based on schema.org

    • underlies Google Datasets metadata specification
  • Helper functions and shinyapps to extract and edit metadata files.

  • Ability to produce:

    • structured json-ld metadata file.
    • a helpful dataset README webpage.

Google unveils search engine for open data

The tool, called Google Dataset Search, should help researchers to find the data they need more easily.

Nature NEWS - 05 SEPTEMBER 2018


dataspice tutorial

The goal of dataspice-tutorial is a practical exercise in creating metadata for an example field collected data product using package dataspice.

  • Understand basic metadata and why it is important
  • Understand where and how to store them

  • Understand where and how to store them

  • Understand how they can feed into more complex metadata objects.

dataspice workflow

Example dataset

Data source :

National Ecological Observatory Network (NEON) data portal

Dataset selected:

Woody plant vegetation structure

This data product contains the quality-controlled, native sampling resolution data from in-situ measurements of live and standing dead woody individuals and shrub groups, from all terrestrial NEON sites with qualifying woody vegetation.

  • Structure and mapping data are reported per individual per plot

  • Sampling metadata, such as per growth form sampling area, are reported per plot.

dataspice workshop data

The data are a trimmed subset of data downladed from the NEON data portal after filtering for:

  • time periods between 2015-06 - 2016-06

  • locations within NEON Domain area D01: Northeast

Filter returned data from 2 sites from 2015-6 to 2015-11.


National Ecological Observatory Network. 2018. Data Products: DP1.10098.001. Provisional data downloaded from http://data.neonscience.org on 2018-05-04. Battelle, Boulder, CO, USA

Plot level data

## # A tibble: 165 x 14
## date siteID plotID plotType nlcdClass decimalLatitude
## <date> <chr> <chr> <chr> <chr> <dbl>
## 1 2015-06-06 BART BART_… tower deciduou… 44.1
## 2 2015-07-16 BART BART_… tower deciduou… 44.1
## 3 2015-07-21 BART BART_… tower deciduou… 44.1
## 4 2015-07-22 BART BART_… tower mixedFor… 44.1
## 5 2015-07-22 BART BART_… tower deciduou… 44.1
## 6 2015-07-22 BART BART_… tower deciduou… 44.1
## 7 2015-07-22 BART BART_… tower deciduou… 44.1
## 8 2015-07-23 BART BART_… tower mixedFor… 44.1
## 9 2015-07-23 BART BART_… tower deciduou… 44.1
## 10 2015-07-28 BART BART_… tower mixedFor… 44.1
## # … with 155 more rows, and 8 more variables: decimalLongitude <dbl>,
## # treesPresent <lgl>, shrubsPresent <lgl>, lianasPresent <lgl>,
## # totalSampledAreaTrees <dbl>, totalSampledAreaShrubSapling <dbl>,
## # totalSampledAreaLiana <dbl>, recordedBy <chr>
Individual level data

## # A tibble: 1,799 x 7
## date siteID plotID individualID taxonID scientificName recordedBy
## <date> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2015-08-04 BART BART_0… NEON.PLA.D0… TSCA Tsuga canaden… 6HzkzFDdL…
## 2 2015-08-04 BART BART_0… NEON.PLA.D0… TSCA Tsuga canaden… 6HzkzFDdL…
## 3 2015-07-22 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… zODC+zTh3…
## 4 2015-08-26 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… zODC+zTh3…
## 5 2015-08-04 BART BART_0… NEON.PLA.D0… PICEA Picea sp. 6HzkzFDdL…
## 6 2015-08-04 BART BART_0… NEON.PLA.D0… TSCA Tsuga canaden… 0uwWHUCkG…
## 7 2015-07-22 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… zODC+zTh3…
## 8 2015-08-12 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… jRr6tAEXv…
## 9 2015-07-22 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… jRr6tAEXv…
## 10 2015-08-25 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… 0uwWHUCkG…
## # … with 1,789 more rows
time for some live coding 😱

or head to the tutorial if working through this on your own

Additional metadata tips

  • The approach we went for is very general / minimal

  • You can make your datasets more discoverable by developing richer/more domain specific metadata files.

  • eg. create Ecological Metadata Language (EML) metadata using r pkg EML.


  • reposit your data at KNB
  • eg. create Ecological Metadata Language (EML) metadata using r pkg EML.


KNB data portal

KNB data portal

Rich interactive metadata

Parting words

  • Start small and build up to more complex standards 💯

    • But make sure to cover bare minimum ⚠️

  • Reach out for help from your local librarians or try the rOpenSci discussion board 🙋

  • Reach out for help from your local librarians or try the rOpenSci discussion board 🙋


You got data. Is it enough?

