+ - 0:00:00
Notes for current slide
Notes for next slide

ACCE Research Data and Project Management


Metadata

10-11 April 2019, University of Sheffield

Dr Anna Krystalli @annakrystalli

1 / 34

You got data. Is it enough?

2 / 34
3 / 34
4 / 34

#otherpeoplesdata dream match!

Thought experiment: Imagine a dream open data set

How would you locate it?

  • what details would you need to know to determine relevance?
  • what information would you need to know to use it?

5 / 34

metadata = data about data


6 / 34

"Information that describes, explains, locates, or in some way makes it easier to find, access, and use a resource (in this case, data).""

Backbone of digital curation

Without it, a digital resource may be irretrievable, unidentifiable or unusable

7 / 34

Descriptive

  • enables identification, location and retrieval of data, often includes use of controlled vocabularies for classification and indexing.

Technical

  • describes the technical processes used to produce, or required to use a digital data object.

Administrative

  • used to manage administrative aspects of the digital object e.g. intellectual property rights and acquisition.
8 / 34

Elements of metadata

  • Structured data files:

    • readable by machines and humans, accessible through the web
  • Controlled vocabularies eg. NERC Vocabulary server

    • allows for connectivity of data

KEY TO SEARCH FUNCTION

  • By structuring & adhering to controlled vocabularies, data can be combined, accessed and searched!
  • Different communities develop different standards which define both the structure and content of metadata
9 / 34

metadata in research


10 / 34

Identifying the right metadata standard

11 / 34

Seek help from support teams

Most university libraries have assistants dedicated to Research Data Management:

12 / 34

Key metadata:

the bare minimum

document data coverage information

  • taxonomic coverage: a table containing taxonomic information on species in data.
    • also record authority / source
  • temporal coverage: temporal range and resolution details
  • spatial coverage:
    • a human readable geographic description of the study area
    • spatial range and resolution details
    • include depth (marine/freshwater) or altitudinal (terrestrial) information

Make sure to record units!

13 / 34

document protocols in a methods document

Keep a dynamic document used to plan, record and write up methods.

Any additional information other users would need to combine your data with theirs? Record it

14 / 34

Practical metadata


15 / 34

ACCE DTP RDM course


Teaching this course has always felt challenging in terms of practical exercises

16 / 34

ACCE DTP RDM course


Teaching this course has always felt challenging in terms of practical exercises

  • Defining Metadata & explaining importance: ✅
16 / 34

ACCE DTP RDM course


Teaching this course has always felt challenging in terms of practical exercises

  • Defining Metadata & explaining importance: ✅

  • Advising on domain specific Controlled Vocabularies & structure

  • How can we practice creating metadata?

16 / 34

rOpenSci Unconf 18

May 21 - 22, 2018. Seattle


rOpenSci Unconf mission

bringing together scientists, developers, and open data enthusiasts from academia, industry, government, and non-profits to get together for a few days and hack on various projects.


Ideas for projects submitted through GitHub issues in the runconf18 repo

17 / 34

issue #72 🙋

18 / 34

Metadata team!


Luckily, a whole bunch of other awesome folks were also thinking about these topics and interested in working on them! 🤩

(in alphabetical order):

19 / 34

rOpenSciLabs pkg dataspice

Package dataspice makes it easier for researchers to create basic, lightweight and concise metadata files for their datasets.


  • Metadata collected in csv files
20 / 34

rOpenSciLabs pkg dataspice

Package dataspice makes it easier for researchers to create basic, lightweight and concise metadata files for their datasets.


  • Metadata collected in csv files

  • Metadata fields are based on schema.org

    • underlies Google Datasets metadata specification
20 / 34

rOpenSciLabs pkg dataspice

Package dataspice makes it easier for researchers to create basic, lightweight and concise metadata files for their datasets.


  • Metadata collected in csv files

  • Metadata fields are based on schema.org

    • underlies Google Datasets metadata specification
  • Helper functions and shinyapps to extract and edit metadata files.
20 / 34

rOpenSciLabs pkg dataspice

Package dataspice makes it easier for researchers to create basic, lightweight and concise metadata files for their datasets.


  • Metadata collected in csv files

  • Metadata fields are based on schema.org

    • underlies Google Datasets metadata specification
  • Helper functions and shinyapps to extract and edit metadata files.

  • Ability to produce:

    • structured json-ld metadata file.
    • a helpful dataset README webpage.


20 / 34

Google unveils search engine for open data

The tool, called Google Dataset Search, should help researchers to find the data they need more easily.

Nature NEWS - 05 SEPTEMBER 2018



https://toolbox.google.com/datasetsearch

21 / 34

dataspice tutorial


The goal of dataspice-tutorial is a practical exercise in creating metadata for an example field collected data product using package dataspice.

  • Understand basic metadata and why it is important
23 / 34

dataspice tutorial


The goal of dataspice-tutorial is a practical exercise in creating metadata for an example field collected data product using package dataspice.

  • Understand basic metadata and why it is important

  • Understand where and how to store them

23 / 34

dataspice tutorial


The goal of dataspice-tutorial is a practical exercise in creating metadata for an example field collected data product using package dataspice.

  • Understand basic metadata and why it is important

  • Understand where and how to store them

  • Understand how they can feed into more complex metadata objects.

23 / 34

dataspice workflow

24 / 34

Example dataset

Data source :

National Ecological Observatory Network (NEON) data portal

Dataset selected:

Woody plant vegetation structure

This data product contains the quality-controlled, native sampling resolution data from in-situ measurements of live and standing dead woody individuals and shrub groups, from all terrestrial NEON sites with qualifying woody vegetation.

25 / 34

Example dataset

Data source :

National Ecological Observatory Network (NEON) data portal

Dataset selected:

Woody plant vegetation structure

This data product contains the quality-controlled, native sampling resolution data from in-situ measurements of live and standing dead woody individuals and shrub groups, from all terrestrial NEON sites with qualifying woody vegetation.

  • Structure and mapping data are reported per individual per plot

  • Sampling metadata, such as per growth form sampling area, are reported per plot.

25 / 34

dataspice workshop data

The data are a trimmed subset of data downladed from the NEON data portal after filtering for:

  • time periods between 2015-06 - 2016-06

  • locations within NEON Domain area D01: Northeast

Filter returned data from 2 sites from 2015-6 to 2015-11.



Citation:

National Ecological Observatory Network. 2018. Data Products: DP1.10098.001. Provisional data downloaded from http://data.neonscience.org on 2018-05-04. Battelle, Boulder, CO, USA

26 / 34

vst_perplotperyear.csv

Plot level data

## # A tibble: 165 x 14
## date siteID plotID plotType nlcdClass decimalLatitude
## <date> <chr> <chr> <chr> <chr> <dbl>
## 1 2015-06-06 BART BART_… tower deciduou… 44.1
## 2 2015-07-16 BART BART_… tower deciduou… 44.1
## 3 2015-07-21 BART BART_… tower deciduou… 44.1
## 4 2015-07-22 BART BART_… tower mixedFor… 44.1
## 5 2015-07-22 BART BART_… tower deciduou… 44.1
## 6 2015-07-22 BART BART_… tower deciduou… 44.1
## 7 2015-07-22 BART BART_… tower deciduou… 44.1
## 8 2015-07-23 BART BART_… tower mixedFor… 44.1
## 9 2015-07-23 BART BART_… tower deciduou… 44.1
## 10 2015-07-28 BART BART_… tower mixedFor… 44.1
## # … with 155 more rows, and 8 more variables: decimalLongitude <dbl>,
## # treesPresent <lgl>, shrubsPresent <lgl>, lianasPresent <lgl>,
## # totalSampledAreaTrees <dbl>, totalSampledAreaShrubSapling <dbl>,
## # totalSampledAreaLiana <dbl>, recordedBy <chr>
27 / 34

vst_mappingandtagging.csv

Individual level data

## # A tibble: 1,799 x 7
## date siteID plotID individualID taxonID scientificName recordedBy
## <date> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2015-08-04 BART BART_0… NEON.PLA.D0… TSCA Tsuga canaden… 6HzkzFDdL…
## 2 2015-08-04 BART BART_0… NEON.PLA.D0… TSCA Tsuga canaden… 6HzkzFDdL…
## 3 2015-07-22 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… zODC+zTh3…
## 4 2015-08-26 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… zODC+zTh3…
## 5 2015-08-04 BART BART_0… NEON.PLA.D0… PICEA Picea sp. 6HzkzFDdL…
## 6 2015-08-04 BART BART_0… NEON.PLA.D0… TSCA Tsuga canaden… 0uwWHUCkG…
## 7 2015-07-22 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… zODC+zTh3…
## 8 2015-08-12 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… jRr6tAEXv…
## 9 2015-07-22 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… jRr6tAEXv…
## 10 2015-08-25 BART BART_0… NEON.PLA.D0… FAGR Fagus grandif… 0uwWHUCkG…
## # … with 1,789 more rows
28 / 34

Practical


29 / 34

Practical


time for some live coding 😱




or head to the tutorial if working through this on your own

29 / 34

Outro


30 / 34

Additional metadata tips

  • The approach we went for is very general / minimal

31 / 34

Additional metadata tips

  • The approach we went for is very general / minimal

  • You can make your datasets more discoverable by developing richer/more domain specific metadata files.

31 / 34

Additional metadata tips

  • The approach we went for is very general / minimal

  • You can make your datasets more discoverable by developing richer/more domain specific metadata files.

  • eg. create Ecological Metadata Language (EML) metadata using r pkg EML.

--

  • reposit your data at KNB
31 / 34

Additional metadata tips

  • The approach we went for is very general / minimal

  • You can make your datasets more discoverable by developing richer/more domain specific metadata files.

  • eg. create Ecological Metadata Language (EML) metadata using r pkg EML.

--

31 / 34

KNB data portal

32 / 34

KNB data portal

Rich interactive metadata

33 / 34

Parting words

34 / 34

Parting words

  • Any metadata documentation is better than none 👍

34 / 34

Parting words

  • Any metadata documentation is better than none 👍

  • Start small and build up to more complex standards 💯

34 / 34

Parting words

  • Any metadata documentation is better than none 👍

  • Start small and build up to more complex standards 💯

    • But make sure to cover bare minimum ⚠️

34 / 34

Parting words

  • Any metadata documentation is better than none 👍

  • Start small and build up to more complex standards 💯

    • But make sure to cover bare minimum ⚠️

  • Reach out for help from your local librarians or try the rOpenSci discussion board 🙋

34 / 34

Parting words

  • Any metadata documentation is better than none 👍

  • Start small and build up to more complex standards 💯

    • But make sure to cover bare minimum ⚠️

  • Reach out for help from your local librarians or try the rOpenSci discussion board 🙋

🌯

34 / 34

You got data. Is it enough?

2 / 34
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow