Skip to content

Metadata

This is an introductory article to the Collection of articles on Metadata. This collection comprises contributions on all aspects and issues relevant to the creation of rich and robust metadata.

Metadata - Information about research data
(or: The data about the data)

Metadata is information that is added to (research) data to provide context and information about the (research) data itself. Metadata is essential to understand, manage and utilise data effectively. Typically, metadata is a set of attributes or properties (e.g., author, date, measurement parameters) that offers insights into the content, structure and characteristics of a dataset, document, file or any other information resource. These attributes can be grouped into different categories such as:

  • Summary: What is the dataset about?
  • Provenance: Who created it?
  • Context: How was the dataset collected or generated and why does it exist?
  • Content: What does the dataset include?
  • Usability: How can the dataset be used?
  • Temporal and Spatial Coverage: What is the dataset’s time frame? What is the spatial coverage of the dataset?
  • Privacy: Is there any restricted content, why is it private, and how can it be accessed?

Metadata may be created automatically or manually. The detailedness (i.e., granularity) of metadata varies depending on the type and purpose of the data. Metadata should at least contain relevant basic information about provenance and content (e.g., according to the "Dublin Core" standard set), but hundreds of metadata attributes may be used for a detailed description of an individual single dataset.

Metadata and metadata standards differ among disciplines and domains, may evolve over time along with the progression of technology and research fields, and are usually specific to the type of (research) data that is described.

The importance of metadata

The appropriate annotation of research data with metadata is mandatory when creating FAIR research data. Proper metadata need to provide all information required to comprehend how the data were created and how it is to be understood, thereby ensuring its findability and reusability. Metadata that follows given standards for vocabularies, schemas, etc. ensures interoperability between databases and systems.

Missing metadata limit the value of research data to the person who generated these, for as long as she or he can memorise how the data were measured, modelled, collected, etc. Any other person will not benefit, either because the research data won't be found, or because the research data won't be understood good enough for making use of it in the context of further research. Hence, metadata are a mandatory part of research data, and undoubtedly as equally important as the data themselve. Or, as Jason Scott coined it: "Metadata is a love note to the future."

Types of metadata

Metadata may be discriminated into descriptive, structural, and administrative metadata. Descriptive metadata describe the contents of a data set. Structural metadata provide information about the organisation of a data set. Administrative metadata contain technical information about a data set.

Appropriate metadata

Metadata should be rich and robust. Rich means, metadata need to be complete, accurate, and detailed to allow others to find, interoperate and reuse the data. Robust refers to e.g., continuity, consistency, curation and maintenance, practicability, uniqueness, avoidance of misinterpretation. Specific requirements for rich and robust depend on the type of data, available standards and protocols, the resepective discpline or community, as well as the use case. Based on the FAIR data principles, descriptions, criteria and indicators for rich and robust were proposed, e.g., by the GoFAIR Initiative and the Research Data Alliance FAIR data maturity model Working Group, as part of the specification and guidelines towards a FAIR Data Maturity Model.

Granularity of metadata

The more detailed metadata are, the more granular they are. A higher granularity allows a more precise search and analysis of the data. Efficient and effective reuse of data requires finding and accessing data at various levels of granularity. The ideal level of granularity for metadata needs to be defined for the particular data type and usage context. The Research Data Alliance Data Granularity Workgroup, for example, explores key questions and collects and shares information on how to best find appropriate data granularity, thereby providing guidance to help data professionals to determine the best level of granularity for user discovery, access, interoperability and citability.

Using vocabularies for metadata naming

To ensure the creation of consistent metadata that can be accurately and quickly searched and retrieved across various databases, controlled vocabularies should be used for

  • the naming of metadata items, and
  • defined lists of accepted terms used to populate descriptive metadata items (e.g., 1).

The use of vocabularies is a prerequisite to overcome ontological and semantic discrepancies when data are synthesized across repositories (e.g., 2).

Metadata standards (in Earth System Sciences)

Metadata standards facilitate not only technological, but also semantic interoperability between infrastructures such as research data repositories. Several general, i.e., cross-domain as well as domain-specific standards exist for the Earth System Sciences. The broad term metadata standard includes standards that may or may not be technology-agnostic, i.e., some standards have a formal specification (e.g., in the form of an XSD or an RDFS schema), others have none. However, even an approved metadata schema can be flawed or is only used by a few. Similarly, an unapproved, but widely used metadata schema might act as quasi-standard in a given domain. There might even exist several metadata standards for the same type of data (e.g., geospatial metadata standards - see information at FGDC.gov). A catalog of metadata standards has been issued by the Research Data Alliance: Metadata Standards Catalog.

Acknowledgement

The writing of this article was inspired by several webpages – many thanks to:

References


  1. Heather Hedden. Taxonomies and controlled vocabularies best practices for metadata. Journal of Digital Asset Management, 6(5):279–284, 2010. URL: https://doi.org/10.1057/dam.2010.29, doi:10.1057/dam.2010.29

  2. Michael Piasecki and Bora Beran. A semantic annotation tool for hydrologic sciences. Earth Science Informatics, 2(3):157–168, 2009. URL: https://doi.org/10.1007/s12145-009-0031-x, doi:10.1007/s12145-009-0031-x