Skip to content

Introduction

Metadata plays a crucial role in making data FAIR (findable, accessible, interoperable, and reusable). In Earth System Science (ESS), data are diverse. They range, for example, from sensor data from long-term observatories and climate change expeditions to topographic base map data. The diversity of data sources and types in ESS requires attention to metadata, as this is the only way to ensure that the data can be effectively understood and used by the intended target group. While custom metadata schemas might be developed to address specific requirements, there is a strong need for harmonisation to enhance the overall usability and interoperability of the data. Striking the proper balance between detailed domain-specific metadata and harmonised more general-purpose metadata is a challenge that requires careful consideration. Often it is beneficial to have both, high-level (or abstract) standardised metadata (schemas) that in general facilitate findability and interoperability, as well as additional, more detailed and sophisticated metadata that is tailored to specific (sub-disciplinary) requirements. In addressing these challenges, we benefit from a number of well-established international metadata standards, e.g., from the geospatial domain. Collaboration among repository providers, data providers, researchers as data consumers, and standards organisations is key to successfully development and implementation of harmonised metadata schemas. By pooling expertise and aligning goals, the ESS community can strengthen the development, harmonisation, mapping, update and usage of metadata standards that strike a balance between findability, interoperability, and the preservation of (sub) discipline-specific information. As the initial step of a community-driven process towards the harmonisation of metadata, this white paper is intended to provide high-level recommendations for the provision of ESS metadata with particular respect to findability and interoperability. We will update it regularly and iteratively add recommendations for specific ESS use cases. This article primarily addresses data repository providers, but also data providers and developers of metadata-related tools. Harmonisation of metadata provision may happen at the syntactic level (i.e., use of a common data format, e.g., JSON), at the semantic level (i.e., use of similar vocabularies to describe the data), or at the level of community practices (e.g., which metadata elements are always documented). The scope of this document is on community practices. Furthermore, we provide links to relevant vocabularies, where applicable. This article is intended to 1) raise awareness of the importance of well-maintained metadata as a minimum viable product [1] for findability and 2) guide on providing metadata for the interoperability with existing infrastructures/services by taking the Helmholtz Data Hubs and the NFDI4Earth Knowledge Hub as examples. The ultimate goal is to provide a set of practices that will facilitate the interaction between publishers and consumers of data in the Earth System Sciences.

[1]: A minimum viable product (MVP) is considered as a version of a product, here metadata, with just enough features to be usable, here for findability tasks. It can then be improved based on specific usage needs in future iterations.

Call for review

This article is intended to be updated regularly and published as a new version on Zenodo. Moreover, you are invited to actively participate and comment on this version by sending your feedback to Christin Henzen (christin.henzen@tu-dresden.de).

Citation

Bernard, L., Degbelo, A., Grieb, J., Henzen, C., Heß, R., Klammer, R., Koppe, R., Lorenz, C., Müller, C., & Weiland, C. (2024). Recommendations for Earth System Sciences Metadata Provision. Zenodo. https://doi.org/10.5281/zenodo.10604587

Recommendations

In the rapidly evolving ESS discipline, the proper organisation and description of data have become essential to foster collaboration, advance research, and enable evidence-based decision-making. Without clear and comprehensive metadata, there is a risk that valuable data sets will be lost. Well-structured metadata enhances data visibility and accessibility, facilitating efficient discovery for researchers, practitioners, and decision-makers. The purpose of this article is to give guidance to repository providers. In the following we describe a 4-step approach with our recommendations, which in summary involve:

  1. Implement a metadata profile or provide a mapping to a metadata profile for harvesting
  2. Structure the metadata content and assure the quality
  3. Define a license for the metadata
  4. Provide an interface using a standardised web-based protocol

Implement a metadata profile or provide a mapping to a metadata profile for harvesting

The Earth System Sciences community has developed several metadata schemas and standards to describe various types of data and promote interoperability. Some institutions or initiatives therefore provide human-readable or machine-actionable comprehensive lists of those existing schemas and standards, e.g., the International Organization for Standardization (ISO) Ontology representations of geographic technology standards , the Open Geospatial Consortium (OGC) standards overview and the schema repository , the FAIRsharing platform , or the Helmholtz Metadata Collaboration (HMC) Compilation of Metadata Standards . This section contains the best practices to implement a core metadata profile or provide a mapping. With the upcoming feedback loops and iterations of this document, we intend to provide more best practices and profiles for specific data domains. All descriptions of best practices follow a simplified version of the W3C Best Practice Template . The specific examples for each best practice are meant to support the second step of our suggested approach, that is to structure metadata. As a running example, we use the dataset Hydrography90m: A new high-resolution global hydrographic dataset available at [@amatulli_2022_08_09] and described in [@DOI:10.5194-essd-14-4525-2022], which is reused in the NFDI4Earth pilot GeoFRESH, in the follow-up NFDI4Earth pilot project, and a NFDI4Biodiversity follow-up project hydrographr.

Best practice 1: Provide a unique and persistent identifiers

Why

The provision of metadata is a fundamental requirement for the publication of data on the Internet, as researchers wish to permanently cite or reference data from publications or other data. Developers, for instance, use these identifiers in their code and persistent identifiers to prevent human intervention in updating the references.

We recommend to provide unique and persistent identifiers for metadata and data that are implemented as a DOI or a handle . By providing the identifier as a URL, you allow users, e.g., researchers, a quick navigation to the cited metadata and/or data.

Example

An example for a human-readable (interactive) description of a dataset DOI is given here:

https://doi.org/10.18728/igb-fred-762.1

An example for the machine-actionable description of a dataset DOI is given here:

dct:identifier “https://doi.org/10.18728/igb-fred-762.1” ;

Several existing recommendations and conventions provide more examples of how to implement unique and persistent identifiers, see for instance the GDI-DE Metadata Conventions , W3C Data on the Web Best Practices and GO FAIR recommendations on implementing the FAIR principles.

Best practice 2: Provide a concise descriptive data title

Why

In particular in search portals, explicitly providing a concise title of the data facilitates discovery on the Web. In most search portals, data titles are indexed and used to implement the search. Moreover, the data title is typically displayed in the result list. A descriptive title is therefore supporting researchers in selecting data.

Provide a concise descriptive data title including spatial and temporal information, if applicable.

Example

An example for a human-readable title of a dataset without temporal information is given here:

Hydrography90m: A new high-resolution global hydrographic dataset

An example for the machine-actionable description of a dataset DOI is given here:

dct:title “Hydrography90m: A new high-resolution global hydrographic dataset”@en ;

GDI-DE provides recommendations for actions with examples for the description of governmental datasets

Best practice 3: Provide an illustrative free-text description

Why

When publishing data on the Web, data publishers and data consumers are often unknown to each other. It is thus essential to provide information that supports researchers and machines to better understand the data.

Provide a free text description of the dataset in English and use disciplinary conventions whenever applicable. The climate and forecast conventions provide one example for disciplinary conventions: http://cfconventions.org/

Example

An example for a human-readable (interactive) description of a dataset description is given here:

A global high-resolution (90m) hydrographical network that delineates headwater stream channels in great detail. Raster and vector data available at https://hydrography.org/

An example for the machine-actionable description of a dataset description is given here:

dct:description “A global high-resolution (90m) hydrographical network that delineates headwater stream channels in great detail. Raster and vector data available at https://hydrography.org/”@en ;

Best practice 4: Provide the publication date

Why

Descriptive metadata include the publication date of the data as this supports researchers in finding up-to-date data and evaluating the up-to-dateness of data. In some search portals, the publication date can be used to sort and filter search results.

Use a standardized format, like ISO8601 , to provide the publication date as this allows for machine- and human-readable encoding.

Example

An example for a human-readable description of publication date is given here:

2022-08-09

An example for the machine-actionable description of a publication date is given here:

dct:created "2022-08-09"^^xsd:date ;

Best practice 5: Provide the creator and contact point by using a simple pattern

Why

Providing information on the creator and the contact point for data is essential. Creator information is used to correctly provide license information (see section license). Information about the contact point allows data consumers to get in contact with the data publishers, e.g., ask for a newer version of the data or specific information that cannot be covered in the metadata.

Use a common simple pattern, e.g.,”firstname lastname” without titles or affixes, like “Dr” or “Prof”. Provide additional identifiers in separated fields per author, e.g., email address, ORCID as URL. By providing the email address you reduce the efforts to contact a person, e.g., do extensive searches for contact information. However, the ORCID allows to identify persons and find contact information for researchers, when they are made publicly available, even if they have changed their affiliations and associated institutional mailing address. We therefore recommend using the ORCID to refer to the creator.

Example

An example for a human-readable contact point description of a dataset is given here:

Sami Domisch, sami.domisch@igb-berlin.de

An example for the machine-actionable description of a contact point is given here:

dcat:contactPoint [ vcard:fn “Sami Domisch” . vcard:hasEmail sami.domisch@igb-berlin.de ] ;

An example for a human-readable creator description of a dataset is given here:

Giuseppe Amatulli, https://orcid.org/0000-0002-9651-9602

An example for the machine-actionable description of a creator is given here:

dct:creator [ foaf:name “Giuseppe Amatulli” . adms:identifier https://orcid.org/0000-0002-9651-9602 ]

Moreover, we recommend to provide information on organizations by providing the Research Organization Registry (ROR) identifier. For the Leibniz Institute of Freshwater Ecology and Inland Fisheries, that is https://ror.org/01nftxb06.

Best practice 6: Provide the data type

Why

Metadata should be offered for all types of data, e.g., for datasets, data services, articles, or software. Providing information about the data type supports researchers and services to filter for specific data types.

Use a controlled vocabulary to describe the type of data, e.g., resource types defined in schema.org or DCAT .

Example

An example for a human-readable data type description is given here:

Dataset (https://www.w3.org/TR/vocab-dcat-2/#Class:Dataset)

An example for the machine-actionable data type description is given here:

type “https://www.w3.org/TR/vocab-dcat-2/#Class:Dataset”

Best practice 7: Provide a data license

Why

License information is essential for data consumers to assess data usage, as it provides legal information about access reuse of the data. Data consumers might exclude relevant data, because license information is missing. Moreover, search portals might exclude data with missing license information.

Provide a link to the license agreement that controls use of the data. Choose an open license whenever possible and use tools that help to identify proper licenses when needed, e.g., CC License Chooser

Example

An example for a human-readable description of license information is given here:

http://creativecommons.org/licenses/by-nc-sa/4.0/

An example for the machine-actionable description of license information is given here:

dct:license https://creativecommons.org/licenses/by/4.0/ ;

Best practice 8: Provide spatial coverage

Why

Data consumers often search data for a specific region. The provision of precise information about the spatial coverage supports data consumers and services in the spatial filtering of data.

Provide the geographic region that is covered by the data as point, bounding box, polygon or line, implemented as Well-Known Text (WKT) representation.

Example

An example for a human-readable bounding box description is given here:

POLYGON (-180 85, 191 85, 191 -60, -180 -60)

An example for the machine-actionable bounding box description is given here:

dcat:bbox "POLYGON(-180 85, 191 85, 191 -60, -180 -60)"@en ;

Please find more information on the expanded extent for the running example in the related publication.

Alternative options to describe the spatial coverage include providing a URI/link using the Geonames service as proposed in DCAT. However, this leads to more complex processing for search & harvesting implementations and would not work for the running example with an extent greater than 180. Moreover, the Spatial Data on the Web Best Practices provide a comprehensive overview on open and text-based formats to describe the spatial coverage with respect to discoverability and granularity and gives examples for different encodings .

Best practice 9: Provide the temporal coverage

Why

The temporal coverage that covers a dataset is a useful indicator to assess the fitness for the use of data. Providing precise information on the temporal coverage supports data consumers and services to spatially filter data.

Provide the temporal coverage that is covered by the data as individual date, several dates, or a time period through start date, end date and resolution by using a standardized format, like ISO8601.

Example

An example for a human-readable definition of a dataset’s time period is given here:

2005-08-09/2005-08-30

An example for the machine-actionable time period description is given here:

a dct:PeriodOfTime ; dcat:endDate "2005-08-09"^^xsd:date ; dcat:startDate "2005-08-30"^^xsd:date .

Best practice 10: Provide a documentation

Why

A page or document with detailed information on the data supports data consumers to assess the fitness for use of data. It can be used to provide complex information that cannot be covered in the metadata description, like formula, tables or figures and related text. For research data, the documentation is typically a scientific publication.

Provide a link to the documentation. Use persistent identifiers, whenever possible.

Example

An example for a human-readable definition of a dataset’s documentation is given here:

https://doi.org/10.5194/essd-14-4525-2022

An example for a machine-actionable definition of a dataset’s documentation is given here:

foaf:page <https://doi.org/10.5194/essd-14-4525-2022>

Core profile for metadata

When using the best practices above, we synthesized a core profile for metadata. The core profile includes the following metadata:

Table 1: Core profile for data description and related patterns for the properties | Property name | Description | Best practice | | -------- | -------- | -------- | | Id | Global unique identifier | 1 | | Title | Name given to the data, human-readable | 2 | | Description | Free-text to describe and summarise the data | 3 | Publication date | The date of the publication of the data | 4 | Contact point | Contact information | 5 | Type | Resource type, like dataset, article, report, map, service | 6 | License | Legal information about access and reuse of the data | 7 | Spatial coverage | Geographic region that is covered by the data | 8 | Temporal coverage | Temporal period that the data covers | 9 | Documentation | Page or document with information about the data 10

Table 2 shows mappings from the recommended core profile to the most frequently used metadata schemas.

Table 2: Core metadata profile properties and corresponding properties in ISO19115, DataCite and schema.org schemas

Property name GeoDCAT ISO19115 DataCite schema.org
ID dct:identifier See https://semiceu.github.io/GeoDCAT-AP/drafts/latest/#unique-resource-identifier---not-in-iso19115-core Identifier with identifierType (controlled list of permitted values), e.g., =”DOI” identifier
Title dct:title MD_Metadata > MD_DataIdentification.citation > CI_Citation.title Title name
Description dct:description MD_Metadata > MD_DataIdentification.abstract Description (with descriptionType=”abstract”) description
Publication date dct:created MD_Metadata > MD_DataIdentification.citation > CI_Citation.date PublicationYear datePublished
Contact point dcat:contactPoint MD_Metadata.contact > CI_ResponsibleParty Contributor with contributorType=”ContactPerson” contactPoint
Type dct:type (with rdf:type dcat:Dataset) See https://semiceu.github.io/GeoDCAT-AP/drafts/latest/#resource-type---not-in-iso19115-core resourceTypeGeneral (controlled list of permitted values according to DataCite Metadata Schema) - if value not in controlled list: ResourceType (free text) additionalType (with rdf:type schema:Dataset)
Spatial coverage dct:spatial MD_Metadata > MD_DataIdentification.extent > EX_Extent > EX_GeographicExtent > EX_GeographicBoundingBox or EX_GeographicDescription Geolocation spatialCoverage
Temporal coverage dct:temporal (MD_Metadata > MD_DataIdentification.extent > EX_Extent > EX_TemporalExtent or EX_VerticalExtent Date with dateType=”Collected” temporalCoverage

Mapping metadata profiles is often accompanied by a loss of information. Thus, we strongly recommend to keep the detailed repository’s internal metadata model, depending on the target group’s needs, and additionally provide the above-mentioned profile for interoperability with other services, e.g., the NFDI4Earth Knowledge Hub.

Use cases for specific metadata in the Earth System Sciences

With the update of this document, we intend to provide further profiles for specific data. In a first step, we collect data-driven use cases of Earth System Sciences that help us to identify metadata requirements from the data-oriented perspective.

Use case 1: Providing metadata for underway measurements of sea water temperature data harvested from PANGAEA into the Marine and Earth Data Portals

The Marine Data and Earth Data portals provide access and search functionalities to harvested research specific metadata. The harvesting approach offers flexibility to add custom keywords and relations to metadata entries. These properties are also included in the PANGAEA data model and are very common in marine research. The Marine Data and Earth Data portals use these to offer data exploration by expedition and platform.

As an example, we refer to a dataset of temperature underway measurements taken during a Polarstern expedition, and embedded in the German Marine Research Alliance. The dataset is published in PANGAEA.


Hoppmann, Mario; Tippenhauer, Sandra; Hanfland, Claudia (2023): Continuous thermosalinograph oceanography along RV POLARSTERN cruise track PS130/2. Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Bremerhaven, PANGAEA, https://doi.org/10.1594/PANGAEA.955760


PANGAEA offers a XML representation of its original metadata as well as serializations to several metadata schemas (e.g. JSON-LD, Datacite XML, ISO19139). During harvesting the original metadata for the Marine Date and Earth Data portals, the metadata is annotated, e.g., with region names based on the geographical extent of the dataset.

The following properties get harvested from the PANGAEA OAI-PMH interface using the original metadata scheme. The portals use these properties for free text search or as a filter to enhance the possibilities of data exploration. The examples illustrate how the metadata is structured in the PANGAEA metadata XML.

Platform

A platform can represent, for instance, a research vessel, a vehicle, a large device, or a station. The platform name can be used in the portal, for example, to filter all datasets related to the research vessel Polarstern.

Example
<md:basis id="event3016850.basis1">
  <md:name>**Polarstern**</md:name>
  <md:URI>https://doi.org/10.17815/jlsrf-3-163</md:URI> 
  <md:callSign>DBLK</md:callSign> 
  <md:IMOnumber>8013132</md:IMOnumber>
</md:basis>

Device

Devices can be specific sensors or instruments used to measure a dataset. PANGAEA also allows linking to the NERC46 standard vocabulary for example.

Example
<md:method id="event3016850.method85">
  <md:name>**Thermosalinograph**</md:name>
  <md:optionalName>TSG</md:optionalName> 
  <md:term id="event3016850.method85.term67626" 
      terminologyId="3" 
      terminologyLabel="PAN-M&D">
    <md:name>CTD & XBT</md:name>
  </md:term> 
  <md:term id="event3016850.method85.term2299591" 
      semanticURI="SDN:L05::133" 
      terminologyId="21" 
      terminologyLabel="NERC-L05">
    <md:name>thermosalinographs</md:name>
    <md:URI>**http://vocab.nerc.ac.uk/collection/L05/current/133/**</md:URI>
  </md:term>
</md:method>

Expedition

The research expedition or campaign on which the dataset was measured is also harvested and can be used as a filter. The portal's expedition list is linked to the data search and can be used to explore which datasets were collected on a particular expedition.

Example
<md:campaign id="event3016850.campaign48085">
  <md:name>**PS130/2**</md:name>
  <md:URI>https://doi.org/10.57738/BzPM_0765_2022</md:URI>
  <md:chiefScientist>Hanfland, Claudia</md:chiefScientist> 
  <md:start>2022-05-22</md:start> 
  <md:end>2022-05-29</md:end>
  <md:attribute attid="2053" name="Start location">
    Las Palmas
  </md:attribute>
  <md:attribute attid="2054" name="End location">
    Bremerhaven
  </md:attribute>
</md:campaign>

Event

Data published on PANGAEA typically refers to sampling events whose metadata include spatial and temporal information. Event names are unique and usually use the expedition name as a prefix. Links to additional metadata like sensor information systems are possible.

Example
<md:event id="event3016850">
  <md:label>**PS130/2_0_Underway-37**</md:label>
  <md:latitude>46.072842</md:latitude> 
  <md:longitude>-7.487635</md:longitude> 
  <md:elevation>-4791.6</md:elevation>
  <md:dateTime>2022-05-26T17:38:39</md:dateTime>
  <md:latitude2>51.202885</md:latitude2> 
  <md:longitude2>1.797622</md:longitude2>
  <md:elevation2>-29.5</md:elevation2> 
  <md:dateTime2>2022-05-28T06:35:17</md:dateTime2>
  <md:attribute attid="51418" name="Sensor URI">
      https://sensor.awi.de/?id=87</md:attribute>
  <md:campaign .../> 
  <md:basis .../>
  <md:method .../>
</md:event>

Parameter

The list of measured parameters in a dataset contains references to standard vocabularies and is also part of the PANGAEA metadata scheme. The portal can use these to filter for datasets where a particular parameter, e.g. water temperature, has been measured.

Example
<md:matrixColumn col="4" format="\#\#0.000" id="col4.ds14920107" 
    source="data" type="numeric"> 
  <md:parameter id="col4.ds14920107.param717">
    <md:name>**Temperature, water**</md:name>
    <md:shortName>Temp</md:shortName>
    <md:unit>°C</md:unit> 
    <md:term endOffset="11" fragment="1" 
        id="col4.ds14920107.param717.term43972" 
        semanticURI= "http://qudt.org/1.1/vocab/quantity\#ThermodynamicTemperature" 
        startOffset="0" terminologyId="13" 
        terminologyLabel="PAN-Quantity">
      <md:name>Temperature</md:name>
      <md:optionalName>Θ</md:optionalName>
      <md:URI>http://dbpedia.org/resource/Temperature</md:URI>
    </md:term>
    <md:term endOffset="18" fragment="2" 
        id="col4.ds14920107.param717.term2606205" 
        semanticURI="http://purl.obolibrary.org/obo/CHEBI_15377" 
        startOffset="13" terminologyId="16" 
        terminologyLabel="ChEBI"> 
      <md:name>water</md:name>
      <md:URI>http://purl.obolibrary.org/obo/CHEBI_15377</md:URI>
    </md:term>
  </md:parameter>
  <md:method id="col4.ds14920107.method13178">
    <md:name>**Digital oceanographic thermometer, Sea-Bird, SBE 38</md:name>
    <md:URI>https://www.seabird.com/...</md:URI>
    <md:term id="col4.ds14920107.method13178.term2300638" 
        semanticURI="SDN:L22::TOOL0191" terminologyId="21" 
        terminologyLabel="NERC-L05">
      <md:name>Sea-Bird SBE 38 thermometer</md:name>
      <md:URI>http://vocab.nerc.ac.uk/collection/L22/current/TOOL0191/</md:URI>
    </md:term>
  </md:method>
  <md:PI id="col4.ds14920107.pi38865">
    <md:lastName>Hoppmann</md:lastName> 
    <md:firstName>Mario</md:firstName>
    <md:eMail>mario.hoppmann@awi.de</md:eMail>
    <md:URI>http://www.awi.de/...</md:URI> 
    <md:orcid>0000-0003-1294-9531</md:orcid>
  </md:PI>
  <md:caption>Temp [°C]</md:caption>
</md:matrixColumn>

References

A metadata record can contain any number of URIs pointing to related records, publications, and download links or web services.

Example
<md:reference dataciteRelType="References" 
    group="210" id="ref115713" 
    relationType="Related to" 
    relationTypeId="12" 
    typeId="ref2" typeName="report">
  <md:author id="ref115713.author59361">
    <md:lastName>Dreutter</md:lastName> 
    <md:firstName>Simon</md:firstName>
    <md:eMail>simon.dreutter@awi.de</md:eMail> 
    <md:orcid>0000-0002-0878-0780</md:orcid>
  </md:author> 
  <md:author id="ref115713.author21851">
    <md:lastName>Hanfland</md:lastName> 
    <md:firstName>Claudia</md:firstName>
    <md:eMail>claudia.hanfland@awi.de</md:eMail>
  </md:author>
  <md:year>2022</md:year> 
  <md:title>The Expeditions PS130/1 and PS130/2 of the Research Vessel POLARSTERN to the 
      Atlantic Ocean in 2022</md:title>
  <md:source id="ref115713.journal16572" relatedTermIds="33964,33974" 
      type="journal">Berichte zur Polar- und Meeresforschung = Reports on Polar and 
      Marine Research</md:source>
  <md:volume>765</md:volume>
  <md:URI>https://doi.org/10.57738/BzPM_0765_2022</md:URI>
  <md:pages>49 pp</md:pages>
</md:reference>

The Marine Data and Earth Data portals will only retrieve the values printed in bold from the metadata examples, but the names can be used to reference the internal expedition, event, or sensor database. The portal's harvesting infrastructure maps the values to internal metadata schema for search functionalities.

Use Case 2: Metadata Ingestion From Self-Describing Data Formats

A fundamental source of information in modern Earth System Sciences comes from large modelling and remote sensing systems. Data from such sources is usually stored in tailored and widely established raster-data formats. In most cases, these tailored data formats are self-describing, which means that they contain, besides the data values, dedicated metadata elements for describing the data structure, type and meaning.

In the past, different communities have used different data formats with different granularities of metadata and community-specific features. As an example, the atmospheric modelling community often applied the so-called GRIB-(GRIdded Binary)- Format that was also proposed by the World Meteorological Organization (WMO).

Remote sensing data was (and still is) often provided in the Hierarchical Data Format (HDF). All these formats store binary data, allow for providing metadata, and can be accessed and used with specific APIs in almost every major programming language and environment. An overview of the various formats can be found, e.g., in Di & Yu (2023).

In the last years, however, various communities adopted NetCDF (Network Common Data Format) as the quasi-standard for storing and distributing raster data. A major part of this is due to the highly active community around the so-called Climate and Forecast (CF) conventions, that aim at harmonising NetCDF-data by providing guidelines for the structure of the data as well as defining different mandatory and recommended attributes. While the CF-Conventions were dedicated particularly for NetCDF-data, they can easily be applied to other self-describing data formats like HDF or even modern cloud-optimised formats like ZARR.

The reason is that most self-describing data formats share similar structures and concepts: they have a header that contains all metadata (i.e., dimensions, attributes, etc.) and a data-part that contains the data itself. Metadata is often splitted into global attributes and variable attributes. Please note that we do not cover the group-hierarchies of, e.g., HDF5- or NetCDF4-data here as this would go beyond the scope of this paper.

While global attributes contain overarching information about the dataset, the variable attributes are related to the variables like, e.g., units, standard names that come from a controlled vocabulary, etc. The ultimate goal of all these attributes and conventions is to describe all important aspects of a particular dataset in a comprehensive, transparent and standardised way to allow for a straightforward application and re-use. It further enables harvesting metadata from the data itself, which substantially simplifies the integration into higher level research data and data catalogue infrastructures and would be a huge step towards fulfilling the FAIR-principles.

Here, we want to propose a slightly enhanced profile of the CF-Conventions as the fundamental and generic core of any self-describing dataset for ensuring a minimal set of metadata information that provides most elements of our core profile (see Tab. 1).

The minimal core of the CF-Conventions is rather straightforward. They simply propose a set of global attributes, some mandatory attributes for the data variables, standard names for dimension variables and some guidelines on how to structure the data arrays (i.e., the ordering of dimensions). Beyond this minimal standard, the CF-Conventions also provide guidelines for complex and derived data variables, map projections and flagging schemes. A detailed overview of all mandatory and optional attributes would be beyond the scope of this document so the reader is advised to visit the official releases from the CF-Conventions.

Example - CF-1.10 compliant dataset

dimensions:
    lon = 140 ;
    lat = 200 ;
    time = 215 ;
variables:
    float lon(lon) ;
        lon:units = "degrees_east" ;
        lon:standard_name = "longitude" ; 
        lon:long_name = "longitude" ;
    float lat(lat) ;
        lat:units = "degrees_north" ;
        lat:standard_name = "latitude" ;
        lat:long_name = "latitude" ;
    float time(time) ;
        time:units = "days since 1980-01-01 00:00:00" ;
        time:standard_name = "time" ; 
        time:long_name = "time" ;
    short tas(time, lat, lon) ;
        tas:long_name = "daily_average_temperature_at_2m" ;
        tas:standard_name = "air_temperature" ;
        tas:units = "K" ; 
        tas:_FillValue = -9999s ;
// global attributes:
        :title = "Temperature forecasts" ; :Conventions = "CF-1.10" ;
        :references = "DOI, URL, etc." ;
        :institution = "Some Institution" ;
        :source = "Atmospheric model" ; 
        :comment = "Global data truncated to study domain" ;
        :history = "2020-05-19 08:37:03: File created." ;
}

In our example, we present the header of a dataset that is consistent with the current release of the CF-Conventions (here: 1.10). The used Conventions must be referenced with the keyword Conventions. Other mandatory global attributes are title, references, institution, source, comment, history. While most of these are self-explanatory, the history serves as an audit trail for modifications to original data. Several tools already support the usage of this attribute; when working with these Conventions, it is hence strongly advised to use the history attribute to ensure traceability of all processing steps. Particularly for data and dimension variables, the CF-Conventions make heavy usage of the attribute standard_name. This usually comes from CFs own controlled vocabulary, the CF Standard Name Table and should be used whenever possible. While the standard_name sometimes provides a rather general variable name (like, e.g., air_temperature), more specific descriptions can be added to the long_name attribute (e.g., daily_average_temperature_at_2m). Further highly recommended attributes of any data and dimension variables are the units (as udunits string) as well as a _FillValue for defining the value that is used for missing values in a dataset.

But when searching through the global attributes of the CF-Conventions, it becomes evident that several crucial pieces of information are missing. As an example, there is no dedicated field for entering a point of contact or author of the dataset. Thus, with the current Conventions, we cannot fulfil our metadata core profile (Table 1). As this has been recognized by several communities, there have been multiple attempts to enhance the core profile with both generic and also more domain specific information. The interinstitutional Earth Science Information Partners (ESIP) have compiled the Attribute Convention for Data Discovery 1-3, which contains a more generic list of mandatory, recommended and suggested attributes. Another enhanced profile particularly from the atmospheric modelling community is the ATMODAT-Standard, which is tailored towards the organisation and description of model-based raster data.

In Table 3, we hence propose a simple and straightforward enhancement of the CF- Conventions with additional attributes from the ESIP- and ATMODAT-Standards in order to ensure consistency with our synthesised core profile (Table 1).

Table 3: Mapping between CF- and similar Conventions onto our Core Profile.

Property Name CF-1.10 Suggested extension Attribute Provenance
Id Na id ESIP
title title CF-1.10
Description Na summary ESIP, ATMODAT
Contact point Na contact ATMODAT
License Na license ATMODAT, ESIP
Spatial coverage Na geospatial_lon_min
geospatial_lat_min
geospatial_lon_max
geospatial_lat_max
ESIP
Temporal coverage Na time_coverage_start
time_coverage_end
ESIP
Documentation references CF-1.10

We have excluded the properties Publication date and Type from this list as the Publication date usually refers to the date when the data is published, e.g., through a repository. And as we are focusing in this use-case on self-describing raster data formats, Type can be simply set to dataset or the respective link to a controlled vocabulary.

Thus, when taking the examples from our best practices, the global attributes from a self-describing dataset could be defined as follows:

Example - Global Attributes of a self-describing dataset consistent with our core profile

{
// global attributes:
    :id = "https://doi.org/10.18728/igb-fred-762.1"
    :title = "Hydrography90m: A new high-resolution global hydrographic dataset";
    :Description = "A global high-resolution (90m) hydrographical network that delineates 
        headwater stream channels in great detail. Raster and vector data available 
        at https://hydrography.org/";
    :contact = "Giuseppe Amatulli,https://orcid.org/0000-0002-9651-9602";
    :license = "CC BY NC 4.0, https://creativecommons.org/licenses/bync/4.0/";
    :geospatial_lon_min = "-180";
    :geospatial_lat_min = "-60"; 
    :geospatial_lon_max = "191";
    :geospatial_lat_max = "85";
    :time_coverage_start = "2005-08-09 00:00:00"; 
    :time_coverage_end = "2005-08-30 00:00:00";
    :references = "https://doi.org/10.5194/essd-14-4525-2022";
    :type = "https://www.w3.org/TR/vocab-dcat-2/\#Class:Dataset"; 
    :Conventions = "CF-1.10, ACDD-1.3, ATMODAT-2.4"; 
    :institution = "Hydrography.org, https://hydrography.org/";
    :source = "MERIT Hydro digital elevation model";
    :comment = "Data processed and compiled, etc. "; 
    :history = "2020-05-19 00:00:00: File created.";
}
It should be noted, however, that this should be the bare minimum of attributes for any dataset in a self-describing format if we want to ensure consistency with our core profile. As crucial information particularly for the interoperability (e.g., information about map projections, missing values, etc.) is missing, we strongly suggest integrating such information into a dataset.

Nevertheless, providing this minimal set of global attributes has the great benefit that we can read all required information for our core metadata profile from the data itself. There is no need for providing and maintaining additional metadata, e.g., via separate metadata catalogues. Furthermore, various tools are already able to directly provide this information in a machine-readable and standardised format. As an example, the ncISO-tools allow to generate ISO199115-conformal metadata directly from NetCDF-files. For a more generic approach, we can also use the so-called NetCDF Markup Language (NcML) to derive XML-representations of NetCDF metadata.

Overall, we strongly suggest to follow these simple guidelines for substantially improving the FAIRness of, in particular, raster data provided in self-describing data formats from Earth System Sciences.

Define A License For The Metadata

Repository providers should facilitate reusing metadata for different use cases, e.g. harvesting and enriching with additional information, by providing license information for the collected metadata. We recommend to provide all metadata under Creative Commons Zero v1.0 Universal (CC0 1.0).

The following Websites provide detailed information about the different license options and how to provide the license in a machine-readable format: - Machine-readable licences: https://spdx.org/licenses/ - Overview of Creative Commons (CC) license versions: https://creativecommons.org/choose/?lang=de - Wizard to choose a CC licence https://chooser-beta.creativecommons.org/

Provide An Interface Using A Standardised Web-Based Protocol

Portals like the Earth Data Portal and NFDI4Earth Knowledge Graphs, harvest metadata content to offer users central access and search capabilities to research relevant information. Repositories which follow standardized interfaces, metadata formats and best practices can be easily integrated in existing and upcoming infrastructures.

We recommend using the well-known and widely use non-disciplinary Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) or ESS specific technologies, like the Catalogue Service for the Web (OGC CSW). The OAI-PMH and the OGC CSW require an XML serialization of metadata for instance according to a specific schema (also called profile in this context). The minimum requirement by the protocols is to support serialization to Dublin Core. However, we recommend to implement the serialization to more expressive metadata schemas, like ISO19115 or GeoDCAT. Alternatively, a Spatio Temporal Asset Catalog (STAC) can be used to provide metadata in JSON format.

The most prominent open source ESS software solutions to set up metadata catalogs offering standardized interfaces, like the CSW include the OSGEO GeoNode, GeoNetwork, and the pycsw.

Conclusion

Metadata serves as the cornerstone for the findability and interoperability of data. This document emphasises the vital role of good metadata in ESS and provides actionable metadata recommendations to enhance findability and interoperability. With this recommendation paper we envision repository providers to 1) provide the described core profile properties, 2) support data providers to fill these fields with meaningful information, 3) decide on a metadata license and 4) provide the metadata via a suitable interface.

As a first step, we described use cases for specific metadata. We envision adding more use cases in a community-driven and iterative process and encourage the ESS community to give feedback, in particular by providing additional use cases.

By following the initial recommendations and drawing insights from successful implementations like the Helmholtz Data Hubs and NFDI4Earth Knowledge Hub, the ESS community can work collaboratively towards advancing research, decision-making, and innovation.

References