Introduction
Metadata plays a crucial role in making data FAIR (findable, accessible, interoperable, and reusable). In Earth System Science (ESS), data are diverse. They range, for example, from sensor data from long-term observatories and climate change expeditions to topographic base map data. The diversity of data sources and types in ESS requires attention to metadata, as this is the only way to ensure that the data can be effectively understood and used by the intended target group. While custom metadata schemas might be developed to address specific requirements, there is a strong need for harmonisation to enhance the overall usability and interoperability of the data. Striking the proper balance between detailed domain-specific metadata and harmonised more general-purpose metadata is a challenge that requires careful consideration. Often it is beneficial to have both, high-level (or abstract) standardised metadata (schemas) that in general facilitate findability and interoperability, as well as additional, more detailed and sophisticated metadata that is tailored to specific (sub-disciplinary) requirements. In addressing these challenges, we benefit from a number of well-established international metadata standards, e.g., from the geospatial domain. Collaboration among repository providers, data providers, researchers as data consumers, and standards organisations is key to successfully development and implementation of harmonised metadata schemas. By pooling expertise and aligning goals, the ESS community can strengthen the development, harmonisation, mapping, update and usage of metadata standards that strike a balance between findability, interoperability, and the preservation of (sub) discipline-specific information. As the initial step of a community-driven process towards the harmonisation of metadata, this white paper is intended to provide high-level recommendations for the provision of ESS metadata with particular respect to findability and interoperability. We will update it regularly and iteratively add recommendations for specific ESS use cases. This article primarily addresses data repository providers, but also data providers and developers of metadata-related tools. Harmonisation of metadata provision may happen at the syntactic level (i.e., use of a common data format, e.g., JSON), at the semantic level (i.e., use of similar vocabularies to describe the data), or at the level of community practices (e.g., which metadata elements are always documented). The scope of this document is on community practices. Furthermore, we provide links to relevant vocabularies, where applicable. This article is intended to 1) raise awareness of the importance of well-maintained metadata as a minimum viable product [1] for findability and 2) guide on providing metadata for the interoperability with existing infrastructures/services by taking the Helmholtz Data Hubs and the NFDI4Earth Knowledge Hub as examples. The ultimate goal is to provide a set of practices that will facilitate the interaction between publishers and consumers of data in the Earth System Sciences.
[1]: A minimum viable product (MVP) is considered as a version of a product, here metadata, with just enough features to be usable, here for findability tasks. It can then be improved based on specific usage needs in future iterations.
Call for review
This article is intended to be updated regularly and published as a new version on Zenodo. Moreover, you are invited to actively participate and comment on this version by sending your feedback to Christin Henzen (christin.henzen@tu-dresden.de).
Citation
Bernard, L., Degbelo, A., Grieb, J., Henzen, C., Heß, R., Klammer, R., Koppe, R., Lorenz, C., Müller, C., & Weiland, C. (2024). Recommendations for Earth System Sciences Metadata Provision. Zenodo. https://doi.org/10.5281/zenodo.10604587
Recommendations
In the rapidly evolving ESS discipline, the proper organisation and description of data have become essential to foster collaboration, advance research, and enable evidence-based decision-making. Without clear and comprehensive metadata, there is a risk that valuable data sets will be lost. Well-structured metadata enhances data visibility and accessibility, facilitating efficient discovery for researchers, practitioners, and decision-makers. The purpose of this article is to give guidance to repository providers. In the following we describe a 4-step approach with our recommendations, which in summary involve:
- Implement a metadata profile or provide a mapping to a metadata profile for harvesting
- Structure the metadata content and assure the quality
- Define a license for the metadata
- Provide an interface using a standardised web-based protocol
Implement a metadata profile or provide a mapping to a metadata profile for harvesting
The Earth System Sciences community has developed several metadata schemas and standards to describe various types of data and promote interoperability. Some institutions or initiatives therefore provide human-readable or machine-actionable comprehensive lists of those existing schemas and standards, e.g., the International Organization for Standardization (ISO) Ontology representations of geographic technology standards , the Open Geospatial Consortium (OGC) standards overview and the schema repository , the FAIRsharing platform , or the Helmholtz Metadata Collaboration (HMC) Compilation of Metadata Standards . This section contains the best practices to implement a core metadata profile or provide a mapping. With the upcoming feedback loops and iterations of this document, we intend to provide more best practices and profiles for specific data domains. All descriptions of best practices follow a simplified version of the W3C Best Practice Template . The specific examples for each best practice are meant to support the second step of our suggested approach, that is to structure metadata. As a running example, we use the dataset Hydrography90m: A new high-resolution global hydrographic dataset available at [@amatulli_2022_08_09] and described in [@DOI:10.5194-essd-14-4525-2022], which is reused in the NFDI4Earth pilot GeoFRESH, in the follow-up NFDI4Earth pilot project, and a NFDI4Biodiversity follow-up project hydrographr.
Best practice 1: Provide a unique and persistent identifiers
Why
The provision of metadata is a fundamental requirement for the publication of data on the Internet, as researchers wish to permanently cite or reference data from publications or other data. Developers, for instance, use these identifiers in their code and persistent identifiers to prevent human intervention in updating the references.
Recommended approach
We recommend to provide unique and persistent identifiers for metadata and data that are implemented as a DOI or a handle . By providing the identifier as a URL, you allow users, e.g., researchers, a quick navigation to the cited metadata and/or data.
Example
An example for a human-readable (interactive) description of a dataset DOI is given here:
https://doi.org/10.18728/igb-fred-762.1
An example for the machine-actionable description of a dataset DOI is given here:
dct:identifier “https://doi.org/10.18728/igb-fred-762.1” ;
Several existing recommendations and conventions provide more examples of how to implement unique and persistent identifiers, see for instance the GDI-DE Metadata Conventions , W3C Data on the Web Best Practices and GO FAIR recommendations on implementing the FAIR principles.
Best practice 2: Provide a concise descriptive data title
Why
In particular in search portals, explicitly providing a concise title of the data facilitates discovery on the Web. In most search portals, data titles are indexed and used to implement the search. Moreover, the data title is typically displayed in the result list. A descriptive title is therefore supporting researchers in selecting data.
Recommended approach
Provide a concise descriptive data title including spatial and temporal information, if applicable.
Example
An example for a human-readable title of a dataset without temporal information is given here:
Hydrography90m: A new high-resolution global hydrographic dataset
An example for the machine-actionable description of a dataset DOI is given here:
dct:title “Hydrography90m: A new high-resolution global hydrographic dataset”@en ;
GDI-DE provides recommendations for actions with examples for the description of governmental datasets
Best practice 3: Provide an illustrative free-text description
Why
When publishing data on the Web, data publishers and data consumers are often unknown to each other. It is thus essential to provide information that supports researchers and machines to better understand the data.
Recommended approach
Provide a free text description of the dataset in English and use disciplinary conventions whenever applicable. The climate and forecast conventions provide one example for disciplinary conventions: http://cfconventions.org/
Example
An example for a human-readable (interactive) description of a dataset description is given here:
A global high-resolution (90m) hydrographical network that delineates headwater stream channels in great detail. Raster and vector data available at https://hydrography.org/
An example for the machine-actionable description of a dataset description is given here:
dct:description “A global high-resolution (90m) hydrographical network that delineates headwater stream channels in great detail. Raster and vector data available at https://hydrography.org/”@en ;
Best practice 4: Provide the publication date
Why
Descriptive metadata include the publication date of the data as this supports researchers in finding up-to-date data and evaluating the up-to-dateness of data. In some search portals, the publication date can be used to sort and filter search results.
Recommended approach
Use a standardized format, like ISO8601 , to provide the publication date as this allows for machine- and human-readable encoding.
Example
An example for a human-readable description of publication date is given here:
2022-08-09
An example for the machine-actionable description of a publication date is given here:
dct:created "2022-08-09"^^xsd:date ;
Best practice 5: Provide the creator and contact point by using a simple pattern
Why
Providing information on the creator and the contact point for data is essential. Creator information is used to correctly provide license information (see section license). Information about the contact point allows data consumers to get in contact with the data publishers, e.g., ask for a newer version of the data or specific information that cannot be covered in the metadata.
Recommended approach
Use a common simple pattern, e.g.,”firstname lastname” without titles or affixes, like “Dr” or “Prof”. Provide additional identifiers in separated fields per author, e.g., email address, ORCID as URL. By providing the email address you reduce the efforts to contact a person, e.g., do extensive searches for contact information. However, the ORCID allows to identify persons and find contact information for researchers, when they are made publicly available, even if they have changed their affiliations and associated institutional mailing address. We therefore recommend using the ORCID to refer to the creator.
Example
An example for a human-readable contact point description of a dataset is given here:
Sami Domisch, sami.domisch@igb-berlin.de
An example for the machine-actionable description of a contact point is given here:
dcat:contactPoint [ vcard:fn “Sami Domisch” . vcard:hasEmail sami.domisch@igb-berlin.de ] ;
An example for a human-readable creator description of a dataset is given here:
Giuseppe Amatulli, https://orcid.org/0000-0002-9651-9602
An example for the machine-actionable description of a creator is given here:
dct:creator [ foaf:name “Giuseppe Amatulli” . adms:identifier https://orcid.org/0000-0002-9651-9602 ]
Moreover, we recommend to provide information on organizations by providing the Research Organization Registry (ROR) identifier. For the Leibniz Institute of Freshwater Ecology and Inland Fisheries, that is https://ror.org/01nftxb06.
Best practice 6: Provide the data type
Why
Metadata should be offered for all types of data, e.g., for datasets, data services, articles, or software. Providing information about the data type supports researchers and services to filter for specific data types.
Recommended approach
Use a controlled vocabulary to describe the type of data, e.g., resource types defined in schema.org or DCAT .
Example
An example for a human-readable data type description is given here:
Dataset (https://www.w3.org/TR/vocab-dcat-2/#Class:Dataset)
An example for the machine-actionable data type description is given here:
type “https://www.w3.org/TR/vocab-dcat-2/#Class:Dataset”
Best practice 7: Provide a data license
Why
License information is essential for data consumers to assess data usage, as it provides legal information about access reuse of the data. Data consumers might exclude relevant data, because license information is missing. Moreover, search portals might exclude data with missing license information.
Recommended approach
Provide a link to the license agreement that controls use of the data. Choose an open license whenever possible and use tools that help to identify proper licenses when needed, e.g., CC License Chooser
Example
An example for a human-readable description of license information is given here:
http://creativecommons.org/licenses/by-nc-sa/4.0/
An example for the machine-actionable description of license information is given here:
dct:license https://creativecommons.org/licenses/by/4.0/ ;
Best practice 8: Provide spatial coverage
Why
Data consumers often search data for a specific region. The provision of precise information about the spatial coverage supports data consumers and services in the spatial filtering of data.
Recommended approach
Provide the geographic region that is covered by the data as point, bounding box, polygon or line, implemented as Well-Known Text (WKT) representation.
Example
An example for a human-readable bounding box description is given here:
POLYGON (-180 85, 191 85, 191 -60, -180 -60)
An example for the machine-actionable bounding box description is given here:
dcat:bbox "POLYGON(-180 85, 191 85, 191 -60, -180 -60)"@en ;
Please find more information on the expanded extent for the running example in the related publication.
Alternative options to describe the spatial coverage include providing a URI/link using the Geonames service as proposed in DCAT. However, this leads to more complex processing for search & harvesting implementations and would not work for the running example with an extent greater than 180. Moreover, the Spatial Data on the Web Best Practices provide a comprehensive overview on open and text-based formats to describe the spatial coverage with respect to discoverability and granularity and gives examples for different encodings .
Best practice 9: Provide the temporal coverage
Why
The temporal coverage that covers a dataset is a useful indicator to assess the fitness for the use of data. Providing precise information on the temporal coverage supports data consumers and services to spatially filter data.
Recommended approach
Provide the temporal coverage that is covered by the data as individual date, several dates, or a time period through start date, end date and resolution by using a standardized format, like ISO8601.
Example
An example for a human-readable definition of a dataset’s time period is given here:
2005-08-09/2005-08-30
An example for the machine-actionable time period description is given here:
a dct:PeriodOfTime ; dcat:endDate "2005-08-09"^^xsd:date ; dcat:startDate "2005-08-30"^^xsd:date .
Best practice 10: Provide a documentation
Why
A page or document with detailed information on the data supports data consumers to assess the fitness for use of data. It can be used to provide complex information that cannot be covered in the metadata description, like formula, tables or figures and related text. For research data, the documentation is typically a scientific publication.
Recommended approach
Provide a link to the documentation. Use persistent identifiers, whenever possible.
Example
An example for a human-readable definition of a dataset’s documentation is given here:
https://doi.org/10.5194/essd-14-4525-2022
An example for a machine-actionable definition of a dataset’s documentation is given here:
foaf:page <https://doi.org/10.5194/essd-14-4525-2022>
Core profile for metadata
When using the best practices above, we synthesized a core profile for metadata. The core profile includes the following metadata:
Table 1: Core profile for data description and related patterns for the properties | Property name | Description | Best practice | | -------- | -------- | -------- | | Id | Global unique identifier | 1 | | Title | Name given to the data, human-readable | 2 | | Description | Free-text to describe and summarise the data | 3 | Publication date | The date of the publication of the data | 4 | Contact point | Contact information | 5 | Type | Resource type, like dataset, article, report, map, service | 6 | License | Legal information about access and reuse of the data | 7 | Spatial coverage | Geographic region that is covered by the data | 8 | Temporal coverage | Temporal period that the data covers | 9 | Documentation | Page or document with information about the data 10
Table 2 shows mappings from the recommended core profile to the most frequently used metadata schemas.
Table 2: Core metadata profile properties and corresponding properties in ISO19115, DataCite and schema.org schemas
| Property name | GeoDCAT | ISO19115 | DataCite | schema.org |
|---|---|---|---|---|
| ID | dct:identifier | See https://semiceu.github.io/GeoDCAT-AP/drafts/latest/#unique-resource-identifier---not-in-iso19115-core | Identifier with identifierType (controlled list of permitted values), e.g., =”DOI” | identifier |
| Title | dct:title | MD_Metadata > MD_DataIdentification.citation > CI_Citation.title | Title | name |
| Description | dct:description | MD_Metadata > MD_DataIdentification.abstract | Description (with descriptionType=”abstract”) | description |
| Publication date | dct:created | MD_Metadata > MD_DataIdentification.citation > CI_Citation.date | PublicationYear | datePublished |
| Contact point | dcat:contactPoint MD_Metadata.contact > CI_ResponsibleParty | Contributor with contributorType=”ContactPerson” | contactPoint | |
| Type | dct:type (with rdf:type dcat:Dataset) | See https://semiceu.github.io/GeoDCAT-AP/drafts/latest/#resource-type---not-in-iso19115-core | resourceTypeGeneral (controlled list of permitted values according to DataCite Metadata Schema) - if value not in controlled list: ResourceType (free text) | additionalType (with rdf:type schema:Dataset) |
| Spatial coverage | dct:spatial | MD_Metadata > MD_DataIdentification.extent > EX_Extent > EX_GeographicExtent > EX_GeographicBoundingBox or EX_GeographicDescription | Geolocation | spatialCoverage |
| Temporal coverage | dct:temporal | (MD_Metadata > MD_DataIdentification.extent > EX_Extent > EX_TemporalExtent or EX_VerticalExtent | Date with dateType=”Collected” | temporalCoverage |
Mapping metadata profiles is often accompanied by a loss of information. Thus, we strongly recommend to keep the detailed repository’s internal metadata model, depending on the target group’s needs, and additionally provide the above-mentioned profile for interoperability with other services, e.g., the NFDI4Earth Knowledge Hub.
Use cases for specific metadata in the Earth System Sciences
With the update of this document, we intend to provide further profiles for specific data. In a first step, we collect data-driven use cases of Earth System Sciences that help us to identify metadata requirements from the data-oriented perspective.
Use case 1: Providing metadata for underway measurements of sea water temperature data harvested from PANGAEA into the Marine and Earth Data Portals
The Marine Data and Earth Data portals provide access and search functionalities to harvested research specific metadata. The harvesting approach offers flexibility to add custom keywords and relations to metadata entries. These properties are also included in the PANGAEA data model and are very common in marine research. The Marine Data and Earth Data portals use these to offer data exploration by expedition and platform.
As an example, we refer to a dataset of temperature underway measurements taken during a Polarstern expedition, and embedded in the German Marine Research Alliance. The dataset is published in PANGAEA.
Hoppmann, Mario; Tippenhauer, Sandra; Hanfland, Claudia (2023): Continuous thermosalinograph oceanography along RV POLARSTERN cruise track PS130/2. Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Bremerhaven, PANGAEA, https://doi.org/10.1594/PANGAEA.955760
PANGAEA offers a XML representation of its original metadata as well as serializations to several metadata schemas (e.g. JSON-LD, Datacite XML, ISO19139). During harvesting the original metadata for the Marine Date and Earth Data portals, the metadata is annotated, e.g., with region names based on the geographical extent of the dataset.
The following properties get harvested from the PANGAEA OAI-PMH interface using the original metadata scheme. The portals use these properties for free text search or as a filter to enhance the possibilities of data exploration. The examples illustrate how the metadata is structured in the PANGAEA metadata XML.
Platform
A platform can represent, for instance, a research vessel, a vehicle, a large device, or a station. The platform name can be used in the portal, for example, to filter all datasets related to the research vessel Polarstern.
Example
<md:basis id="event3016850.basis1">
<md:name>**Polarstern**</md:name>
<md:URI>https://doi.org/10.17815/jlsrf-3-163</md:URI>
<md:callSign>DBLK</md:callSign>
<md:IMOnumber>8013132</md:IMOnumber>
</md:basis>
Device
Devices can be specific sensors or instruments used to measure a dataset. PANGAEA also allows linking to the NERC46 standard vocabulary for example.
Example
<md:method id="event3016850.method85">
<md:name>**Thermosalinograph**</md:name>
<md:optionalName>TSG</md:optionalName>
<md:term id="event3016850.method85.term67626"
terminologyId="3"
terminologyLabel="PAN-M&D">
<md:name>CTD & XBT</md:name>
</md:term>
<md:term id="event3016850.method85.term2299591"
semanticURI="SDN:L05::133"
terminologyId="21"
terminologyLabel="NERC-L05">
<md:name>thermosalinographs</md:name>
<md:URI>**http://vocab.nerc.ac.uk/collection/L05/current/133/**</md:URI>
</md:term>
</md:method>
Expedition
The research expedition or campaign on which the dataset was measured is also harvested and can be used as a filter. The portal's expedition list is linked to the data search and can be used to explore which datasets were collected on a particular expedition.
Example
<md:campaign id="event3016850.campaign48085">
<md:name>**PS130/2**</md:name>
<md:URI>https://doi.org/10.57738/BzPM_0765_2022</md:URI>
<md:chiefScientist>Hanfland, Claudia</md:chiefScientist>
<md:start>2022-05-22</md:start>
<md:end>2022-05-29</md:end>
<md:attribute attid="2053" name="Start location">
Las Palmas
</md:attribute>
<md:attribute attid="2054" name="End location">
Bremerhaven
</md:attribute>
</md:campaign>
Event
Data published on PANGAEA typically refers to sampling events whose metadata include spatial and temporal information. Event names are unique and usually use the expedition name as a prefix. Links to additional metadata like sensor information systems are possible.
Example
<md:event id="event3016850">
<md:label>**PS130/2_0_Underway-37**</md:label>
<md:latitude>46.072842</md:latitude>
<md:longitude>-7.487635</md:longitude>
<md:elevation>-4791.6</md:elevation>
<md:dateTime>2022-05-26T17:38:39</md:dateTime>
<md:latitude2>51.202885</md:latitude2>
<md:longitude2>1.797622</md:longitude2>
<md:elevation2>-29.5</md:elevation2>
<md:dateTime2>2022-05-28T06:35:17</md:dateTime2>
<md:attribute attid="51418" name="Sensor URI">
https://sensor.awi.de/?id=87</md:attribute>
<md:campaign .../>
<md:basis .../>
<md:method .../>
</md:event>
Parameter
The list of measured parameters in a dataset contains references to standard vocabularies and is also part of the PANGAEA metadata scheme. The portal can use these to filter for datasets where a particular parameter, e.g. water temperature, has been measured.
Example
<md:matrixColumn col="4" format="\#\#0.000" id="col4.ds14920107"
source="data" type="numeric">
<md:parameter id="col4.ds14920107.param717">
<md:name>**Temperature, water**</md:name>
<md:shortName>Temp</md:shortName>
<md:unit>°C</md:unit>
<md:term endOffset="11" fragment="1"
id="col4.ds14920107.param717.term43972"
semanticURI= "http://qudt.org/1.1/vocab/quantity\#ThermodynamicTemperature"
startOffset="0" terminologyId="13"
terminologyLabel="PAN-Quantity">
<md:name>Temperature</md:name>
<md:optionalName>Θ</md:optionalName>
<md:URI>http://dbpedia.org/resource/Temperature</md:URI>
</md:term>
<md:term endOffset="18" fragment="2"
id="col4.ds14920107.param717.term2606205"
semanticURI="http://purl.obolibrary.org/obo/CHEBI_15377"
startOffset="13" terminologyId="16"
terminologyLabel="ChEBI">
<md:name>water</md:name>
<md:URI>http://purl.obolibrary.org/obo/CHEBI_15377</md:URI>
</md:term>
</md:parameter>
<md:method id="col4.ds14920107.method13178">
<md:name>**Digital oceanographic thermometer, Sea-Bird, SBE 38</md:name>
<md:URI>https://www.seabird.com/...</md:URI>
<md:term id="col4.ds14920107.method13178.term2300638"
semanticURI="SDN:L22::TOOL0191" terminologyId="21"
terminologyLabel="NERC-L05">
<md:name>Sea-Bird SBE 38 thermometer</md:name>
<md:URI>http://vocab.nerc.ac.uk/collection/L22/current/TOOL0191/</md:URI>
</md:term>
</md:method>
<md:PI id="col4.ds14920107.pi38865">
<md:lastName>Hoppmann</md:lastName>
<md:firstName>Mario</md:firstName>
<md:eMail>mario.hoppmann@awi.de</md:eMail>
<md:URI>http://www.awi.de/...</md:URI>
<md:orcid>0000-0003-1294-9531</md:orcid>
</md:PI>
<md:caption>Temp [°C]</md:caption>
</md:matrixColumn>
References
A metadata record can contain any number of URIs pointing to related records, publications, and download links or web services.
Example
<md:reference dataciteRelType="References"
group="210" id="ref115713"
relationType="Related to"
relationTypeId="12"
typeId="ref2" typeName="report">
<md:author id="ref115713.author59361">
<md:lastName>Dreutter</md:lastName>
<md:firstName>Simon</md:firstName>
<md:eMail>simon.dreutter@awi.de</md:eMail>
<md:orcid>0000-0002-0878-0780</md:orcid>
</md:author>
<md:author id="ref115713.author21851">
<md:lastName>Hanfland</md:lastName>
<md:firstName>Claudia</md:firstName>
<md:eMail>claudia.hanfland@awi.de</md:eMail>
</md:author>
<md:year>2022</md:year>
<md:title>The Expeditions PS130/1 and PS130/2 of the Research Vessel POLARSTERN to the
Atlantic Ocean in 2022</md:title>
<md:source id="ref115713.journal16572" relatedTermIds="33964,33974"
type="journal">Berichte zur Polar- und Meeresforschung = Reports on Polar and
Marine Research</md:source>
<md:volume>765</md:volume>
<md:URI>https://doi.org/10.57738/BzPM_0765_2022</md:URI>
<md:pages>49 pp</md:pages>
</md:reference>
The Marine Data and Earth Data portals will only retrieve the values printed in bold from the metadata examples, but the names can be used to reference the internal expedition, event, or sensor database. The portal's harvesting infrastructure maps the values to internal metadata schema for search functionalities.
Use Case 2: Metadata Ingestion From Self-Describing Data Formats
A fundamental source of information in modern Earth System Sciences comes from large modelling and remote sensing systems. Data from such sources is usually stored in tailored and widely established raster-data formats. In most cases, these tailored data formats are self-describing, which means that they contain, besides the data values, dedicated metadata elements for describing the data structure, type and meaning.
In the past, different communities have used different data formats with different granularities of metadata and community-specific features. As an example, the atmospheric modelling community often applied the so-called GRIB-(GRIdded Binary)- Format that was also proposed by the World Meteorological Organization (WMO).
Remote sensing data was (and still is) often provided in the Hierarchical Data Format (HDF). All these formats store binary data, allow for providing metadata, and can be accessed and used with specific APIs in almost every major programming language and environment. An overview of the various formats can be found, e.g., in Di & Yu (2023).
In the last years, however, various communities adopted NetCDF (Network Common Data Format) as the quasi-standard for storing and distributing raster data. A major part of this is due to the highly active community around the so-called Climate and Forecast (CF) conventions, that aim at harmonising NetCDF-data by providing guidelines for the structure of the data as well as defining different mandatory and recommended attributes. While the CF-Conventions were dedicated particularly for NetCDF-data, they can easily be applied to other self-describing data formats like HDF or even modern cloud-optimised formats like ZARR.
The reason is that most self-describing data formats share similar structures and concepts: they have a header that contains all metadata (i.e., dimensions, attributes, etc.) and a data-part that contains the data itself. Metadata is often splitted into global attributes and variable attributes. Please note that we do not cover the group-hierarchies of, e.g., HDF5- or NetCDF4-data here as this would go beyond the scope of this paper.
While global attributes contain overarching information about the dataset, the variable attributes are related to the variables like, e.g., units, standard names that come from a controlled vocabulary, etc. The ultimate goal of all these attributes and conventions is to describe all important aspects of a particular dataset in a comprehensive, transparent and standardised way to allow for a straightforward application and re-use. It further enables harvesting metadata from the data itself, which substantially simplifies the integration into higher level research data and data catalogue infrastructures and would be a huge step towards fulfilling the FAIR-principles.
Here, we want to propose a slightly enhanced profile of the CF-Conventions as the fundamental and generic core of any self-describing dataset for ensuring a minimal set of metadata information that provides most elements of our core profile (see Tab. 1).
The minimal core of the CF-Conventions is rather straightforward. They simply propose a set of global attributes, some mandatory attributes for the data variables, standard names for dimension variables and some guidelines on how to structure the data arrays (i.e., the ordering of dimensions). Beyond this minimal standard, the CF-Conventions also provide guidelines for complex and derived data variables, map projections and flagging schemes. A detailed overview of all mandatory and optional attributes would be beyond the scope of this document so the reader is advised to visit the official releases from the CF-Conventions.
Example - CF-1.10 compliant dataset
dimensions:
lon = 140 ;
lat = 200 ;
time = 215 ;
variables:
float lon(lon) ;
lon:units = "degrees_east" ;
lon:standard_name = "longitude" ;
lon:long_name = "longitude" ;
float lat(lat) ;
lat:units = "degrees_north" ;
lat:standard_name = "latitude" ;
lat:long_name = "latitude" ;
float time(time) ;
time:units = "days since 1980-01-01 00:00:00" ;
time:standard_name = "time" ;
time:long_name = "time" ;
short tas(time, lat, lon) ;
tas:long_name = "daily_average_temperature_at_2m" ;
tas:standard_name = "air_temperature" ;
tas:units = "K" ;
tas:_FillValue = -9999s ;
// global attributes:
:title = "Temperature forecasts" ; :Conventions = "CF-1.10" ;
:references = "DOI, URL, etc." ;
:institution = "Some Institution" ;
:source = "Atmospheric model" ;
:comment = "Global data truncated to study domain" ;
:history = "2020-05-19 08:37:03: File created." ;
}
In our example, we present the header of a dataset that is consistent with the current release of the CF-Conventions (here: 1.10). The used Conventions must be referenced with the keyword Conventions. Other mandatory global attributes are title, references, institution, source, comment, history. While most of these are self-explanatory, the history serves as an audit trail for modifications to original data. Several tools already support the usage of this attribute; when working with these Conventions, it is hence strongly advised to use the history attribute to ensure traceability of all processing steps. Particularly for data and dimension variables, the CF-Conventions make heavy usage of the attribute standard_name. This usually comes from CFs own controlled vocabulary, the CF Standard Name Table and should be used whenever possible. While the standard_name sometimes provides a rather general variable name (like, e.g., air_temperature), more specific descriptions can be added to the long_name attribute (e.g., daily_average_temperature_at_2m). Further highly recommended attributes of any data and dimension variables are the units (as udunits string) as well as a _FillValue for defining the value that is used for missing values in a dataset.
But when searching through the global attributes of the CF-Conventions, it becomes evident that several crucial pieces of information are missing. As an example, there is no dedicated field for entering a point of contact or author of the dataset. Thus, with the current Conventions, we cannot fulfil our metadata core profile (Table 1). As this has been recognized by several communities, there have been multiple attempts to enhance the core profile with both generic and also more domain specific information. The interinstitutional Earth Science Information Partners (ESIP) have compiled the Attribute Convention for Data Discovery 1-3, which contains a more generic list of mandatory, recommended and suggested attributes. Another enhanced profile particularly from the atmospheric modelling community is the ATMODAT-Standard, which is tailored towards the organisation and description of model-based raster data.
In Table 3, we hence propose a simple and straightforward enhancement of the CF- Conventions with additional attributes from the ESIP- and ATMODAT-Standards in order to ensure consistency with our synthesised core profile (Table 1).
Table 3: Mapping between CF- and similar Conventions onto our Core Profile.
| Property Name | CF-1.10 | Suggested extension | Attribute Provenance |
|---|---|---|---|
| Id | Na | id | ESIP |
| title | title | CF-1.10 | |
| Description | Na | summary | ESIP, ATMODAT |
| Contact point | Na | contact | ATMODAT |
| License | Na | license | ATMODAT, ESIP |
| Spatial coverage | Na | geospatial_lon_min geospatial_lat_min geospatial_lon_max geospatial_lat_max |
ESIP |
| Temporal coverage | Na | time_coverage_start time_coverage_end |
ESIP |
| Documentation | references | CF-1.10 | |
We have excluded the properties Publication date and Type from this list as the Publication date usually refers to the date when the data is published, e.g., through a repository. And as we are focusing in this use-case on self-describing raster data formats, Type can be simply set to dataset or the respective link to a controlled vocabulary.
Thus, when taking the examples from our best practices, the global attributes from a self-describing dataset could be defined as follows:
Example - Global Attributes of a self-describing dataset consistent with our core profile
{
// global attributes:
:id = "https://doi.org/10.18728/igb-fred-762.1"
:title = "Hydrography90m: A new high-resolution global hydrographic dataset";
:Description = "A global high-resolution (90m) hydrographical network that delineates
headwater stream channels in great detail. Raster and vector data available
at https://hydrography.org/";
:contact = "Giuseppe Amatulli,https://orcid.org/0000-0002-9651-9602";
:license = "CC BY NC 4.0, https://creativecommons.org/licenses/bync/4.0/";
:geospatial_lon_min = "-180";
:geospatial_lat_min = "-60";
:geospatial_lon_max = "191";
:geospatial_lat_max = "85";
:time_coverage_start = "2005-08-09 00:00:00";
:time_coverage_end = "2005-08-30 00:00:00";
:references = "https://doi.org/10.5194/essd-14-4525-2022";
:type = "https://www.w3.org/TR/vocab-dcat-2/\#Class:Dataset";
:Conventions = "CF-1.10, ACDD-1.3, ATMODAT-2.4";
:institution = "Hydrography.org, https://hydrography.org/";
:source = "MERIT Hydro digital elevation model";
:comment = "Data processed and compiled, etc. ";
:history = "2020-05-19 00:00:00: File created.";
}
Nevertheless, providing this minimal set of global attributes has the great benefit that we can read all required information for our core metadata profile from the data itself. There is no need for providing and maintaining additional metadata, e.g., via separate metadata catalogues. Furthermore, various tools are already able to directly provide this information in a machine-readable and standardised format. As an example, the ncISO-tools allow to generate ISO199115-conformal metadata directly from NetCDF-files. For a more generic approach, we can also use the so-called NetCDF Markup Language (NcML) to derive XML-representations of NetCDF metadata.
Overall, we strongly suggest to follow these simple guidelines for substantially improving the FAIRness of, in particular, raster data provided in self-describing data formats from Earth System Sciences.
Define A License For The Metadata
Repository providers should facilitate reusing metadata for different use cases, e.g. harvesting and enriching with additional information, by providing license information for the collected metadata. We recommend to provide all metadata under Creative Commons Zero v1.0 Universal (CC0 1.0).
The following Websites provide detailed information about the different license options and how to provide the license in a machine-readable format: - Machine-readable licences: https://spdx.org/licenses/ - Overview of Creative Commons (CC) license versions: https://creativecommons.org/choose/?lang=de - Wizard to choose a CC licence https://chooser-beta.creativecommons.org/
Provide An Interface Using A Standardised Web-Based Protocol
Portals like the Earth Data Portal and NFDI4Earth Knowledge Graphs, harvest metadata content to offer users central access and search capabilities to research relevant information. Repositories which follow standardized interfaces, metadata formats and best practices can be easily integrated in existing and upcoming infrastructures.
We recommend using the well-known and widely use non-disciplinary Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) or ESS specific technologies, like the Catalogue Service for the Web (OGC CSW). The OAI-PMH and the OGC CSW require an XML serialization of metadata for instance according to a specific schema (also called profile in this context). The minimum requirement by the protocols is to support serialization to Dublin Core. However, we recommend to implement the serialization to more expressive metadata schemas, like ISO19115 or GeoDCAT. Alternatively, a Spatio Temporal Asset Catalog (STAC) can be used to provide metadata in JSON format.
The most prominent open source ESS software solutions to set up metadata catalogs offering standardized interfaces, like the CSW include the OSGEO GeoNode, GeoNetwork, and the pycsw.
Conclusion
Metadata serves as the cornerstone for the findability and interoperability of data. This document emphasises the vital role of good metadata in ESS and provides actionable metadata recommendations to enhance findability and interoperability. With this recommendation paper we envision repository providers to 1) provide the described core profile properties, 2) support data providers to fill these fields with meaningful information, 3) decide on a metadata license and 4) provide the metadata via a suitable interface.
As a first step, we described use cases for specific metadata. We envision adding more use cases in a community-driven and iterative process and encourage the ESS community to give feedback, in particular by providing additional use cases.
By following the initial recommendations and drawing insights from successful implementations like the Helmholtz Data Hubs and NFDI4Earth Knowledge Hub, the ESS community can work collaboratively towards advancing research, decision-making, and innovation.