Hierarchical Data Format for Water-related Big Geodata (HDF4Water)

Media 1: Flowchart of the operations carried out to map data from OSM into a HDF5 container losslessly.

Abstract

Humans rely on clean water for their health, well-being, and various socio-economic activities. To ensure an accurate, up-to-date map of surface water bodies, the often heterogeneous big geodata (remote sensing, GIS, and climate data) must be jointly explored in an efficient and effective manner. In this context, a cross-platform and rock-solid data representation system is key to support advanced water-related research using cutting-edge data science technologies, like deep learning (DL) and high-performance computing (HPC). In this incubator project, we will develop a novel data representation system based on Hierarchical Data Format (HDF), which supports the integration of heterogeneous water-related big geodata and the training of state-of-the-art DL methods. The project will deliver high-quality technical guidelines together with an example water-related data repository based on HDF5 with the support of the Big Geospatial Data Management group at the Technical Unviersity of Munich, with which the NFDI4Earth will consistently benefit from this incubator project since the solution can serve as a blueprint for many other research fields facing the same big data challenge.

Outcomes and Trends

The novelty and contribution of this incubator project can be summarized mainly as follows:

An efficient file format concept: preference was given to HDF5 data container as we require:

Fast access (memory mapping),
Built-in support of compression,
Multilingual strings (UTF-8),
Platform-independence,
Command line tool support,
Hierarchical structuring into groups or folders,
Single file for downloads and DOI assignments.

An analysis-ready representation of OSM: OSM data is a key ingredient in our machine learning scheme, and it is known for its unique and simple data representation. Therefore, we designed an analysis-ready mapping by triangulating multi-polygons and embedding an R*-tree into HDF5.

A dedicated C++ implementation and Python module: we followed the FAIR principle during the development of AtlasHDF. As a result, the source code in C++ and Python, the openly-available data (e.g., OSM and Sentinel-2), together with a tutorial in Jupyter Notebook as Open Educational Resources (OERs) were made available in GitHub with a DOI.