Skip to content

FAIR publishing of a Multi-Terabyte Dataset

Contribution to the community: The work provides an example for publishing (and reusing) large gridded datasets in a FAIR way. We show data publishers a way to deal with the file size limits that are often imposed by repositories, and also show repository providers a path towards possible future developments for efficient repository use. In addition, we aim to enhance user interest by providing a solution for easily accessing subsets of large data files.

Introduction to the FLUXCOM-X Dataset

In this case, the team of scientists (here called the FLUXCOM-X team) produced a new data-driven global gridded dataset for ecosystem carbon fluxes (gross primary productivity and net ecosystem exchange) as well as water fluxes (evapotranspiration and transpiration) at a 0.05 degree spatial and hourly temporal resolution. For the year-round data from 2003 to 2021, this resulted in approx 4 TB of compressed data for each variable. The dataset is called FLUXCOM-X and details about the dataset can be found in [@nelson_etal_2024_egusphere].

Challenges in Handling Large Datasets

The FLUXCOM-X team approached NFDI4Earth asking for support in publishing their dataset in a FAIR way and in particular communicated with members of M2.5 involved in new technologies. They described a problem that more and more scientists working with large earth science datasets face nowadays: They want to follow all the guidelines that we publish as NFDI4Earth or that they find in their data management plan and try to upload their multi-terabyte dataset to one of the supported data portals, but they soon find out that datasets of this size are not supported by the portals.

In addition, even if they negotiated with one of the portals to allow the upload of NetCDF files with a volume of 15 terabytes they wonder which users would actually have the capacities to download these amounts of data on their local computers.

Addressing Diverse User Requirements

For the dataset in question, many users will not need the full-resolution dataset but rather aggregated versions of this dataset. For example, scientists from the atmospheric inversion community will normally not need the full 0.05 degree resolution but would rather compare their results to a spatially aggregated 0.25 degree daily product. Other researchers interested in the sub-daily variability of the land carbon sink often look at monthly aggregated daily cycles of the carbon fluxes at 0.25 degree resolution. Another group of potential dataset users would be data analysts with access to large computing facilities who really want to analyze the dataset on its highest spatiotemporal resolution.

However, other scientists performing case studies on some geographic areas will need access to the full resolution dataset, but would ideally only download a small subset containing their area of interest and not the full dataset.

Implementing a Hybrid Publication Strategy

In the end, after discussing with the FLUXCOM-X team as well as the maintainers of their desired target portal, the ICOS Data Portal, we suggested and implemented a hybrid data publication strategy. The first step was to produce aggregated NetCDF files according to best practices and metadata standards, using 4 different aggregation strategies, and each targeting a different user group.

Each of the aggregated products were smaller by a factor of ~100 compared to the full-resolution product (in the range of GBs) and could be published in a findable, citeable and persistent manner. In parallel, the full resolution dataset was converted to a cloud-optimized data format Zarr and uploaded to an object storage service kindly provided by the NFDI4Earth partner DKRZ. The data was chunked in a way to maximize the efficiency of accessing spatial subsets of the data. Providing the data on the DKRZ object storage serves the following two purposes.

First, the data is publicly available and users can access the full metadata and download subsets of the data with only very few lines of code. Second, users with a DKRZ user account and access to the HPC cluster can analyse the dataset in full resolution directly from the DKRZ cluster, and use the big computing resources available there to scale their analysis workflows. Access and computing on the full dataset as well as the aggregated datasets are documented in a living repository that contains example code and demo jupyter notebooks for different data access patterns.

Future Directions and Improvements

We think that this hybrid approach is a good solution for this particular dataset that fulfills the needs for many different user communities and can foster future re-use of the dataset. While our approach might be a potential solution for other groups seeking a solution for publishing their large dataset as well we think that the whole process of publishing very large datasets can still be improved.

We believe that instead of only increasing the file size limits for data uploads, there is a lot of potential for data service providers to explore new cloud-based storage techniques and formats for very large datasets. We also believe that for data volumes of this size, the proximity between data storage and a scalable computing infrastructure is becoming increasingly important and we look forward to observing and contributing to future developments in this field.

References