HPC and Cloud Computing - What is it and when should we use it

Datasets in the Earth System Sciences are very heterogenous and their size can range from a few Kilobytes to multiple Petabytes for large climate simulation or remote sensing data. While small datasets can still be processed on local computers, this becomes impractical for larger datasets where vast amounts of data would need to be moved to be processed locally. This means that many scientific workflows need to be executed on remote machines at computing centers where the machines have direct access to large geo-datasets and sufficient computing resources allow parallel processing of the large datasets. When a scientist realizes that local resources are not sufficient to perform a certain study, it can be difficult to choose a fitting service among the multitude of providers that offer services to scientists. Here we want to provide an overview on the available types of options and the type and level of expertise required by scientists to use them efficiently.

Option 1: HPC

One option and certainly the traditional way to apply a workflow on a larger scale is to move the computation to a High Performance Computing (HPC) center. In particular scientists in the Climate modelling community depend on the availability of HPC resources for their research. Typically reasearchers in HPC environments run their code directly on the shared hardware where resources are managed by a job scheduling system. On HPC systems the user can choose from a list of software modules that can be loaded on demand and encompass Fortran, C and C++ compilers, different MPI versions for distributed computing and libraries for reading and writing HDF5, NetCDF and other file formats. As an HPC user on shared resources the ability to install individual softwares is limited, it is advisable to consider available software modules when making a decision about a suitable HPC infrastructure. Some infractstructures like DKRZ also provide a list of important Earth System Science datasets to users, which can significantly reduce the need for data movement. The required skillset to use these systems are basic knowledge of the Linux operating system, ssh, bash and job schedulers as well as some programming in a low-level or higher programming language depending on the available software modules.

For an introduction to working with an HPC system see https://carpentries-incubator.github.io/hpc-intro/

2: Going to the cloud

An alternative and currently more "hyped" way to scale geocomputations is to move computations into the cloud. Compared to traditional HPC systems, cloud environments are characterized by a higher level of virtualization, this means the user interacts with virtual machines instead of bare metal machines. Large datasets are typically in objects stores that are accessed through network protocols instead of directly from a file system. These access patterns influence the latency and speed of data input and output operations.

However, it is important to notice that there is not the one cloud and service providers try to provide simple interfaces for users that hide most of the complexity described above. In the following subsections we will present some typical models in which geo-related cloud-computing resources are offered in the cloud.

2.1 Manual cloud setup

The most individual option to do cloud-computing is to manually setup the system. This starts with the choice of a coud computing provider, which can be commercial (EC2, GCP) or publicly funded (CDSE, Denbi cloud, EODC). These providers offer a list of geo-datasets that are accessible through their storage systems, ideally they are searchable through STAC-formatted metadata catalog. Virtual machines of different sizes can be acquired on-demand to do the actual processing. Typically data access and computing time are directly associated with a cost and it can be difficult to estimate the amount of necessary resources before starting an analysis. However, for the publicly funded cloud services it is possible to apply for funded computing time through small proposals in a similar to HPC centers. Although the initial setup is quite complex, in the end the user has full control over the operating system and the software running inside their virtual machines and containers, which gives the user the highest flexibility in setting up their workflow with all dependencies. Many analysis environments are distributed as pre-built docker containers which could be easily deployed in a cloud environment. Through the use of containerised software analyses performed in the cloud can achieve a high level of repoducibility. Ideally, providers offer OpenStack as a tool to setup the cloud computing resources which is becoming more widespread and helps making cloud setups transferable to other providers/ In general this manual setup should only be recommended to advanced users interested in learning cloud techniques whose priority is to have full control over their whole software stack.

2.2. Using Big-tech Geodata processing APIs (Google Earth Engine, MS Planetary Computer)

A much more user-friendly way of accessing cloud computing resources is provided by US Big-Tech companies through Google Earth Engine and Microsoft Planetary Computer, where Google Earth Engine is the platform with the much longer history and more online learning material. Both platforms provide access to a large variety of publicly accessible Remote Sensing datasets as well as a limited amount of free computing resources for members of universities and research institutions. The user can define their own workflows based on pre-defined APIs in JavaScript or Python that support many operations for data analysis, processing and visualisation. These APIs can either be interfaced through jupyter-based notebooks or from local resources, while computations on the data happen in the cloud close to the data. The researcher does not need to go through the complications of setting up their own cloud environment, the only prerequisite would be familiarity with one of the supported high-level programming languages. The downside of giving away the control over the environment in which the computations are happening is that it can be difficult to reproduce scientific results in the future. While the processing scripts themselves can be published, the engine running in the background is closed-source code, software versions may change at any time at the will of the service provider. Even the future availability of the service is not guaranteed or might be associated with costs in the future which might be added to your considerations when choosing an analysis platform.

2.3 Using standardized Query

One more way to interact with cloud-based Geodata processing services is through open and standardized query languages. The main idea behind these query languages is that researchers can define their processing tasks locally, send them to the cloud service providers through standardized queries and retrieve the aggregated results. One example that got a lot of attention in the scientific community in the last years is OpenEO. Researchers can define their workflows using clients available for several programming languages commonly used in Data Science (Python, R, JavaScript) or through web-based graphical user interfaces. Once the workflow is defined it can be processed by a number of backend providers for which the OpenEO interface is implement and that provided the datasets of interest. Possible backends can be both commercial or publicly funded the Copernicus Dataspace, Google Earth Engine, SentinelHub and Rasdaman servers. Similar to BigTech APIs described in the last paragraph, using OpenEO clients can substantially reduce the technical burden for scientists to get started in using cloud computing while maintaining a free choice of the backend. Some of the publicly funded OpenEO backends are based on open source software and access the officially distributed data sources, which can be a plus for reproducibility of analyses performed with this framework and transferability of workflows between providers is made easy. However, users are still limited by the range of operators that are supported by a backend so that for some complex data analysis tasks it might not be possible to efficiently express them as OpenEO operations. In these cases it will still be necessary to either manually set up a cloud processing environment or to move the computation to traditional HPC centers.

3. Summary

The concept of "moving the algorithm to the data" is often presented as the solution to scientific analyses involving large geo-data. Scientists can benefit from a variety of computing services in different flavors. While this article draws a distinction between traditional HPC providers and cloud-based services, in reality the boundary between these two camps is getting more and more diffuse. HPC centers tend to add complementary technologies to their portfolio that come from the cloud-world like providing http-accesible object storage, or web-based interfaces to their computing resources through jupyter servers, or to have STAC-searchable data catalogs. This, in addition with emerging interfaces to cloud processing services like OpenEO provides scientists with a good set of tools to move their computations to the exciting but large earth system datsets available.