New Framework for the Analysis of Aquatic Ecosystems

Media 1: Visualisation of the feature extraction.

Abstract

State-of-the-art underwater imaging systems provide an exciting opportunity to observe billions of individual organisms in their natural habitats at unprecedented spatiotemporal resolution. To unlock the full potential of these advances, we require new analysis pipelines that go beyond classifying organisms by taxonomic groups, and quantify functional traits and biological phenomena from images. Critically, these tools must be made accessible to domain specialists without programming expertise and deployable at scale on modern supercomputing systems. We develop such an image analysis pipeline, manually annotate functional groups, traits and biological processes in images, and train CNNs to automate and scale analysis of massive zooplankton image datasets. Our pipeline, implemented on a high-performance computing (HPC) system and combining multiple existing open-source frameworks and libraries, provides an intuitive web interface for browsing, searching and annotating images, and allows multiple simultaneous users to work on a single copy of the data online. Images and annotations are then used for both supervised and unsupervised training of CNNs, with the results made available in the web interface (Media 1). We demonstrate this approach by classifying ~700000 images to identify functional groups (copepods, diatom chains, Noctiluca scintillans, marine snow, etc). Organisms are further annotated for relevant functional traits. Using these trait annotations, future work will further train CNNs for object detection and feature extraction, thereby iteratively fine-tuning CNNs to perform increasingly complex trait extraction from images. We foresee that these tools will enable new avenues of investigation in aquatic research, ecosystem modelling and global biogeochemical flux estimations, revealing previously inaccessible relationships between species biodiversity, zooplankton traits and seasonal variations in environmental conditions.

Outcome and Trends

We provide detailed documentation of our image analysis pipeline. Such a consolidated pipeline does not currently exist for trait extraction and will be an invaluable tool for marine researchers. This includes:

Implement a labeling interface: While any web-based tool would fit our pipeline, we used an existing, open-source platform (Label-Studio). We provide instructions to implement an instance of Label-Studio with a PostgreSQL backend on an HPC. We include instructions and code to generate file lists for local file-serving, and import into Label-Studio. Detailed templates (HTML tags, CSS styling) interface for plankton classification, CNN evaluation, trait and biological process annotation.
Generate training and validation data: Manual classification of randomly-selected images in the implemented interface. We provide documentation and Jupyter notebooks to support information exchange between Label-Studio and a CNN (in this case, our custom-built plankton classifier).
Train a basic CNN for classification: We used a custom-built CNN (referred to as 'Plankton-classifier') and implemented a semi-supervised machine learning paradigm, code for the Plankton-classifier will be available upon publication.
Use CNN performance metrics to evaluate and choose between machine learning paradigms and hyper parameters. We provide our experiment results and documentation to enable researchers to choose the machine learning approach.
Run the CNN model on the entire dataset to predict classes or functional groups with a high conditional accuracy.
Further manual annotation of visual signatures of functional traits for class-specific trait annotation.

Annotated class and trait datasets
We applied our image analysis pipeline to images acquired during research expeditions in the North Sea to produce an annotated class-labels dataset (~6000 images). We trained our Plankton-classifier on these labels, evaluated CNN accuracy on withheld labels and used the CNN to infer predictions on the entire dataset (all ~700000 images). To each image, the Plankton-classifier assigns probabilities for each class, and the class assigned the maximum probability (max_p) is the predicted label. We then selected all images with max_p < 0.4 ( ~7000 images) for manual labeling. We provide ~14000 class annotations (images will be made available upon publication) and their max_p values. The Plankton classifier predictions were used to extract classes with high conditional accuracy such as Noctiluca, diatom chains, marine snow, etc. We are currently generating relevant trait annotations for these classes and add the trait annotations.

All the labels generated were used to train the plankton classifier in collaboration with Hereon’s Model-Driven Machine Learning group (MDML). The code repository will be made public upon publication. Documentation and code for our data analysis pipeline are provided in the current version of our code repository. Upon publication of the Plankton-classifier, we will provide a Jupyter notebook tutorial of our data analysis pipeline to guide users through annotation for classification, trait segmentation, CNN training and data visualization.

In collaboration with the Helmholtz AI Cooperation Unit (Helmholtz AI), we are currently working on incorporating a conversion to binary formats to scale our data analysis pipeline to deal with larger datasets (~108 images). Additionally, we are generating trait annotations that capture information about relevant characteristics or morphological and behavioral properties in images. These annotations will be used to train a CNN for automated object detection and feature extraction, in collaboration with the MDML group at Hereon. To improve accessibility by non-programming, domain experts, we plan to develop a GUI for our data pipeline.

Automatic taxonomic classification and trait extraction (D1/D3) will be valuable for marine biologists, ecologists and image analysts. We hope that the tools developed here will enable domain experts in aquatic research, ecosystem modelling and global biogeochemical flux estimations, to analyze previously inaccessible relationships between biodiversity, zooplankton biology, seasonal variations in environmental conditions and impact by climate change.

Resources

Code repository (GitLab | Zenodo)
Final Report