Detecting and classifying animal calls from audio data using animal2vec

Post provided by Julian Schäfer-Zimmermann

An introduction for people lacking a machine-learning background

We provide a non-technical explanation of the animal2vec framework, including its capabilities and potential for usage in animal behavior, ecology, and conservation research. This summary is intended as a starting point for people lacking a technical background (e.g., field biologists) interested in understanding how the system works and what makes it unique and potentially applying it to their own research.

What does animal2vec do?

Imagine you are trying to learn a new language. You would start by listening to native speakers, picking up on recurring patterns, and gradually associating sounds with meanings; animal2vec does something similar. It first learns from a massive amount of unlabeled audio data, essentially “listening” to various animal sounds. This is the pretraining phase. Then, it refines its understanding using a smaller labeled data set, where specific vocalizations are identified and categorized. This is the finetuning phase. This two-step process allows animal2vec to detect and classify animal calls (or other acoustic events) from raw audio recordings.

The system is designed to label the onset and offset times of calls and classify them into types. After training, the system can be run on continuous audio files (e.g., wav or other audio filetypes) and output a set of detections, which can then be used for downstream analyses.

How does animal2vec work, and what are its unique features?

At a very basic level, animal2vec works by training a deep neural network to classify data from an audio stream into a set of different categories, e.g., call types. Compared to other deep learning approaches previously used in bioacoustics, animal2vec has two main unique features: (1) the neural network architecture and (2) the training paradigm.

In terms of architecture, animal2vec is a transformer-based model. A transformer is a neural network architecture that “pays attention to” relevant contextual information in an audio stream when predicting whether any given audio snippet contains a call. For example, if calls are given in sequences, the network can use information from neighboring calls to predict whether a given moment in time contains a call of a given type. Transformers are a recent advance in machine learning that has resulted in massive improvements across various domains, including (most famously) large language models such as the Chat-generative pretrained transformer (ChatGPT) model. A non-technical explanation of transformer models is available at [1].

In terms of the training paradigm, animal2vec is a self-supervised learning approach. The approach consists of 2 main steps: (1) a pretraining phase where a large amount of unlabeled audio data is used to generate a “good” way to mathematically represent the audio data (also known as an embedding) and (2) a finetuning phase where labeled data is used to train the model to detect events of interest (e.g., different call types). The purpose of the pretraining step is that it allows the system to learn features of the raw audio data that are later useful for the task of detecting and classifying calls (this is also known as feature extraction). For example, some human-interpretable examples of features would be peak frequency and entropy, which might be useful for determining whether an audio snippet contains a vocalization or not. However, during pretraining, the machine learning system learns a very large and arbitrarily complex set of features, many of which are not human interpretable. Once the network has learned a good way to represent the audio data, these embeddings can be used to train another neural net system to detect calls. This mathematical representation generated in the first step makes it much easier for the system to learn to detect and classify calls.

Importantly, during the pretraining step, the network is not learning to detect animal calls. Instead, it is performing a different learning task that, while not the task we ultimately want to solve, results in the network learning a good way to represent the audio data. In the case of animal2vec (and the scheme it is related to, data2vec 2.0 [2]), the model during pretraining is learning to regress sections of audio that have been masked from the original input. As a result of this different task, the pretraining step does not require labeled data, meaning that, typically, a much larger amount of data can be used. Labeled data is then only required for the finetuning step. The upshot is that much less labeled training data is needed to obtain good classification results than if the pretrained embeddings were not used.

What are the features of the MeerKAT dataset (and other bioacoustic datasets) that make it particularly challenging?

Bioacoustic datasets can present different challenges for automated detection and classification of signals of interest depending on the species, environment, recording technology, and other factors. However, many bioacoustic tasks share some common challenges.

First, bioacoustic datasets are often noisy, with interesting signals buried in relatively large amounts of background noise. The relative volume, bandwidth, coverage, and type of noise can vary widely. In the MeerKAT dataset, a substantial challenge arises because most recordings come from audio data recorded on tracking collars, and these data were collected while the animals were foraging. Meerkats forage by digging for prey in the sand, and the sound of this digging behavior – heard as punctuated, broadband “crashing” noises – can be heard at high volume and very frequently in the dataset, covering up many of the vocalizations. On the other hand, the collar recordings also have a high signal-to-noise ratio since the microphone is located very close to the animal producing the sounds of interest, see figure 1.

Second, bioacoustic datasets are often sparse, meaning that signals of interest occur less frequently relative to the amount of non-signal recording.

Figure 1: Meerkats standing upright, wearing their GPS, audio, and accelerometer data-recording collars. Taken at the Kalahari Research Centre, South Africa. ©Vlad Demartsev, Max Planck Institute of Animal Behavior.

What are the potential applications of animal2vec?

The development of animal2vec is an ongoing process and challenge with endless possible applications. As more data from diverse species and environments are incorporated into larger datasets, the model’s capabilities after pretraining and finetuning will continue to expand. The ultimate vision is to create something called a foundational model. A foundational model is a very large model that has been pretrained in such a broad and extensive way that it easily can adapt to a wide range of tasks. Imagine a model that has been pretrained on all human languages. It has seen during pretraining every language for which there is data. Finetuning such a broadly pretrained model to any task related to any language can then be achieved using only very little annotated data.

animal2vec as a foundational model for bioacoustics would enable researchers to finetune a large and capable model to their needs and species of interest without expensive computing infrastructure. Further, animal2vec is not limited to classification, but can be used for any task that can be solved using time-series, like bioacoustics. We plan on adding support for more data modalities like GPS or accelerometer data, as is common now in modern biologgers [3], enabling researchers to classify behavioral states with unprecedented knowledge from all available modalities. This, in return, would enable animal2vec to help in scenarios in which multiple data-streams have to be combined, as is often the case in animal ecology [4, 5], behavior [3, 6], and conservation [7] research.

Read the full article here.

References

[1] Alammar, Jay. The Illustrated Transformer. https://jalammar.github.io/illustrated-transformer/. Accessed: 2023-06-18.

[2] Baevski, A., Babu, A., Hsu, W.-N. & Auli, M. Efficient self-supervised learning with contextualized target representations for vision, speech and language. In International Conference on Machine Learning, 1416–1429 (PMLR, 2023).

[3] Demartsev, V. et al. Signalling in groups: New tools for the integration of animal communication and collective movement. Methods Ecol. Evol. (2022).

[4] Penar, W., Magiera, A. & Klocek, C. Applications of bioacoustics in animal ecology. Ecol. Complex. 43, 100847 (2020).

[5] Pichler, M. & Hartig, F. Machine learning and deep learning—a review for ecologists. Methods Ecol. Evol. 14, 994–1016 (2023).

[6] Fletcher, N. H. Animal bioacoustics. Springer Handbook of Acoustics, 821–841 (Springer New York, New York, NY, 2014).

[7] Laiolo, P. The emerging significance of bioacoustics in animal species conservation. Biol. Conserv. 143, 1635–1645 (2010).

Leave a comment