Notes from a discussion with a speech PhD student :

Processing Pipeline for Human Speech Recordings

Imagine you have 10-minute recordings of human speech (5 minutes of speech, 5 minutes of background noise). The typical processing pipeline involves:

Voice Activity Detection (removing non-speech parts)
Feature Extraction (from relevant parts)
Classification

You can either use specific models for each step, trained individually or pre-trained, or a model that optimizes all three steps together with a combined loss function for potentially better results. The main focus is on feature extraction, comparing spectral/handcrafted features like MFCCs, vanilla spectrograms, Mel-Spectrograms, and low-level descriptors such as eGeMaps and openSMILE with embeddings from state-of-the-art deep learning models. These features are aggregated into vectors for classification, typically using an MLP/transformer model, to determine the best performance with the chosen metrics. Using a model like WavLM, you can test all layers to find the most effective one for the task.

Processing Pipeline for Animal Sounds

Now, for animal sounds rather than human sounds, things get a bit more complicated. MFCCs and even other models like HuBERT are all biased towards humans. As soon as the input features are on a Mel scale or log-mel, they are biased towards human perception and frequencies because the production of vocalizations in any animal (including humans) is fundamentally linked to the perception of vocalizations.

In this case, you can:

Transform the frequency scale of your spectrogram to amplify the vocalizations, such as using a log-spectrogram or a bark-spectrogram.
Use GFCCs (Greenwood Function Cepstral Coefficients) instead of MFCCs, which are less biased towards humans.
Use time-series and signal processing features, and stack them together to have all possible spectral features, e.g., HTCSA, Catch-22, etc.
Give state-of-the-art models the animal inputs anyway and hope it works. The issue with this is that these models are often pre-trained on audio/speech down-sampled to 8 kHz bandwidth (information beyond this frequency is eliminated), and animal vocalizations often start at 8-10 kHz.

The general problem is that it depends a lot on the animal and its spectral range. It's impossible to have a set of features for all animals because the diversity is too great. But if it's just bird songs, well, birds are the simplest animals to classify in bioacoustics in my opinion. You just need to visualize the spectrograms, and you can recognize the patterns yourself by eye. I don't know what is considered the best feature today, but I don't think it's very complex.