Spaces:

wwebec
/

empty-string

Sleeping

App Files Files Community

mbressieux commited on May 20

Commit

9afdd0a

•

1 Parent(s): f0c12aa

added notes from a discussion with a speech phd student

Browse files

Files changed (2) hide show

docs/speech_notes.md +25 -0
python/notebooks/Untitled.py +110 -0

docs/speech_notes.md ADDED Viewed

	@@ -0,0 +1,25 @@

+#Notes from a discussion with a speech PhD student
+#Processing Pipeline for Human Speech Recordings
+Imagine you have 10-minute recordings of human speech (5 minutes of speech, 5 minutes of background noise). The typical processing pipeline involves:
+1. **Voice Activity Detection** (removing non-speech parts)
+2. **Feature Extraction** (from relevant parts)
+3. **Classification**
+You can either use specific models for each step, trained individually or pre-trained, or a model that optimizes all three steps together with a combined loss function for potentially better results. The main focus is on feature extraction, comparing spectral/handcrafted features like **MFCCs**, **vanilla spectrograms**, **Mel-Spectrograms**, and low-level descriptors such as **eGeMaps** and **openSMILE** with embeddings from state-of-the-art deep learning models. These features are aggregated into vectors for classification, typically using an **MLP/transformer model**, to determine the best performance with the chosen metrics. Using a model like **WavLM**, you can test all layers to find the most effective one for the task.
+# Processing Pipeline for Animal Sounds
+Now, for animal sounds rather than human sounds, things get a bit more complicated. **MFCCs** and even other models like **HuBERT** are all biased towards humans. As soon as the input features are on a **Mel scale** or **log-mel**, they are biased towards human perception and frequencies because the production of vocalizations in any animal (including humans) is fundamentally linked to the perception of vocalizations.
+In this case, you can:
+- Transform the frequency scale of your spectrogram to amplify the vocalizations, such as using a **log-spectrogram** or a **bark-spectrogram**.
+- Use **GFCCs** (Greenwood Function Cepstral Coefficients) instead of MFCCs, which are less biased towards humans.
+- Use **time-series** and **signal processing features**, and stack them together to have all possible spectral features, e.g., **HTCSA**, **Catch-22**, etc.
+- Give state-of-the-art models the animal inputs anyway and hope it works. The issue with this is that these models are often pre-trained on audio/speech down-sampled to 8 kHz bandwidth (information beyond this frequency is eliminated), and animal vocalizations often start at 8-10 kHz.
+The general problem is that it depends a lot on the animal and its spectral range. It's impossible to have a set of features for all animals because the diversity is too great. But if it's just bird songs, well, birds are the simplest animals to classify in bioacoustics in my opinion. You just need to visualize the spectrograms, and you can recognize the patterns yourself by eye. I don't know what is considered the best feature today, but I don't think it's very complex.

python/notebooks/Untitled.py ADDED Viewed

	@@ -0,0 +1,110 @@

+# ---
+# jupyter:
+#   jupytext:
+#     text_representation:
+#       extension: .py
+#       format_name: percent
+#       format_version: '1.3'
+#       jupytext_version: 1.16.1
+#   kernelspec:
+#     display_name: Python 3 (ipykernel)
+#     language: python
+#     name: python3
+# ---
+# %%
+import os
+import numpy as np
+import librosa
+from sklearn.preprocessing import StandardScaler
+import joblib
+import numpy as np
+from sklearn.cluster import KMeans
+from sklearn.decomposition import PCA
+import matplotlib.pyplot as plt
+import librosa
+from IPython.display import Audio, display
+from sklearn.model_selection import cross_val_score
+from sklearn.ensemble import RandomForestClassifier
+# %%
+audio_dir = (
+    "../data/SoundMeters_Ingles_Primary"
+)
+# %%
+features_dir = "../data/features"
+os.makedirs(features_dir, exist_ok=True)
+# %%
+clusters_dir = "../data/clusters"
+os.makedirs(clusters_dir, exist_ok=True)
+# %%
+# %%
+# Parameters for windowing
+window_size = 10  # window size in seconds
+hop_size = 10  # hop size in seconds
+# Define frequency bands (in Hz)
+bands = {
+    "Sub-bass": (20, 60),
+    "Bass": (60, 250),
+    "Low Midrange": (250, 500),
+    "Midrange": (500, 2000),
+    "Upper Midrange": (2000, 4000),
+    "Presence": (4000, 6000),
+    "Brilliance": (6000, 20000),
+}
+# %%
+# Iterate over each audio file in the directory
+for filename in os.listdir(audio_dir):
+    if filename.endswith(".wav"):
+        file_path = os.path.join(audio_dir, filename)
+        y, sr = librosa.load(file_path, sr=None)
+        # Convert window and hop size to samples
+        window_samples = int(window_size * sr)
+        hop_samples = int(hop_size * sr)
+        # Total number of windows in the current file
+        num_windows = (len(y) - window_samples) // hop_samples + 1
+        all_features = []
+        for i in range(num_windows):
+            start_sample = i * hop_samples
+            end_sample = start_sample + window_samples
+            y_window = y[start_sample:end_sample]
+            # Compute STFT
+            S = librosa.stft(y_window)
+            S_db = librosa.amplitude_to_db(np.abs(S))
+            # Compute features for each band
+            features = []
+            for band, (low_freq, high_freq) in bands.items():
+                low_bin = int(np.floor(low_freq * (S.shape[0] / sr)))
+                high_bin = int(np.ceil(high_freq * (S.shape[0] / sr)))
+                band_energy = np.mean(S_db[low_bin:high_bin, :], axis=0)
+                features.append(band_energy)
+            # Flatten the feature array and add to all_features
+            features_flat = np.concatenate(features)
+            all_features.append(features_flat)
+        # Convert to numpy array
+        all_features = np.array(all_features)
+        # Standardize features
+        scaler = StandardScaler()
+        all_features = scaler.fit_transform(all_features)
+        # Save features to disk
+        feature_file = os.path.join(
+            features_dir, f"{os.path.splitext(filename)[0]}_features.npy"
+        )
+        joblib.dump((all_features, scaler), feature_file)
+# %%