mbressieux commited on
Commit
9afdd0a
1 Parent(s): f0c12aa

added notes from a discussion with a speech phd student

Browse files
Files changed (2) hide show
  1. docs/speech_notes.md +25 -0
  2. python/notebooks/Untitled.py +110 -0
docs/speech_notes.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #Notes from a discussion with a speech PhD student
2
+
3
+ #Processing Pipeline for Human Speech Recordings
4
+
5
+ Imagine you have 10-minute recordings of human speech (5 minutes of speech, 5 minutes of background noise). The typical processing pipeline involves:
6
+
7
+ 1. **Voice Activity Detection** (removing non-speech parts)
8
+ 2. **Feature Extraction** (from relevant parts)
9
+ 3. **Classification**
10
+
11
+ You can either use specific models for each step, trained individually or pre-trained, or a model that optimizes all three steps together with a combined loss function for potentially better results. The main focus is on feature extraction, comparing spectral/handcrafted features like **MFCCs**, **vanilla spectrograms**, **Mel-Spectrograms**, and low-level descriptors such as **eGeMaps** and **openSMILE** with embeddings from state-of-the-art deep learning models. These features are aggregated into vectors for classification, typically using an **MLP/transformer model**, to determine the best performance with the chosen metrics. Using a model like **WavLM**, you can test all layers to find the most effective one for the task.
12
+
13
+
14
+ # Processing Pipeline for Animal Sounds
15
+
16
+ Now, for animal sounds rather than human sounds, things get a bit more complicated. **MFCCs** and even other models like **HuBERT** are all biased towards humans. As soon as the input features are on a **Mel scale** or **log-mel**, they are biased towards human perception and frequencies because the production of vocalizations in any animal (including humans) is fundamentally linked to the perception of vocalizations.
17
+
18
+ In this case, you can:
19
+
20
+ - Transform the frequency scale of your spectrogram to amplify the vocalizations, such as using a **log-spectrogram** or a **bark-spectrogram**.
21
+ - Use **GFCCs** (Greenwood Function Cepstral Coefficients) instead of MFCCs, which are less biased towards humans.
22
+ - Use **time-series** and **signal processing features**, and stack them together to have all possible spectral features, e.g., **HTCSA**, **Catch-22**, etc.
23
+ - Give state-of-the-art models the animal inputs anyway and hope it works. The issue with this is that these models are often pre-trained on audio/speech down-sampled to 8 kHz bandwidth (information beyond this frequency is eliminated), and animal vocalizations often start at 8-10 kHz.
24
+
25
+ The general problem is that it depends a lot on the animal and its spectral range. It's impossible to have a set of features for all animals because the diversity is too great. But if it's just bird songs, well, birds are the simplest animals to classify in bioacoustics in my opinion. You just need to visualize the spectrograms, and you can recognize the patterns yourself by eye. I don't know what is considered the best feature today, but I don't think it's very complex.
python/notebooks/Untitled.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ---
2
+ # jupyter:
3
+ # jupytext:
4
+ # text_representation:
5
+ # extension: .py
6
+ # format_name: percent
7
+ # format_version: '1.3'
8
+ # jupytext_version: 1.16.1
9
+ # kernelspec:
10
+ # display_name: Python 3 (ipykernel)
11
+ # language: python
12
+ # name: python3
13
+ # ---
14
+
15
+ # %%
16
+ import os
17
+ import numpy as np
18
+ import librosa
19
+ from sklearn.preprocessing import StandardScaler
20
+ import joblib
21
+ import numpy as np
22
+ from sklearn.cluster import KMeans
23
+ from sklearn.decomposition import PCA
24
+ import matplotlib.pyplot as plt
25
+ import librosa
26
+ from IPython.display import Audio, display
27
+ from sklearn.model_selection import cross_val_score
28
+ from sklearn.ensemble import RandomForestClassifier
29
+
30
+ # %%
31
+ audio_dir = (
32
+ "../data/SoundMeters_Ingles_Primary"
33
+ )
34
+
35
+ # %%
36
+ features_dir = "../data/features"
37
+ os.makedirs(features_dir, exist_ok=True)
38
+
39
+ # %%
40
+ clusters_dir = "../data/clusters"
41
+ os.makedirs(clusters_dir, exist_ok=True)
42
+
43
+ # %%
44
+ # %%
45
+ # Parameters for windowing
46
+ window_size = 10 # window size in seconds
47
+ hop_size = 10 # hop size in seconds
48
+
49
+ # Define frequency bands (in Hz)
50
+ bands = {
51
+ "Sub-bass": (20, 60),
52
+ "Bass": (60, 250),
53
+ "Low Midrange": (250, 500),
54
+ "Midrange": (500, 2000),
55
+ "Upper Midrange": (2000, 4000),
56
+ "Presence": (4000, 6000),
57
+ "Brilliance": (6000, 20000),
58
+ }
59
+
60
+ # %%
61
+ # Iterate over each audio file in the directory
62
+ for filename in os.listdir(audio_dir):
63
+ if filename.endswith(".wav"):
64
+ file_path = os.path.join(audio_dir, filename)
65
+ y, sr = librosa.load(file_path, sr=None)
66
+
67
+ # Convert window and hop size to samples
68
+ window_samples = int(window_size * sr)
69
+ hop_samples = int(hop_size * sr)
70
+
71
+ # Total number of windows in the current file
72
+ num_windows = (len(y) - window_samples) // hop_samples + 1
73
+
74
+ all_features = []
75
+
76
+ for i in range(num_windows):
77
+ start_sample = i * hop_samples
78
+ end_sample = start_sample + window_samples
79
+ y_window = y[start_sample:end_sample]
80
+
81
+ # Compute STFT
82
+ S = librosa.stft(y_window)
83
+ S_db = librosa.amplitude_to_db(np.abs(S))
84
+
85
+ # Compute features for each band
86
+ features = []
87
+ for band, (low_freq, high_freq) in bands.items():
88
+ low_bin = int(np.floor(low_freq * (S.shape[0] / sr)))
89
+ high_bin = int(np.ceil(high_freq * (S.shape[0] / sr)))
90
+ band_energy = np.mean(S_db[low_bin:high_bin, :], axis=0)
91
+ features.append(band_energy)
92
+
93
+ # Flatten the feature array and add to all_features
94
+ features_flat = np.concatenate(features)
95
+ all_features.append(features_flat)
96
+
97
+ # Convert to numpy array
98
+ all_features = np.array(all_features)
99
+
100
+ # Standardize features
101
+ scaler = StandardScaler()
102
+ all_features = scaler.fit_transform(all_features)
103
+
104
+ # Save features to disk
105
+ feature_file = os.path.join(
106
+ features_dir, f"{os.path.splitext(filename)[0]}_features.npy"
107
+ )
108
+ joblib.dump((all_features, scaler), feature_file)
109
+
110
+ # %%