Spaces:

wwebec
/

empty-string

Sleeping

App Files Files Community

janwinkler1 commited on May 19

Commit

7235148

•

1 Parent(s): 40fa82b

eda jan, second push, adapted to wav files (#2)

Browse files

Files changed (6) hide show

README.md +19 -11
docs/docker-setup.md +8 -2
docs/huggingface-spaces.md +7 -1
format +7 -0
python/eda_jan.py +148 -62
python/requirements.txt +2 -0

README.md CHANGED Viewed

@@ -1,7 +1,8 @@
 # Source of truth
 ## welcome and initial setup
-hi all,
 i think when we start with the EDA, it suffices if anyone just uses what they are used to (f.e. conda or whatever). However, afterwards, i think it could be helpful that everyone, always has exactly the same environment, same package/python versions, which is why i propose working with docker to minimize headaches and "but it works on my machine" issues. I think with this minimal setup below, we can fully focus on hacking while not having pain with painful stuff.
 please feel free to add / change / challenge things!
@@ -11,35 +12,39 @@ please feel free to add / change / challenge things!
 - read how to setup docker for mac [here](https://docs.docker.com/desktop/install/mac-install/)
 ### how docker compose works
-essentially, you just have to build the container with the services you want. if you're interested in it i can go into more detail just let me know.
 1. navigate to dc/dev and run:
 ```
-docker compose up -d --build
 ```
 only use the `--build` flag the first time around, or if you want to rebuild the container (e.g. when having added a package you need in the container). **NOTE:** the `-d` flag stands for `detach` which means that your docker container runs in the background and does not log everything into your console.
-2. then, to check whether everything worked hit:
 ```
 docker ps
 ```
-3. for this specific setup, you can head to `localhost:8888` where jupyterlab is running.
-4. to create a new file (using jupytext, see below), just create a new .ipynb file, the .py file will be created automatically. all the changes you make in the notebook, will be reflected in the .py files which you then can use for your commits.
-now you shoold see the running docker containers.
-### what about huggingface spaces:
 - [here](./docs/huggingface-spaces.md), you can see what huggingface spaces is and how we can complement our github repo with it (credits to chat-gpt)
 ### jupytext - nice versioning of jupyter notebooks
 since we are likely be working with jupyter notebooks alot, lets use jupytext. It automatically maps .ipynb to .py files with some magic. The .ipynb are in the gitignore, so we only have .py files nicely versioned in the repo. read more about it [here](https://jupytext.readthedocs.io/en/latest/)
 ### trunk based development
 lets stick to trunk based. if you dont know what it is, read all about it [here](https://trunkbaseddevelopment.com/)
 key take aways:
@@ -49,7 +54,7 @@ key take aways:
 1. **Single Main Branch**: All developers commit to the trunk or main branch.
 2. **Short-Lived Branches**: Branches, if used, are short-lived and quickly merged back.
 3. **Frequent Integrations**: Code changes are integrated frequently, often multiple times a day.
-4.  **Feature Flags**: Incomplete features are managed with feature flags to maintain trunk stability.
 #### Benefits
@@ -70,6 +75,9 @@ key take aways:
 - **Feature Flags**: Manage incomplete or experimental features.
 - **Code Reviews**: Maintain quality and knowledge sharing.

 # Source of truth
 ## welcome and initial setup
+hi all,
 i think when we start with the EDA, it suffices if anyone just uses what they are used to (f.e. conda or whatever). However, afterwards, i think it could be helpful that everyone, always has exactly the same environment, same package/python versions, which is why i propose working with docker to minimize headaches and "but it works on my machine" issues. I think with this minimal setup below, we can fully focus on hacking while not having pain with painful stuff.
 please feel free to add / change / challenge things!
 - read how to setup docker for mac [here](https://docs.docker.com/desktop/install/mac-install/)
 ### how docker compose works
+essentially, you just have to build the container with the services you want. if you're interested in it i can go into more detail just let me know.
 1. navigate to dc/dev and run:
 ```
+docker compose up -d --build
 ```
 only use the `--build` flag the first time around, or if you want to rebuild the container (e.g. when having added a package you need in the container). **NOTE:** the `-d` flag stands for `detach` which means that your docker container runs in the background and does not log everything into your console.
+2. then, to check whether everything worked hit:
 ```
 docker ps
 ```
+3. for this specific setup, you can head to `localhost:8888` where jupyterlab is running.
+4. to create a new file (using jupytext, see below), just create a new .ipynb file, the .py file will be created automatically. all the changes you make in the notebook, will be reflected in the .py files which you then can use for your commits.
+now you shoold see the running docker containers.
+### what about huggingface spaces:
 - [here](./docs/huggingface-spaces.md), you can see what huggingface spaces is and how we can complement our github repo with it (credits to chat-gpt)
 ### jupytext - nice versioning of jupyter notebooks
 since we are likely be working with jupyter notebooks alot, lets use jupytext. It automatically maps .ipynb to .py files with some magic. The .ipynb are in the gitignore, so we only have .py files nicely versioned in the repo. read more about it [here](https://jupytext.readthedocs.io/en/latest/)
 ### trunk based development
 lets stick to trunk based. if you dont know what it is, read all about it [here](https://trunkbaseddevelopment.com/)
 key take aways:
 1. **Single Main Branch**: All developers commit to the trunk or main branch.
 2. **Short-Lived Branches**: Branches, if used, are short-lived and quickly merged back.
 3. **Frequent Integrations**: Code changes are integrated frequently, often multiple times a day.
+4. **Feature Flags**: Incomplete features are managed with feature flags to maintain trunk stability.
 #### Benefits
 - **Feature Flags**: Manage incomplete or experimental features.
 - **Code Reviews**: Maintain quality and knowledge sharing.
+### code format
+- lets stick to black for python and prettier for .md and other formats
+- using docker for the purpose of formatting is really easy
+- just `chmod +x format` so that the `format` is executable
+- then simply use `./format` before adding your changes and all the files will be autoformatted

docs/docker-setup.md CHANGED Viewed

@@ -1,9 +1,11 @@
 # DockerSetup
 This is a short version of howto install Docker on Ubuntu 20.04. It is basically a summary of the following source:\
-* https://docs.docker.com/install/linux/docker-ce/ubuntu/#os-requirements
 ## Install Docker
 ```bash
 sudo apt update
 sudo apt-get install \
@@ -24,15 +26,19 @@ sudo apt install docker-ce
 ```
 ## Add user to "docker" group & re-login
 Add yourself to the docker group (to get access to the docker deamon socket)\
 note: beeing member of docker gives you root access via the docker deamon
-```bash
 sudo usermod -a -G docker `whoami`
 ```
 Logout / Login to load the new group rights\
 Using Ubuntu Gnome it may be required to restart
 ## Install Docker-Compose
 ```bash
 sudo apt install python3-pip
 sudo pip3 install docker-compose

 # DockerSetup
 This is a short version of howto install Docker on Ubuntu 20.04. It is basically a summary of the following source:\
+- https://docs.docker.com/install/linux/docker-ce/ubuntu/#os-requirements
 ## Install Docker
 ```bash
 sudo apt update
 sudo apt-get install \
 ```
 ## Add user to "docker" group & re-login
 Add yourself to the docker group (to get access to the docker deamon socket)\
 note: beeing member of docker gives you root access via the docker deamon
+```bash
 sudo usermod -a -G docker `whoami`
 ```
 Logout / Login to load the new group rights\
 Using Ubuntu Gnome it may be required to restart
 ## Install Docker-Compose
 ```bash
 sudo apt install python3-pip
 sudo pip3 install docker-compose

docs/huggingface-spaces.md CHANGED Viewed

@@ -3,6 +3,7 @@ Hugging Face Spaces and GitHub repositories serve different but complementary pu
 ### Comparison with GitHub Repositories
 - **GitHub Repository**:
   - **Purpose**: Primarily used for version control, collaboration, and sharing of code and projects.
   - **Capabilities**: Stores code, tracks changes, manages issues, and supports CI/CD pipelines.
   - **Usage**: Developers collaborate on software development projects, manage codebases, and deploy applications.
@@ -17,12 +18,15 @@ Hugging Face Spaces and GitHub repositories serve different but complementary pu
 You can import a GitHub repository into Hugging Face Spaces to deploy an application hosted on GitHub. Here’s how to do it:
 1. **Create a Space on Hugging Face**:
    - Go to the Hugging Face Spaces website and create a new Space.
 2. **Link to GitHub Repository**:
    - During the setup of the new Space, you can link it to a GitHub repository. This allows Hugging Face Spaces to pull the code from your GitHub repo.
 3. **Configure Your Space**:
    - Ensure your repository contains the necessary files for the framework you are using (Streamlit, Gradio, or HTML/CSS/JS).
    - For example, if you are using Streamlit, ensure you have a `requirements.txt` file for dependencies and a main Python script that runs the Streamlit app.
@@ -34,15 +38,17 @@ You can import a GitHub repository into Hugging Face Spaces to deploy an applica
 1. **Create a New Space**:
    - Navigate to Hugging Face Spaces and click on “New Space”.
 2. **Set Up Space**:
    - Choose a name for your Space, select the appropriate SDK (e.g., Streamlit, Gradio, or HTML), and choose the visibility (public or private).
 3. **Connect GitHub Repository**:
    - In the Space settings, you will find an option to link a GitHub repository. Provide the URL of your GitHub repository.
    - Hugging Face Spaces will clone your GitHub repository to use it as the source code for your Space.
 4. **Configure and Deploy**:
    - Make sure your GitHub repository is set up correctly for the chosen framework. For example, a Streamlit app should have a `requirements.txt` and an entry-point script like `app.py`.
    - Once everything is set up, your Space will be deployed and can be accessed via a URL provided by Hugging Face.

 ### Comparison with GitHub Repositories
 - **GitHub Repository**:
   - **Purpose**: Primarily used for version control, collaboration, and sharing of code and projects.
   - **Capabilities**: Stores code, tracks changes, manages issues, and supports CI/CD pipelines.
   - **Usage**: Developers collaborate on software development projects, manage codebases, and deploy applications.
 You can import a GitHub repository into Hugging Face Spaces to deploy an application hosted on GitHub. Here’s how to do it:
 1. **Create a Space on Hugging Face**:
    - Go to the Hugging Face Spaces website and create a new Space.
 2. **Link to GitHub Repository**:
    - During the setup of the new Space, you can link it to a GitHub repository. This allows Hugging Face Spaces to pull the code from your GitHub repo.
 3. **Configure Your Space**:
    - Ensure your repository contains the necessary files for the framework you are using (Streamlit, Gradio, or HTML/CSS/JS).
    - For example, if you are using Streamlit, ensure you have a `requirements.txt` file for dependencies and a main Python script that runs the Streamlit app.
 1. **Create a New Space**:
    - Navigate to Hugging Face Spaces and click on “New Space”.
 2. **Set Up Space**:
    - Choose a name for your Space, select the appropriate SDK (e.g., Streamlit, Gradio, or HTML), and choose the visibility (public or private).
 3. **Connect GitHub Repository**:
    - In the Space settings, you will find an option to link a GitHub repository. Provide the URL of your GitHub repository.
    - Hugging Face Spaces will clone your GitHub repository to use it as the source code for your Space.
 4. **Configure and Deploy**:
    - Make sure your GitHub repository is set up correctly for the chosen framework. For example, a Streamlit app should have a `requirements.txt` and an entry-point script like `app.py`.
    - Once everything is set up, your Space will be deployed and can be accessed via a URL provided by Hugging Face.

format ADDED Viewed

	@@ -0,0 +1,7 @@

+#! /bin/sh
+# Run as root, otherwise the container cannot modify the mounted files.
+docker run --rm --user root --volume $(pwd):/work tmknom/prettier prettier --loglevel warn --write .
+# Format Python files using Black
+docker run --rm --user root --volume $(pwd):/data cytopia/black:latest .

python/eda_jan.py CHANGED Viewed

@@ -20,7 +20,7 @@ import librosa.display
 import matplotlib.pyplot as plt
 from sklearn.cluster import KMeans
 from sklearn.decomposition import PCA
-from IPython.display import Audio, display
 # %%
 # Load the entire audio file
@@ -32,7 +32,7 @@ y, sr = librosa.load(file_path, sr=44100)
 # %%
 # split soundfile in to 10s chunks
 window_size = 10  # window size in seconds
-hop_size = 10     # hop size in seconds
 # Convert window and hop size to samples
 window_samples = int(window_size * sr)
@@ -47,13 +47,13 @@ print(f"Total number of windows: {num_windows}")
 # %%
 # Define frequency bands (in Hz)
 bands = {
-    'Sub-bass': (20, 60),
-    'Bass': (60, 250),
-    'Low Midrange': (250, 500),
-    'Midrange': (500, 2000),
-    'Upper Midrange': (2000, 4000),
-    'Presence': (4000, 6000),
-    'Brilliance': (6000, 20000)
 }
 # Initialize a list to hold the features
@@ -67,7 +67,7 @@ for i in range(num_windows):
     # Compute STFT
     S = librosa.stft(y_window)
     S_db = librosa.amplitude_to_db(np.abs(S))
     # Compute features for each band
     features = []
     for band, (low_freq, high_freq) in bands.items():
@@ -75,7 +75,7 @@ for i in range(num_windows):
         high_bin = int(np.ceil(high_freq * (S.shape[0] / sr)))
         band_energy = np.mean(S_db[low_bin:high_bin, :], axis=0)
         features.append(band_energy)
     # Flatten the feature array and add to all_features
     features_flat = np.concatenate(features)
     all_features.append(features_flat)
@@ -95,11 +95,13 @@ clusters = kmeans.fit_predict(features_reduced)
 # Plot the clusters
 plt.figure(figsize=(10, 6))
-scatter = plt.scatter(features_reduced[:, 0], features_reduced[:, 1], c=clusters, cmap='viridis')
-plt.title('Clustered Frequency Band Features')
-plt.xlabel('Principal Component 1')
-plt.ylabel('Principal Component 2')
-plt.colorbar(scatter, label='Cluster')
 plt.show()
@@ -108,12 +110,12 @@ plt.show()
 for cluster_label in np.unique(clusters):
     # Find the first data point in the cluster
     representative_index = np.where(clusters == cluster_label)[0][0]
     # Use the original audio window at the representative index
     start_sample = representative_index * hop_samples
     end_sample = start_sample + window_samples
     y_representative = y[start_sample:end_sample]
     print(f"Cluster {cluster_label} representative audio:")
     display(Audio(data=y_representative, rate=sr))
@@ -127,23 +129,36 @@ import numpy as np
 import librosa
 from sklearn.preprocessing import StandardScaler
 import joblib
 # Directory containing the audio files
-audio_dir = "data/soundscape_data"
 # Parameters for windowing
 window_size = 10  # window size in seconds
-hop_size = 10     # hop size in seconds
 # Define frequency bands (in Hz)
 bands = {
-    'Sub-bass': (20, 60),
-    'Bass': (60, 250),
-    'Low Midrange': (250, 500),
-    'Midrange': (500, 2000),
-    'Upper Midrange': (2000, 4000),
-    'Presence': (4000, 6000),
-    'Brilliance': (6000, 20000)
 }
 # Directory to save features
@@ -152,10 +167,10 @@ os.makedirs(features_dir, exist_ok=True)
 # Iterate over each audio file in the directory
 for filename in os.listdir(audio_dir):
-    if filename.endswith(".flac"):
         file_path = os.path.join(audio_dir, filename)
         y, sr = librosa.load(file_path, sr=44100)
         # Convert window and hop size to samples
         window_samples = int(window_size * sr)
         hop_samples = int(hop_size * sr)
@@ -194,19 +209,17 @@ for filename in os.listdir(audio_dir):
         all_features = scaler.fit_transform(all_features)
         # Save features to disk
-        feature_file = os.path.join(features_dir, f"{os.path.splitext(filename)[0]}_features.npy")
         joblib.dump((all_features, scaler), feature_file)
 # %%
-import numpy as np
-import joblib
-from sklearn.cluster import KMeans
-from sklearn.decomposition import PCA
-import matplotlib.pyplot as plt
 # Directory to load features
 features_dir = "features"
 # Load all features
 all_features = []
@@ -223,53 +236,50 @@ pca = PCA(n_components=2)
 features_pca = pca.fit_transform(all_features)
 # Perform k-means clustering
-kmeans = KMeans(n_clusters=5)  # Example: 5 clusters
 clusters = kmeans.fit_predict(all_features)
 # Plot the PCA-reduced features with cluster labels
 plt.figure(figsize=(10, 6))
-scatter = plt.scatter(features_pca[:, 0], features_pca[:, 1], c=clusters, cmap='viridis')
-plt.title('PCA of Clustered Frequency Band Features')
-plt.xlabel('Principal Component 1')
-plt.ylabel('Principal Component 2')
-plt.colorbar(scatter, label='Cluster')
 plt.show()
 # Save clustering results
-clustering_results = {
-    'clusters': clusters,
-    'kmeans': kmeans,
-    'pca': pca
-}
-joblib.dump(clustering_results, 'clustering_results.pkl')
 # Plot the clusters
 plt.figure(figsize=(10, 6))
-for i in range(5):
-    plt.plot(all_features[clusters == i].mean(axis=0), label=f'Cluster {i}')
 plt.legend()
-plt.title('Clustered Frequency Band Features')
 plt.show()
 # %%
-import os
-import numpy as np
-import librosa
-from IPython.display import Audio, display
-import joblib
 # Directory containing the audio files
-audio_dir = "data/soundscape_data"
 # Directory to load features
 features_dir = "features"
 # Parameters for windowing
 window_size = 10  # window size in seconds
-hop_size = 10     # hop size in seconds
 # Load clustering results
-clustering_results = joblib.load('clustering_results.pkl')
-clusters = clustering_results['clusters']
 # Load all features
 all_features = []
@@ -278,7 +288,7 @@ audio_segments = []
 for feature_file in os.listdir(features_dir):
     if feature_file.endswith("_features.npy"):
         features, scaler = joblib.load(os.path.join(features_dir, feature_file))
-        filename = feature_file.replace('_features.npy', '.flac')
         file_path = os.path.join(audio_dir, filename)
         y, sr = librosa.load(file_path, sr=44100)
@@ -318,3 +328,79 @@ for cluster_label in np.unique(clusters):
 # %%

 import matplotlib.pyplot as plt
 from sklearn.cluster import KMeans
 from sklearn.decomposition import PCA
+from IPython.display import Audio, display
 # %%
 # Load the entire audio file
 # %%
 # split soundfile in to 10s chunks
 window_size = 10  # window size in seconds
+hop_size = 10  # hop size in seconds
 # Convert window and hop size to samples
 window_samples = int(window_size * sr)
 # %%
 # Define frequency bands (in Hz)
 bands = {
+    "Sub-bass": (20, 60),
+    "Bass": (60, 250),
+    "Low Midrange": (250, 500),
+    "Midrange": (500, 2000),
+    "Upper Midrange": (2000, 4000),
+    "Presence": (4000, 6000),
+    "Brilliance": (6000, 20000),
 }
 # Initialize a list to hold the features
     # Compute STFT
     S = librosa.stft(y_window)
     S_db = librosa.amplitude_to_db(np.abs(S))
     # Compute features for each band
     features = []
     for band, (low_freq, high_freq) in bands.items():
         high_bin = int(np.ceil(high_freq * (S.shape[0] / sr)))
         band_energy = np.mean(S_db[low_bin:high_bin, :], axis=0)
         features.append(band_energy)
     # Flatten the feature array and add to all_features
     features_flat = np.concatenate(features)
     all_features.append(features_flat)
 # Plot the clusters
 plt.figure(figsize=(10, 6))
+scatter = plt.scatter(
+    features_reduced[:, 0], features_reduced[:, 1], c=clusters, cmap="viridis"
+)
+plt.title("Clustered Frequency Band Features")
+plt.xlabel("Principal Component 1")
+plt.ylabel("Principal Component 2")
+plt.colorbar(scatter, label="Cluster")
 plt.show()
 for cluster_label in np.unique(clusters):
     # Find the first data point in the cluster
     representative_index = np.where(clusters == cluster_label)[0][0]
     # Use the original audio window at the representative index
     start_sample = representative_index * hop_samples
     end_sample = start_sample + window_samples
     y_representative = y[start_sample:end_sample]
     print(f"Cluster {cluster_label} representative audio:")
     display(Audio(data=y_representative, rate=sr))
 import librosa
 from sklearn.preprocessing import StandardScaler
 import joblib
+import numpy as np
+from sklearn.cluster import KMeans
+from sklearn.decomposition import PCA
+import matplotlib.pyplot as plt
+import librosa
+from IPython.display import Audio, display
+from sklearn.model_selection import cross_val_score
+from sklearn.ensemble import RandomForestClassifier
+# %%
 # Directory containing the audio files
+# audio_dir = "data/soundscape_data"
+audio_dir = (
+    "data/SoundMeters_Ingles_Primary-20240519T132658Z-002/SoundMeters_Ingles_Primary"
+)
 # Parameters for windowing
 window_size = 10  # window size in seconds
+hop_size = 10  # hop size in seconds
 # Define frequency bands (in Hz)
 bands = {
+    "Sub-bass": (20, 60),
+    "Bass": (60, 250),
+    "Low Midrange": (250, 500),
+    "Midrange": (500, 2000),
+    "Upper Midrange": (2000, 4000),
+    "Presence": (4000, 6000),
+    "Brilliance": (6000, 20000),
 }
 # Directory to save features
 # Iterate over each audio file in the directory
 for filename in os.listdir(audio_dir):
+    if filename.endswith(".wav"):
         file_path = os.path.join(audio_dir, filename)
         y, sr = librosa.load(file_path, sr=44100)
         # Convert window and hop size to samples
         window_samples = int(window_size * sr)
         hop_samples = int(hop_size * sr)
         all_features = scaler.fit_transform(all_features)
         # Save features to disk
+        feature_file = os.path.join(
+            features_dir, f"{os.path.splitext(filename)[0]}_features.npy"
+        )
         joblib.dump((all_features, scaler), feature_file)
 # %%
 # Directory to load features
 features_dir = "features"
+n_clusters = 5
 # Load all features
 all_features = []
 features_pca = pca.fit_transform(all_features)
 # Perform k-means clustering
+kmeans = KMeans(n_clusters=n_clusters)  # Example: 5 clusters
 clusters = kmeans.fit_predict(all_features)
 # Plot the PCA-reduced features with cluster labels
 plt.figure(figsize=(10, 6))
+scatter = plt.scatter(
+    features_pca[:, 0], features_pca[:, 1], c=clusters, cmap="viridis"
+)
+plt.title("PCA of Clustered Frequency Band Features")
+plt.xlabel("Principal Component 1")
+plt.ylabel("Principal Component 2")
+plt.colorbar(scatter, label="Cluster")
 plt.show()
 # Save clustering results
+clustering_results = {"clusters": clusters, "kmeans": kmeans, "pca": pca}
+joblib.dump(clustering_results, "clustering_results.pkl")
 # Plot the clusters
 plt.figure(figsize=(10, 6))
+for i in range(n_clusters):
+    plt.plot(all_features[clusters == i].mean(axis=0), label=f"Cluster {i}")
 plt.legend()
+plt.title("Clustered Frequency Band Features")
+plt.xlabel("Feature Index (Frequency Bands)")
+plt.ylabel("Mean Feature Value (Energy in dB)")
 plt.show()
 # %%
 # Directory containing the audio files
+# audio_dir = "data/soundscape_data"
+audio_dir = (
+    "data/SoundMeters_Ingles_Primary-20240519T132658Z-002/SoundMeters_Ingles_Primary"
+)
 # Directory to load features
 features_dir = "features"
 # Parameters for windowing
 window_size = 10  # window size in seconds
+hop_size = 10  # hop size in seconds
 # Load clustering results
+clustering_results = joblib.load("clustering_results.pkl")
+clusters = clustering_results["clusters"]
 # Load all features
 all_features = []
 for feature_file in os.listdir(features_dir):
     if feature_file.endswith("_features.npy"):
         features, scaler = joblib.load(os.path.join(features_dir, feature_file))
+        filename = feature_file.replace("_features.npy", ".wav")
         file_path = os.path.join(audio_dir, filename)
         y, sr = librosa.load(file_path, sr=44100)
 # %%
+# Fit PCA
+pca = PCA().fit(all_features_scaled)
+# Method 1: Variance Explained
+explained_variance = pca.explained_variance_ratio_
+cumulative_explained_variance = np.cumsum(explained_variance)
+# Plot the cumulative explained variance
+plt.figure(figsize=(10, 6))
+plt.plot(cumulative_explained_variance, marker="o")
+plt.xlabel("Number of Principal Components")
+plt.ylabel("Cumulative Explained Variance")
+plt.title("Explained Variance vs. Number of Principal Components")
+plt.grid(True)
+plt.show()
+# %%
+# Method 2: Scree Plot
+plt.figure(figsize=(10, 6))
+plt.plot(np.arange(1, len(explained_variance) + 1), explained_variance, marker="o")
+plt.xlabel("Principal Component")
+plt.ylabel("Explained Variance")
+plt.title("Scree Plot")
+plt.grid(True)
+plt.show()
+# %%
+# Method 3: Kaiser Criterion
+eigenvalues = pca.explained_variance_
+kaiser_criterion = np.sum(eigenvalues > 1)
+# IMO this doesnt make sense at the moment, we need to extract more features
+print(f"Number of components selected by Kaiser Criterion: {kaiser_criterion}")
+# %%
+# Method 4: Cross-Validation
+# Evaluate a classifier with different numbers of principal components
+## do not run if you dont have time, this takes forever.
+# scores = []
+# for n_components in range(1, len(explained_variance) + 1):
+#     pca = PCA(n_components=n_components)
+#     features_pca = pca.fit_transform(all_features_scaled)
+#     classifier = RandomForestClassifier()  # Use your preferred model here
+#     score = np.mean(cross_val_score(classifier, features_pca, clusters, cv=n_clusters))  # Assuming `clusters` are your labels
+#     scores.append(score)
+# # Plot cross-validation scores
+# plt.figure(figsize=(10, 6))
+# plt.plot(range(1, len(explained_variance) + 1), scores, marker='o')
+# plt.xlabel('Number of Principal Components')
+# plt.ylabel('Cross-Validation Score')
+# plt.title('Cross-Validation Score vs. Number of Principal Components')
+# plt.grid(True)
+# plt.show()
+# # Choosing the number of components that explain at least 95% of the variance
+# n_components_variance = np.argmax(cumulative_explained_variance >= 0.95) + 1
+# print(f"Number of components to retain 95% variance: {n_components_variance}")
+# # Choose the optimal number of components based on your analysis
+# optimal_n_components = n_components_variance  # or based on the scree plot, cross-validation, etc.
+# print(f"Optimal number of components: {optimal_n_components}")
+# # Perform PCA with the selected number of components
+# pca = PCA(n_components=optimal_n_components)
+# features_pca = pca.fit_transform(all_features_scaled)
+# %%
+# %%

python/requirements.txt CHANGED Viewed

@@ -8,3 +8,5 @@ numpy
 opencv-python
 scipy
 librosa

 opencv-python
 scipy
 librosa
+plotly
+ipywidgets