wsntxxn
/

effb2-trm-clotho-captioning

Feature Extraction

Model card Files Files and versions Community

wsntxxn commited on Aug 19

Commit

1a02037

•

1 Parent(s): a393d69

Update README.md

Files changed (1) hide show

README.md +64 -1

README.md CHANGED Viewed

@@ -4,4 +4,67 @@ language:
 - en
 ---
 [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning)
-[![arXiv](https://img.shields.io/badge/arXiv-2407.14329-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2407.14329)

 - en
 ---
 [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning)
+[![arXiv](https://img.shields.io/badge/arXiv-2407.14329-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2407.14329)
+# Model Details
+This is a lightweight audio captioning model, with an Efficient-B2 encoder and a two-layer Transformer decoder. The model is trained on AudioCaps and unlabeled AudioSet.
+# Dependencies
+Install corresponding dependencies to run the model:
+```bash
+pip install numpy torch torchaudio einops transformers efficientnet_pytorch
+```
+# Usage
+```python
+import torch
+from transformers import AutoModel, PreTrainedTokenizerFast
+import torchaudio
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# use the model trained on AudioCaps
+model = AutoModel.from_pretrained(
+    "wsntxxn/effb2-trm-clotho-captioning",
+    trust_remote_code=True
+).to(device)
+tokenizer = PreTrainedTokenizerFast.from_pretrained(
+    "wsntxxn/clotho-simple-tokenizer"
+)
+# inference on a single audio clip
+wav, sr = torchaudio.load("/path/to/file.wav")
+wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
+if wav.size(0) > 1:
+    wav = wav.mean(0).unsqueeze(0)
+with torch.no_grad():
+    word_idxs = model(
+        audio=wav,
+        audio_length=[wav.size(1)],
+    )
+caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
+print(caption)
+# inference on a batch
+wav1, sr1 = torchaudio.load("/path/to/file1.wav")
+wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
+wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]
+wav2, sr2 = torchaudio.load("/path/to/file2.wav")
+wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
+wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]
+wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True)
+with torch.no_grad():
+    word_idxs = model(
+        audio=wav_batch,
+        audio_length=[wav1.size(0), wav2.size(0)],
+    )
+captions = tokenizer.batch_decode(word_idxs, skip_special_tokens=True)
+print(captions)
+```