wsntxxn commited on
Commit
1a02037
1 Parent(s): a393d69

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -1
README.md CHANGED
@@ -4,4 +4,67 @@ language:
4
  - en
5
  ---
6
  [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning)
7
- [![arXiv](https://img.shields.io/badge/arXiv-2407.14329-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2407.14329)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - en
5
  ---
6
  [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning)
7
+ [![arXiv](https://img.shields.io/badge/arXiv-2407.14329-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2407.14329)
8
+
9
+ # Model Details
10
+ This is a lightweight audio captioning model, with an Efficient-B2 encoder and a two-layer Transformer decoder. The model is trained on AudioCaps and unlabeled AudioSet.
11
+
12
+ # Dependencies
13
+ Install corresponding dependencies to run the model:
14
+ ```bash
15
+ pip install numpy torch torchaudio einops transformers efficientnet_pytorch
16
+ ```
17
+
18
+ # Usage
19
+ ```python
20
+ import torch
21
+ from transformers import AutoModel, PreTrainedTokenizerFast
22
+ import torchaudio
23
+
24
+
25
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
26
+
27
+ # use the model trained on AudioCaps
28
+ model = AutoModel.from_pretrained(
29
+ "wsntxxn/effb2-trm-clotho-captioning",
30
+ trust_remote_code=True
31
+ ).to(device)
32
+ tokenizer = PreTrainedTokenizerFast.from_pretrained(
33
+ "wsntxxn/clotho-simple-tokenizer"
34
+ )
35
+
36
+ # inference on a single audio clip
37
+ wav, sr = torchaudio.load("/path/to/file.wav")
38
+ wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
39
+ if wav.size(0) > 1:
40
+ wav = wav.mean(0).unsqueeze(0)
41
+
42
+ with torch.no_grad():
43
+ word_idxs = model(
44
+ audio=wav,
45
+ audio_length=[wav.size(1)],
46
+ )
47
+
48
+ caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
49
+ print(caption)
50
+
51
+ # inference on a batch
52
+ wav1, sr1 = torchaudio.load("/path/to/file1.wav")
53
+ wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
54
+ wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]
55
+
56
+ wav2, sr2 = torchaudio.load("/path/to/file2.wav")
57
+ wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
58
+ wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]
59
+
60
+ wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True)
61
+
62
+ with torch.no_grad():
63
+ word_idxs = model(
64
+ audio=wav_batch,
65
+ audio_length=[wav1.size(0), wav2.size(0)],
66
+ )
67
+
68
+ captions = tokenizer.batch_decode(word_idxs, skip_special_tokens=True)
69
+ print(captions)
70
+ ```