wsntxxn commited on
Commit
90bb951
1 Parent(s): 8aa4a8b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -1
README.md CHANGED
@@ -6,4 +6,56 @@ language:
6
  - en
7
  ---
8
  [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning)
9
- [![arXiv](https://img.shields.io/badge/arXiv-2407.14329-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2407.14329)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - en
7
  ---
8
  [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning)
9
+ [![arXiv](https://img.shields.io/badge/arXiv-2407.14329-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2407.14329)
10
+
11
+ # Model Details
12
+ This is a lightweight audio captioning model, with an Efficient-B2 encoder and a two-layer Transformer decoder. The model is trained on AudioCaps and unlabeled AudioSet.
13
+
14
+ # Dependencies
15
+ Install corresponding dependencies to run the model:
16
+ ```bash
17
+ pip install numpy torch torchaudio einops transformers efficientnet_pytorch
18
+ ```
19
+
20
+ # Usage
21
+ ```python
22
+ import torch
23
+ from transformers import AutoModel, PreTrainedTokenizerFast
24
+ import torchaudio
25
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
26
+ # use the model trained on AudioCaps
27
+ model = AutoModel.from_pretrained(
28
+ "wsntxxn/effb2-trm-audiocaps-captioning",
29
+ trust_remote_code=True
30
+ ).to(device)
31
+ tokenizer = PreTrainedTokenizerFast.from_pretrained(
32
+ "wsntxxn/audiocaps-simple-tokenizer"
33
+ )
34
+ # inference on a single audio clip
35
+ wav, sr = torchaudio.load("/path/to/file.wav")
36
+ wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
37
+ if wav.size(0) > 1:
38
+ wav = wav.mean(0).unsqueeze(0)
39
+ with torch.no_grad():
40
+ word_idxs = model(
41
+ audio=wav,
42
+ audio_length=[wav.size(1)],
43
+ )
44
+ caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
45
+ print(caption)
46
+ # inference on a batch
47
+ wav1, sr1 = torchaudio.load("/path/to/file1.wav")
48
+ wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
49
+ wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]
50
+ wav2, sr2 = torchaudio.load("/path/to/file2.wav")
51
+ wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
52
+ wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]
53
+ wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True)
54
+ with torch.no_grad():
55
+ word_idxs = model(
56
+ audio=wav_batch,
57
+ audio_length=[wav1.size(0), wav2.size(0)],
58
+ )
59
+ captions = tokenizer.batch_decode(word_idxs, skip_special_tokens=True)
60
+ print(captions)
61
+ ```