wsntxxn's picture
Update README.md
f405418 verified
---
license: apache-2.0
language:
- en
---
[![arXiv](https://img.shields.io/badge/arXiv-2306.01533-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2306.01533)
# Usage
```python
import torch
from transformers import AutoModel, PreTrainedTokenizerFast
import torchaudio
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(
"wsntxxn/cnn14rnn-tempgru-audiocaps-captioning",
trust_remote_code=True
).to(device)
tokenizer = PreTrainedTokenizerFast.from_pretrained(
"wsntxxn/audiocaps-simple-tokenizer"
)
wav, sr = torchaudio.load("/path/to/file.wav")
wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
if wav.size(0) > 1:
wav = wav.mean(0).unsqueeze(0)
with torch.no_grad():
word_idxs = model(
audio=wav,
audio_length=[wav.size(1)],
)
caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
print(caption)
```
This will make the description as specific as possible.
You can also manually assign a temporal tag to control the specificity of temporal relationship description:
```python
with torch.no_grad():
word_idxs = model(
audio=wav,
audio_length=[wav.size(1)],
temporal_tag=[2], # desribe "sequential" if there are sequential events, otherwise use the most complex relationship
)
```
The temporal tag is defined as:
|Temporal Tag|Definition|
|----:|-----:|
|0|Only 1 Event|
|1|Simultaneous Events|
|2|Sequential Events|
|3|More Complex Events|
# Citation
If you find the model useful, please cite this paper:
```BibTeX
@inproceedings{xie2023enhance,
author = {Zeyu Xie and Xuenan Xu and Mengyue Wu and Kai Yu},
title = {Enhance Temporal Relations in Audio Captioning with Sound Event Detection},
year = 2023,
booktitle = {Proc. INTERSPEECH},
pages = {4179--4183},
}
```