wsntxxn
/

cnn14rnn-tempgru-audiocaps-captioning

Feature Extraction

Model card Files Files and versions Community

cnn14rnn-tempgru-audiocaps-captioning / README.md

wsntxxn's picture

Update README.md

f405418 verified about 2 months ago

|

history blame contribute delete

No virus

1.86 kB

	---
	license: apache-2.0
	language:
	- en
	---
	[![arXiv](https://img.shields.io/badge/arXiv-2306.01533-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2306.01533)

	# Usage
	```python
	import torch
	from transformers import AutoModel, PreTrainedTokenizerFast
	import torchaudio


	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	model = AutoModel.from_pretrained(
	"wsntxxn/cnn14rnn-tempgru-audiocaps-captioning",
	trust_remote_code=True
	).to(device)
	tokenizer = PreTrainedTokenizerFast.from_pretrained(
	"wsntxxn/audiocaps-simple-tokenizer"
	)

	wav, sr = torchaudio.load("/path/to/file.wav")
	wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
	if wav.size(0) > 1:
	wav = wav.mean(0).unsqueeze(0)

	with torch.no_grad():
	word_idxs = model(
	audio=wav,
	audio_length=[wav.size(1)],
	)

	caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
	print(caption)
	```
	This will make the description as specific as possible.

	You can also manually assign a temporal tag to control the specificity of temporal relationship description:
	```python
	with torch.no_grad():
	word_idxs = model(
	audio=wav,
	audio_length=[wav.size(1)],
	temporal_tag=[2], # desribe "sequential" if there are sequential events, otherwise use the most complex relationship
	)
	```
	The temporal tag is defined as:
	\|Temporal Tag\|Definition\|
	\|----:\|-----:\|
	\|0\|Only 1 Event\|
	\|1\|Simultaneous Events\|
	\|2\|Sequential Events\|
	\|3\|More Complex Events\|


	# Citation
	If you find the model useful, please cite this paper:
	```BibTeX
	@inproceedings{xie2023enhance,
	author = {Zeyu Xie and Xuenan Xu and Mengyue Wu and Kai Yu},
	title = {Enhance Temporal Relations in Audio Captioning with Sound Event Detection},
	year = 2023,
	booktitle = {Proc. INTERSPEECH},
	pages = {4179--4183},
	}
	```