|
--- |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
# Grounded-VideoLLM Model Card |
|
Grounded-VideoLLM is a Video-LLM adept at fine-grained temporal grounding, which not only excels in grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding. |
|
|
|
## Model details |
|
|
|
**Model date:** |
|
Grounded-VideoLLM-Phi3.5-Vision-Instruct was trained in Oct. 2024. |
|
|
|
**Paper or resources for more information:** |
|
[Paper](https://arxiv.org/abs/2410.03290), [Code](https://github.com/WHB139426/Grounded-Video-LLM) |
|
|
|
## Citation |
|
If you find our project useful, hope you can star our repo and cite our paper as follows: |
|
|
|
``` |
|
@article{wang2024grounded, |
|
title={Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models}, |
|
author={Wang, Haibo and Xu, Zhiyang and Cheng, Yu and Diao, Shizhe and Zhou, Yufan and Cao, Yixin and Wang, Qifan and Ge, Weifeng and Huang, Lifu}, |
|
journal={arXiv preprint arXiv:2410.03290}, |
|
year={2024} |
|
} |
|
``` |