README.md · Yongxin-Guo/trace at main

metadata

license: apache-2.0
language:
  - en
base_model:
  - mistralai/Mistral-7B-Instruct-v0.2
tags:
  - video temporal grounding
  - dense video caption
  - video highlight detection

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

If our project helps you, please give us a star ⭐ on GitHub and cite our paper!

📰 News

[2024.10.10] 🔥 Our code and paper are released!
[2024.10.10] 🔥 Our checkpoints are available now!

Overview

In this work

We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.

Model Zoo

Checkpoints	Description	URL
Initialization	Weights initialized from VideoLLaMA2	trace-init
Stage-1	Model checkpoints trained after stage-1	trace-stage1
Stage-2	Model checkpoints trained after stage-2	trace
FT-Charades	Fine-tuned on Charades-STA dataset	trace-ft-charades
FT-Youcook2	Fine-tuned on Youcook2 dataset	trace-ft-youcook2
FT-QVHighlights	Fine-tuned on QVHighlights dataset	trace-ft-qvhighlights

Results

Youcook2 (Zero-Shot)	CIDER	METEOR	SODA_c	F1
TRACE	8.1	2.8	2.2	22.4

Charades-STA (Zero-Shot)	0.3	0.5	0.7	mIOU
TRACE	58.6	40.3	19.4	38.7

QVHighlights (Zero-Shot)	mAP	Hit@1
TRACE	26.8	42.7

ActivityNet-DVC	CIDER	METEOR	SODA_c	F1
TRACE	25.9	6.0	6.4	39.3

ActivityNet-MR	0.3	0.5	0.7	mIOU
TRACE	53.0	37.7	24.0	39.0