ZhangYuanhan commited on
Commit
70af2bb
1 Parent(s): 3331e8c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +231 -3
README.md CHANGED
@@ -1,3 +1,231 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - lmms-lab/LLaVA-NeXT-Video-178K
4
+ language:
5
+ - en
6
+ library_name: transformers
7
+ license: apache-2.0
8
+ metrics:
9
+ - accuracy
10
+ tags:
11
+ - multimodal
12
+ model-index:
13
+ - name: LLaVA-NeXT-Video-7B-Qwen2
14
+ results:
15
+ - task:
16
+ type: multimodal
17
+ dataset:
18
+ name: ActNet-QA
19
+ type: actnet-qa
20
+ metrics:
21
+ - type: accuracy
22
+ value: 58.2
23
+ name: accuracy
24
+ verified: true
25
+ - task:
26
+ type: multimodal
27
+ dataset:
28
+ name: EgoSchema
29
+ type: egoschema
30
+ metrics:
31
+ - type: accuracy
32
+ value: 57.3
33
+ name: accuracy
34
+ verified: true
35
+ - task:
36
+ type: multimodal
37
+ dataset:
38
+ name: MLVU
39
+ type: mlvu
40
+ metrics:
41
+ - type: accuracy
42
+ value: 69.8
43
+ name: accuracy
44
+ verified: true
45
+ - task:
46
+ type: multimodal
47
+ dataset:
48
+ name: MVBench
49
+ type: mvbench
50
+ metrics:
51
+ - type: accuracy
52
+ value: 58.4
53
+ name: accuracy
54
+ verified: true
55
+ - task:
56
+ type: multimodal
57
+ dataset:
58
+ name: NextQA
59
+ type: nextqa
60
+ metrics:
61
+ - type: accuracy
62
+ value: 82.2
63
+ name: accuracy
64
+ verified: true
65
+ - task:
66
+ type: multimodal
67
+ dataset:
68
+ name: PercepTest
69
+ type: percepTest
70
+ metrics:
71
+ - type: accuracy
72
+ value: 71.7
73
+ name: accuracy
74
+ verified: true
75
+ - task:
76
+ type: multimodal
77
+ dataset:
78
+ name: VideoChatGPT
79
+ type: videochatgpt
80
+ metrics:
81
+ - type: score
82
+ value: 3.54
83
+ name: score
84
+ verified: true
85
+ - task:
86
+ type: multimodal
87
+ dataset:
88
+ name: VideoDC
89
+ type: videodc
90
+ metrics:
91
+ - type: score
92
+ value: 3.71
93
+ name: score
94
+ verified: true
95
+ - task:
96
+ type: multimodal
97
+ dataset:
98
+ name: LongVideoBench
99
+ type: longvideobench
100
+ metrics:
101
+ - type: accuracy
102
+ value: 57.3
103
+ name: accuracy
104
+ verified: true
105
+ - task:
106
+ type: multimodal
107
+ dataset:
108
+ name: VideoMME
109
+ type: videomme
110
+ metrics:
111
+ - type: accuracy
112
+ value: 63.2
113
+ name: accuracy
114
+ verified: true
115
+ base_model:
116
+ - lmms-lab/llava-onevision-qwen2-7b-si
117
+ ---
118
+
119
+ # LLaVA-NeXT-Video-7B-Qwen2-video-only
120
+
121
+ ## Table of Contents
122
+
123
+ 1. [Model Summary](##model-summary)
124
+ 2. [Use](##use)
125
+ 3. [Limitations](##limitations)
126
+ 4. [Training](##training)
127
+ 5. [License](##license)
128
+ 6. [Citation](##citation)
129
+
130
+ ## Model Summary
131
+
132
+ The LLaVA-NeXT-Video models are 7/72B parameter models trained on [LLaVA-NeXT-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Video-SFT-Data), based on Qwen2 language model with a context window of 32K tokens.
133
+
134
+ This model support at most 110 frames.
135
+
136
+ - **Repository:** [LLaVA-VL/LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file)
137
+ - **Point of Contact:** [Yuanhan Zhang](https://zhangyuanhan-ai.github.io/)
138
+ - **Languages:** English, Chinese
139
+
140
+
141
+ ## Use
142
+
143
+ ### Intended use
144
+
145
+ The model was trained on [LLaVA-NeXT-Video-178K](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Video-SFT-Data) and have the ability to interact with images, multi-image and videos, but specific to videos.
146
+
147
+ **Feel free to share your generations in the Community tab!**
148
+
149
+ ### Generation
150
+
151
+ We provide the simple generation process for using our model. For more details, you could refer to [Github](https://github.com/LLaVA-VL/LLaVA-NeXT).
152
+
153
+ ```python
154
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
155
+ from llava.model.builder import load_pretrained_model
156
+ from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
157
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
158
+ from llava.conversation import conv_templates, SeparatorStyle
159
+ from PIL import Image
160
+ import requests
161
+ import copy
162
+ import torch
163
+ import sys
164
+ import warnings
165
+ from decord import VideoReader, cpu
166
+ import numpy as np
167
+ warnings.filterwarnings("ignore")
168
+ def load_video(self, video_path, max_frames_num,fps=1,force_sample=False):
169
+ if max_frames_num == 0:
170
+ return np.zeros((1, 336, 336, 3))
171
+ vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
172
+ total_frame_num = len(vr)
173
+ video_time = total_frame_num / vr.get_avg_fps()
174
+ fps = round(vr.get_avg_fps()/fps)
175
+ frame_idx = [i for i in range(0, len(vr), fps)]
176
+ frame_time = [i/fps for i in frame_idx]
177
+ if len(frame_idx) > max_frames_num or force_sample:
178
+ sample_fps = max_frames_num
179
+ uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
180
+ frame_idx = uniform_sampled_frames.tolist()
181
+ frame_time = [i/vr.get_avg_fps() for i in frame_idx]
182
+ frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
183
+ spare_frames = vr.get_batch(frame_idx).asnumpy()
184
+ # import pdb;pdb.set_trace()
185
+ return spare_frames,frame_time,video_time
186
+ pretrained = "lmms-lab/LLaVA-NeXT-Video-7B-Qwen2"
187
+ model_name = "llava_qwen"
188
+ device = "cuda"
189
+ device_map = "auto"
190
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
191
+ model.eval()
192
+ video_path = "XXXX"
193
+ max_frames_num = "110"
194
+ video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
195
+ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
196
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
197
+ question = DEFAULT_IMAGE_TOKEN + "\nPlease describe this video in detail."
198
+ conv = copy.deepcopy(conv_templates[conv_template])
199
+ conv.append_message(conv.roles[0], question)
200
+ conv.append_message(conv.roles[1], None)
201
+ prompt_question = conv.get_prompt()
202
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
203
+ cont = model.generate(
204
+ input_ids,
205
+ images=video,
206
+ modalities="video"
207
+ do_sample=False,
208
+ temperature=0,
209
+ max_new_tokens=4096,
210
+ )
211
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
212
+ print(text_outputs)
213
+ ```
214
+
215
+
216
+ # Training
217
+
218
+ ## Model
219
+
220
+ - **Architecture:** SO400M + Qwen2
221
+ - **Initialized Model:** lmms-lab/llava-onevision-qwen2-7b-si
222
+ - **Data:** A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
223
+ - **Precision:** bfloat16
224
+
225
+ ## Hardware & Software
226
+
227
+ - **GPUs:** 256 * Nvidia Tesla A100 (for whole model series training)
228
+ - **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
229
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
230
+
231
+ # Citation