luodian commited on
Commit
4301fbf
1 Parent(s): 57cb1b2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +329 -0
README.md ADDED
@@ -0,0 +1,329 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - lmms-lab/LLaVA-OneVision-Data
5
+ language:
6
+ - en
7
+ - zh
8
+ metrics:
9
+ - accuracy
10
+ library_name: transformers
11
+ tags:
12
+ - multimodal
13
+
14
+ model-index:
15
+ - name: llava-onevision-qwen-7b-si
16
+ results:
17
+ - task:
18
+ type: multimodal
19
+ dataset:
20
+ type: ai2d
21
+ name: AI2D
22
+ metrics:
23
+ - name: accuracy
24
+ type: accuracy
25
+ value: 81.6
26
+ verified: true
27
+ - task:
28
+ type: multimodal
29
+ dataset:
30
+ type: chartqa
31
+ name: ChartQA
32
+ metrics:
33
+ - name: accuracy
34
+ type: accuracy
35
+ value: 78.8
36
+ verified: true
37
+ - task:
38
+ type: multimodal
39
+ dataset:
40
+ type: docvqa
41
+ name: DocVQA
42
+ metrics:
43
+ - name: accuracy
44
+ type: accuracy
45
+ value: 89.3
46
+ verified: true
47
+ - task:
48
+ type: multimodal
49
+ dataset:
50
+ type: infovqa
51
+ name: InfoVQA
52
+ metrics:
53
+ - name: accuracy
54
+ type: accuracy
55
+ value: 69.9
56
+ verified: true
57
+ - task:
58
+ type: multimodal
59
+ dataset:
60
+ type: mathverse
61
+ name: MathVerse
62
+ metrics:
63
+ - name: accuracy
64
+ type: accuracy
65
+ value: 26.9
66
+ verified: true
67
+ - task:
68
+ type: multimodal
69
+ dataset:
70
+ type: mathvista
71
+ name: MathVista
72
+ metrics:
73
+ - name: accuracy
74
+ type: accuracy
75
+ value: 56.1
76
+ verified: true
77
+ - task:
78
+ type: multimodal
79
+ dataset:
80
+ type: mmbench
81
+ name: MMBench
82
+ metrics:
83
+ - name: accuracy
84
+ type: accuracy
85
+ value: 81.7
86
+ verified: true
87
+ - task:
88
+ type: multimodal
89
+ dataset:
90
+ type: mme-perception
91
+ name: MME-Perception
92
+ metrics:
93
+ - name: score
94
+ type: score
95
+ value: 1626
96
+ verified: true
97
+ - task:
98
+ type: multimodal
99
+ dataset:
100
+ type: mme-cognition
101
+ name: MME-Cognition
102
+ metrics:
103
+ - name: score
104
+ type: score
105
+ value: 483
106
+ verified: true
107
+ - task:
108
+ type: multimodal
109
+ dataset:
110
+ type: mmmu
111
+ name: MMMU
112
+ metrics:
113
+ - name: accuracy
114
+ type: accuracy
115
+ value: 47.3
116
+ verified: true
117
+ - task:
118
+ type: multimodal
119
+ dataset:
120
+ type: mmvet
121
+ name: MMVet
122
+ metrics:
123
+ - name: accuracy
124
+ type: accuracy
125
+ value: 58.8
126
+ verified: true
127
+ - task:
128
+ type: multimodal
129
+ dataset:
130
+ type: mmstar
131
+ name: MMStar
132
+ metrics:
133
+ - name: accuracy
134
+ type: accuracy
135
+ value: 60.9
136
+ verified: true
137
+ - task:
138
+ type: multimodal
139
+ dataset:
140
+ type: seed-bench
141
+ name: Seed-Bench
142
+ metrics:
143
+ - name: accuracy
144
+ type: accuracy
145
+ value: 74.8
146
+ verified: true
147
+ - task:
148
+ type: multimodal
149
+ dataset:
150
+ type: science-qa
151
+ name: Science-QA
152
+ metrics:
153
+ - name: accuracy
154
+ type: accuracy
155
+ value: 96.6
156
+ verified: true
157
+ - task:
158
+ type: multimodal
159
+ dataset:
160
+ type: imagedc
161
+ name: ImageDC
162
+ metrics:
163
+ - name: accuracy
164
+ type: accuracy
165
+ value: 85.7
166
+ verified: true
167
+ - task:
168
+ type: multimodal
169
+ dataset:
170
+ type: mmlbench
171
+ name: MMLBench
172
+ metrics:
173
+ - name: accuracy
174
+ type: accuracy
175
+ value: 75.8
176
+ verified: true
177
+ - task:
178
+ type: multimodal
179
+ dataset:
180
+ type: realworldqa
181
+ name: RealWorldQA
182
+ metrics:
183
+ - name: accuracy
184
+ type: accuracy
185
+ value: 65.5
186
+ verified: true
187
+ - task:
188
+ type: multimodal
189
+ dataset:
190
+ type: vibe-eval
191
+ name: Vibe-Eval
192
+ metrics:
193
+ - name: accuracy
194
+ type: accuracy
195
+ value: 47.2
196
+ verified: true
197
+ - task:
198
+ type: multimodal
199
+ dataset:
200
+ type: llava-w
201
+ name: LLaVA-W
202
+ metrics:
203
+ - name: accuracy
204
+ type: accuracy
205
+ value: 86.9
206
+ verified: true
207
+ - task:
208
+ type: multimodal
209
+ dataset:
210
+ type: l-wilder
211
+ name: LLaVA-Wilder
212
+ metrics:
213
+ - name: accuracy
214
+ type: accuracy
215
+ value: 69.1
216
+ verified: true
217
+ ---
218
+
219
+
220
+ # LLaVA-OneVision
221
+
222
+ ![banner](https://i.postimg.cc/pL17YtG4/WX20240508-220230-2x.png)
223
+
224
+ Play with the model on the [LLaVA OneVision Chat](https://llava-onevision.lmms-lab.com/).
225
+
226
+ ## Table of Contents
227
+
228
+ 1. [Model Summary](##model-summary)
229
+ 2. [Use](##use)
230
+ 3. [Limitations](##limitations)
231
+ 4. [Training](##training)
232
+ 5. [License](##license)
233
+ 6. [Citation](##citation)
234
+
235
+ ## Model Summary
236
+
237
+ The LLaVA-OneVision models are 0.5/7/72B parameter models trained on [LLaVA-OneVision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), based on Qwen2 language model with a context window of 32K tokens.
238
+
239
+ - **Repository:** [LLaVA-VL/LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file)
240
+ - **Project Website:** [llava-onevision.lmms-lab.com](llava-onevision.lmms-lab.com)
241
+ - **Paper:** [LLaVA-OneVision]()
242
+ - **Point of Contact:** [Bo Li](mailto:[email protected])
243
+ - **Languages:** English, Chinese
244
+
245
+
246
+ ## Use
247
+
248
+ ### Intended use
249
+
250
+ The model was trained on [LLaVA-OneVision Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data) and have the ability to interact with images, multi-image and videos.
251
+
252
+ **Feel free to share your generations in the Community tab!**
253
+
254
+ ### Generation
255
+ ```python
256
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
257
+ from llava.model.builder import load_pretrained_model
258
+ from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
259
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
260
+ from llava.conversation import conv_templates, SeparatorStyle
261
+
262
+ from PIL import Image
263
+ import requests
264
+ import copy
265
+ import torch
266
+
267
+ import sys
268
+ import warnings
269
+
270
+ warnings.filterwarnings("ignore")
271
+ pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-si"
272
+ model_name = "llava_qwen"
273
+ device = "cuda"
274
+ device_map = "auto"
275
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
276
+
277
+ model.eval()
278
+
279
+ url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
280
+ image = Image.open(requests.get(url, stream=True).raw)
281
+ image_tensor = process_images([image], image_processor, model.config)
282
+ image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
283
+
284
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
285
+ question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
286
+ conv = copy.deepcopy(conv_templates[conv_template])
287
+ conv.append_message(conv.roles[0], question)
288
+ conv.append_message(conv.roles[1], None)
289
+ prompt_question = conv.get_prompt()
290
+
291
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
292
+ image_sizes = [image.size]
293
+
294
+
295
+ cont = model.generate(
296
+ input_ids,
297
+ images=image_tensor,
298
+ image_sizes=image_sizes,
299
+ do_sample=False,
300
+ temperature=0,
301
+ max_new_tokens=4096,
302
+ )
303
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
304
+ print(text_outputs)
305
+ ```
306
+
307
+ # Training
308
+
309
+ ## Model
310
+
311
+ - **Architecture:** SO400M + Qwen2
312
+ - **Pretraining Stage:** LCS-558K, 1 epoch, projector
313
+ - **Mid Stage:** A mixture of 4.7M high-quality synthetic data, 1 epoch, full model
314
+ - **Final-Image Stage:** A mixture of 3.6M single-image data, 1 epoch, full model
315
+ - **OneVision Stage:** A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
316
+ - **Precision:** bfloat16
317
+
318
+ ## Hardware & Software
319
+
320
+ - **GPUs:** 256 * Nvidia Tesla A100 (for whole model series training)
321
+ - **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
322
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
323
+
324
+ # Citation
325
+ ```
326
+ @article{li2024llavaonevision,
327
+ title={LLaVA-OneVision},
328
+ }
329
+ ```