YuanLiuuuuuu commited on
Commit
d6340a9
1 Parent(s): e862003

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - HuggingFaceM4/MMBench
5
+ language:
6
+ - en
7
+ base_model:
8
+ - openai/clip-vit-large-patch14-336
9
+ - Qwen/Qwen2.5-7B-Instruct
10
+ pipeline_tag: image-text-to-text
11
+ tags:
12
+ - vision-language
13
+ - multimodal
14
+ ---
15
+ ## POINTS-Qwen-2-5-7B-Chat
16
+
17
+ ### Introduction
18
+
19
+ We are excited to announce the first version of POINTS, which integrates recent advancement in vision-language model and new techniques proposed by researchers from WeChat AI.
20
+
21
+ <p align="center">
22
+ 🏠 <a href="https://github.com/WePOINTS/WePOINTS">Github</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://arxiv.org/abs/2409.04828">Paper</a> &nbsp&nbsp </a>
23
+ </p>
24
+
25
+ ### What's new in POINTS?
26
+
27
+ **Key Innovations**
28
+
29
+ 1. **Strong Baseline**: We integrate the most recent advancement in vision-language model, i.e., CapFusion, Dual Vision Encoder, and
30
+ Dynamic High Resolution, into POINTS.
31
+
32
+ 2. **Pre-training Dataset Filtering**: We propose to filter the pre-training dataset using perplexity as a metric. Utilizing this filtering strategy, we can significantly reduce the size of the pre-training dataset and improve the performance of the model.
33
+
34
+ 3. **Model Soup**: We propose to apply model soup to models, fine-tuned with different visual instruction tuning datasets, which can further significantly improve the performance of the model.
35
+
36
+ <p align="center">
37
+ <img src="https://github.com/user-attachments/assets/6af35008-f501-400a-a870-b66a9bf2baab" width="60%"/>
38
+ <p>
39
+
40
+
41
+ ### How to use POINTS?
42
+
43
+ ```python
44
+ from transformers import AutoModelForCausalLM, AutoTokenizer
45
+ from transformers import CLIPImageProcessor
46
+ from PIL import Image
47
+ import torch
48
+ import requests
49
+ from io import BytesIO
50
+
51
+
52
+ image_url = 'https://github.com/user-attachments/assets/83258e94-5d61-48ef-a87f-80dd9d895524'
53
+ response = requests.get(image_url)
54
+ image_data = BytesIO(response.content)
55
+ pil_image = Image.open(image_data)
56
+ prompt = 'please describe the image in detail'
57
+ model_path = 'WePOINTS/POINTS-Qwen-2-5-7B-Chat'
58
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
59
+ model = AutoModelForCausalLM.from_pretrained(
60
+ model_path, trust_remote_code=True, device_map='cuda').to(torch.bfloat16)
61
+ image_processor = CLIPImageProcessor.from_pretrained(model_path)
62
+ generation_config = {
63
+ 'max_new_tokens': 1024,
64
+ 'temperature': 0.0,
65
+ 'top_p': 0.0,
66
+ 'num_beams': 1,
67
+ }
68
+ res = model.chat(
69
+ pil_image,
70
+ prompt,
71
+ tokenizer,
72
+ image_processor,
73
+ True,
74
+ generation_config
75
+ )
76
+ print(res)
77
+ ```
78
+
79
+ ### Evaluation
80
+
81
+ | Benchmark | InternVL2-8B | LLaVA-OneVision | POINTS |
82
+ | :-------: | :----------: | :-------------: | :----: |
83
+ | MMBench-dev-en | - | 80.8 | 83.2 |
84
+ | MathVista | 58.3 | 62.3 | 63.1 |
85
+ | HallucinationBench | 45.0 | 31.6 | 46.0 |
86
+ | OCRBench | 79.4 | 62.2 | 72.0 |
87
+ | AI2D | 83.6 | 82.4 | 80.9 |
88
+ | MMVet | 54.3 | 51.9 | 52.3 |
89
+ | MMStar | 61.5 | 61.9 | 61.0 |
90
+ | MMMU | 51.2 | 47.9 | 49.4 |
91
+ | ScienceQA | 97.1 | 95.4 | - |
92
+ | MME | 2215.1 | 1993.6 | 2195.2 |
93
+ | RealWorldQA | 64.2 | 69.9 | 67.3 |
94
+ | LLaVA-Wild | 73.3 | 81.0 | 71.1 |
95
+
96
+
97
+ ### Citation
98
+
99
+ If you find our work helpful, feel free to cite us:
100
+
101
+ ```
102
+ @article{liu2024points,
103
+ title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
104
+ author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
105
+ journal={arXiv preprint arXiv:2409.04828},
106
+ year={2024}
107
+ }
108
+
109
+ @article{liu2024rethinking,
110
+ title={Rethinking Overlooked Aspects in Vision-Language Models},
111
+ author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
112
+ journal={arXiv preprint arXiv:2405.11850},
113
+ year={2024}
114
+ }
115
+ ```