czczup commited on
Commit
841877a
1 Parent(s): da1dbc3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +194 -0
README.md CHANGED
@@ -1,3 +1,197 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - laion/laion2B-en
5
+ - laion/laion-coco
6
+ - laion/laion2B-multi
7
+ - kakaobrain/coyo-700m
8
+ - conceptual_captions
9
+ - wanng/wukong100m
10
+ pipeline_tag: visual-question-answering
11
  ---
12
+
13
+ # Model Card for InternVL-Chat-V1.5
14
+
15
+ \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
16
+
17
+ ## Model Details
18
+ - **Model Type:** vision large language model, multimodal chatbot
19
+ - **Model Stats:**
20
+ - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
21
+ - Params: 25.5B
22
+ - Image size: dynamic resolution, max to 24 tiles of 448 x 448 during inference.
23
+ - Number of visual tokens: 256 * number of tiles
24
+
25
+ - **Training Strategy:**
26
+ - Pretraining Stage
27
+ - Learnable Component: ViT + MLP
28
+ - Data: TODO
29
+ - SFT Stage
30
+ - Learnable Component: ViT + MLP + LLM
31
+ - Data: TODO
32
+
33
+
34
+ ## Model Usage
35
+
36
+ We provide a minimum code example to run InternVL-Chat using only the `transformers` library.
37
+
38
+ You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
39
+
40
+ Note: If you meet this error `ImportError: This modeling file requires the following packages that were not found in your environment: fastchat`, please run `pip install fschat`.
41
+
42
+
43
+ ```python
44
+ import json
45
+ import os
46
+ from internvl.model.internvl_chat import InternVLChatModel
47
+ from transformers import AutoTokenizer, AutoModel
48
+ from tqdm import tqdm
49
+ import torch
50
+ import torchvision.transforms as T
51
+ from PIL import Image
52
+
53
+ from torchvision.transforms.functional import InterpolationMode
54
+
55
+
56
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
57
+ IMAGENET_STD = (0.229, 0.224, 0.225)
58
+
59
+
60
+ def build_transform(input_size):
61
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
62
+ transform = T.Compose([
63
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
64
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
65
+ T.ToTensor(),
66
+ T.Normalize(mean=MEAN, std=STD)
67
+ ])
68
+ return transform
69
+
70
+
71
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
72
+ best_ratio_diff = float('inf')
73
+ best_ratio = (1, 1)
74
+ area = width * height
75
+ for ratio in target_ratios:
76
+ target_aspect_ratio = ratio[0] / ratio[1]
77
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
78
+ if ratio_diff < best_ratio_diff:
79
+ best_ratio_diff = ratio_diff
80
+ best_ratio = ratio
81
+ elif ratio_diff == best_ratio_diff:
82
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
83
+ best_ratio = ratio
84
+ return best_ratio
85
+
86
+
87
+ def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
88
+ orig_width, orig_height = image.size
89
+ aspect_ratio = orig_width / orig_height
90
+
91
+ # calculate the existing image aspect ratio
92
+ target_ratios = set(
93
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
94
+ i * j <= max_num and i * j >= min_num)
95
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
96
+
97
+ # find the closest aspect ratio to the target
98
+ target_aspect_ratio = find_closest_aspect_ratio(
99
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
100
+
101
+ # calculate the target width and height
102
+ target_width = image_size * target_aspect_ratio[0]
103
+ target_height = image_size * target_aspect_ratio[1]
104
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
105
+
106
+ # resize the image
107
+ resized_img = image.resize((target_width, target_height))
108
+ processed_images = []
109
+ for i in range(blocks):
110
+ box = (
111
+ (i % (target_width // image_size)) * image_size,
112
+ (i // (target_width // image_size)) * image_size,
113
+ ((i % (target_width // image_size)) + 1) * image_size,
114
+ ((i // (target_width // image_size)) + 1) * image_size
115
+ )
116
+ # split the image
117
+ split_img = resized_img.crop(box)
118
+ processed_images.append(split_img)
119
+ assert len(processed_images) == blocks
120
+ if use_thumbnail and len(processed_images) != 1:
121
+ thumbnail_img = image.resize((image_size, image_size))
122
+ processed_images.append(thumbnail_img)
123
+ return processed_images
124
+
125
+
126
+ def load_image(image_file, input_size=448, max_num=6):
127
+ image = Image.open(image_file).convert('RGB')
128
+ transform = build_transform(input_size=input_size)
129
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
130
+ pixel_values = [transform(image) for image in images]
131
+ pixel_values = torch.stack(pixel_values)
132
+ return pixel_values
133
+
134
+
135
+ path = "OpenGVLab/InternVL-Chat-V1-5"
136
+ # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
137
+ model = AutoModel.from_pretrained(
138
+ path,
139
+ torch_dtype=torch.bfloat16,
140
+ low_cpu_mem_usage=True,
141
+ trust_remote_code=True).eval().cuda()
142
+ # Otherwise, you need to set device_map='auto' to use multiple GPUs for inference.
143
+ # model = AutoModel.from_pretrained(
144
+ # path,
145
+ # torch_dtype=torch.bfloat16,
146
+ # low_cpu_mem_usage=True,
147
+ # trust_remote_code=True,
148
+ # device_map='auto').eval()
149
+
150
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
151
+ # set the max number of tiles in `max_num`
152
+ pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
153
+
154
+ generation_config = dict(
155
+ num_beams=1,
156
+ max_new_tokens=512,
157
+ do_sample=False,
158
+ )
159
+
160
+ # single-round conversation
161
+ question = "请详细描述图片"
162
+ response = model.chat(tokenizer, pixel_values, question, generation_config)
163
+ print(question, response)
164
+
165
+ # multi-round conversation
166
+ question = "请详细描述图片"
167
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
168
+ print(question, response)
169
+
170
+ question = "请根据图片写一首诗"
171
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
172
+ print(question, response)
173
+ ```
174
+
175
+
176
+ ## Citation
177
+
178
+ If you find this project useful in your research, please consider citing:
179
+
180
+ ```BibTeX
181
+ @article{chen2023internvl,
182
+ title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
183
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
184
+ journal={arXiv preprint arXiv:2312.14238},
185
+ year={2023}
186
+ }
187
+ ```
188
+
189
+ ## License
190
+
191
+ This project is released under the MIT license. Parts of this project contain code and models (e.g., LLaMA2) from other sources, which are subject to their respective licenses.
192
+
193
+ Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved.
194
+
195
+ ## Acknowledgement
196
+
197
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!