A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

[GitHub](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/tree/main) ## Usage Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10: ``` torch==2.0.1 torchvision==0.15.2 transformers==4.37.2 megfile==3.1.2 ``` ```python # test.py import torch from PIL import Image from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True) image = Image.open('xx.jpg').convert('RGB') question = 'What is in the image?' msgs = [{'role': 'user', 'content': [image, question]}] res = model.chat( image=None, msgs=msgs, tokenizer=tokenizer ) print(res) ## if you want to use streaming, please make sure sampling=True and stream=True ## the model.chat will return a generator res = model.chat( image=None, msgs=msgs, tokenizer=tokenizer, sampling=True, stream=True ) generated_text = "" for new_text in res: generated_text += new_text print(new_text, flush=True, end='') ```