metadata

library_name: transformers
tags:
  - llama-factory
  - yi-vl
  - llava
license: other
language:
  - zh
  - en
pipeline_tag: visual-question-answering

This is the Huggingface version of Yi-VL-6B model.

You may use this model for fine-tuning in downstream tasks, we recommend using our efficient fine-tuning toolkit. https://github.com/hiyouga/LLaMA-Factory

Developed by: 01-AI.
Language(s) (NLP): Chinese/English
License: Yi Series Model License

Usage:

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

model_id = "BUAADreamer/Yi-VL-6B-hf"

messages = [
  { "role": "user", "content": "What's in the picture?" }
]

model = AutoModelForVision2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)
processor = AutoProcessor.from_pretrained(model_id)

text = [processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)]
images = [Image.open(requests.get(image_file, stream=True).raw)]
inputs = processor(text=prompt, images=images, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200)
output = processor.batch_decode(output, skip_special_tokens=True)
print(output.split("Assistant:")[-1].strip())

You could also alternatively launch a Web demo by using the CLI command in LLaMA-Factory

llamafactory-cli webchat \
--model_name_or_path BUAADreamer/Yi-VL-6B-hf \
--template yivl \
--visual_inputs

lmms-eval Evaluation Results

Metric	Value
MMMU_val	36.8
CMMMU_val	32.2