Edit model card

UForm

Pocket-Sized Multimodal AI
For Content Understanding and Generation

Description

UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

  1. CLIP-like ViT-H/14
  2. Qwen1.5-0.5B-Chat

The model was pre-trained on the internal image captioning dataset and fine-tuned on public instructions datasets: SVIT, LVIS, VQAs datasets. The model took one day to train on a DGX-H100 with 8x H100 GPUs. Thanks to Nebius.ai for providing the compute 🤗

Usage

The generative model can be used to caption images, answer questions about them. Also it is suitable for a multimodal chat.

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained("unum-cloud/uform-gen2-qwen-500m", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("unum-cloud/uform-gen2-qwen-500m", trust_remote_code=True)

prompt = "Question or Instruction"
image = Image.open("image.jpg")

inputs = processor(text=[prompt], images=[image], return_tensors="pt")
with torch.inference_mode():
     output = model.generate(
        **inputs,
        do_sample=False,
        use_cache=True,
        max_new_tokens=256,
        eos_token_id=151645,
        pad_token_id=processor.tokenizer.pad_token_id
    )

prompt_len = inputs["input_ids"].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]

You can check examples of different prompts in our demo space.

Evaluation

Model LLM Size SQA MME MMBench Average¹
UForm-Gen2-Qwen-500m 0.5B 45.5 880.1 42.0 29.31
MobileVLM v2 1.4B 52.1 1302.8 57.7 36.81
LLaVA-Phi 2.7B 68.4 1335.1 59.8 42.95

¹MME scores were divided by 2000 before averaging.

Downloads last month
36,011
Safetensors
Model size
1.27B params
Tensor type
F32
·
Inference Examples
Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train unum-cloud/uform-gen2-qwen-500m

Spaces using unum-cloud/uform-gen2-qwen-500m 4

Collection including unum-cloud/uform-gen2-qwen-500m