metadata

language:
  - en
library_name: transformers
base_model: google/paligemma-3b-pt-224
pipeline_tag: visual-question-answering
inference: false
tags:
  - paligemma
  - coffe
  - caption
license: mit

Model Card for Model ID

Google's Paligemma VLM (Vision Language Model) finetuned to provide captions to coffe machine images

Model Description

Developed by: Komorebi AI
Language(s) (NLP): English
License: MIT
Finetuned from model : google/paligemma-3b-pt-224
Demo : https://huggingface.co/spaces/Fer14/coffe_machine_caption

Usage


from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
from PIL import Image


model_id = "Fer14/paligemma_coffee_machine_caption"

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
processor = PaliGemmaProcessor.from_pretrained(model_id)


image = Image.open("path to your image").convert("RGB")

prompt  = (
            f"Generate a caption for the following coffee maker image. The caption has to be of the following structure:\n"
            "\"A <color> <type>, <accessories>, <shape> shaped, with <screen> and <number> <b_color> butons\"\n\n"
            "in which:\n"
            "- color: red, black, blue...\n"
            "- type: coffee machine, coffee maker, espresso coffee machine...\n"
            "- accessories: a list of accessories like the ones described above\n"
            "- shape: cubed, round...\n"
            "- screen: screen, no screen.\n"
            "- number: amount of buttons to add\n"
            "- b_color: color of the buttons"
        )

inputs = processor(
            text=prompt,
            images=image,
            return_tensors="pt",
            padding="longest",
        )

output = model.generate(**inputs, max_length=1000)

decoded_output = processor.decode(output[0], skip_special_tokens=True)[len(prompt) :]

Framework versions

PEFT 0.11.1
Transformers 4.41.2