Fer14's picture
Update README.md
202d5d5 verified
metadata
language:
  - en
library_name: transformers
base_model: google/paligemma-3b-pt-224
pipeline_tag: visual-question-answering
inference: false
tags:
  - paligemma
  - coffe
  - caption
license: mit

Model Card for Model ID

Google's Paligemma VLM (Vision Language Model) finetuned to provide captions to coffe machine images

Model Description

Usage


from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
from PIL import Image


model_id = "Fer14/paligemma_coffee_machine_caption"

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
processor = PaliGemmaProcessor.from_pretrained(model_id)


image = Image.open("path to your image").convert("RGB")

prompt  = (
            f"Generate a caption for the following coffee maker image. The caption has to be of the following structure:\n"
            "\"A <color> <type>, <accessories>, <shape> shaped, with <screen> and <number> <b_color> butons\"\n\n"
            "in which:\n"
            "- color: red, black, blue...\n"
            "- type: coffee machine, coffee maker, espresso coffee machine...\n"
            "- accessories: a list of accessories like the ones described above\n"
            "- shape: cubed, round...\n"
            "- screen: screen, no screen.\n"
            "- number: amount of buttons to add\n"
            "- b_color: color of the buttons"
        )

inputs = processor(
            text=prompt,
            images=image,
            return_tensors="pt",
            padding="longest",
        )

output = model.generate(**inputs, max_length=1000)

decoded_output = processor.decode(output[0], skip_special_tokens=True)[len(prompt) :]

Framework versions

  • PEFT 0.11.1
  • Transformers 4.41.2