Fer14's picture
Update README.md
202d5d5 verified
---
language:
- en
library_name: transformers
base_model: google/paligemma-3b-pt-224
pipeline_tag: visual-question-answering
inference: false
tags:
- paligemma
- coffe
- caption
license: mit
---
# Model Card for Model ID
Google's Paligemma VLM (Vision Language Model) finetuned to provide captions to coffe machine images
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** Komorebi AI
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model :** google/paligemma-3b-pt-224
- **Demo :** https://huggingface.co/spaces/Fer14/coffe_machine_caption
## Usage
```python
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
from PIL import Image
model_id = "Fer14/paligemma_coffee_machine_caption"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
processor = PaliGemmaProcessor.from_pretrained(model_id)
image = Image.open("path to your image").convert("RGB")
prompt = (
f"Generate a caption for the following coffee maker image. The caption has to be of the following structure:\n"
"\"A <color> <type>, <accessories>, <shape> shaped, with <screen> and <number> <b_color> butons\"\n\n"
"in which:\n"
"- color: red, black, blue...\n"
"- type: coffee machine, coffee maker, espresso coffee machine...\n"
"- accessories: a list of accessories like the ones described above\n"
"- shape: cubed, round...\n"
"- screen: screen, no screen.\n"
"- number: amount of buttons to add\n"
"- b_color: color of the buttons"
)
inputs = processor(
text=prompt,
images=image,
return_tensors="pt",
padding="longest",
)
output = model.generate(**inputs, max_length=1000)
decoded_output = processor.decode(output[0], skip_special_tokens=True)[len(prompt) :]
```
### Framework versions
- PEFT 0.11.1
- Transformers 4.41.2