File size: 2,014 Bytes
a473468
21a24a4
 
be9b3bd
58ae3bc
6477241
b90349b
 
e12d448
 
 
 
a473468
 
 
 
233349b
a473468
 
 
 
 
 
58ae3bc
a473468
e12d448
9f86748
e12d448
9f86748
 
a473468
233349b
a473468
233349b
a473468
233349b
 
a473468
 
202d5d5
a473468
233349b
 
a473468
 
233349b
a473468
77c4ed6
 
 
 
 
 
 
 
 
 
 
 
a473468
233349b
 
 
 
 
 
a473468
233349b
a473468
233349b
a473468
233349b
a473468
 
58ae3bc
 
9f86748
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
language:
- en
library_name: transformers
base_model: google/paligemma-3b-pt-224
pipeline_tag: visual-question-answering
inference: false
tags:
- paligemma
- coffe
- caption
license: mit
---

# Model Card for Model ID

Google's Paligemma VLM (Vision Language Model) finetuned to provide captions to coffe machine images


### Model Description

<!-- Provide a longer summary of what this model is. -->



- **Developed by:** Komorebi AI
- **Language(s) (NLP):** English 
- **License:** MIT
- **Finetuned from model :** google/paligemma-3b-pt-224 
- **Demo :** https://huggingface.co/spaces/Fer14/coffe_machine_caption

## Usage

```python

from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
from PIL import Image


model_id = "Fer14/paligemma_coffee_machine_caption"

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
processor = PaliGemmaProcessor.from_pretrained(model_id)


image = Image.open("path to your image").convert("RGB")

prompt  = (
            f"Generate a caption for the following coffee maker image. The caption has to be of the following structure:\n"
            "\"A <color> <type>, <accessories>, <shape> shaped, with <screen> and <number> <b_color> butons\"\n\n"
            "in which:\n"
            "- color: red, black, blue...\n"
            "- type: coffee machine, coffee maker, espresso coffee machine...\n"
            "- accessories: a list of accessories like the ones described above\n"
            "- shape: cubed, round...\n"
            "- screen: screen, no screen.\n"
            "- number: amount of buttons to add\n"
            "- b_color: color of the buttons"
        )

inputs = processor(
            text=prompt,
            images=image,
            return_tensors="pt",
            padding="longest",
        )

output = model.generate(**inputs, max_length=1000)

decoded_output = processor.decode(output[0], skip_special_tokens=True)[len(prompt) :]

```


### Framework versions

- PEFT 0.11.1
- Transformers 4.41.2