English
File size: 6,540 Bytes
6c5b16d
 
 
 
 
 
 
 
11fc728
6c5b16d
 
 
 
 
46b0f0f
 
 
6c5b16d
 
 
9f9ab73
6c5b16d
9f9ab73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46b0f0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c5b16d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a548ec
 
 
 
 
6c5b16d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a548ec
6c5b16d
 
 
 
692bdfb
 
 
 
 
6c5b16d
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
language: en
datasets:
- laion2b
---

# OpenFlamingo-4B (CLIP ViT-L/14, RedPajama-INCITE-Instruct-3B-v1)

[Paper](https://arxiv.org/abs/2308.01390) | [Blog post](https://laion.ai/blog/open-flamingo-v2/) | [Code](https://github.com/mlfoundations/open_flamingo) | [Demo](https://huggingface.co/spaces/openflamingo/OpenFlamingo)

OpenFlamingo is an open source implementation of DeepMind's [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) models. 
This 4B-parameter model uses a [CLIP ViT-L/14](https://huggingface.co/openai/clip-vit-large-patch14) vision encoder and an instruction tuned [RedPajama-3B](https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1) language model.

## Model Details
We follow the Flamingo modeling paradigm, outfitting the layers of a pretrained, frozen language model such that they cross-attend to visual features when decoding. Following Flamingo, we freeze the vision encoder and language model but train the connecting modules on web-scraped image-text sequences. Specifically, we trained this model on a mixture of [LAION-2B](https://arxiv.org/abs/2210.08402), [Multimodal C4](https://arxiv.org/abs/2304.06939), and custom ChatGPT-generated sequences using images from LAION (to be released soon).

This model has cross-attention modules inserted in *every other* decoder block. It was trained using FullyShardedDataParallel across 64 A100 40GB GPUs at FP32 precision.

## Uses
OpenFlamingo models process arbitrarily interleaved sequences of images and text to output text. This allows the models to accept in-context examples and undertake tasks like captioning, visual question answering, and image classification. 
### Initialization

``` python
from open_flamingo import create_model_and_transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="togethercomputer/RedPajama-INCITE-Instruct-3B-v1",
    tokenizer_path="togethercomputer/RedPajama-INCITE-Instruct-3B-v1",
    cross_attn_every_n_layers=2
)

# grab model checkpoint from huggingface hub
from huggingface_hub import hf_hub_download
import torch

checkpoint_path = hf_hub_download("openflamingo/OpenFlamingo-4B-vitl-rpj3b-langinstruct", "checkpoint.pt")
model.load_state_dict(torch.load(checkpoint_path), strict=False)
```
### Generation example
Below is an example of generating text conditioned on interleaved images/text. In particular, let's try few-shot image captioning.

``` python
from PIL import Image
import requests

"""
Step 1: Load images
"""
demo_image_one = Image.open(
    requests.get(
        "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
    ).raw
)

demo_image_two = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028137.jpg",
        stream=True
    ).raw
)

query_image = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028352.jpg", 
        stream=True
    ).raw
)


"""
Step 2: Preprocessing images
Details: For OpenFlamingo, we expect the image to be a torch tensor of shape 
 batch_size x num_media x num_frames x channels x height x width. 
 In this case batch_size = 1, num_media = 3, num_frames = 1,
 channels = 3, height = 224, width = 224.
"""
vision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)]
vision_x = torch.cat(vision_x, dim=0)
vision_x = vision_x.unsqueeze(1).unsqueeze(0)

"""
Step 3: Preprocessing text
Details: In the text we expect an <image> special token to indicate where an image is.
 We also expect an <|endofchunk|> special token to indicate the end of the text 
 portion associated with an image.
"""
tokenizer.padding_side = "left" # For generation padding tokens should be on the left
lang_x = tokenizer(
    ["<image>An image of two cats.<|endofchunk|><image>An image of a bathroom sink.<|endofchunk|><image>An image of"],
    return_tensors="pt",
)


"""
Step 4: Generate text
"""
generated_text = model.generate(
    vision_x=vision_x,
    lang_x=lang_x["input_ids"],
    attention_mask=lang_x["attention_mask"],
    max_new_tokens=20,
    num_beams=3,
)

print("Generated text: ", tokenizer.decode(generated_text[0]))
```

### Bias, Risks, and Limitations
OpenFlamingo models inherit the risks of their parent models, especially the language model. As an open-source research effort, we highly value open, accessible, reproducible multimodal model research; however, it is crucial to be aware that these models are trained on web data, have not been finetuned for safety, and thus may produce unintended, inappropriate, unreliable, and/or inaccurate outputs. Please use caution before deploying OpenFlamingo models in real applications. We also hope that OpenFlamingo enables further safety and reliability research to address these issues.

In an effort to mitigate current potential biases and harms, we have deployed a text content filter on model outputs in the OpenFlamingo demo. We continue to red-team the model to understand and improve its safety.

## Evaluation
<table>
  <tr>
    <th></th>
    <th>0-shot</th>
    <th>4-shot</th>
    <th>8-shot</th>
    <th>16-shot</th>
    <th>32-shot</th>
  </tr>
  <tr>
    <th>COCO (CIDEr)</th>
    <td>81.2 (0.3)</td>
    <td>85.8 (0.5)</td>
    <td>94.8 (0.2)</td>
    <td>98.0 (0.3)</td>
    <td>99.2 (0.3)</td>
  </tr>
  <tr>
    <th>VQAv2 (Accuracy)</th>
    <td>46.3 (0.6)</td>
    <td>49.1 (0.2)</td>
    <td>47.7 (0.5)</td>
    <td>45.9 (0.9)</td>
    <td>47.0 (0.2)</td>
  </tr>
  <tr>
    <th>Flickr-30K (CIDEr)</th>
    <td>55.6 (1.3)</td>
    <td>61.2 (0.5)</td>
    <td>59.0 (1.0)</td>
    <td>54.8 (1.0)</td>
    <td>53.0 (0.5)</td>
  </tr>
  <tr>
    <th>OK-VQA (Accuracy)</th>
    <td>29.7 (0.2)</td>
    <td>34.3 (0.2)</td>
    <td>32.4 (0.2)</td>
    <td>30.7 (0.3)</td>
    <td>32.5 (0.1)</td>
  </tr>
  <tr>
    <th>TextVQA (Accuracy)</th>
    <td>21.1 (0.4)</td>
    <td>27.2 (0.3)</td>
    <td>25.1 (0.2)</td>
    <td>23.2 (0.1)</td>
    <td>23.2 (0.2)</td>
  </tr>
  <tr>
    <th>Vizwiz (Accuracy)</th>
    <td>14.9 (0.1)</td>
    <td>21.0 (0.4)</td>
    <td>27.1 (1.4)</td>
    <td>33.6 (0.6)</td>
    <td>37.1 (0.3)</td>
  </tr>
  <tr>
    <th>Hateful Memes (ROC AUC)</th>
    <td>53.2 (2.6)</td>
    <td>53.2 (3.3)</td>
    <td>52.9 (3.0)</td>
    <td>55.9 (1.1)</td>
    <td>55.0 (1.5)</td>
  </tr>
</table>