Files changed (1) hide show
  1. README.md +65 -26
README.md CHANGED
@@ -73,15 +73,15 @@ We also fine-tune these base models on a mixture of SFT datasets (TODO: find a m
73
  - **Resources for more information:**
74
  - [GitHub Repo](https://github.com/huggingface/m4/)
75
  - Description of [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC): [OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
76
- ](https://arxiv.org/abs/2306.16527)
77
- - Original Paper: [Flamingo: a Visual Language Model for Few-Shot Learning](https://arxiv.org/abs/2204.14198)
78
 
79
- ATUM is a large multimodal model that takes sequences of interleaved images and texts as inputs and generates text outputs.
80
  The model shows strong in-context few-shot learning capabilities (and on par with the closed-source model), and is a robust starting point to fine-tune multimodal models on custom data.
81
 
82
  ATUM is built on top of two unimodal open-access pre-trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image/text pairs and unstrucutred multimodal web documents.
83
 
84
-
85
  # Uses
86
 
87
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
@@ -109,34 +109,85 @@ More information needed
109
 
110
  # Training Details
111
 
112
- We closel follow the training procedure layed out in [Flamingo](https://arxiv.org/abs/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks.
113
 
114
- The model is trained on the following data mixture of openly accessible data:
115
 
116
  | Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
117
  |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
118
- | PMD | Image-Text Pairs | TODO | TODO | 3 | 73.85% |
119
- | LAION | Image-Text Pairs | TODO | TODO | 1 | 6.15% |
120
- | OBELISC | Unstructured Multimodal Web Documents | TODO | TODO | 3 | 2.82% |
121
- | Wikipedia | Unstructured Multimodal Web Documents | TODO | TODO | 1 | 17.18% |
122
-
123
- For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images.
124
- For image-text pairs, we form the training sequences by packing images with their captions.
125
- The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
 
 
 
 
 
 
 
 
 
 
126
  The training objective is the standard next token prediction.
127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
129
  # Evaluation
130
 
131
  <!-- This section describes the evaluation protocols and provides the results. -->
132
  We closely follow the evaluation protocol of Flamingo and evaluate ATUM on a suite of downstream image + text benchmarks ranging from visual question answering to image captioning.
 
133
  We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.
134
 
 
 
135
  TODO: beautiful plots of shots scaling laws.
136
 
137
  TODO: detail of the numbers in a table.
138
 
139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  # Bias, Risks, and Limitations
141
 
142
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
@@ -163,18 +214,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
163
  - **Cloud Provider:** AWS Sagemaker
164
  - **Carbon Emitted:** unknown
165
 
166
- # Technical Specifications
167
-
168
- ## Hardware
169
-
170
- The training was performed on an AWS SageMaker cluster with 64 nodes of 8x80GB A100 GPUs (512 GPUs total). The cluster uses the current EFA network which provides about 340GBps throughput.
171
-
172
- As the network is quite slow for the needs of DeepSpeed ZeRO-3 we were only able to clock ~90 TFLOPs.
173
-
174
- ## Software
175
-
176
- The training software is built on top of HuggingFace Transformers + Accelerate, and DeepSpeed ZeRO-3 for training, and [WebDataset](https://github.com/webdataset/webdataset) for data loading.
177
-
178
 
179
  # Citation
180
 
 
73
  - **Resources for more information:**
74
  - [GitHub Repo](https://github.com/huggingface/m4/)
75
  - Description of [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC): [OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
76
+ ](https://huggingface.co/papers/2306.16527)
77
+ - Original Paper: [Flamingo: a Visual Language Model for Few-Shot Learning](https://huggingface.co/papers/2204.14198)
78
 
79
+ ATUM is a large multimodal English model that takes sequences of interleaved images and texts as inputs and generates text outputs.
80
  The model shows strong in-context few-shot learning capabilities (and on par with the closed-source model), and is a robust starting point to fine-tune multimodal models on custom data.
81
 
82
  ATUM is built on top of two unimodal open-access pre-trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image/text pairs and unstrucutred multimodal web documents.
83
 
84
+
85
  # Uses
86
 
87
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
109
 
110
  # Training Details
111
 
112
+ We closel follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.
113
 
114
+ The model is trained on the following data mixture of openly accessible English data:
115
 
116
  | Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
117
  |-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
118
+ | [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | TODO | TODO | 3 | 73.85% |
119
+ | [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | TODO | TODO | 1 | 6.15% |
120
+ | [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC) | Unstructured Multimodal Web Documents | TODO | TODO | 3 | 2.82% |
121
+ | [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents | TODO | TODO | 1 | 17.18% |
122
+
123
+ **PMD** is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.
124
+
125
+ **LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image.
126
+
127
+ **Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
128
+
129
+ **OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
130
+
131
+
132
+ For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
133
+
134
+ Following (Dehghani et al., 2023)[https://huggingface.co/papers/2302.05442], we apply a layer normalization on the projected queries and keys of both the Perceiver and cross-attention blocks, which improved training stability in our early experiments. We use the [RMSNorm](https://huggingface.co/papers/1910.07467) implementation for trainable Layer Norms.
135
+
136
  The training objective is the standard next token prediction.
137
 
138
+ We use the following hyper and training parameters:
139
+ | Parameters | | ATUM | ATUM-9b |
140
+ | -- | -- | -- | -- |
141
+ | Perceiver Resampler | Number of Layers | 6 | 6 |
142
+ | | Number of Latents | 64 | 64 |
143
+ | | Number of Heads | 16 | 16 |
144
+ | | Resampler Head Dimension | 96 | 96 |
145
+ | Model | Language Model Backbone | [Llama-65b](https://huggingface.co/huggyllama/llama-65b) | [Llama-7b](https://huggingface.co/huggyllama/llama-7b) |
146
+ | | Vision Model Backbone | [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) | [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) |
147
+ | | Cross-Layer Interval | 4 | 4 |
148
+ | Training | Sequence Length | 1024 | 1024 |
149
+ | | Effective Batch Size (# of tokens) | 3.67M | 1.31M |
150
+ | | Max Training Steps | 200K | 200K |
151
+ | | Weight Decay | 0.1 | 0.1 |
152
+ | | Optimizer | Adam(0.9, 0.999) | Adam(0.9, 0.999) |
153
+ | | Gradient Clipping | 1.0 | 1.0 |
154
+ | | [Z-loss](https://huggingface.co/papers/2204.02311) weight | 1e-3 | 1e-3 |
155
+ | Learning Rate | Initial Max | 5e-5 | 1e-5 |
156
+ | | Initial Final | 3e-5 | 6e-6 |
157
+ | | Decay Schedule | Linear | Linear |
158
+ | | Linear warmup Steps | 2K | 2K |
159
+ | Large-scale Optimization | Gradient Checkpointing | True | True |
160
+ | | Precision | Mixed-pres bf16 | Mixed-pres bf16 |
161
+ | | ZeRO Optimization | Stage 3 | Stage 3 |
162
+
163
 
164
  # Evaluation
165
 
166
  <!-- This section describes the evaluation protocols and provides the results. -->
167
  We closely follow the evaluation protocol of Flamingo and evaluate ATUM on a suite of downstream image + text benchmarks ranging from visual question answering to image captioning.
168
+
169
  We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.
170
 
171
+ We perform checkpoint selection based on validation sets of TODO, and select the checkpoint at step 65'000 for ATUM-9B and at step 37'500 for ATUM. The models are evaluated with in-context few-shot learning where the priming instances are selected from a support set to be similar (i.e. close in a vector space) to the queried instance. We do not use any form of ensembling.
172
+
173
  TODO: beautiful plots of shots scaling laws.
174
 
175
  TODO: detail of the numbers in a table.
176
 
177
 
178
+ # Technical Specifications
179
+
180
+ ## Hardware
181
+
182
+ The training was performed on an AWS SageMaker cluster with 64 nodes of 8x80GB A100 GPUs (512 GPUs total). The cluster uses the current EFA network which provides about 340GBps throughput.
183
+
184
+ As the network is quite slow for the needs of DeepSpeed ZeRO-3 we were only able to clock ~90 TFLOPs.
185
+
186
+ ## Software
187
+
188
+ The training software is built on top of HuggingFace Transformers + Accelerate, and DeepSpeed ZeRO-3 for training, and [WebDataset](https://github.com/webdataset/webdataset) for data loading.
189
+
190
+
191
  # Bias, Risks, and Limitations
192
 
193
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
214
  - **Cloud Provider:** AWS Sagemaker
215
  - **Carbon Emitted:** unknown
216
 
 
 
 
 
 
 
 
 
 
 
 
 
217
 
218
  # Citation
219