metadata

language: en
tags:
  - multimodal
  - text
  - image
license: other
datasets:
  - HuggingFaceM4/OBELISC
  - wikipedia
  - facebook/pmd
  - laion/laion2B-en

TODO: logo?

Model Card for m4-80b

ATUM (Adapted Transformers for Unstructured Multimodal data) is an open-access reproduction of Flamingo, a closed-source visual language model developed by Deepmind. The multimodal model accepts arbitrary sequences of image and text inputs and produces text outputs and is built solely on public available data and models. ATUM (TODO) is on par with the original model on various image + text benchmarks, including visual question answering (open-ended and multiple choice), image captioning, and image classification when evaluated with in-context few-shot learning.

The model comes into two variants: a large 80 billion parameters version and a 9 billion parameters version. We also fine-tune these base models on a mixture of SFT datasets (TODO: find a more understandable characterization), which boosts the downstream performance while making the models more usable in conversational settings: (TODO: 80B-sfted) and (TODO: 9B sfted).

Model Card for m4-80b
Table of Contents
Model Details
- Model Description
Uses
Bias, Risks, and Limitations
- Recommendations
Training Details
- Training Data
- Training Procedure
  - Preprocessing
  - Speeds, Sizes, Times
Evaluation
- Testing Data, Factors & Metrics
- Results
Model Examination
Environmental Impact
Technical Specifications [optional]
- Model Architecture and Objective
- Compute Infrastructure
  - Hardware
  - Software
Citation
Glossary [optional]
More Information [optional]
Model Card Authors [optional]
Model Card Contact
How to Get Started with the Model

Model Details

Developed by: Hugging Face
Model type: Multi-modal model (text+image)
Language(s) (NLP): en
License: other
Parent Model: laion/CLIP-ViT-H-14-laion2B-s32B-b79K and huggyllama/llama-65b
Resources for more information:
- GitHub Repo
- Description of OBELISC: OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
- Original Paper: Flamingo: a Visual Language Model for Few-Shot Learning

ATUM is a large multimodal English model that takes sequences of interleaved images and texts as inputs and generates text outputs. The model shows strong in-context few-shot learning capabilities (and on par with the closed-source model), and is a robust starting point to fine-tune multimodal models on custom data.

ATUM is built on top of two unimodal open-access pre-trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image/text pairs and unstrucutred multimodal web documents.

Uses

The model can be used to perform inference on multimodal (image + text) tasks in which the input is composed of a text query/instruction along with one or multiple images. This model does not support image generation.

It is possible to fine-tune the base model on custom data for a specific use-case. We note that the instruction-fine-tuned models are significantly better at following instructions and thus should be prefered when using the models out-of-the-box.

The following screenshot is an example of interaction with the model:

TODO: screenshot

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand

More information needed

Training Details

We closel follow the training procedure layed out in Flamingo. We combine two open-source pre-trained models (laion/CLIP-ViT-H-14-laion2B-s32B-b79K and huggyllama/llama-65b) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.

The model is trained on the following data mixture of openly accessible English data:

Data Source	Type of Data	Number of Tokens in Source	Number of Images in Source	Epochs	Effective Proportion in Number of Tokens
OBELISC	Unstructured Multimodal Web Documents	114.9B	353M	1	73.85%
Wikipedia	Unstructured Multimodal Web Documents	3.192B	TODO	3	6.15%
LAION	Image-Text Pairs	29.9B	1.120B	1	17.18%
PMD	Image-Text Pairs	1.6B	70M	3	2.82%

OBELISC is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available here.

Wkipedia is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.

LAION is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image. We deduplicated it (following this paper), slightly filtered it, and removed the opted-out images.

PMD is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.

For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.

Following (Dehghani et al., 2023)[https://huggingface.co/papers/2302.05442], we apply a layer normalization on the projected queries and keys of both the Perceiver and cross-attention blocks, which improved training stability in our early experiments. We use the RMSNorm implementation for trainable Layer Norms.

The training objective is the standard next token prediction.

We use the following hyper and training parameters:

Parameters		ATUM	ATUM-9b
Perceiver Resampler	Number of Layers	6	6
	Number of Latents	64	64
	Number of Heads	16	16
	Resampler Head Dimension	96	96
Model	Language Model Backbone	Llama-65b	Llama-7b
	Vision Model Backbone	laion/CLIP-ViT-H-14-laion2B-s32B-b79K	laion/CLIP-ViT-H-14-laion2B-s32B-b79K
	Cross-Layer Interval	4	4
Training	Sequence Length	1024	1024
	Effective Batch Size (# of tokens)	3.67M	1.31M
	Max Training Steps	200K	200K
	Weight Decay	0.1	0.1
	Optimizer	Adam(0.9, 0.999)	Adam(0.9, 0.999)
	Gradient Clipping	1.0	1.0
	Z-loss weight	1e-3	1e-3
Learning Rate	Initial Max	5e-5	1e-5
	Initial Final	3e-5	6e-6
	Decay Schedule	Linear	Linear
	Linear warmup Steps	2K	2K
Large-scale Optimization	Gradient Checkpointing	True	True
	Precision	Mixed-pres bf16	Mixed-pres bf16
	ZeRO Optimization	Stage 3	Stage 3

Evaluation

We closely follow the evaluation protocol of Flamingo and evaluate ATUM on a suite of downstream image + text benchmarks ranging from visual question answering to image captioning.

We compare our model to the original Flamingo along with OpenFlamingo, another open-source reproduction.

We perform checkpoint selection based on validation sets of TODO, and select the checkpoint at step 65'000 for ATUM-9B and at step 37'500 for ATUM. The models are evaluated with in-context few-shot learning where the priming instances are selected from a support set to be similar (i.e. close in a vector space) to the queried instance. We do not use any form of ensembling.

TODO: beautiful plots of shots scaling laws.

TODO: detail of the numbers in a table.

Technical Specifications

Hardware

The training was performed on an AWS SageMaker cluster with 64 nodes of 8x80GB A100 GPUs (512 GPUs total). The cluster uses the current EFA network which provides about 340GBps throughput.

As the network is quite slow for the needs of DeepSpeed ZeRO-3 we were only able to clock ~90 TFLOPs.

Software

The training software is built on top of HuggingFace Transformers + Accelerate, and DeepSpeed ZeRO-3 for training, and WebDataset for data loading.

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). As a derivative of such a language model, ATUM can produce texts that include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. Moreover, ATUM can produce factually incorrect texts, and should not be relied on to produce factually accurate information.

Here are a few examples of outputs that could be categorized as factually incorrect, biased, or offensive: TODO: give 4/5 representative examples

To measure ATUM's ability to recognize socilogical (TODO: find a better adjective) attributes, we evaluate the model on FairFace... TODO: include FairFace numbers

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: 64 nodes of 8x 80GB A100 gpus, EFA network
Hours used: ~672 node hours
Cloud Provider: AWS Sagemaker
Carbon Emitted: unknown

Citation

BibTeX:

More information needed

APA:

More information needed

Model Card Authors [optional]

V, i, c, t, o, r, ,, , S, t, a, s, ,, , X, X, X

Model Card Contact

Please open a discussion on the Community tab!

Model Card for m4-80b

Table of Contents

Model Details

Uses

How to Get Started with the Model

Training Details

Evaluation

Technical Specifications

Hardware

Software

Bias, Risks, and Limitations

Environmental Impact

Citation

Model Card Authors [optional]

Model Card Contact