Papers
arxiv:2306.16527

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Published on Jun 21, 2023
ยท Submitted by akhaliq on Jun 30, 2023
#1 Paper of the day

Abstract

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

Community

congrats on the public release of the model and the dataset (both linked from this Paper page) today!!

Thanks @julien-c ! Is it possible to modify the title and the abstract of this paper, now that we have updated the name (from OBELISC to OBELICS) and the arXiv article?

This comment has been hidden

Discover OBELICS: The Ultimate Open-Source Multimodal Dataset

Links ๐Ÿ”—:

๐Ÿ‘‰ Subscribe: https://www.youtube.com/@Arxflix
๐Ÿ‘‰ Twitter: https://x.com/arxflix
๐Ÿ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 14

Browse 14 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 142

Collections including this paper 4