arxiv:2406.08418

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Published on Jun 12

· Submitted by

Weiyun1025 on Jun 17

Upvote

Authors:

Qingyun Li ,

Weiyun Wang ,

Jiashuo Yu ,

Abstract

Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.

View arXiv page View PDF Add to collection

Community

Weiyun1025

Paper author Paper submitter Jun 17

OmniCorpus dataset is the largest multimodal dataset to date, which pushes the boundaries of scale and diversity by encompassing 8.6 billion images interleaved with 1,696 text tokens from diverse sources, significantly surpassing previous datasets.

emanuelevivoli

Jun 18

Thank you for the immense work! I'm curious if you plan to release the pipeline code too (did you use datrove ?), like the FineWeb paper did :)

natarajg2712

5 days ago

Extract and identify the elements present in the academic reference cited in journals from the following text. If an element cannot be identified, place the corresponding text at the end and label it as 'unstyled'. Provide the extracted elements in the format: [Element] - [Description/Context]. Preserve the original content and formatting as much as possible.

Tarasevich, M. R.; Sadkowski; Yeager, E. in Comprehensive Treatise of Electrochemistry, Vol. 7, (Eds. B. E., Bockris, J. O., Yeager, E., Khan, S. U. M., White, R. E); Plenum Press: New York, 1983; pp 301-398.