Papers
arxiv:2305.07017

An Inverse Scaling Law for CLIP Training

Published on May 11, 2023
· Submitted by akhaliq on May 12, 2023

Abstract

CLIP, the first foundation model that connects images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even by using academic resources. For example, on an A100 eight-GPU server, our CLIP models achieve zero-shot top-1 ImageNet accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.

Community

Finds an inverse scaling law that can help train CLIP (Contrastive Language Image Pre-training) with (significantly) lesser resources. Inverse scaling law: larger encoders need shorter sequences (fewer tokens) when training. Reduce image tokens by different masking strategies or image resizing. Reduce text tokens by truncation, masking, or syntax (significant nouns and phrases in long sentences). Empirically derive the law using ViT-S, B, and L/16. Train CLIP with academic resources (8 A100 GPUs in one DGX machine). From UC Santa Cruz.

toy

Sign up or log in to comment

Models citing this paper 7

Browse 7 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2305.07017 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2305.07017 in a Space README.md to link it from this page.

Collections including this paper 1