Abstract
CLIP, the first foundation model that connects images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even by using academic resources. For example, on an A100 eight-GPU server, our CLIP models achieve zero-shot top-1 ImageNet accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.
Community
Finds an inverse scaling law that can help train CLIP (Contrastive Language Image Pre-training) with (significantly) lesser resources. Inverse scaling law: larger encoders need shorter sequences (fewer tokens) when training. Reduce image tokens by different masking strategies or image resizing. Reduce text tokens by truncation, masking, or syntax (significant nouns and phrases in long sentences). Empirically derive the law using ViT-S, B, and L/16. Train CLIP with academic resources (8 A100 GPUs in one DGX machine). From UC Santa Cruz.
toy
Models citing this paper 7
Browse 7 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper