AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models
Abstract
Classifiers built upon vision-language models such as CLIP have shown remarkable zero-shot performance across a broad range of image classification tasks. Prior work has studied different ways of automatically creating descriptor sets for every class based on prompt templates, ranging from manually engineered templates over templates obtained from a large language model to templates built from random words and characters. In contrast, deriving zero-shot classifiers from the respective encoded class descriptors has remained nearly unchanged, that is: classify to the class that maximizes the cosine similarity between its averaged encoded class descriptors and the encoded image. However, weighting all class descriptors equally can be suboptimal when certain descriptors match visual clues on a given image better than others. In this work, we propose AutoCLIP, a method for auto-tuning zero-shot classifiers. AutoCLIP assigns to each prompt template per-image weights, which are derived from statistics of class descriptor-image similarities at inference time. AutoCLIP is fully unsupervised, has very low overhead, and can be easily implemented in few lines of code. We show that for a broad range of vision-language models, datasets, and prompt templates, AutoCLIP outperforms baselines consistently and by up to 3 percent point accuracy.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Cross-Modal Retrieval Meets Inference: Improving Zero-Shot Classification with Cross-Modal Retrieval (2023)
- AnoVL: Adapting Vision-Language Models for Unified Zero-shot Anomaly Localization (2023)
- Zero-Shot Visual Classification with Guided Cropping (2023)
- Distribution-Aware Prompt Tuning for Vision-Language Models (2023)
- Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models (2023)
Please give a thumbs up to this comment if you found it helpful!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper