Zero-Shot Image Classification
OpenCLIP
clip
Edit model card

Model Summary

NLLB-CLIP-SigLIP is a model that combines a text encoder from the NLLB model and an image encoder from the SigLIP model. This allows us to extend the model capabilities to 201 languages of the Flores-200. NLLB-CLIP sets state-of-the-art on the Crossmodal-3600 dataset by performing very well on low-resource languages. You can find more details about the model in the paper.

This version performs much better than the standard version. You can see the results here and here.

NB: There is even better version of this model available!

How to use

Open In Colab

This model is integrated into OpenCLIP so that you can use it as any other model:

!pip install -U open_clip_torch
from open_clip import create_model_from_pretrained, get_tokenizer
from PIL import Image
import requests
import torch

model, transform = create_model_from_pretrained("nllb-clip-base-siglip", "v1", device="cuda")

tokenizer = get_tokenizer("nllb-clip-base-siglip")

class_options = ["бабочка", "butterfly", "kat"]
class_langs = ["rus_Cyrl", "eng_Latn", "afr_Latn"]

text_inputs = []
for i in range(len(class_options)):
    tokenizer.set_language(class_langs[i])
    text_inputs.append(tokenizer(class_options[i]))
text_inputs = torch.stack(text_inputs).squeeze(1).to("cuda")

image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)

image_inputs = transform(image).unsqueeze(0).to("cuda")

with torch.inference_mode():
    logits_per_image, logits_per_text = model.get_logits(image_inputs, text_inputs)

print(logits_per_image.softmax(dim=-1))

Acknowledgements

I thank ML Collective for providing Google Cloud compute resources to train the OpenCLIP-compatible version of NLLB-CLIP.

Downloads last month
943
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train visheratin/nllb-clip-base-siglip