Image-Text to Text
Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.
For more details about the image-text-to-text
task, check out its dedicated page! You will find examples and related materials.
Recommended models
- meta-llama/Llama-3.2-11B-Vision-Instruct: Powerful vision language model with great visual understanding and reasoning capabilities.
- Qwen/Qwen2-VL-7B-Instruct: Strong image-text-to-text model.
Explore all available models and find the one that suits you best here.
Using the API
Using huggingface_hub
:
from huggingface_hub import InferenceClient
client = InferenceClient(api_key="hf_***")
messages = "\"Can you please let us know more details about your \""
stream = client.chat.completions.create(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=messages,
max_tokens=500,
stream=True
)
for chunk in stream:
print(chunk.choices[0].delta.content, end="")
Using openai
:
from openai import OpenAI
client = OpenAI(
base_url="https://api-inference.huggingface.co/v1/",
api_key="hf_***"
)
messages = "\"Can you please let us know more details about your \""
stream = client.chat.completions.create(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=messages,
max_tokens=500,
stream=True
)
for chunk in stream:
print(chunk.choices[0].delta.content, end="")
To use the Python client, see huggingface_hub
’s package reference.
API specification
For the API specification of conversational image-text-to-text models, please refer to the Chat Completion API documentation.
< > Update on GitHub