Image-Text to Text

Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.

For more details about the image-text-to-text task, check out its dedicated page! You will find examples and related materials.

Recommended models

meta-llama/Llama-3.2-11B-Vision-Instruct: Powerful vision language model with great visual understanding and reasoning capabilities.
Qwen/Qwen2-VL-7B-Instruct: Strong image-text-to-text model.

Explore all available models and find the one that suits you best here.

Using the API

Python

JavaScript

cURL

API specification

For the API specification of conversational image-text-to-text models, please refer to the Chat Completion API documentation.

< > Update on GitHub

api-inference