Parameters

Additional Options

Caching

There is a cache layer on the inference API to speed up requests when the inputs are exactly the same. Many models, such as classifiers and embedding models, can use those results as is if they are deterministic, meaning the results will be the same. However, if you use a nondeterministic model, you can disable the cache mechanism from being used, resulting in a real new query.

To do this, you can add x-use-cache:false to the request headers. For example

Python

JavaScript

cURL

Wait for the model

When a model is warm, it is ready to be used and you will get a response relatively quickly. However, some models are cold and need to be loaded before they can be used. In that case, you will get a 503 error. Rather than doing many requests until it’s loaded, you can wait for the model to be loaded by adding x-wait-for-model:true to the request headers. We suggest to only use this flag to wait for the model to be loaded when you are sure that the model is cold. That means, first try the request without this flag and only if you get a 503 error, try again with this flag.

Python

JavaScript

cURL

< > Update on GitHub

api-inference

Parameters

Additional Options

Caching

Wait for the model