Inference Endpoints
Inference Endpoints provides a secure production solution to easily deploy models on a dedicated and autoscaling infrastructure managed by Hugging Face. An Inference Endpoint is built from a model from the Hub. This page is a reference for huggingface_hub
βs integration with Inference Endpoints. For more information about the Inference Endpoints product, check out its official documentation.
Check out the related guide to learn how to use huggingface_hub
to manage your Inference Endpoints programmatically.
Inference Endpoints can be fully managed via API. The endpoints are documented with Swagger. The InferenceEndpoint class is a simple wrapper built on top on this API.
Methods
A subset of the Inference Endpoint features are implemented in HfApi:
- get_inference_endpoint() and list_inference_endpoints() to get information about your Inference Endpoints
- create_inference_endpoint(), update_inference_endpoint() and delete_inference_endpoint() to deploy and manage Inference Endpoints
- pause_inference_endpoint() and resume_inference_endpoint() to pause and resume an Inference Endpoint
- scale_to_zero_inference_endpoint() to manually scale an Endpoint to 0 replicas
InferenceEndpoint
The main dataclass is InferenceEndpoint. It contains information about a deployed InferenceEndpoint
, including its configuration and current state. Once deployed, you can run inference on the Endpoint using the InferenceEndpoint.client and InferenceEndpoint.async_client properties that respectively return an InferenceClient and an AsyncInferenceClient object.
class huggingface_hub.InferenceEndpoint
< source >( namespace: str raw: typing.Dict _token: typing.Union[str, bool, NoneType] _api: HfApi )
Parameters
- name (
str
) — The unique name of the Inference Endpoint. - namespace (
str
) — The namespace where the Inference Endpoint is located. - repository (
str
) — The name of the model repository deployed on this Inference Endpoint. - status (InferenceEndpointStatus) — The current status of the Inference Endpoint.
- url (
str
, optional) — The URL of the Inference Endpoint, if available. Only a deployed Inference Endpoint will have a URL. - framework (
str
) — The machine learning framework used for the model. - revision (
str
) — The specific model revision deployed on the Inference Endpoint. - task (
str
) — The task associated with the deployed model. - created_at (
datetime.datetime
) — The timestamp when the Inference Endpoint was created. - updated_at (
datetime.datetime
) — The timestamp of the last update of the Inference Endpoint. - type (InferenceEndpointType) — The type of the Inference Endpoint (public, protected, private).
- raw (
Dict
) — The raw dictionary data returned from the API. - token (
str
orbool
, optional) — Authentication token for the Inference Endpoint, if set when requesting the API. Will default to the locally saved token if not provided. Passtoken=False
if you don’t want to send your token to the server.
Contains information about a deployed Inference Endpoint.
Example:
>>> from huggingface_hub import get_inference_endpoint
>>> endpoint = get_inference_endpoint("my-text-to-image")
>>> endpoint
InferenceEndpoint(name='my-text-to-image', ...)
# Get status
>>> endpoint.status
'running'
>>> endpoint.url
'https://my-text-to-image.region.vendor.endpoints.huggingface.cloud'
# Run inference
>>> endpoint.client.text_to_image(...)
# Pause endpoint to save $$$
>>> endpoint.pause()
# ...
# Resume and wait for deployment
>>> endpoint.resume()
>>> endpoint.wait()
>>> endpoint.client.text_to_image(...)
from_raw
< source >( raw: typing.Dict namespace: str token: typing.Union[str, bool, NoneType] = None api: typing.Optional[ForwardRef('HfApi')] = None )
Initialize object from raw dictionary.
client
< source >( ) β InferenceClient
Returns
an inference client pointing to the deployed endpoint.
Raises
- InferenceEndpointError β If the Inference Endpoint is not yet deployed.
Returns a client to make predictions on this Inference Endpoint.
async_client
< source >( ) β AsyncInferenceClient
Returns
an asyncio-compatible inference client pointing to the deployed endpoint.
Raises
- InferenceEndpointError β If the Inference Endpoint is not yet deployed.
Returns a client to make predictions on this Inference Endpoint.
Delete the Inference Endpoint.
This operation is not reversible. If you donβt want to be charged for an Inference Endpoint, it is preferable to pause it with InferenceEndpoint.pause() or scale it to zero with InferenceEndpoint.scale_to_zero().
This is an alias for HfApi.delete_inference_endpoint().
fetch
< source >( ) β InferenceEndpoint
Fetch latest information about the Inference Endpoint.
pause
< source >( ) β InferenceEndpoint
Pause the Inference Endpoint.
A paused Inference Endpoint will not be charged. It can be resumed at any time using InferenceEndpoint.resume(). This is different than scaling the Inference Endpoint to zero with InferenceEndpoint.scale_to_zero(), which would be automatically restarted when a request is made to it.
This is an alias for HfApi.pause_inference_endpoint(). The current object is mutated in place with the latest data from the server.
resume
< source >( running_ok: bool = True ) β InferenceEndpoint
Resume the Inference Endpoint.
This is an alias for HfApi.resume_inference_endpoint(). The current object is mutated in place with the latest data from the server.
scale_to_zero
< source >( ) β InferenceEndpoint
Scale Inference Endpoint to zero.
An Inference Endpoint scaled to zero will not be charged. It will be resume on the next request to it, with a cold start delay. This is different than pausing the Inference Endpoint with InferenceEndpoint.pause(), which would require a manual resume with InferenceEndpoint.resume().
This is an alias for HfApi.scale_to_zero_inference_endpoint(). The current object is mutated in place with the latest data from the server.
update
< source >( accelerator: typing.Optional[str] = None instance_size: typing.Optional[str] = None instance_type: typing.Optional[str] = None min_replica: typing.Optional[int] = None max_replica: typing.Optional[int] = None scale_to_zero_timeout: typing.Optional[int] = None repository: typing.Optional[str] = None framework: typing.Optional[str] = None revision: typing.Optional[str] = None task: typing.Optional[str] = None custom_image: typing.Optional[typing.Dict] = None secrets: typing.Optional[typing.Dict[str, str]] = None ) β InferenceEndpoint
Parameters
- accelerator (
str
, optional) — The hardware accelerator to be used for inference (e.g."cpu"
). - instance_size (
str
, optional) — The size or type of the instance to be used for hosting the model (e.g."x4"
). - instance_type (
str
, optional) — The cloud instance type where the Inference Endpoint will be deployed (e.g."intel-icl"
). - min_replica (
int
, optional) — The minimum number of replicas (instances) to keep running for the Inference Endpoint. - max_replica (
int
, optional) — The maximum number of replicas (instances) to scale to for the Inference Endpoint. - scale_to_zero_timeout (
int
, optional) — The duration in minutes before an inactive endpoint is scaled to zero. - repository (
str
, optional) — The name of the model repository associated with the Inference Endpoint (e.g."gpt2"
). - framework (
str
, optional) — The machine learning framework used for the model (e.g."custom"
). - revision (
str
, optional) — The specific model revision to deploy on the Inference Endpoint (e.g."6c0e6080953db56375760c0471a8c5f2929baf11"
). - task (
str
, optional) — The task on which to deploy the model (e.g."text-classification"
). - custom_image (
Dict
, optional) — A custom Docker image to use for the Inference Endpoint. This is useful if you want to deploy an Inference Endpoint running on thetext-generation-inference
(TGI) framework (see examples). - secrets (
Dict[str, str]
, optional) — Secret values to inject in the container environment.
Returns
the same Inference Endpoint, mutated in place with the latest data.
Update the Inference Endpoint.
This method allows the update of either the compute configuration, the deployed model, or both. All arguments are optional but at least one must be provided.
This is an alias for HfApi.update_inference_endpoint(). The current object is mutated in place with the latest data from the server.
wait
< source >( timeout: typing.Optional[int] = None refresh_every: int = 5 ) β InferenceEndpoint
Parameters
- timeout (
int
, optional) — The maximum time to wait for the Inference Endpoint to be deployed, in seconds. IfNone
, will wait indefinitely. - refresh_every (
int
, optional) — The time to wait between each fetch of the Inference Endpoint status, in seconds. Defaults to 5s.
Returns
the same Inference Endpoint, mutated in place with the latest data.
Raises
InferenceEndpointError or InferenceEndpointTimeoutError
- InferenceEndpointError β If the Inference Endpoint ended up in a failed state.
InferenceEndpointTimeoutError
β If the Inference Endpoint is not deployed aftertimeout
seconds.
Wait for the Inference Endpoint to be deployed.
Information from the server will be fetched every 1s. If the Inference Endpoint is not deployed after timeout
seconds, a InferenceEndpointTimeoutError
will be raised. The InferenceEndpoint will be mutated in place with the latest
data.
InferenceEndpointStatus
class huggingface_hub.InferenceEndpointStatus
< source >( value names = None module = None qualname = None type = None start = 1 )
An enumeration.
InferenceEndpointType
class huggingface_hub.InferenceEndpointType
< source >( value names = None module = None qualname = None type = None start = 1 )
An enumeration.
InferenceEndpointError
Generic exception when dealing with Inference Endpoints.