Wauplin/huggingface_hub · 📢 [v0.22]: Chat completion, inference types and hub mixins!

Owner Mar 21

•

Note: pre-release 0.22.0.rc0 is available on PyPI. Official release will occur in the coming days.
EDIT: 0.22.0 is now officially available on PyPI!

✨ InferenceClient

Support for inference tools continues to improve in huggingface_hub. At the menu in this release? A new chat_completion API and fully typed inputs/outputs!

Chat-completion API!

A long-awaited API has just landed in huggingface_hub! InferenceClient.chat_completion follows most of OpenAI's API, making it much easier to integrate with existing tools.

Technically speaking it uses the same backend as the text-generation task but requires a preprocessing step to format the list of messages into a single text prompt. The chat template is rendered server-side when models are powered by TGI, which is the case for most LLMs: Llama, Zephyr, Mistral, Gemma, etc. Otherwise, the templating happens client-side which requires minijinja package to be installed. We are actively working on bridging this gap, aiming at rendering all templates server-side in the future.

>>> from huggingface_hub import InferenceClient
>>> messages = [{"role": "user", "content": "What is the capital of France?"}]
>>> client = InferenceClient("HuggingFaceH4/zephyr-7b-beta")

# Batch completion
>>> client.chat_completion(messages, max_tokens=100)
ChatCompletionOutput(
    choices=[
        ChatCompletionOutputChoice(
            finish_reason='eos_token',
            index=0,
            message=ChatCompletionOutputChoiceMessage(
                content='The capital of France is Paris. The official name of the city is "Ville de Paris" (City of Paris) and the name of the country\'s governing body, which is located in Paris, is "La République française" (The French Republic). \nI hope that helps! Let me know if you need any further information.'
            )
        )
    ],
    created=1710498360
)

# Stream new tokens one by one
>>> for token in client.chat_completion(messages, max_tokens=10, stream=True):
...     print(token)
ChatCompletionStreamOutput(choices=[ChatCompletionStreamOutputChoice(delta=ChatCompletionStreamOutputDelta(content='The', role='assistant'), index=0, finish_reason=None)], created=1710498504)
ChatCompletionStreamOutput(choices=[ChatCompletionStreamOutputChoice(delta=ChatCompletionStreamOutputDelta(content=' capital', role='assistant'), index=0, finish_reason=None)], created=1710498504)
(...)
ChatCompletionStreamOutput(choices=[ChatCompletionStreamOutputChoice(delta=ChatCompletionStreamOutputDelta(content=' may', role='assistant'), index=0, finish_reason=None)], created=1710498504)
ChatCompletionStreamOutput(choices=[ChatCompletionStreamOutputChoice(delta=ChatCompletionStreamOutputDelta(content=None, role=None), index=0, finish_reason='length')], created=1710498504)

Implement InferenceClient.chat_completion + use new types for text-generation by @Wauplin in #2094
Fix InferenceClient.text_generation for non-tgi models by @Wauplin in #2136
https://github.com/huggingface/huggingface_hub/pull/2153 by @Wauplin in #2153

Inference types

We are currently working towards more consistency in tasks definitions across the Hugging Face ecosystem. This is no easy job but a major milestone has recently been achieved! All inputs and outputs of the main ML tasks are now fully specified as JSONschema objects. This is the first brick needed to have consistent expectations when running inference across our stack: transformers (Python), transformers.js (Typescript), Inference API (Python), Inference Endpoints (Python), Text Generation Inference (Rust), Text Embeddings Inference (Rust), InferenceClient (Python), Inference.js (Typescript), etc.

Integrating those definitions will require more work but huggingface_hub is one of the first tools to integrate them. As a start, all InferenceClient return values are now typed dataclasses. Furthermore, typed dataclasses have been generated for all tasks' inputs and outputs. This means you can now integrate them in your own library to ensure consistency with the Hugging Face ecosystem. Specifications are open-source (see here) meaning anyone can access and contribute to them. Python's generated classes are documented here.

Here is a short example showcasing the new output types:

>>> from huggingface_hub import InferenceClient
>>> client = InferenceClient()
>>> client.object_detection("people.jpg"):
[
    ObjectDetectionOutputElement(
        score=0.9486683011054993,
        label='person',
        box=ObjectDetectionBoundingBox(xmin=59, ymin=39, xmax=420, ymax=510)
    ),
...
]

Note that those dataclasses are backward-compatible with the dict-based interface that was previously in use. In the example above, both ObjectDetectionBoundingBox(...).xmin and ObjectDetectionBoundingBox(...)["xmin"] are correct, even though the former should be the preferred solution from now on.

Generate inference types + start using output types by @Wauplin in #2036
Add = None at optional parameters by @LysandreJik in #2095
Fix inference types shared between tasks by @Wauplin in #2125

🧩 ModelHubMixin

ModelHubMixin is an object that can be used as a parent class for the objects in your library in order to provide built-in serialization methods to upload and download pretrained models from the Hub. This mixin is adapted into a PyTorchHubMixin that can serialize and deserialize any Pytorch model. The 0.22 release brings its share of improvements to these classes:

Better support of init values. If you instantiate a model with some custom arguments, the values will be automatically stored in a config.json file and restored when reloading the model from pretrained weights. This should unlock integrations with external libraries in a much smoother way.
Library authors integrating the hub mixin can now define custom metadata for their library: library name, tags, document url and repo url. These are to be defined only once when integrating the library. Any model pushed to the Hub using the library will then be easily discoverable thanks to those tags.
A base modelcard is generated for each saved model. This modelcard includes default tags (e.g. model_hub_mixin) and custom tags from the library (see 2.). You can extend/modify this modelcard by overwriting the generate_model_card method.

>>> import torch
>>> import torch.nn as nn
>>> from huggingface_hub import PyTorchModelHubMixin


# Define your Pytorch model exactly the same way you are used to
>>> class MyModel(
...         nn.Module,
...         PyTorchModelHubMixin, # multiple inheritance
...         library_name="keras-nlp",
...         tags=["keras"],
...         repo_url="https://github.com/keras-team/keras-nlp",
...         docs_url="https://keras.io/keras_nlp/",
...         # ^ optional metadata to generate model card
...     ):
...     def __init__(self, hidden_size: int = 512, vocab_size: int = 30000, output_size: int = 4):
...         super().__init__()
...         self.param = nn.Parameter(torch.rand(hidden_size, vocab_size))
...         self.linear = nn.Linear(output_size, vocab_size)

...     def forward(self, x):
...         return self.linear(x + self.param)

# 1. Create model
>>> model = MyModel(hidden_size=128)

# Config is automatically created based on input + default values
>>> model._hub_mixin_config
{"hidden_size": 128, "vocab_size": 30000, "output_size": 4}

# 2. (optional) Save model to local directory
>>> model.save_pretrained("path/to/my-awesome-model")

# 3. Push model weights to the Hub
>>> model.push_to_hub("my-awesome-model")

# 4. Initialize model from the Hub => config has been preserved
>>> model = MyModel.from_pretrained("username/my-awesome-model")
>>> model._hub_mixin_config
{"hidden_size": 128, "vocab_size": 30000, "output_size": 4}

# Model card has been correctly populated
>>> from huggingface_hub import ModelCard
>>> card = ModelCard.load("username/my-awesome-model")
>>> card.data.tags
["keras", "pytorch_model_hub_mixin", "model_hub_mixin"]
>>> card.data.library_name
"keras-nlp"

For more details on how to integrate these classes, check out the integration guide.

Fix ModelHubMixin: pass config when __init__ accepts **kwargs by @Wauplin in #2058
[PyTorchModelHubMixin] Fix saving model with shared tensors by @NielsRogge in #2086
Correctly inject config in PytorchModelHubMixin by @Wauplin in #2079
Fix passing kwargs in PytorchHubMixin by @Wauplin in #2093
Generate modelcard in ModelHubMixin by @Wauplin in #2080
Fix ModelHubMixin: save config only if doesn't exist by @Wauplin in #2105
Fix ModelHubMixin - kwargs should be passed correctly when reloading by @Wauplin in #2099
Fix ModelHubMixin when kwargs and config are both passed by @Wauplin in #2138
ModelHubMixin overwrite config if preexistant by @Wauplin in #2142

🛠️ Misc improvements

HfFileSystem download speed was limited by some internal logic in fsspec. We've now updated the get_file and read implementations to improve their download speed to a level similar to hf_hub_download.

Fast download in hf file system by @Wauplin in #2143

We are aiming at moving all errors raised by huggingface_hub into a single module huggingface_hub.errors to ease the developer experience. This work has been started as a community contribution from @Y4suyuki .

Start defining custom errors in one place by @Y4suyuki in #2122

HfApi class now accepts a headers parameters that is then passed to every HTTP call made to the Hub.

Allow passing custom headers to HfApi by @Wauplin in #2098

📚 More documentation in Korean!

[i18n-KO] Translated package_reference/overview.md to Korean by @jungnerd in #2113

💔 Breaking changes

The new types returned by InferenceClient methods should be backward compatible, especially to access values either as attributes (.my_field) or as items (i.e. ["my_field"]). However, dataclasses and dicts do not always behave exactly the same so might notice some breaking changes. Those breaking changes should be very limited.
ModelHubMixin internals changed quite a bit, breaking some use cases. We don't think those use cases were in use and changing them should really benefit 99% of integrations. If you witness any inconsistency or error in your integration, please let us know and we will do our best to mitigate the problem. One of the biggest change is that the config values are not attached to the mixin instance as instance.config anymore but as instance._model_hub_mixin. The .config attribute has been mistakenly introduced in 0.20.x so we hope it has not been used much yet.
huggingface_hub.file_download.http_user_agent has been removed in favor of the officially document huggingface_hub.utils.build_hf_headers. It was a deprecated method since 0.18.x.

Small fixes and maintenance

⚙️ CI optimization

The CI pipeline has been greatly improved, especially thanks to the efforts from @bmuskalla . Most tests are now passing in under 3 minutes, against 8 to 10 minutes previously. Some long-running tests have been greatly simplified and all tests are now ran in parallel with python-xdist, thanks to a complete decorrelation between them.

We are now also using the great uv installer instead of pip in our CI, which saves around 30-40s per pipeline.

More optimized tests by @Wauplin in #2054
Enable python-xdist on all tests by @bmuskalla in #2059
do not list all models by @Wauplin in #2061
update ruff by @Wauplin in #2071
Use uv in CI to speed-up requirements install by @Wauplin in #2072

⚙️ fixes

Fix Space variable when updatedAt is missing by @Wauplin in #2050
Fix tests involving temp directory on macOS by @bmuskalla in #2052
fix glob no magic by @lhoestq in #2056
Point out that the token must have write scope by @bmuskalla in #2053
Fix commonpath in read-only filesystem by @stevelaskaridis in #2073
rm unnecessary early makedirs by @poedator in #2092
Fix unhandled filelock issue by @Wauplin in #2108
Handle .DS_Store files in _scan_cache_repos by @sealad886 in #2112
Fix REPO_API_REGEX by @Wauplin in #2119
Fix uploading to HF proxy by @Wauplin in #2120
Fix --delete in huggingface-cli upload command by @Wauplin in #2129
Explicitly fail on Keras3 by @Wauplin in #2107
Fix serverless naming by @Wauplin in #2137

⚙️ internal

tag as 0.22.0.dev + remove deprecated code by @Wauplin in #2049
Some cleaning by @Wauplin in #2070
Fix test test_delete_branch_on_missing_branch_fails by @Wauplin in #2088

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@Y4suyuki
- Start defining custom errors in one place (#2122)
@bmuskalla
- Enable python-xdist on all tests by @bmuskalla in #2059

radames

Mar 21

hi @Wauplin , can I use the InferenceClient with any OpenAI API compatible services? that's amazing, Anthropic, Groq, Mistral APIs?

Wauplin

Owner Mar 22

can I use the InferenceClient with any OpenAI API compatible services? that's amazing, Anthropic, Groq, Mistral APIs?

@radames In theory yes since we are using the same interface for inputs and outputs. In practice, we only implement a subset of the full specification - meaning some parameters are not accepted. You can find more information on this PR. Regarding other providers, I did not test them as I was focused on shipping a client compatible with HF products first. Here is an example of how switching from OpenAI package to InferenceClient made the workflow slightly more integrated.