@MoritzLaurer on Hugging Face: "🆕 Releasing a new series of 8 zeroshot classifiers: better performance, fully…"

MoritzLaurer

posted an update Apr 3

Post

3706

🆕 Releasing a new series of 8 zeroshot classifiers: better performance, fully commercially useable thanks to synthetic data, up to 8192 tokens, run on any hardware.

Summary:
🤖 The zeroshot-v2.0-c series replaces commercially restrictive training data with synthetic data generated with mistralai/Mixtral-8x7B-Instruct-v0.1 (Apache 2.0). All models are released under the MIT license.
🦾 The best model performs 17%-points better across 28 tasks vs. facebook/bart-large-mnli (the most downloaded commercially-friendly baseline).
🌏 The series includes a multilingual variant fine-tuned from BAAI/bge-m3 for zeroshot classification in 100+ languages and with a context window of 8192 tokens
🪶 The models are 0.2 - 0.6 B parameters small, so they run on any hardware. The base-size models are +2x faster than bart-large-mnli while performing significantly better.
🤏 The models are not generative LLMs, they are efficient encoder-only models specialized in zeroshot classification through the universal NLI task.
🤑 For users where commercially restrictive training data is not an issue, I've also trained variants with even more human data for improved performance.

Next steps:
✍️ I'll publish a blog post with more details soon
🔮 There are several improvements I'm planning for v2.1. Especially the multilingual model has room for improvement.

All models are available for download in this Hugging Face collection: MoritzLaurer/zeroshot-classifiers-6548b4ff407bb19ff5c3ad6f

These models are an extension of the approach explained in this paper, but with additional synthetic data: https://arxiv.org/abs/2312.17543

tomaarsen

Apr 4

Looking forward to your blogpost! It's always exciting to see solid non-generative models.

HAMRONI

May 8

hello. The model is very interesting.
In the "model="MoritzLaurer/bge-m3-zeroshot-v2.0" model, a maximum of 8192 tokens are said to run on all hardware, but in Python, 'Token indices sequence length is longer than the specified maximum sequence length for this model (1615 > 512). Is there a solution to the error 'Running this sequence through the model will result in indexing errors'?
Thank you for answer.

MoritzLaurer

May 24

@HAMRONI can you share the full inference code that caused this error? you can open a discussion in the model repo

Join the conversation