Tom Aarsen


AI & ML interests

NLP: text embeddings, information retrieval, named entity recognition, few-shot text classification



tomaarsen's activity

posted an update about 10 hours ago
view post
🎉SetFit v1.1.0 is out! Training efficient classifiers on CPU or GPU now uses the Sentence Transformers Trainer, and we resolved a lot of issues caused by updates of third-party libraries (like Transformers). Details:

Training a SetFit classifier model consists of 2 phases:
1. Finetuning a Sentence Transformer embedding model
2. Training a Classifier to map embeddings -> classes

🔌The first phase now uses the SentenceTransformerTrainer that was introduced in the Sentence Transformers v3 update. This brings some immediate upsides like MultiGPU support, without any (intended) breaking changes.

➡️ Beyond that, we softly deprecated the "evaluation_strategy" argument in favor of "eval_strategy" (following a Transformers deprecation), and deprecated Python 3.7. In return, we add official support for Python 3.11 and 3.12.

✨ There's some more minor changes too, like max_steps and eval_max_steps now being a hard limit instead of an approximate one, training/validation losses now logging nicely in Notebooks, and the "device" parameter no longer being ignored in some situations.

Check out the full release notes here:
Or read the documentation:
Or check out the public SetFit models for inspiration:

P.s. the model in the code snippet trained in 1 minute and it can classify ~6000 sentences per second on my GPU.
replied to their post 8 days ago
view reply

Glad to hear it! Feel free to send over feedback if you have any, it's always quite valuable for new features/docs.

posted an update 8 days ago
view post
🚀 Sentence Transformers v3.1 is out! Featuring a hard negatives mining utility to get better models out of your data, a new strong loss function, training with streaming datasets, custom modules, bug fixes, small additions and docs changes. Here's the details:

⛏ Hard Negatives Mining Utility: Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
📉 New loss function: This loss function works very well for symmetric tasks (e.g. clustering, classification, finding similar texts/paraphrases) and a bit less so for asymmetric tasks (e.g. question-answer retrieval).
💾 Streaming datasets: You can now train with the datasets.IterableDataset, which doesn't require downloading the full dataset to disk before training. As simple as "streaming=True" in your "datasets.load_dataset".
🧩 Custom Modules: Model authors can now customize a lot more of the components that make up Sentence Transformer models, allowing for a lot more flexibility (e.g. multi-modal, model-specific quirks, etc.)
✨ New arguments to several methods: encode_multi_process gets a progress bar, push_to_hub can now be done to different branches, and CrossEncoders can be downloaded to specific cache directories.
🐛 Bug fixes: Too many to name here, check out the release notes!
📝 Documentation: A particular focus on clarifying the batch samplers in the Package Reference this release.

Check out the full release notes here ⭐:

I'm very excited to hear your feedback, and I'm looking forward to the future changes that I have planned, such as ONNX inference! I'm also open to suggestions for new features: feel free to send me your ideas.
  • 2 replies
posted an update 3 months ago
view post
@Omartificial-Intelligence-Space has trained and released 6 Arabic embedding models for semantic similarity. 4 of them outperform all previous models on the STS17 Arabic-Arabic task!

📚 Trained on a large dataset of 558k Arabic triplets translated from the AllNLI triplet dataset: Omartificial-Intelligence-Space/Arabic-NLi-Triplet
6️⃣ 6 different base models: AraBERT, MarBERT, LaBSE, MiniLM, paraphrase-multilingual-mpnet-base, mpnet-base, ranging from 109M to 471M parameters.
🪆 Trained with a Matryoshka loss, allowing you to truncate embeddings with minimal performance loss: smaller embeddings are faster to compare.
📈 Outperforms all commonly used multilingual models like intfloat/multilingual-e5-large, sentence-transformers/paraphrase-multilingual-mpnet-base-v2, and sentence-transformers/LaBSE.

Check them out here:
- Omartificial-Intelligence-Space/Arabic-mpnet-base-all-nli-triplet
- Omartificial-Intelligence-Space/Arabic-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-labse-Matryoshka
- Omartificial-Intelligence-Space/Marbert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet
Or the collection with all: Omartificial-Intelligence-Space/arabic-matryoshka-embedding-models-666f764d3b570f44d7f77d4e

My personal favourite is likely Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka: a very efficient 135M parameters & scores #1 on mteb/leaderboard.
  • 1 reply
posted an update 3 months ago
view post
I just published Sentence Transformers v3.0.1: the first patch release since v3 from last week. It introduces gradient checkpointing, pushing model checkpoints to Hugging Face while training, model card improvements and fixes. Details:

1️⃣ Gradient checkpointing allows for much less memory usage at a cost of ~20% training speed. Seems to allow for higher batch sizes, which is quite important for loss functions with in-batch negatives.
2️⃣ You can specify args.push_to_hub=True and args.hub_model_id to upload your model checkpoints to Hugging Face while training. It also uploads your emissions (if codecarbon is installed) and your Tensorboard logs (if tensorboard is installed)
3️⃣ Model card improvements: improved automatic widget examples, better tags, and the default of "sentence_transformers_model_id" now gets replaced when possible.
4️⃣ Several evaluator fixes, see release notes for details.
5️⃣ Fixed a bug with MatryoshkaLoss throwing an error if the supplied Matryoshka dimensions are ascending instead of descending.
6️⃣ Full Safetensors support; even the uncommon modules can now save and load "model.safetensors" files: no more pickle risks.

Check out the full release notes here:

And let me know what kind of features you'd like to see next! I have some plans already (ONNX, Sparse models, ColBERT, PEFT), but I don't yet know how I should prioritize everything.
replied to victor's post 4 months ago
view reply

I just tried this out, and wow, it works very well!

posted an update 4 months ago
view post
‼️Sentence Transformers v3.0 is out! You can now train and finetune embedding models with multi-GPU training, bf16 support, loss logging, callbacks & much more. I also release 50+ datasets to train on.

1️⃣ Training Refactor
Embedding models can now be trained using an extensive trainer with a lot of powerful features:
- MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
- bf16 training support; loss logging
- Evaluation datasets + evaluation loss
- Improved callback support + an excellent Weights & Biases integration
- Gradient checkpointing, gradient accumulation
- Improved model card generation
- Resuming from a training checkpoint without performance loss
- Hyperparameter Optimization
and much more!
Read my detailed blogpost to learn about the components that make up this new training approach:

2️⃣ Similarity Score
Not sure how to compare embeddings? Don't worry, you can now use model.similarity(embeddings1, embeddings2) and you'll get your similarity scores immediately. Model authors can specify their desired similarity score, so you don't have to worry about it anymore!

3️⃣ Additional Kwargs
Sentence Transformers relies on various Transformers instances (AutoModel, AutoTokenizer, AutoConfig), but it was hard to provide valuable keyword arguments to these (like 'torch_dtype=torch.bfloat16' to load a model a lower precision for 2x inference speedup). This is now easy!

4️⃣ Hyperparameter Optimization
Sentence Transformers now ships with HPO, allowing you to effectively choose your hyperparameters for your data and task.

5️⃣ Dataset Release
To help you out with finetuning models, I've released 50+ ready-to-go datasets that can be used with training or finetuning embedding models: sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552

Full release notes:
replied to fdaudens's post 4 months ago
view reply

Very impressive! It seems excellent at Dutch, too

replied to their post 4 months ago
view reply

Here's some evidence that these models work well across domains:

posted an update 4 months ago
view post
NuMind has just released 3 new state-of-the-art GLiNER models for Named Entity Recognition/Information Extraction. These GLiNER models allow you to specify any label that you want, and it'll find spans in the text corresponding to your label. It's been shown to work quite well on unusual domains, e.g. celestial entities in my picture.

There are 3 models released:
- numind/NuNER_Zero:
The primary model, SOTA & can detect really long entities.
- numind/NuNER_Zero-span:
Slightly better performance than NuNER Zero, but can't detect entities longer than 12 tokens.
- numind/NuNER_Zero-4k:
Slightly worse than NuNER Zero, but has a context length of 4k tokens.

Some more details about these models in general:
- They are *really* small, orders of magnitude smaller than LLMs, which don't reach this level of performance.
- Because they're small - they're fast: <1s per sentence on free GPUs.
- They have an MIT license: free commercial usage.

Try out the demo here:
Or check out all of the models here: numind/nunerzero-zero-shot-ner-662b59803b9b438ff56e49e2

If there's ever a need for me to extract some information from any text: I'll be using these. Great work @Serega6678 !
  • 3 replies
posted an update 5 months ago
view post
I've just stumbled upon some excellent work on (🇫🇷 French) retrieval models by @antoinelouis . Kudos to him!

- French Embedding Models:
- French Reranker Models: antoinelouis/cross-encoder-rerankers-651523f16efa656d1788a239
- French Multi-vector Models:
- Multilingual Models:

A lot of these models use the MS MARCO Hard Negatives dataset, which I'm currently reformatting to be more easily usable. Notably, they should work out of the box without any pre-processing for training embedding models in the upcoming Sentence Transformers v3.
replied to albertvillanova's post 5 months ago
view reply

Oooh, Dataset.take should be very convenient. No more .select(range(...)) 🚀

replied to Sentdex's post 5 months ago
view reply

I'm concerned about the low training speed (10x slower). Do we know anything about the inference latency as well? I think that's key to figure out whether this is viable or not.

replied to fdaudens's post 5 months ago
view reply

Thanks for writing out this list! I try my best to keep up, but even I missed some of these

replied to bwang0911's post 5 months ago
view reply

I quite enjoy the speed of these, well done.

replied to beomi's post 5 months ago
view reply

Nice job! What are your findings so far? Can you reasonably handle the lengths that they claim?

posted an update 5 months ago
view post
🚀 Sentence Transformers v2.7.0 is out! Featuring a new loss function, easier Matryoshka model inference & evaluation, CrossEncoder improvements & Intel Gaudi2 Accelerator support. Details:

1️⃣ A new loss function: CachedGISTEmbedLoss
This loss function is a combination of CachedMultipleNegativesRankingLoss and the GISTEmbedLoss, both of which are already excellent. The caching mechanism allows for much higher batch sizes with constant memory usage, which boosts training performance. The GIST part introduces a guide model to guide the in-batch negative sample selection. This prevents false negatives, resulting in a stronger training signal.

2️⃣ Automatic Matryoshka model truncation
Matryoshka models produce embeddings that are still useful after truncation. However, this truncation always had to be done manually, until now! We've added a truncate_dim option to the Sentence Transformer constructor. This also allows truncation when using HuggingFaceEmbeddings from LlamaIndex or LangChain.

3️⃣ Additionally, you can now specify truncate_dim in evaluators to get the performance after truncation. (Hint: it's surprisingly good, even for models not trained with MatryoshkaLoss, and it can speed up e.g. clustering, retrieval, etc.)

4️⃣ CrossEncoder improvements
The CrossEncoder now supports 'push_to_hub' to upload trained reranker models to Hugging Face. Additionally, CrossEncoders now support trust_remote_code to load models with custom modelling code.

5️⃣ Inference on Intel Gaudi2
If you have an Intel Gaudi2 Accelerator, Sentence Transformers now uses it automatically for even faster inference. No changes are necessary to your code, the device is automatically detected!

Check out the release notes for all of the details:

I'm very excited for the upcoming releases: I'm making great progress with a notable v3 refactor that should heavily improve the training process for embedding models!
  • 2 replies
replied to jamarks's post 5 months ago
view reply

Awesome! I reckon this'll make it a lot easier to quickly share, save & load some annotation work.

replied to louisbrulenaudet's post 5 months ago
view reply

Very glad to see more uses of embedding quantization, great job.

replied to trisfromgoogle's post 5 months ago
view reply

The Recurrent Gemma is very intriguing to me. I'm looking forward to reading more about the RNN-based models when I have some more spare time.

replied to urchade's post 5 months ago
replied to MoritzLaurer's post 6 months ago
view reply

Looking forward to your blogpost! It's always exciting to see solid non-generative models.

posted an update 6 months ago
view post
🏅 Quantized Embeddings are here! Unlike model quantization, embedding quantization is a post-processing step for embeddings that converts e.g. float32 embeddings to binary or int8 embeddings. This saves 32x or 4x memory & disk space, and these embeddings are much easier to compare!

Our results show 25-45x speedups in retrieval compared to full-size embeddings, while keeping 96% of the performance!

Learn more about it in our blogpost in collaboration with
Or try out our demo where we use quantized embeddings to let you search all of Wikipedia (yes, 41,000,000 texts) in 1 second on a CPU Space: sentence-transformers/quantized-retrieval
  • 1 reply
posted an update 6 months ago
view post
🎉Today, the 5000th Sentence Transformer model was uploaded to Hugging Face! Embedding models are extremely versatile, so it's no wonder that they're still being trained.

Here's a few resources to get you started with them:
- All Sentence Transformer models:
- Sentence Transformer documentation:
- Massive Text Embedding Benchmark (MTEB) Leaderboard: mteb/leaderboard

The embedding space is extremely active right now, so if you're using an embedding model for your retrieval, semantic similarity, reranking, classification, clustering, etc., then be sure to keep an eye out on the trending Sentence Transformer models & new models on MTEB.

Also, I'm curious if you've ever used Sentence Transformers via a third party library, like a RAG framework or vector database. I'm quite interested in more integrations to bring everyone free, efficient & powerful embedding models!
replied to giux78's post 6 months ago
posted an update 7 months ago
view post
I remember very well that about two years ago, 0-shot named entity recognition (i.e. where you can choose any labels on the fly) was completely infeasible. Fast forward a year, and Universal-NER/UniNER-7B-all surprised me by showing that 0-shot NER is possible! However, I had a bunch of concerns that prevented me from ever adopting it myself. For example, the model was 7B parameters, only worked with 1 custom label at a time, and it had a cc-by-nc-4.0 license.

Since then, a little known research paper introduced GLiNER, which was a modified & finetuned variant of the microsoft/deberta-v3-base line of models. Notably, GLiNER outperforms UniNER-7B, despite being almost 2 orders of magnitude smaller! It also allows for multiple labels at once, supports nested NER, and the models are Apache 2.0.

Very recently, the models were uploaded to Hugging Face, and I was inspired to create a demo for the English model. The demo runs on CPU, and can still very efficiently compute labels with great performance. I'm very impressed at the models.

There are two models right now:
* base (english): urchade/gliner_base
* multi (multilingual): urchade/gliner_multi

And my demo to experiment with the base model can be found here:
replied to urchade's post 7 months ago
replied to their post 7 months ago
view reply

I've had the same idea before as well! I think this should work as well, but I haven't had time to do the research myself. Perhaps @SeanLee97 is interested in trying this out?

posted an update 7 months ago
view post
🤗 Sentence Transformers v2.4.0 for embedding models is now out! It introduces a lot of powerful features, such as:

1. Matryoshka Loss function - you can now train & perform inference on 🪆 Matryoshka Embedding models. See also our blogpost:

2. CoSENTLoss & AnglELoss: State of the art loss functions. These are quite interesting, they outperform CosineSimilarityLoss on nearly all benchmarks as a drop-in replacement! See also the docs:

3. Prompt templates: Many popular models such as intfloat/multilingual-e5-large and BAAI/bge-large-en-v1.5 prefix their texts with prompts, so this adds configuration options to automatically include prompts using model.encode(..., prompt_name="query") which will include a prompt with the name "query". More info in the docs:

4. Instructor support: Support for the INSTRUCTOR line of models, such as hkunlp/instructor-large. Learn how to use them here:

5. Removed NLTK & sentencepiece dependencies: Should allow for a smaller installation & a slightly faster import!

6. Updated documentation: a new Loss Overview section: and more detailed loss functions:

And much more! See the full release notes here:

Some more very exciting updates are still on the horizon!
replied to alielfilali01's post 7 months ago
view reply

I've been working hard to get my HF Inbox down, but now my emails have started overflowing 🙃

replied to stas's post 7 months ago
replied to lbourdois's post 8 months ago
view reply

I did not expect that many datasets to have such notable issues! Very interesting, thanks for sharing.
I would also be interested in the data quality bot that you describe at the end - I think that would be quite useful.

replied to their post 8 months ago
posted an update 8 months ago
view post
Sentence Transformers v2.3.0 has been released! It includes several bug fixes, enhanced model loading including custom models & no more unnecessary file downloads, improved performance, a powerful loss function, and much more!

⬆ Uploading Models to the Hub with save_to_hub.
⬇ Downloading Models from the Hub now downloads only necessary files.
⚙ Custom Models (such as jinaai/jina-embeddings-v2-base-de) can now be loaded with trust_remote_code=True.
🔍 Models can now be loaded at specific revisions (e.g. commit hashes or git branches).
🖥️ Various device fixes; models will now always operate on the device that you specify.
📉 A new "Cached" variant of the powerful Multiple Negatives Ranking Loss allows common hardware to reach performance previously only accessible on multi-gpu clusters.
🐎 Computation time of Community Detection was decreased significantly (7x speedup at 500k sentences :exploding_head:)
🪶 Removed the now unnecessary "torchvision" dependency for a smaller installation.

Check out the full changelog here:

I'll be working on much more changes in the near future, so expect more exciting updates. If you encounter any issues, or have any questions or feature requests, don't hesitate to open an issue on the repository:
  • 1 reply