Paper (at HF): https://huggingface.co/papers/2406.09206
Paper (in the ACL Anthology): https://aclanthology.org/2024.emnlp-main.669/
Code: https://github.com/chschroeder/self-training-for-sample-efficient-active-learning
Christopher SchrΓΆder
AI & ML interests
Recent Activity
Organizations
cschroeder's activity
In this work, we leverage self-training in an active learning loop in order to train small language models with even less data. Hope to see you there!
β Hard negatives are texts that are rather similar to some anchor text (e.g. a query), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
mine_hard_negatives
docs: https://sbert.net/docs/package_reference/util.html#sentence_transformers.util.mine_hard_negativesπ Beyond that, this release removes the numpy<2 restriction from v3.1.0. This was previously required for Windows as not all third-party libraries were updated to support numpy v2. With Sentence Transformers, you can now choose v1 or v2 of numpy.
Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.1
I'm looking forward to releasing v3.2, I have some exciting things planned π
Did not know text-splitter yet, thanks!
Mine are:
- https://github.com/benbrandt/text-splitter (Rust/Python, battle-tested, Wasm version coming soon)
- https://github.com/umarbutler/semchunk (Python, really performant but some issues with huge docs)
I tried the huge Jina AI regex, but it failed for my (admittedly messy) documents, e.g. from EUR-LEX. Their free segmenter API is really cool but unfortunately times out on my huge docs (~100 pages): https://jina.ai/segmenter/
Also, I tried to write a Vanilla JS chunker with a simple, adjustable hierarchical logic (inspired from the above). I think it does a decent job for the few lines of code: https://do-me.github.io/js-text-chunker/
Happy to hear your thoughts!
Please try our model (Colab Quickstart available) and let us know what you think:
iiiorg/piiranha-v1-detect-personal-information
β Hard Negatives Mining Utility: Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
π New loss function: This loss function works very well for symmetric tasks (e.g. clustering, classification, finding similar texts/paraphrases) and a bit less so for asymmetric tasks (e.g. question-answer retrieval).
πΎ Streaming datasets: You can now train with the datasets.IterableDataset, which doesn't require downloading the full dataset to disk before training. As simple as "streaming=True" in your "datasets.load_dataset".
𧩠Custom Modules: Model authors can now customize a lot more of the components that make up Sentence Transformer models, allowing for a lot more flexibility (e.g. multi-modal, model-specific quirks, etc.)
β¨ New arguments to several methods: encode_multi_process gets a progress bar, push_to_hub can now be done to different branches, and CrossEncoders can be downloaded to specific cache directories.
π Bug fixes: Too many to name here, check out the release notes!
π Documentation: A particular focus on clarifying the batch samplers in the Package Reference this release.
Check out the full release notes here β: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.0
I'm very excited to hear your feedback, and I'm looking forward to the future changes that I have planned, such as ONNX inference! I'm also open to suggestions for new features: feel free to send me your ideas.
This bold claim is not my opinion, but it has been made in a recent "report" of a group, whose stance is recognizable in their name. It is roughly translated as "Authors' Rights Initiative". They published a report which was also presented before the EU Parliament according to the LinkedIn post below.
I am not really interested in politics, but as an EU citizen I am of course somewhat interested in a reasonable and practical version of the EU AI Act. Not saying there should not be rules around data and AI, but this report is obviously very biased towards one side.
While I think the report itself does not deserve attention, I post it in the hope that you find more examples, where they did not address the issue adequately. Feel free to add to my LinkedIn posts (where the original authors will see it) or here.
[en] Executive summary: https://urheber.info/media/pages/diskurs/ai-training-is-copyright-infringement/3b900058e6-1725460935/executive-summary_engl_final_29-08-2024.pdf
[de] Full report: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214
LinkedIn: https://www.linkedin.com/posts/activity-7238912869268959232-6cFx
What feature or improvement would make the biggest impact on Hugging Face?
Whether it's the Hub, better documentation, new integrations, or something completely different β we're all ears!
Your feedback shapes the future of Hugging Face. Drop your ideas in the comments below! π
LIGER "is a [Hugging Face-compatible] collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%."
GitHub: https://github.com/linkedin/Liger-Kernel
- π€ Demo: webml-community/phi-3.5-webgpu
- π§βπ» Source code: https://github.com/huggingface/transformers.js-examples/tree/main/phi-3.5-webgpu
Apparently, some papers from the ACL 2024 are still not listed in the ACL Anthology. While this issue will hopefully be fixed soon, we should give those papers additional spotlight.
Some of my favorites:
1. Dolma is an English corpus that encompasses 3 trillion tokens. Additionally, it is accompanied by an exceptional software package that consdierably advances the state-of-the-art in preparing data for LLM pretraining. (Source: I am currently using Dolma.)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (2402.00159)
2. In the paper "Same Task, More Tokens: the Impact of Input Length on
the Reasoning Performance of Large Language Models", the authors show how extending the context length impacts an LLM's reasoning performance. I asked myself a similar question a few months ago, and therefore this paper is highly interesting to me.
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models (2402.14848)
This was brought to my attention through a Linkedin post by @ShayeghB , who is also affected:
Ensemble-Based Unsupervised Discontinuous Constituency Parsing by Tree Averaging (2403.00143)
View all the missing papers here:
https://theshayegh.github.io/ACL2024MissingPapers/
I want to start by expressing my appreciation for the incredible work Hugging Face has done for the open-source community. Your contributions have been invaluable, and Iβm grateful for the tools and resources you've provided.
Please take the following as constructive feedback. I wouldnβt have mentioned these points if you hadnβt asked, and I hope they can be seen as suggestions for further improvement.
Software quality: When I first started using transformers, I was thoroughly impressed. The basic "hello world" examples work wonderfully, making the initial experience smooth and enjoyable. However, nowadays I am am regularly diving deeper into the library, and I am regularly facing challenges such as long-time standing bugs, undocumented issues, lack of API documentation, and occasionally broken functionality. I am only guessing here, but I think the majority of these repos is written by research engineers or researchers, whose focus might be more on the methodological correctness (which is of course crucial as well). That said, it might be helpful to include someone who is stronger in software development and less knowledgeable in ML. This would be the first person to complain about "clean code" issues, and also would be the first to notice problems with the software.
Posts: Great feature! However, it could be enhanced by adding basic text formatting options. This would make posts more visually appealing and easier to read.
Papers: Restricting this to arXiv is too limiting. While I understand the rationale in terms of implementation effort, if the goal is to be the "Github of ML/AI," it might be worth considering support for at least the high-ranking conferences (or a subset thereof). In many cases, the conference version of a paper supersedes the arXiv version, and this restriction may inadvertently encourage the use of preprints over the finalized versions.
Again, these are just my personal pain points, and Iβm sharing them with the intention of helping Hugging Face continue to improve.
An exciting new example in the world of AI-assisted journalism! The Post has developed an internal tool called "Hayatacker" that's enhancing in-depth reporting. Here's why it matters:
π₯ What it does:
β’ Extracts stills from video files
β’ Processes on-screen text
β’ Labels objects in images
π³οΈ First big project:
Analyzed 745 Republican campaign ads on immigration (Jan-Jun 2024)
π€ Human-AI collaboration:
β’ AI extracts and organizes data
β’ Reporters verify and analyze findings
π Thorough approach:
β’ Manual review of all 745 ads
β’ Reverse image searches when context is lacking
β’ Cross-referencing with AdImpact transcripts
π‘ Key insight from WaPo's Senior Editor for AI strategy Phoebe Connelly:
"The more exciting choice is putting AI in the hands of reporters early on in the process."
This tool showcases how AI can augment journalistic capabilities without replacing human insight and verification. It's a powerful example of technology enhancing, not replacing, traditional reporting skills.
π Read the full article and the methodology: https://www.washingtonpost.com/elections/interactive/2024/republican-campaign-ads-immigration-border-security/
The new release contains some smaller bugfixes. Check it out!
Github: https://github.com/webis-de/small-text
Paper: Small-Text: Active Learning for Text Classification in Python (2107.10314)
Find out by playing Fake Insects π a Game where you need to identify which insects are fake (AI generated). Good luck & share your best score in the comments!
victor/fake-insects
This small revolution includes:
πΒ You can now integrate with the Hugging Face Hub and get started in under five minutes.
πͺΒ A single
Dataset
class is now designed to handle multiple tasks.π§Β Itβs 100 times simpler to configure your dataset now with the new SDK!
πΒ The documentation has been revamped to be cleaner and more user-friendly.
πΒ A new feature automates splitting annotation tasks among a team.
βοΈΒ The layout has been made more flexible to accommodate many use cases.
Check out the release highlights for more details: https://github.com/argilla-io/argilla/releases/tag/v2.0.0
The new version provides a small-text-compatible implementation for the recent AnchorAL strategy by @pietrolesci .
Github: https://github.com/webis-de/small-text
Paper: https://aclanthology.org/2023.eacl-demo.11/
AnchorAL: AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets (2404.05623)
1οΈβ£ Training Refactor
Embedding models can now be trained using an extensive trainer with a lot of powerful features:
- MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
- bf16 training support; loss logging
- Evaluation datasets + evaluation loss
- Improved callback support + an excellent Weights & Biases integration
- Gradient checkpointing, gradient accumulation
- Improved model card generation
- Resuming from a training checkpoint without performance loss
- Hyperparameter Optimization
and much more!
Read my detailed blogpost to learn about the components that make up this new training approach: https://huggingface.co/blog/train-sentence-transformers
2οΈβ£ Similarity Score
Not sure how to compare embeddings? Don't worry, you can now use
model.similarity(embeddings1, embeddings2)
and you'll get your similarity scores immediately. Model authors can specify their desired similarity score, so you don't have to worry about it anymore!3οΈβ£ Additional Kwargs
Sentence Transformers relies on various Transformers instances (AutoModel, AutoTokenizer, AutoConfig), but it was hard to provide valuable keyword arguments to these (like 'torch_dtype=torch.bfloat16' to load a model a lower precision for 2x inference speedup). This is now easy!
4οΈβ£ Hyperparameter Optimization
Sentence Transformers now ships with HPO, allowing you to effectively choose your hyperparameters for your data and task.
5οΈβ£ Dataset Release
To help you out with finetuning models, I've released 50+ ready-to-go datasets that can be used with training or finetuning embedding models: sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552
Full release notes: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.0