cschroeder (Christopher Schröder)

replied to their post 11 days ago

Paper (at HF): https://huggingface.co/papers/2406.09206
Paper (in the ACL Anthology): https://aclanthology.org/2024.emnlp-main.669/
Code: https://github.com/chschroeder/self-training-for-sample-efficient-active-learning

posted an update 11 days ago

Post

669

#EMNLP2024 is happening soon! Unfortunately, I will not be on site, but I will present our poster virtually on Wednesday, Nov 13 (7:45 EST / 13:45 CEST) in Virtual Poster Session 2.

In this work, we leverage self-training in an active learning loop in order to train small language models with even less data. Hope to see you there!

1 reply

·

reacted to tomaarsen's post with 🔥 2 months ago

Post

1981

I've just shipped the Sentence Transformers v3.1.1 patch release, fixing the hard negatives mining utility for some models. This utility is extremely useful to get more performance out of your embedding training data.

⛏ Hard negatives are texts that are rather similar to some anchor text (e.g. a query), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
mine_hard_negatives docs: https://sbert.net/docs/package_reference/util.html#sentence_transformers.util.mine_hard_negatives

🔓 Beyond that, this release removes the numpy<2 restriction from v3.1.0. This was previously required for Windows as not all third-party libraries were updated to support numpy v2. With Sentence Transformers, you can now choose v1 or v2 of numpy.

Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.1

I'm looking forward to releasing v3.2, I have some exciting things planned 🚀

replied to do-me's post 2 months ago

Did not know text-splitter yet, thanks!

reacted to do-me's post with 👀 2 months ago

Post

1029

What are your favorite text chunkers/splitters?
Mine are:
- https://github.com/benbrandt/text-splitter (Rust/Python, battle-tested, Wasm version coming soon)
- https://github.com/umarbutler/semchunk (Python, really performant but some issues with huge docs)

I tried the huge Jina AI regex, but it failed for my (admittedly messy) documents, e.g. from EUR-LEX. Their free segmenter API is really cool but unfortunately times out on my huge docs (~100 pages): https://jina.ai/segmenter/

Also, I tried to write a Vanilla JS chunker with a simple, adjustable hierarchical logic (inspired from the above). I think it does a decent job for the few lines of code: https://do-me.github.io/js-text-chunker/

Happy to hear your thoughts!

1 reply

·

reacted to gaodrew's post with 🔥 2 months ago

Post

1394

We used Hugging Face Trainer to fine-tune Deberta-v3-base for Personally Identifiable Information detection, achieving 99.44% overall accuracy (98.27% Recall for PII detection).

Please try our model (Colab Quickstart available) and let us know what you think:
iiiorg/piiranha-v1-detect-personal-information

2 replies

·

reacted to tomaarsen's post with 🔥 2 months ago

Post

3700

🚀 Sentence Transformers v3.1 is out! Featuring a hard negatives mining utility to get better models out of your data, a new strong loss function, training with streaming datasets, custom modules, bug fixes, small additions and docs changes. Here's the details:

⛏ Hard Negatives Mining Utility: Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
📉 New loss function: This loss function works very well for symmetric tasks (e.g. clustering, classification, finding similar texts/paraphrases) and a bit less so for asymmetric tasks (e.g. question-answer retrieval).
💾 Streaming datasets: You can now train with the datasets.IterableDataset, which doesn't require downloading the full dataset to disk before training. As simple as "streaming=True" in your "datasets.load_dataset".
🧩 Custom Modules: Model authors can now customize a lot more of the components that make up Sentence Transformer models, allowing for a lot more flexibility (e.g. multi-modal, model-specific quirks, etc.)
✨ New arguments to several methods: encode_multi_process gets a progress bar, push_to_hub can now be done to different branches, and CrossEncoders can be downloaded to specific cache directories.
🐛 Bug fixes: Too many to name here, check out the release notes!
📝 Documentation: A particular focus on clarifying the batch samplers in the Package Reference this release.

Check out the full release notes here ⭐: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.0

I'm very excited to hear your feedback, and I'm looking forward to the future changes that I have planned, such as ONNX inference! I'm also open to suggestions for new features: feel free to send me your ideas.

2 replies

·

posted an update 2 months ago

Post

399

⚖️ 𝐀𝐈 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐢𝐬 𝐂𝐨𝐩𝐲𝐫𝐢𝐠𝐡𝐭 𝐈𝐧𝐟𝐫𝐢𝐧𝐠𝐞𝐦𝐞𝐧𝐭

This bold claim is not my opinion, but it has been made in a recent "report" of a group, whose stance is recognizable in their name. It is roughly translated as "Authors' Rights Initiative". They published a report which was also presented before the EU Parliament according to the LinkedIn post below.

I am not really interested in politics, but as an EU citizen I am of course somewhat interested in a reasonable and practical version of the EU AI Act. Not saying there should not be rules around data and AI, but this report is obviously very biased towards one side.

While I think the report itself does not deserve attention, I post it in the hope that you find more examples, where they did not address the issue adequately. Feel free to add to my LinkedIn posts (where the original authors will see it) or here.

[en] Executive summary: https://urheber.info/media/pages/diskurs/ai-training-is-copyright-infringement/3b900058e6-1725460935/executive-summary_engl_final_29-08-2024.pdf
[de] Full report: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214

LinkedIn: https://www.linkedin.com/posts/activity-7238912869268959232-6cFx

reacted to victor's post with 🔥 3 months ago

Post

5424

🙋 Calling all Hugging Face users! We want to hear from YOU!

What feature or improvement would make the biggest impact on Hugging Face?

Whether it's the Hub, better documentation, new integrations, or something completely different – we're all ears!

Your feedback shapes the future of Hugging Face. Drop your ideas in the comments below! 👇

150 replies

·

posted an update 3 months ago

Post

675

🌟 Liger Kernel: Efficient Triton Kernels for LLM Training

LIGER "is a [Hugging Face-compatible] collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%."

GitHub: https://github.com/linkedin/Liger-Kernel

reacted to Xenova's post with 🔥 3 months ago

Post

13238

I can't believe this... Phi-3.5-mini (3.8B) running in-browser at ~90 tokens/second on WebGPU w/ Transformers.js and ONNX Runtime Web! 🤯 Since everything runs 100% locally, no messages are sent to a server — a huge win for privacy!
- 🤗 Demo: webml-community/phi-3.5-webgpu
- 🧑‍💻 Source code: https://github.com/huggingface/transformers.js-examples/tree/main/phi-3.5-webgpu

11 replies

·

posted an update 3 months ago

Post

343

📄 ACL 2024: The Missing Papers

Apparently, some papers from the ACL 2024 are still not listed in the ACL Anthology. While this issue will hopefully be fixed soon, we should give those papers additional spotlight.

Some of my favorites:

1. Dolma is an English corpus that encompasses 3 trillion tokens. Additionally, it is accompanied by an exceptional software package that consdierably advances the state-of-the-art in preparing data for LLM pretraining. (Source: I am currently using Dolma.)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (2402.00159)

2. In the paper "Same Task, More Tokens: the Impact of Input Length on
the Reasoning Performance of Large Language Models", the authors show how extending the context length impacts an LLM's reasoning performance. I asked myself a similar question a few months ago, and therefore this paper is highly interesting to me.
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models (2402.14848)

This was brought to my attention through a Linkedin post by @ShayeghB , who is also affected:
Ensemble-Based Unsupervised Discontinuous Constituency Parsing by Tree Averaging (2403.00143)

View all the missing papers here:
https://theshayegh.github.io/ACL2024MissingPapers/

replied to victor's post 3 months ago

I want to start by expressing my appreciation for the incredible work Hugging Face has done for the open-source community. Your contributions have been invaluable, and I’m grateful for the tools and resources you've provided.

Please take the following as constructive feedback. I wouldn’t have mentioned these points if you hadn’t asked, and I hope they can be seen as suggestions for further improvement.

Software quality: When I first started using transformers, I was thoroughly impressed. The basic "hello world" examples work wonderfully, making the initial experience smooth and enjoyable. However, nowadays I am am regularly diving deeper into the library, and I am regularly facing challenges such as long-time standing bugs, undocumented issues, lack of API documentation, and occasionally broken functionality. I am only guessing here, but I think the majority of these repos is written by research engineers or researchers, whose focus might be more on the methodological correctness (which is of course crucial as well). That said, it might be helpful to include someone who is stronger in software development and less knowledgeable in ML. This would be the first person to complain about "clean code" issues, and also would be the first to notice problems with the software.
Posts: Great feature! However, it could be enhanced by adding basic text formatting options. This would make posts more visually appealing and easier to read.
Papers: Restricting this to arXiv is too limiting. While I understand the rationale in terms of implementation effort, if the goal is to be the "Github of ML/AI," it might be worth considering support for at least the high-ranking conferences (or a subset thereof). In many cases, the conference version of a paper supersedes the arXiv version, and this restriction may inadvertently encourage the use of preprints over the finalized versions.

Again, these are just my personal pain points, and I’m sharing them with the intention of helping Hugging Face continue to improve.

reacted to fdaudens's post with 🔥 3 months ago

Post

2898

🚀 How The Washington Post Uses AI to Empower Journalists 🔍📰

An exciting new example in the world of AI-assisted journalism! The Post has developed an internal tool called "Hayatacker" that's enhancing in-depth reporting. Here's why it matters:

🎥 What it does:
• Extracts stills from video files
• Processes on-screen text
• Labels objects in images

🗳️ First big project:
Analyzed 745 Republican campaign ads on immigration (Jan-Jun 2024)

🤝 Human-AI collaboration:
• AI extracts and organizes data
• Reporters verify and analyze findings

🔎 Thorough approach:
• Manual review of all 745 ads
• Reverse image searches when context is lacking
• Cross-referencing with AdImpact transcripts

💡 Key insight from WaPo's Senior Editor for AI strategy Phoebe Connelly:
"The more exciting choice is putting AI in the hands of reporters early on in the process."

This tool showcases how AI can augment journalistic capabilities without replacing human insight and verification. It's a powerful example of technology enhancing, not replacing, traditional reporting skills.

👉 Read the full article and the methodology: https://www.washingtonpost.com/elections/interactive/2024/republican-campaign-ads-immigration-border-security/

posted an update 3 months ago

Post

1881

🔔 Release: small-text v1.4.1

The new release contains some smaller bugfixes. Check it out!

Github: https://github.com/webis-de/small-text
Paper: Small-Text: Active Learning for Text Classification in Python (2107.10314)

reacted to victor's post with 👍 3 months ago

Post

4125

How good are you at spotting AI-generated images?

Find out by playing Fake Insects 🐞 a Game where you need to identify which insects are fake (AI generated). Good luck & share your best score in the comments!

victor/fake-insects

6 replies

·

reacted to Ameeeee's post with 🔥 4 months ago

Post

3534

❤️‍🔥 Just released version 2.0 of Argilla!

This small revolution includes:

🔌 You can now integrate with the Hugging Face Hub and get started in under five minutes.
🪂 A single Dataset class is now designed to handle multiple tasks.
🔧 It’s 100 times simpler to configure your dataset now with the new SDK!
📖 The documentation has been revamped to be cleaner and more user-friendly.
🍌 A new feature automates splitting annotation tasks among a team.
✍️ The layout has been made more flexible to accommodate many use cases.

Check out the release highlights for more details: https://github.com/argilla-io/argilla/releases/tag/v2.0.0

1 reply

·

posted an update 6 months ago

Post

1470

🔔 Release: small-text v1.4.0

The new version provides a small-text-compatible implementation for the recent AnchorAL strategy by @pietrolesci .

Github: https://github.com/webis-de/small-text
Paper: https://aclanthology.org/2023.eacl-demo.11/
AnchorAL: AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets (2404.05623)

reacted to tomaarsen's post with 🔥 6 months ago

Post

1935

‼️Sentence Transformers v3.0 is out! You can now train and finetune embedding models with multi-GPU training, bf16 support, loss logging, callbacks & much more. I also release 50+ datasets to train on.

1️⃣ Training Refactor
Embedding models can now be trained using an extensive trainer with a lot of powerful features:
- MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
- bf16 training support; loss logging
- Evaluation datasets + evaluation loss
- Improved callback support + an excellent Weights & Biases integration
- Gradient checkpointing, gradient accumulation
- Improved model card generation
- Resuming from a training checkpoint without performance loss
- Hyperparameter Optimization
and much more!
Read my detailed blogpost to learn about the components that make up this new training approach: https://huggingface.co/blog/train-sentence-transformers

2️⃣ Similarity Score
Not sure how to compare embeddings? Don't worry, you can now use model.similarity(embeddings1, embeddings2) and you'll get your similarity scores immediately. Model authors can specify their desired similarity score, so you don't have to worry about it anymore!

3️⃣ Additional Kwargs
Sentence Transformers relies on various Transformers instances (AutoModel, AutoTokenizer, AutoConfig), but it was hard to provide valuable keyword arguments to these (like 'torch_dtype=torch.bfloat16' to load a model a lower precision for 2x inference speedup). This is now easy!

4️⃣ Hyperparameter Optimization
Sentence Transformers now ships with HPO, allowing you to effectively choose your hyperparameters for your data and task.

5️⃣ Dataset Release
To help you out with finetuning models, I've released 50+ ready-to-go datasets that can be used with training or finetuning embedding models: sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552

Full release notes: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.0

Christopher Schröder

AI & ML interests

Recent Activity

Organizations

cschroeder's activity