Christopher SchrΓΆder

cschroeder

AI & ML interests

NLP, Active Learning, Text Representations, PyTorch

Recent Activity

replied to their post 11 days ago
upvoted a paper 11 days ago
updated a collection 11 days ago
Active Learning Papers

Organizations

cschroeder's activity

posted an update 11 days ago
view post
Post
669
#EMNLP2024 is happening soon! Unfortunately, I will not be on site, but I will present our poster virtually on Wednesday, Nov 13 (7:45 EST / 13:45 CEST) in Virtual Poster Session 2.

In this work, we leverage self-training in an active learning loop in order to train small language models with even less data. Hope to see you there!
  • 1 reply
Β·
reacted to tomaarsen's post with πŸ”₯ 2 months ago
view post
Post
1981
I've just shipped the Sentence Transformers v3.1.1 patch release, fixing the hard negatives mining utility for some models. This utility is extremely useful to get more performance out of your embedding training data.

⛏ Hard negatives are texts that are rather similar to some anchor text (e.g. a query), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
mine_hard_negatives docs: https://sbert.net/docs/package_reference/util.html#sentence_transformers.util.mine_hard_negatives

πŸ”“ Beyond that, this release removes the numpy<2 restriction from v3.1.0. This was previously required for Windows as not all third-party libraries were updated to support numpy v2. With Sentence Transformers, you can now choose v1 or v2 of numpy.

Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.1

I'm looking forward to releasing v3.2, I have some exciting things planned πŸš€
replied to do-me's post 2 months ago
view reply

Did not know text-splitter yet, thanks!

reacted to do-me's post with πŸ‘€ 2 months ago
view post
Post
1029
What are your favorite text chunkers/splitters?
Mine are:
- https://github.com/benbrandt/text-splitter (Rust/Python, battle-tested, Wasm version coming soon)
- https://github.com/umarbutler/semchunk (Python, really performant but some issues with huge docs)

I tried the huge Jina AI regex, but it failed for my (admittedly messy) documents, e.g. from EUR-LEX. Their free segmenter API is really cool but unfortunately times out on my huge docs (~100 pages): https://jina.ai/segmenter/

Also, I tried to write a Vanilla JS chunker with a simple, adjustable hierarchical logic (inspired from the above). I think it does a decent job for the few lines of code: https://do-me.github.io/js-text-chunker/

Happy to hear your thoughts!
  • 1 reply
Β·
reacted to gaodrew's post with πŸ”₯ 2 months ago
view post
Post
1394
We used Hugging Face Trainer to fine-tune Deberta-v3-base for Personally Identifiable Information detection, achieving 99.44% overall accuracy (98.27% Recall for PII detection).

Please try our model (Colab Quickstart available) and let us know what you think:
iiiorg/piiranha-v1-detect-personal-information
  • 2 replies
Β·
reacted to tomaarsen's post with πŸ”₯ 2 months ago
view post
Post
3700
πŸš€ Sentence Transformers v3.1 is out! Featuring a hard negatives mining utility to get better models out of your data, a new strong loss function, training with streaming datasets, custom modules, bug fixes, small additions and docs changes. Here's the details:

⛏ Hard Negatives Mining Utility: Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
πŸ“‰ New loss function: This loss function works very well for symmetric tasks (e.g. clustering, classification, finding similar texts/paraphrases) and a bit less so for asymmetric tasks (e.g. question-answer retrieval).
πŸ’Ύ Streaming datasets: You can now train with the datasets.IterableDataset, which doesn't require downloading the full dataset to disk before training. As simple as "streaming=True" in your "datasets.load_dataset".
🧩 Custom Modules: Model authors can now customize a lot more of the components that make up Sentence Transformer models, allowing for a lot more flexibility (e.g. multi-modal, model-specific quirks, etc.)
✨ New arguments to several methods: encode_multi_process gets a progress bar, push_to_hub can now be done to different branches, and CrossEncoders can be downloaded to specific cache directories.
πŸ› Bug fixes: Too many to name here, check out the release notes!
πŸ“ Documentation: A particular focus on clarifying the batch samplers in the Package Reference this release.

Check out the full release notes here ⭐: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.0

I'm very excited to hear your feedback, and I'm looking forward to the future changes that I have planned, such as ONNX inference! I'm also open to suggestions for new features: feel free to send me your ideas.
  • 2 replies
Β·
posted an update 2 months ago
view post
Post
399
βš–οΈ π€πˆ π“π«πšπ’π§π’π§π  𝐒𝐬 𝐂𝐨𝐩𝐲𝐫𝐒𝐠𝐑𝐭 𝐈𝐧𝐟𝐫𝐒𝐧𝐠𝐞𝐦𝐞𝐧𝐭

This bold claim is not my opinion, but it has been made in a recent "report" of a group, whose stance is recognizable in their name. It is roughly translated as "Authors' Rights Initiative". They published a report which was also presented before the EU Parliament according to the LinkedIn post below.

I am not really interested in politics, but as an EU citizen I am of course somewhat interested in a reasonable and practical version of the EU AI Act. Not saying there should not be rules around data and AI, but this report is obviously very biased towards one side.

While I think the report itself does not deserve attention, I post it in the hope that you find more examples, where they did not address the issue adequately. Feel free to add to my LinkedIn posts (where the original authors will see it) or here.

[en] Executive summary: https://urheber.info/media/pages/diskurs/ai-training-is-copyright-infringement/3b900058e6-1725460935/executive-summary_engl_final_29-08-2024.pdf
[de] Full report: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214

LinkedIn: https://www.linkedin.com/posts/activity-7238912869268959232-6cFx

reacted to victor's post with πŸ”₯ 3 months ago
view post
Post
5424
πŸ™‹ Calling all Hugging Face users! We want to hear from YOU!

What feature or improvement would make the biggest impact on Hugging Face?

Whether it's the Hub, better documentation, new integrations, or something completely different – we're all ears!

Your feedback shapes the future of Hugging Face. Drop your ideas in the comments below! πŸ‘‡
Β·
posted an update 3 months ago
view post
Post
675
🌟 Liger Kernel: Efficient Triton Kernels for LLM Training

LIGER "is a [Hugging Face-compatible] collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%."

GitHub: https://github.com/linkedin/Liger-Kernel
reacted to Xenova's post with πŸ”₯ 3 months ago
view post
Post
13238
I can't believe this... Phi-3.5-mini (3.8B) running in-browser at ~90 tokens/second on WebGPU w/ Transformers.js and ONNX Runtime Web! 🀯 Since everything runs 100% locally, no messages are sent to a server β€” a huge win for privacy!
- πŸ€— Demo: webml-community/phi-3.5-webgpu
- πŸ§‘β€πŸ’» Source code: https://github.com/huggingface/transformers.js-examples/tree/main/phi-3.5-webgpu
Β·
posted an update 3 months ago
view post
Post
343
πŸ“„ ACL 2024: The Missing Papers

Apparently, some papers from the ACL 2024 are still not listed in the ACL Anthology. While this issue will hopefully be fixed soon, we should give those papers additional spotlight.

Some of my favorites:

1. Dolma is an English corpus that encompasses 3 trillion tokens. Additionally, it is accompanied by an exceptional software package that consdierably advances the state-of-the-art in preparing data for LLM pretraining. (Source: I am currently using Dolma.)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (2402.00159)

2. In the paper "Same Task, More Tokens: the Impact of Input Length on
the Reasoning Performance of Large Language Models", the authors show how extending the context length impacts an LLM's reasoning performance. I asked myself a similar question a few months ago, and therefore this paper is highly interesting to me.
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models (2402.14848)

This was brought to my attention through a Linkedin post by @ShayeghB , who is also affected:
Ensemble-Based Unsupervised Discontinuous Constituency Parsing by Tree Averaging (2403.00143)

View all the missing papers here:
https://theshayegh.github.io/ACL2024MissingPapers/
replied to victor's post 3 months ago
view reply

I want to start by expressing my appreciation for the incredible work Hugging Face has done for the open-source community. Your contributions have been invaluable, and I’m grateful for the tools and resources you've provided.

Please take the following as constructive feedback. I wouldn’t have mentioned these points if you hadn’t asked, and I hope they can be seen as suggestions for further improvement.

  • Software quality: When I first started using transformers, I was thoroughly impressed. The basic "hello world" examples work wonderfully, making the initial experience smooth and enjoyable. However, nowadays I am am regularly diving deeper into the library, and I am regularly facing challenges such as long-time standing bugs, undocumented issues, lack of API documentation, and occasionally broken functionality. I am only guessing here, but I think the majority of these repos is written by research engineers or researchers, whose focus might be more on the methodological correctness (which is of course crucial as well). That said, it might be helpful to include someone who is stronger in software development and less knowledgeable in ML. This would be the first person to complain about "clean code" issues, and also would be the first to notice problems with the software.

  • Posts: Great feature! However, it could be enhanced by adding basic text formatting options. This would make posts more visually appealing and easier to read.

  • Papers: Restricting this to arXiv is too limiting. While I understand the rationale in terms of implementation effort, if the goal is to be the "Github of ML/AI," it might be worth considering support for at least the high-ranking conferences (or a subset thereof). In many cases, the conference version of a paper supersedes the arXiv version, and this restriction may inadvertently encourage the use of preprints over the finalized versions.

Again, these are just my personal pain points, and I’m sharing them with the intention of helping Hugging Face continue to improve.

reacted to fdaudens's post with πŸ”₯ 3 months ago
view post
Post
2898
πŸš€ How The Washington Post Uses AI to Empower Journalists πŸ”πŸ“°

An exciting new example in the world of AI-assisted journalism! The Post has developed an internal tool called "Hayatacker" that's enhancing in-depth reporting. Here's why it matters:

πŸŽ₯ What it does:
β€’ Extracts stills from video files
β€’ Processes on-screen text
β€’ Labels objects in images

πŸ—³οΈ First big project:
Analyzed 745 Republican campaign ads on immigration (Jan-Jun 2024)

🀝 Human-AI collaboration:
β€’ AI extracts and organizes data
β€’ Reporters verify and analyze findings

πŸ”Ž Thorough approach:
β€’ Manual review of all 745 ads
β€’ Reverse image searches when context is lacking
β€’ Cross-referencing with AdImpact transcripts

πŸ’‘ Key insight from WaPo's Senior Editor for AI strategy Phoebe Connelly:
"The more exciting choice is putting AI in the hands of reporters early on in the process."

This tool showcases how AI can augment journalistic capabilities without replacing human insight and verification. It's a powerful example of technology enhancing, not replacing, traditional reporting skills.

πŸ‘‰ Read the full article and the methodology: https://www.washingtonpost.com/elections/interactive/2024/republican-campaign-ads-immigration-border-security/
posted an update 3 months ago
reacted to victor's post with πŸ‘ 3 months ago
view post
Post
4125
How good are you at spotting AI-generated images?

Find out by playing Fake Insects 🐞 a Game where you need to identify which insects are fake (AI generated). Good luck & share your best score in the comments!

victor/fake-insects
Β·
reacted to Ameeeee's post with πŸ”₯ 4 months ago
view post
Post
3534
❀️‍πŸ”₯Β Just released version 2.0 of Argilla!

This small revolution includes:

πŸ”ŒΒ You can now integrate with the Hugging Face Hub and get started in under five minutes.
πŸͺ‚Β A single Dataset class is now designed to handle multiple tasks.
πŸ”§Β It’s 100 times simpler to configure your dataset now with the new SDK!
πŸ“–Β The documentation has been revamped to be cleaner and more user-friendly.
🍌  A new feature automates splitting annotation tasks among a team.
✍️ The layout has been made more flexible to accommodate many use cases.

Check out the release highlights for more details: https://github.com/argilla-io/argilla/releases/tag/v2.0.0
  • 1 reply
Β·
posted an update 6 months ago
reacted to tomaarsen's post with πŸ”₯ 6 months ago
view post
Post
1935
‼️Sentence Transformers v3.0 is out! You can now train and finetune embedding models with multi-GPU training, bf16 support, loss logging, callbacks & much more. I also release 50+ datasets to train on.

1️⃣ Training Refactor
Embedding models can now be trained using an extensive trainer with a lot of powerful features:
- MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
- bf16 training support; loss logging
- Evaluation datasets + evaluation loss
- Improved callback support + an excellent Weights & Biases integration
- Gradient checkpointing, gradient accumulation
- Improved model card generation
- Resuming from a training checkpoint without performance loss
- Hyperparameter Optimization
and much more!
Read my detailed blogpost to learn about the components that make up this new training approach: https://huggingface.co/blog/train-sentence-transformers

2️⃣ Similarity Score
Not sure how to compare embeddings? Don't worry, you can now use model.similarity(embeddings1, embeddings2) and you'll get your similarity scores immediately. Model authors can specify their desired similarity score, so you don't have to worry about it anymore!

3️⃣ Additional Kwargs
Sentence Transformers relies on various Transformers instances (AutoModel, AutoTokenizer, AutoConfig), but it was hard to provide valuable keyword arguments to these (like 'torch_dtype=torch.bfloat16' to load a model a lower precision for 2x inference speedup). This is now easy!

4️⃣ Hyperparameter Optimization
Sentence Transformers now ships with HPO, allowing you to effectively choose your hyperparameters for your data and task.

5️⃣ Dataset Release
To help you out with finetuning models, I've released 50+ ready-to-go datasets that can be used with training or finetuning embedding models: sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552

Full release notes: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.0