de-Rodrigo (de Rodrigo)

Posts 1

Post

988

A few weeks ago, we uploaded the MERIT Dataset 🎒📃🏆 into Hugging Face 🤗!

Now, we are excited to share the Merit Dataset paper via arXiv! 📃💫
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts (2409.00447)

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. 🔧🔨

MERIT contains synthetically rendered students' transcripts of records from different schools in English and Spanish. We plan to expand the dataset into different contexts (synth medical/insurance documents, synth IDS, etc.) Want to collaborate? Do you have any feedback? 🧐

Resources:

- Dataset: de-Rodrigo/merit
- Code and generation pipeline: https://github.com/nachoDRT/MERIT-Dataset

PD: We are grateful to Hugging Face 🤗 for providing the fantastic tools and resources we find in the platform and, more specifically, to @nielsr for sharing the fine-tuning/inference scripts we have used in our benchmark.

Collections 2

Papers 1

arxiv:2409.00447

spaces 1

Sleeping

🚀

Saliencies

models 3

de-Rodrigo/donut-merit

Image-Text-to-Text • Updated Sep 13 • 9 • 1

de-Rodrigo/donut-cord-v2

Updated Sep 13 • 14

de Rodrigo PRO

AI & ML interests

Recent Activity

Organizations

Posts 1

Collections 2

The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts

de-Rodrigo/donut-merit

de-Rodrigo/idefics2-merit

Saliencies

Papers 1

spaces 1

Saliencies

models 3

de-Rodrigo/donut-merit

de-Rodrigo/donut-cord-v2

de-Rodrigo/idefics2-merit

datasets 1

de-Rodrigo/merit

de Rodrigo PRO

AI & ML interests

Recent Activity

Organizations

Posts 1

Collections 2

Saliencies

Papers 1

spaces 1

Saliencies

models 3 Sort: Recently updated

datasets 1

models 3