Model Card: Donut Model for Ticket Parsing

Model Description

This is a fine-tuned version of the Donut architecture, specifically tailored for parsing retail receipts. Donut is a transformer-based model designed for document understanding, and it performs OCR-free parsing by directly processing images into structured JSON outputs. This implementation was fine-tuned using a custom dataset of artificial and real receipts.

Use Case

This model is intended to be used for parsing receipts into structured data, extracting information such as item names, quantities, prices, taxes, and total amounts directly from image inputs.

Dataset

The model was trained on a mixture of synthetic and real-world receipts:

Artificial Receipts: Generated using a custom tool inspired by SynthDoG and built with OpenCV. The tool simulates various real-world conditions (e.g., Gaussian noise, wrinkles, luminance variations) to enhance the robustness of the model.
Real Receipts: A manually parsed dataset of 704 receipts, including a validation set of 200 receipts.

Data Creation Process

The artificial receipts were generated using a combination of background images, fonts, and custom templates to mimic real-world conditions, ensuring the model can handle various types of distortions such as noise, wrinkles, and lighting changes. The real receipts were annotated manually using a custom tool based on the Marimo app, which allowed for structured annotation of receipt elements.

Training Details

Hardware: The model was trained using Google Colab Pro.
Training Steps: The model was trained in three main steps of 10 epochs each, totaling 30 epochs.
Loss Function: The model was trained using a combination of Levenshtein edit-distance for string similarity and nTED (normalized Tree Edit Distance) for accuracy in tree-based data structures.
Performance: The model showed significant improvements when trained with a mix of artificial and real receipts, achieving a validation accuracy of 0.98 and a test accuracy of 0.70.

Results

The model was tested on both validation and test datasets, achieving the following results:

Validation Accuracy: 98.37% (final fine-tuned model)
Test Accuracy: 69.63% (final fine-tuned model)

Limitations

Synthetic Data: Although artificial receipts helped improve performance, the model may still struggle with unseen or very complex receipt formats that weren't part of the training dataset.
Real-world Deployment: Further fine-tuning might be necessary to adapt the model to new types of receipts or different languages.

Ethical Considerations

Privacy: Care should be taken when using this model on personal or sensitive financial data. Ensure compliance with local privacy laws and regulations.
Bias: The model was trained on a limited set of receipts, which could result in biases toward certain types of stores or receipt formats.

How to Use

This model is available on Hugging Face and can be used as follows:

from transformers import DonutProcessor, VisionEncoderDecoderModel
from PIL import Image
import json
import torch
import re

# Load model and processor
print("Loading Donut model...") 

processor = DonutProcessor.from_pretrained("pandafm/donut-es")
model = VisionEncoderDecoderModel.from_pretrained("pandafm/donut-es")

if torch.cuda.is_available():
    device = torch.device("cuda")
    model.to(device)
else:
    model.encoder.to(torch.bfloat16)
print("Donut model loaded.")

# Open image of a receipt
image = Image.open("path_to_receipt_image.jpg")

# Process image and generate JSON output
pixel_values = processor(image, return_tensors="pt").pixel_values
if torch.cuda.is_available():
    pixel_values = pixel_values.to(device)
else:
    pixel_values = pixel_values.to(torch.bfloat16)

# Convert output to JSON
task_prompt = "<s_cord-v2>"
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
decoder_input_ids = decoder_input_ids.to(device)

# autoregressively generate sequence
result = model.generate(
        pixel_values,
        decoder_input_ids=decoder_input_ids,
        max_length=model.decoder.config.max_position_embeddings,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        return_dict_in_generate=True,
    )
seq = processor.batch_decode(result.sequences)[0]
seq = seq.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
seq = re.sub(r"<.*?>", "", seq, count=1).strip()  # remove first task start token
seq = processor.token2json(seq)

Acknowledgements

This model was fine-tuned as part of a research project for a Bachelor's Degree, leveraging the Donut architecture and integrating tools like OpenCV for data generation. The final dataset included both synthetic and real-world receipts to improve robustness in parsing.

Citation

@thesis{pandafm2024DonutES, author = {David Florez Mazuera}, title = {Ticket Parser}, school = {Universidad de Murcia}, year = {2024}, address = {Murcia, España}, month = {June}, type = {Bachelor's thesis}, note = {Gines García Mateos}, url = {}, keywords = {donut, transformers, fine-tune}, }

pandafm
/

donut-es