Open Food Facts - Ingredients spellcheck model

When a product is added to the database, all its details, such as allergens, additives, or nutritional values, are either wrote down by the contributor, or automatically extracted from the product pictures using OCR.

However, it often happens the information extracted by OCR contains typos and errors due to bad quality pictures: low-definition, curved product, light reflection, etc...

To solve this problem, we developed an Ingredient Spellcheck 🍊, a model capable of correcting typos in a list of ingredients following a defined guideline. The model, based on [Mistral-7B-v0.3], was fine-tuned on thousand of corrected lists of ingredients extracted from the database.

Model Details

Model Description

The Open Food Facts Ingredients Spellcheck is a version of Mistral-7B-v0.3 fine-tuned on thousands of corrected list of ingredients extracted from the OFF database.

The training dataset, with the evaluation benchmark are available in the Open Food Facts HF repository:

Training dataset: https://huggingface.co/datasets/openfoodfacts/spellcheck-dataset
Evaluation benchmark: https://huggingface.co/datasets/openfoodfacts/spellcheck-benchmark

The project is currently in development. You can find it in the Open Food Facts Github repo.

A demo of this model is also available in HF Spaces.

Repository: https://github.com/openfoodfacts/openfoodfacts-ai/tree/develop/spellcheck
Demo: https://huggingface.co/spaces/jeremyarancio/ingredients-spellcheck

Uses

This model takes a list of ingredients of a product as input and returns the correction.

It follows a spellcheck guideline, which was used to build the training and evaluation datasets. You can find this guideline in the Spellcheck project README.

To respect the training process, the input list of ingredients needs to be embedded into the following prompt:

def prepare_instruction(text: str) -> str:
    """Prepare instruction prompt for fine-tuning and inference.
    Identical to instruction during training.

    Args:
        text (str): List of ingredients

    Returns:
        str: Instruction.
    """
    instruction = (
        "###Correct the list of ingredients:\n"
        + text
        + "\n\n###Correction:\n"
    )
    return instruction

Training Details

The model training informations are available in the CometML Experiment Tracker, along the other experimentations.

The model was trained on AWS Sagemaker using an ml.g5.2xlarge instance for 3 epochs.

Evaluation

The model is evaluated on the benchmark using a custom evaluation algorithm.

In short, lists of ingredients are separated into 3 parts: original, reference, prediction. Using a sequence alignement algorithm between respectively original-reference and original-*prediction, we are able to tell which token were supposed to be corrected, and which one was actually corrected. This leads to a correction Precision and Recall.

The complete explanation of the algorithm is available in the Spellchech README.

Metrics:

Correction precision: 0.67
Correction recall: 0.62
Localisation precision: 0.75
Localisation recall: 0.69

Additional links:

Open Food Facts website: https://world.openfoodfacts.org/discover
Open Food Facts Github: https://github.com/openfoodfacts

openfoodfacts
/

spellcheck-mistral-7b