MANTa-LM (small)

Pretrained MANTa-LM architecture as introduced in the paper MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling.

Model Details

Model Description

The MANTa tokenizer aims at mimicking the combination of a subword tokenizer and an embedding matrix in a classical language model in a differentiable way. This trainable tokenizer is thus added as the first layer of an encoder-decoder model and trained using the language modeling objective.

Our results show that MANTa-LM only slightly degrades the performance of a T5 equivalent on the GLUE benchmark while being much more robust to artificial and user-generated noise.

Model Sources

Paper: MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling (EMNLP 2022 Findings)

Uses

Direct Use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("almanach/manta-lm-small", trust_remote_code=True)
manta_model = AutoModelForSeq2SeqLM.from_pretrained("almanach/manta-lm-small", trust_remote_code=True)

tokens = tokenizer("The name of the capital of France is <extra_id_0> and it is a very big city.", return_tensors="pt")
output = manta_model.generate(**tokens, decoder_start_token_id=0, repetition_penalty=1.5, do_sample=True)

print(tokenizer.batch_decode(output))

Recommendations

We recommend using a smaller learning rate for the tokenizer module during fine-tuning (byte embeddings, frontier predictor, pooler).

Training Details

Training Data

This model was trained on the C4 dataset.

Training Procedure

The training objective is the same as ByT5, but most hyperparameters are taken from T5.

Citation

BibTeX:

@inproceedings{godey-etal-2022-manta,
    title = "{MANT}a: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling",
    author = "Godey, Nathan  and
      Castagn{\'e}, Roman  and
      de la Clergerie, {\'E}ric  and
      Sagot, Beno{\^\i}t",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.207",
    pages = "2859--2870",
}

Model Card Authors

Nathan Godey Roman Castagné