FLANEC: Exploring FLAN-T5 for Post-ASR Error Correction

Model Overview

FLANEC is an encoder-decoder model based on FLAN-T5, specifically fine-tuned for post-Automatic Speech Recognition (ASR) error correction, also known as Generative Speech Error Correction (GenSEC). The model utilizes n-best hypotheses from ASR systems to enhance the accuracy and grammaticality of final transcriptions by generating a single corrected output. FLANEC models are trained on diverse subsets of the HyPoradise dataset, leveraging multiple ASR domains to provide robust, scalable error correction across different types of audio data.

FLANEC was developed for the GenSEC Task 1 challenge at SLT 2024 - Challenge website.

Model Checkpoints

Cumulative Dataset (CD) Models trained with full fine-tuning:

FLANEC Base CD: Base model with ~250 million parameters, fine-tuned for post-ASR correction on cumulative datasets.
FLANEC Large CD: Large model with ~800 million parameters, fine-tuned for post-ASR correction on cumulative datasets.
FLANEC XL CD: Extra-large model with ~3 billion parameters, fine-tuned for post-ASR correction on cumulative datasets.

Cumulative Dataset (CD) Models trained with Low-Rank Adaptation (LoRA):

FLANEC Base LoRA: Base model with ~250 million parameters, fine-tuned with LoRA on cumulative datasets.
FLANEC Large LoRA: Large model with ~800 million parameters, fine-tuned with LoRA on cumulative datasets.
FLANEC XL LoRA: Extra-large model with ~3 billion parameters, fine-tuned with LoRA on cumulative datasets.

Intended Use

FLANEC is designed for the task of Generative Speech Error Correction (GenSEC). The model is suitable for post-processing ASR outputs to correct grammatical and linguistic errors. The model supports the English language.

Training Details

Datasets

FLANEC is trained on the HyPoradise dataset, which contains data from eight ASR domains:

WSJ: Business and financial news.
ATIS: Airline travel queries.
CHiME-4: Noisy speech.
Tedlium-3: TED talks.
CV-accent: Accented speech.
SwitchBoard: Conversational speech.
LRS2: BBC program audio.
CORAAL: Accented speech from African American English.

For more details, see the HyPoradise paper.

Training Strategy

The model has been fine-tuned using both full fine-tuning and LoRA (Low-Rank Adaptation) methods. Fine-tuning was performed on multiple model scales, ranging from 250M to 3B parameters. Both single-dataset (SD) and cumulative dataset (CD) training approaches were employed to assess model performance across different ASR domains.

For more information on the training strategy, refer to the SLT 2024 paper.

Citation

Please use the following citation to reference this work in your research:

citation will be updated soon after SLT 2024 proceedings are published

@inproceedings{moreno2024flanec,
  title={FLANEC: Exploring FLAN-T5 for Post-ASR Error Correction},
  author={La Quatra, Moreno and Salerno, Valerio and Tsao, Yu and Sabato Marco, Siniscalchi},
  booktitle={Proceedings of the 2024 IEEE Workshop on Spoken Language Technology},
  year={2024}
}

morenolq
/

flanec-xl-cd