Extractive Question Answering with AutoTrain

Community Article Published August 20, 2024

Extractive Question Answering is a task in which a model is trained to extract the answer to a question from a given context. The model is trained to predict the start and end positions of the answer span within the context. This task is commonly used in question-answering systems to extract relevant information from a large corpus of text.

Sometimes, generative is not all you need ;)

In this blog, we will discuss how to train an Extractive Question Answering model using AutoTrain. AutoTrain (aka AutoTrain Advanced) is an open-source, no-code solution that simplifies the process of training state-of-the-art models across various domains and modality types. It enables you to train models with just a few clicks, without the need for any coding or machine learning expertise.

AutoTrain GitHub repository can be found here.

Preparing your data

To train an Extractive Question Answering model, you need a dataset that contains the following columns:

  • context: The context or passage from which the answer is to be extracted.
  • question: The question for which the answer is to be extracted.
  • answer: The start position of the answer span in the context and the answer text.

The answer column should be a dictionary with keys text and answer_start.

For example:

{
    "context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",
    "question":"To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?",
    "answers":{"text":["Saint Bernadette Soubirous"],"answer_start":[515]}
}

AutoTrain supports CSV and JSONL formats for training data. If you want to use CSV, the answer column should be stringified JSON with the keys text and answer_start. JSONL is the preferred format for question answering tasks.

You can also use a dataset from the Hugging Face Hub, such as lhoestq/squad.

This is how the dataset looks like:

Column mapping:

Column mapping is very crucial for AutoTrain. AutoTrain understands the data based on the column mapping provided. For Extractive Question Answering, the column mapping should be as follows:

{"text": "context", "question": "question", "answer": "answers"}

where answer is a dictionary with keys text and answer_start.

As you can see, the AutoTrain columns are: text, question, and answer!

Training the model locally

To use AutoTrain locally, you need to install the pip package: autotrain-advanced.

$ pip install -U autotrain-advanced

After installing the package, you can train the model using the following command:

$ export HF_USERNAME=<your_hf_username>
$ export HF_TOKEN=<your_hf_write_token>
$ autotrain --config <path_to_config_file>

Where config file looks something like this:

task: extractive-qa
base_model: google-bert/bert-base-uncased
project_name: autotrain-bert-ex-qa1
log: tensorboard
backend: local

data:
  path: lhoestq/squad
  train_split: train
  valid_split: validation
  column_mapping:
    text_column: context
    question_column: question
    answer_column: answers

params:
  max_seq_length: 512
  max_doc_stride: 128
  epochs: 3
  batch_size: 4
  lr: 2e-5
  optimizer: adamw_torch
  scheduler: linear
  gradient_accumulation: 1
  mixed_precision: fp16

hub:
  username: ${HF_USERNAME}
  token: ${HF_TOKEN}
  push_to_hub: true

The above config will train a BERT model on the lhoestq/squad dataset for 3 epochs with a batch size of 4 and a learning rate of 2e-5. You can find all the parameters in docs.

In case of local file, all you need to do is change the data part of the config file to:

data:
  path: data/ # this must be the path to the directory containing the train and valid files
  train_split: train # this must be either train.csv or train.json
  valid_split: valid # this must be either valid.csv or valid.json, can also be null

Note: you dont need to export your HF_USERNAME and HF_TOKEN if you are not pushing the model to the hub or not using a gated/private model/dataset.

Training the model on Hugging Face Hub

To train the model on Hugging Face Hub, you need to create an AutoTrain space with appropriate hardware. To create an autotrain space, visit AutoTrain and follow the instructions or click here.

Once done, you will be presented with a screen like this:

image/png

Choose the Extractive Question Answering task, fill in the required details: dataset and column mapping, change the parameters if you wish and click on "Start Training".

You can also run the UI locally using the following command:

$ export HF_TOKEN=<your_hf_write_token>
$ autotrain app

That's it! You can now train your own Extractive Question Answering model using AutoTrain locally or on Hugging Face Hub.

Happy training! 🚀

In case you have any questions or need help, feel free to reach out to us on GitHub.