Arabic News Article Summarization with mT5

This project fine-tunes the google/mt5-small model on the BBC Arabic news dataset for the task of summarizing news articles into concise summaries. Utilizing the Transformer-based model's state-of-the-art performance in natural language understanding and generation, this project addresses the unique linguistic nuances of Arabic through advanced NLP techniques.

Introduction

Harnessing the power of the google/mt5-small model, this project aims to leverage its multilingual processing capabilities for Arabic text summarization. By fine-tuning the model on the BBC Arabic news dataset, we enhance its ability to generate accurate and concise summaries of Arabic news articles. The project employs the Transformers library for an efficient training loop and uses ROUGE scores as an evaluation metric to ensure high-quality summaries. You can replicate this model following the Training Repo

Dataset

The dataset comprises news articles from the BBC Arabic news, split into 32,000 training rows, 4,000 testing rows, and 4,000 validation rows.

Dataset Source: BBC Arabic News Data

Model

The google/mt5-small model, a part of the T5 family, is extended to mT5 to support multilingual capabilities, covering 101 languages including Arabic. This project fine-tunes mT5 for Arabic news summarization.

Pretrained Model: google/mt5-small

Usage

To use this model for summarizing Arabic news articles, follow the steps below:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization")
config = AutoConfig.from_pretrained(
    "yalsaffar/mt5-small-Arabic-Summarization",
    max_length=128,
    length_penalty=0.6,
    no_repeat_ngram_size=2,
    num_beams=15,
)
model = AutoModelForSeq2SeqLM.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization", config=config).to("cuda" if torch.cuda.is_available() else "cpu")

# Prepare input
input_text = "الأخبار ...."
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")

# Generate summary
with torch.no_grad():
    preds = model.generate(
        input_ids,
        num_beams=15,
        num_return_sequences=1,
        no_repeat_ngram_size=1,
        remove_invalid_values=True,
        max_length=128,
    )

# Convert ids to text
summary = tokenizer.batch_decode(preds, skip_special_tokens=True)

print("***** Original Text *****")
print(input_text)
print("***** Generated Summary *****")
print(summary[0])

License

This project is licensed under the MIT License - see the LICENSE file for details.