Arabic News Article Summarization with mT5
This project fine-tunes the google/mt5-small
model on the BBC Arabic news dataset for the task of summarizing news articles into concise summaries. Utilizing the Transformer-based model's state-of-the-art performance in natural language understanding and generation, this project addresses the unique linguistic nuances of Arabic through advanced NLP techniques.
Introduction
Harnessing the power of the google/mt5-small
model, this project aims to leverage its multilingual processing capabilities for Arabic text summarization. By fine-tuning the model on the BBC Arabic news dataset, we enhance its ability to generate accurate and concise summaries of Arabic news articles. The project employs the Transformers library for an efficient training loop and uses ROUGE scores as an evaluation metric to ensure high-quality summaries. You can replicate this model following the Training Repo
Dataset
The dataset comprises news articles from the BBC Arabic news, split into 32,000 training rows, 4,000 testing rows, and 4,000 validation rows.
- Dataset Source: BBC Arabic News Data
Model
The google/mt5-small
model, a part of the T5 family, is extended to mT5 to support multilingual capabilities, covering 101 languages including Arabic. This project fine-tunes mT5 for Arabic news summarization.
- Pretrained Model: google/mt5-small
Usage
To use this model for summarizing Arabic news articles, follow the steps below:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig
import torch
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization")
config = AutoConfig.from_pretrained(
"yalsaffar/mt5-small-Arabic-Summarization",
max_length=128,
length_penalty=0.6,
no_repeat_ngram_size=2,
num_beams=15,
)
model = AutoModelForSeq2SeqLM.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization", config=config).to("cuda" if torch.cuda.is_available() else "cpu")
# Prepare input
input_text = "الأخبار ...."
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")
# Generate summary
with torch.no_grad():
preds = model.generate(
input_ids,
num_beams=15,
num_return_sequences=1,
no_repeat_ngram_size=1,
remove_invalid_values=True,
max_length=128,
)
# Convert ids to text
summary = tokenizer.batch_decode(preds, skip_special_tokens=True)
print("***** Original Text *****")
print(input_text)
print("***** Generated Summary *****")
print(summary[0])
License
This project is licensed under the MIT License - see the LICENSE file for details.
- Downloads last month
- 26