|
--- |
|
license: llama3.1 |
|
language: |
|
- en |
|
- pa |
|
metrics: |
|
- chrf |
|
base_model: |
|
- meta-llama/Meta-Llama-3.1-8B |
|
pipeline_tag: translation |
|
tags: |
|
- text-2-text translation |
|
- English2Punjabi |
|
--- |
|
# π¦π LLAMA-VaaniSetu-EN2PA: English to Punjabi Translation with Large Language Models |
|
|
|
### Overview |
|
|
|
This model, **LLAMA-VaaniSetu-EN2PA**, is a fine-tuned version of the LLaMA 3.1 8B architecture model, designed specifically for **English to Punjabi translation**. The model has been trained using the **Bharat Parallel Corpus Collection (BPCC)**, which contains around **10 million English<>Punjabi pairs**. The BPCC has been made available by [AI4Bharat](https://github.com/AI4Bharat/IndicTrans2). |
|
|
|
This model aims to bridge the gap in **open-source English to Punjabi translation models**, with potential applications in translating judicial documents, government orders, court judgments, and other documents to cater to Punjabi-speaking masses. |
|
|
|
### Model and Data Information |
|
|
|
- **Training Data**: 10 million English<>Punjabi parallel sentences from [AI4Bharat's Bharat Parallel Corpus Collection (BPCC)](https://github.com/AI4Bharat/IndicTrans2). |
|
- **Evaluation Data**: The model has been evaluated on **1503 samples** from the **IN22-Conv dataset**, which is also available via [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2). |
|
- **Model Architecture**: Based on **LLaMA 3.1 8B** with BF16 precision. |
|
- **Score (chrF++)**: Achieved a **chrF++ score of 28.1** on the IN22-Conv dataset, which is an excellent score for an open-source model. The benchmark chrF++ score for Google Translate is 61.1 (as noted in [this paper](https://arxiv.org/pdf/2305.16307)). |
|
|
|
This is the **first release** of the model, and future updates aim to improve the chrF++ score for enhanced translation quality. |
|
|
|
### GPU Requirements for Inference |
|
|
|
To perform inference with this model, here are the **minimum GPU requirements**: |
|
- **Memory Requirements**: 16-18 GB of VRAM for inference in **BF16 (BFloat16)** precision. |
|
- **Recommended GPUs**: |
|
- **NVIDIA A100 (20GB)**: Ideal for BF16 precision and efficiently handles large models like LLaMA 8B. |
|
- Other GPUs with **at least 16 GB VRAM** may also work, but performance may vary based on memory availability. |
|
|
|
## Requirements |
|
|
|
- Python 3.8.10 or above |
|
- Required Python packages: |
|
- `transformers` |
|
- `torch` |
|
- `huggingface_hub` |
|
|
|
### Installation Instructions |
|
|
|
To use this model, ensure you have the following dependencies installed: |
|
|
|
```bash |
|
pip install torch transformers huggingface_hub |
|
``` |
|
|
|
### Model Usage Example |
|
|
|
Here's an example of how to load and use the **LLAMA-VaaniSetu-EN2PA** model for **English to Punjabi translation**: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
def load_model(): |
|
tokenizer = AutoTokenizer.from_pretrained("partex-nv/Llama-3.1-8B-VaaniSetu-EN2PA") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"partex-nv/Llama-3.1-8B-VaaniSetu-EN2PA", |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto", # Automatically moves model to GPU |
|
) |
|
return model, tokenizer |
|
|
|
model, tokenizer = load_model() |
|
|
|
# Define the function for translation |
|
# Define the function for translation which translated from English to Punjabi |
|
def translate_to_punjabi(english_text): |
|
# Create the prompt |
|
translate_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. |
|
|
|
### Instruction: |
|
{} |
|
|
|
### Input: |
|
{} |
|
|
|
### Response: |
|
{}""" |
|
|
|
# Format the prompt |
|
formatted_input = translate_prompt.format( |
|
"You are given the english text, read it and understand it. After reading translate the english text to Punjabi and provide the output strictly", # Instruction |
|
english_text, # Input text to be translated |
|
"" # Output - leave blank for generation |
|
) |
|
|
|
# Tokenize the input |
|
inputs = tokenizer([formatted_input], return_tensors="pt").to("cuda") |
|
|
|
# Generate the translation output |
|
output_ids = model.generate(**inputs, max_new_tokens=500) |
|
|
|
# Decode the output |
|
translated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True) |
|
fulloutput = translated_text.split("Response:")[-1].strip() |
|
if not fulloutput: |
|
fulloutput = "" |
|
return fulloutput |
|
|
|
|
|
english_text = """ |
|
Delhi is a beautiful place |
|
""" |
|
|
|
punjabi_translation = translate_to_punjabi(english_text) |
|
|
|
print(punjabi_translation) |
|
``` |
|
|
|
### Notes |
|
|
|
- The translation function is designed to handle **English to Punjabi** translations. You can use this for various applications, such as translating judicial documents, government orders, and other documents into Punjabi. |
|
|
|
### Performance and Future Work |
|
|
|
As this is the **first release** of the **LLAMA-VaaniSetu-EN2PA** model, there is room for improvement, particularly in increasing the chrF++ score. Future versions of the model will focus on optimizing performance, enhancing the translation quality, and expanding to additional domains. |
|
|
|
Stay tuned for updates, and feel free to contribute or raise issues on Hugging Face or the associated repositories! |
|
|
|
### Resources |
|
|
|
- **Training Data**: [Bharat Parallel Corpus Collection (BPCC)](https://github.com/AI4Bharat/IndicTrans2) by AI4Bharat. |
|
- **Evaluation Data**: [IN22-Conv dataset](https://github.com/AI4Bharat/IndicTrans2). |
|
- **Benchmarks**: [Translation Benchmarks Paper](https://arxiv.org/pdf/2305.16307). |
|
|
|
## Contributors |
|
|
|
- **Rohit Anurag** - Principle Software Engineer, PerpetualBlock - A Partex Company |
|
|
|
|
|
|
|
## Acknowledgements |
|
|
|
- [AI4Bharat](https://github.com/AI4Bharat/IndicTrans2): The training and evaluation data we took from. |
|
|
|
|
|
### License |
|
|
|
This model is licensed under the appropriate terms for the **LLaMA** architecture and any datasets used during fine-tuning. |