Upload README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,145 @@
|
|
1 |
-
---
|
2 |
-
license: llama3.1
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3.1
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
- pa
|
6 |
+
metrics:
|
7 |
+
- chrf
|
8 |
+
base_model:
|
9 |
+
- meta-llama/Meta-Llama-3.1-8B
|
10 |
+
pipeline_tag: translation
|
11 |
+
tags:
|
12 |
+
- text-2-text translation
|
13 |
+
- English2Punjabi
|
14 |
+
---
|
15 |
+
# 🦙📝 LLAMA-VaaniSetu-EN2PA: English to Punjabi Translation with Large Language Models
|
16 |
+
|
17 |
+
### Overview
|
18 |
+
|
19 |
+
This model, **LLAMA-VaaniSetu-EN2PA**, is a fine-tuned version of the LLaMA 3.1 8B architecture model, designed specifically for **English to Punjabi translation**. The model has been trained using the **Bharat Parallel Corpus Collection (BPCC)**, which contains around **10 million English<>Punjabi pairs**. The BPCC has been made available by [AI4Bharat](https://github.com/AI4Bharat/IndicTrans2).
|
20 |
+
|
21 |
+
This model aims to bridge the gap in **open-source English to Punjabi translation models**, with potential applications in translating judicial documents, government orders, court judgments, and other documents to cater to Punjabi-speaking masses.
|
22 |
+
|
23 |
+
### Model and Data Information
|
24 |
+
|
25 |
+
- **Training Data**: 10 million English<>Punjabi parallel sentences from [AI4Bharat's Bharat Parallel Corpus Collection (BPCC)](https://github.com/AI4Bharat/IndicTrans2).
|
26 |
+
- **Evaluation Data**: The model has been evaluated on **1503 samples** from the **IN22-Conv dataset**, which is also available via [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2).
|
27 |
+
- **Model Architecture**: Based on **LLaMA 3.1 8B** with BF16 precision.
|
28 |
+
- **Score (chrF++)**: Achieved a **chrF++ score of 28.1** on the IN22-Conv dataset, which is an excellent score for an open-source model. The benchmark chrF++ score for Google Translate is 61.1 (as noted in [this paper](https://arxiv.org/pdf/2305.16307)).
|
29 |
+
|
30 |
+
This is the **first release** of the model, and future updates aim to improve the chrF++ score for enhanced translation quality.
|
31 |
+
|
32 |
+
### GPU Requirements for Inference
|
33 |
+
|
34 |
+
To perform inference with this model, here are the **minimum GPU requirements**:
|
35 |
+
- **Memory Requirements**: 16-18 GB of VRAM for inference in **BF16 (BFloat16)** precision.
|
36 |
+
- **Recommended GPUs**:
|
37 |
+
- **NVIDIA A100 (20GB)**: Ideal for BF16 precision and efficiently handles large models like LLaMA 8B.
|
38 |
+
- Other GPUs with **at least 16 GB VRAM** may also work, but performance may vary based on memory availability.
|
39 |
+
|
40 |
+
## Requirements
|
41 |
+
|
42 |
+
- Python 3.8.10 or above
|
43 |
+
- Required Python packages:
|
44 |
+
- `transformers`
|
45 |
+
- `torch`
|
46 |
+
- `huggingface_hub`
|
47 |
+
|
48 |
+
### Installation Instructions
|
49 |
+
|
50 |
+
To use this model, ensure you have the following dependencies installed:
|
51 |
+
|
52 |
+
```bash
|
53 |
+
pip install torch transformers huggingface_hub
|
54 |
+
```
|
55 |
+
|
56 |
+
### Model Usage Example
|
57 |
+
|
58 |
+
Here's an example of how to load and use the **LLAMA-VaaniSetu-EN2PA** model for **English to Punjabi translation**:
|
59 |
+
|
60 |
+
```python
|
61 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
62 |
+
import torch
|
63 |
+
|
64 |
+
|
65 |
+
# Load model and tokenizer
|
66 |
+
def load_model():
|
67 |
+
tokenizer = AutoTokenizer.from_pretrained("rohitanurag/Llama-3.1-8B-VaaniSetu-EN2PA", use_auth_token=HF_TOKEN)
|
68 |
+
model = AutoModelForCausalLM.from_pretrained(
|
69 |
+
"rohitanurag/Llama-3.1-8B-VaaniSetu-EN2PA",
|
70 |
+
torch_dtype=torch.bfloat16,
|
71 |
+
device_map="auto", # Automatically moves model to GPU
|
72 |
+
)
|
73 |
+
return model, tokenizer
|
74 |
+
|
75 |
+
model, tokenizer = load_model()
|
76 |
+
|
77 |
+
# Define the function for translation
|
78 |
+
# Define the function for translation which translated from English to Punjabi
|
79 |
+
def translate_to_punjabi(english_text):
|
80 |
+
# Create the prompt
|
81 |
+
translate_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
|
82 |
+
|
83 |
+
### Instruction:
|
84 |
+
{}
|
85 |
+
|
86 |
+
### Input:
|
87 |
+
{}
|
88 |
+
|
89 |
+
### Response:
|
90 |
+
{}"""
|
91 |
+
|
92 |
+
# Format the prompt
|
93 |
+
formatted_input = translate_prompt.format(
|
94 |
+
"You are given the english text, read it and understand it. After reading translate the english text to Punjabi and provide the output strictly", # Instruction
|
95 |
+
english_text, # Input text to be translated
|
96 |
+
"" # Output - leave blank for generation
|
97 |
+
)
|
98 |
+
|
99 |
+
# Tokenize the input
|
100 |
+
inputs = tokenizer([formatted_input], return_tensors="pt").to("cuda")
|
101 |
+
|
102 |
+
# Generate the translation output
|
103 |
+
output_ids = model.generate(**inputs, max_new_tokens=500)
|
104 |
+
|
105 |
+
# Decode the output
|
106 |
+
translated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
|
107 |
+
fulloutput = translated_text.split("Response:")[-1].strip()
|
108 |
+
if not fulloutput:
|
109 |
+
fulloutput = ""
|
110 |
+
return fulloutput
|
111 |
+
|
112 |
+
|
113 |
+
english_text = """
|
114 |
+
Delhi is a beautiful place
|
115 |
+
"""
|
116 |
+
|
117 |
+
punjabi_translation = translate_to_punjabi(english_text)
|
118 |
+
|
119 |
+
print(punjabi_translation)
|
120 |
+
```
|
121 |
+
|
122 |
+
### Notes
|
123 |
+
|
124 |
+
- The translation function is designed to handle **English to Punjabi** translations. You can use this for various applications, such as translating judicial documents, government orders, and other documents into Punjabi.
|
125 |
+
|
126 |
+
### Performance and Future Work
|
127 |
+
|
128 |
+
As this is the **first release** of the **LLAMA-VaaniSetu-EN2PA** model, there is room for improvement, particularly in increasing the chrF++ score. Future versions of the model will focus on optimizing performance, enhancing the translation quality, and expanding to additional domains.
|
129 |
+
|
130 |
+
Stay tuned for updates, and feel free to contribute or raise issues on Hugging Face or the associated repositories!
|
131 |
+
|
132 |
+
### Resources
|
133 |
+
|
134 |
+
- **Training Data**: [Bharat Parallel Corpus Collection (BPCC)](https://github.com/AI4Bharat/IndicTrans2) by AI4Bharat.
|
135 |
+
- **Evaluation Data**: [IN22-Conv dataset](https://github.com/AI4Bharat/IndicTrans2).
|
136 |
+
- **Benchmarks**: [Translation Benchmarks Paper](https://arxiv.org/pdf/2305.16307).
|
137 |
+
|
138 |
+
## Acknowledgements
|
139 |
+
|
140 |
+
- [AI4Bharat](https://github.com/AI4Bharat/IndicTrans2): The training and evaluation data we took from.
|
141 |
+
|
142 |
+
|
143 |
+
### License
|
144 |
+
|
145 |
+
This model is licensed under the appropriate terms for the **LLaMA** architecture and any datasets used during fine-tuning.
|