File size: 5,908 Bytes
ddb32f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27569a5
ddb32f2
27569a5
ddb32f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
287aa99
 
96f521d
287aa99
 
 
ddb32f2
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
license: llama3.1
language:
- en
- pa
metrics:
- chrf
base_model:
- meta-llama/Meta-Llama-3.1-8B
pipeline_tag: translation
tags:
- text-2-text translation
- English2Punjabi
---
# 🦙📝 LLAMA-VaaniSetu-EN2PA: English to Punjabi Translation with Large Language Models

### Overview

This model, **LLAMA-VaaniSetu-EN2PA**, is a fine-tuned version of the LLaMA 3.1 8B architecture model, designed specifically for **English to Punjabi translation**. The model has been trained using the **Bharat Parallel Corpus Collection (BPCC)**, which contains around **10 million English<>Punjabi pairs**. The BPCC has been made available by [AI4Bharat](https://github.com/AI4Bharat/IndicTrans2).

This model aims to bridge the gap in **open-source English to Punjabi translation models**, with potential applications in translating judicial documents, government orders, court judgments, and other documents to cater to Punjabi-speaking masses.

### Model and Data Information

- **Training Data**: 10 million English<>Punjabi parallel sentences from [AI4Bharat's Bharat Parallel Corpus Collection (BPCC)](https://github.com/AI4Bharat/IndicTrans2).
- **Evaluation Data**: The model has been evaluated on **1503 samples** from the **IN22-Conv dataset**, which is also available via [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2).
- **Model Architecture**: Based on **LLaMA 3.1 8B** with BF16 precision.
- **Score (chrF++)**: Achieved a **chrF++ score of 28.1** on the IN22-Conv dataset, which is an excellent score for an open-source model. The benchmark chrF++ score for Google Translate is 61.1 (as noted in [this paper](https://arxiv.org/pdf/2305.16307)).

This is the **first release** of the model, and future updates aim to improve the chrF++ score for enhanced translation quality.

### GPU Requirements for Inference

To perform inference with this model, here are the **minimum GPU requirements**:
- **Memory Requirements**: 16-18 GB of VRAM for inference in **BF16 (BFloat16)** precision.
- **Recommended GPUs**:
  - **NVIDIA A100 (20GB)**: Ideal for BF16 precision and efficiently handles large models like LLaMA 8B.
  - Other GPUs with **at least 16 GB VRAM** may also work, but performance may vary based on memory availability.

## Requirements

- Python 3.8.10 or above
- Required Python packages:
  - `transformers`
  - `torch`
  - `huggingface_hub`

### Installation Instructions

To use this model, ensure you have the following dependencies installed:

```bash
pip install torch transformers huggingface_hub
```

### Model Usage Example

Here's an example of how to load and use the **LLAMA-VaaniSetu-EN2PA** model for **English to Punjabi translation**:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


# Load model and tokenizer
def load_model():
    tokenizer = AutoTokenizer.from_pretrained("partex-nv/Llama-3.1-8B-VaaniSetu-EN2PA")
    model = AutoModelForCausalLM.from_pretrained(
        "partex-nv/Llama-3.1-8B-VaaniSetu-EN2PA",
        torch_dtype=torch.bfloat16,
        device_map="auto",  # Automatically moves model to GPU
    )
    return model, tokenizer

model, tokenizer = load_model()

# Define the function for translation
# Define the function for translation which translated from English to Punjabi
def translate_to_punjabi(english_text):
    # Create the  prompt
    translate_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
    
    ### Instruction:
    {}
    
    ### Input:
    {}
    
    ### Response:
    {}"""
    
    # Format the prompt
    formatted_input = translate_prompt.format(
        "You are given the english text, read it and understand it. After reading translate the english text to Punjabi and provide the output strictly",  # Instruction
        english_text,  # Input text to be translated
        ""  # Output - leave blank for generation
    )
    
    # Tokenize the input
    inputs = tokenizer([formatted_input], return_tensors="pt").to("cuda")

    # Generate the translation output
    output_ids = model.generate(**inputs, max_new_tokens=500)

    # Decode the output
    translated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    fulloutput = translated_text.split("Response:")[-1].strip()
    if not fulloutput:
        fulloutput = ""
    return fulloutput


english_text = """
Delhi is a beautiful place
"""

punjabi_translation = translate_to_punjabi(english_text)

print(punjabi_translation)
```

### Notes

- The translation function is designed to handle **English to Punjabi** translations. You can use this for various applications, such as translating judicial documents, government orders, and other documents into Punjabi.
  
### Performance and Future Work

As this is the **first release** of the **LLAMA-VaaniSetu-EN2PA** model, there is room for improvement, particularly in increasing the chrF++ score. Future versions of the model will focus on optimizing performance, enhancing the translation quality, and expanding to additional domains.

Stay tuned for updates, and feel free to contribute or raise issues on Hugging Face or the associated repositories!

### Resources

- **Training Data**: [Bharat Parallel Corpus Collection (BPCC)](https://github.com/AI4Bharat/IndicTrans2) by AI4Bharat.
- **Evaluation Data**: [IN22-Conv dataset](https://github.com/AI4Bharat/IndicTrans2).
- **Benchmarks**: [Translation Benchmarks Paper](https://arxiv.org/pdf/2305.16307).

## Contributors

- **Rohit Anurag** - Principle Software Engineer, PerpetualBlock - A Partex Company



## Acknowledgements

- [AI4Bharat](https://github.com/AI4Bharat/IndicTrans2): The training and evaluation data we took from.


### License

This model is licensed under the appropriate terms for the **LLaMA** architecture and any datasets used during fine-tuning.