partex-nv commited on
Commit
ddb32f2
1 Parent(s): ae819e1

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -3
README.md CHANGED
@@ -1,3 +1,145 @@
1
- ---
2
- license: llama3.1
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ language:
4
+ - en
5
+ - pa
6
+ metrics:
7
+ - chrf
8
+ base_model:
9
+ - meta-llama/Meta-Llama-3.1-8B
10
+ pipeline_tag: translation
11
+ tags:
12
+ - text-2-text translation
13
+ - English2Punjabi
14
+ ---
15
+ # 🦙📝 LLAMA-VaaniSetu-EN2PA: English to Punjabi Translation with Large Language Models
16
+
17
+ ### Overview
18
+
19
+ This model, **LLAMA-VaaniSetu-EN2PA**, is a fine-tuned version of the LLaMA 3.1 8B architecture model, designed specifically for **English to Punjabi translation**. The model has been trained using the **Bharat Parallel Corpus Collection (BPCC)**, which contains around **10 million English<>Punjabi pairs**. The BPCC has been made available by [AI4Bharat](https://github.com/AI4Bharat/IndicTrans2).
20
+
21
+ This model aims to bridge the gap in **open-source English to Punjabi translation models**, with potential applications in translating judicial documents, government orders, court judgments, and other documents to cater to Punjabi-speaking masses.
22
+
23
+ ### Model and Data Information
24
+
25
+ - **Training Data**: 10 million English<>Punjabi parallel sentences from [AI4Bharat's Bharat Parallel Corpus Collection (BPCC)](https://github.com/AI4Bharat/IndicTrans2).
26
+ - **Evaluation Data**: The model has been evaluated on **1503 samples** from the **IN22-Conv dataset**, which is also available via [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2).
27
+ - **Model Architecture**: Based on **LLaMA 3.1 8B** with BF16 precision.
28
+ - **Score (chrF++)**: Achieved a **chrF++ score of 28.1** on the IN22-Conv dataset, which is an excellent score for an open-source model. The benchmark chrF++ score for Google Translate is 61.1 (as noted in [this paper](https://arxiv.org/pdf/2305.16307)).
29
+
30
+ This is the **first release** of the model, and future updates aim to improve the chrF++ score for enhanced translation quality.
31
+
32
+ ### GPU Requirements for Inference
33
+
34
+ To perform inference with this model, here are the **minimum GPU requirements**:
35
+ - **Memory Requirements**: 16-18 GB of VRAM for inference in **BF16 (BFloat16)** precision.
36
+ - **Recommended GPUs**:
37
+ - **NVIDIA A100 (20GB)**: Ideal for BF16 precision and efficiently handles large models like LLaMA 8B.
38
+ - Other GPUs with **at least 16 GB VRAM** may also work, but performance may vary based on memory availability.
39
+
40
+ ## Requirements
41
+
42
+ - Python 3.8.10 or above
43
+ - Required Python packages:
44
+ - `transformers`
45
+ - `torch`
46
+ - `huggingface_hub`
47
+
48
+ ### Installation Instructions
49
+
50
+ To use this model, ensure you have the following dependencies installed:
51
+
52
+ ```bash
53
+ pip install torch transformers huggingface_hub
54
+ ```
55
+
56
+ ### Model Usage Example
57
+
58
+ Here's an example of how to load and use the **LLAMA-VaaniSetu-EN2PA** model for **English to Punjabi translation**:
59
+
60
+ ```python
61
+ from transformers import AutoModelForCausalLM, AutoTokenizer
62
+ import torch
63
+
64
+
65
+ # Load model and tokenizer
66
+ def load_model():
67
+ tokenizer = AutoTokenizer.from_pretrained("rohitanurag/Llama-3.1-8B-VaaniSetu-EN2PA", use_auth_token=HF_TOKEN)
68
+ model = AutoModelForCausalLM.from_pretrained(
69
+ "rohitanurag/Llama-3.1-8B-VaaniSetu-EN2PA",
70
+ torch_dtype=torch.bfloat16,
71
+ device_map="auto", # Automatically moves model to GPU
72
+ )
73
+ return model, tokenizer
74
+
75
+ model, tokenizer = load_model()
76
+
77
+ # Define the function for translation
78
+ # Define the function for translation which translated from English to Punjabi
79
+ def translate_to_punjabi(english_text):
80
+ # Create the prompt
81
+ translate_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
82
+
83
+ ### Instruction:
84
+ {}
85
+
86
+ ### Input:
87
+ {}
88
+
89
+ ### Response:
90
+ {}"""
91
+
92
+ # Format the prompt
93
+ formatted_input = translate_prompt.format(
94
+ "You are given the english text, read it and understand it. After reading translate the english text to Punjabi and provide the output strictly", # Instruction
95
+ english_text, # Input text to be translated
96
+ "" # Output - leave blank for generation
97
+ )
98
+
99
+ # Tokenize the input
100
+ inputs = tokenizer([formatted_input], return_tensors="pt").to("cuda")
101
+
102
+ # Generate the translation output
103
+ output_ids = model.generate(**inputs, max_new_tokens=500)
104
+
105
+ # Decode the output
106
+ translated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
107
+ fulloutput = translated_text.split("Response:")[-1].strip()
108
+ if not fulloutput:
109
+ fulloutput = ""
110
+ return fulloutput
111
+
112
+
113
+ english_text = """
114
+ Delhi is a beautiful place
115
+ """
116
+
117
+ punjabi_translation = translate_to_punjabi(english_text)
118
+
119
+ print(punjabi_translation)
120
+ ```
121
+
122
+ ### Notes
123
+
124
+ - The translation function is designed to handle **English to Punjabi** translations. You can use this for various applications, such as translating judicial documents, government orders, and other documents into Punjabi.
125
+
126
+ ### Performance and Future Work
127
+
128
+ As this is the **first release** of the **LLAMA-VaaniSetu-EN2PA** model, there is room for improvement, particularly in increasing the chrF++ score. Future versions of the model will focus on optimizing performance, enhancing the translation quality, and expanding to additional domains.
129
+
130
+ Stay tuned for updates, and feel free to contribute or raise issues on Hugging Face or the associated repositories!
131
+
132
+ ### Resources
133
+
134
+ - **Training Data**: [Bharat Parallel Corpus Collection (BPCC)](https://github.com/AI4Bharat/IndicTrans2) by AI4Bharat.
135
+ - **Evaluation Data**: [IN22-Conv dataset](https://github.com/AI4Bharat/IndicTrans2).
136
+ - **Benchmarks**: [Translation Benchmarks Paper](https://arxiv.org/pdf/2305.16307).
137
+
138
+ ## Acknowledgements
139
+
140
+ - [AI4Bharat](https://github.com/AI4Bharat/IndicTrans2): The training and evaluation data we took from.
141
+
142
+
143
+ ### License
144
+
145
+ This model is licensed under the appropriate terms for the **LLaMA** architecture and any datasets used during fine-tuning.