pranjalchitale commited on
Commit
6827d3d
1 Parent(s): c65c553

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -2
README.md CHANGED
@@ -56,10 +56,75 @@ This is the model card of IndicTrans2 En-Indic Distilled 200M variant.
56
  Please refer to [section 7.6: Distilled Models](https://openreview.net/forum?id=vfT4YuzAYA) in the TMLR submission for further details on model training, data and metrics.
57
 
58
  ### Usage Instructions
59
-
60
  Please refer to the [github repository](https://github.com/AI4Bharat/IndicTrans2/tree/main/huggingface_inference) for a detail description on how to use HF compatible IndicTrans2 models for inference.
61
 
62
- **Note: IndicTrans2 is not compatible with AutoTokenizer, therefore we provide [IndicTransTokenizer](https://github.com/VarunGumma/IndicTransTokenizer)**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
 
65
  ### Citation
 
56
  Please refer to [section 7.6: Distilled Models](https://openreview.net/forum?id=vfT4YuzAYA) in the TMLR submission for further details on model training, data and metrics.
57
 
58
  ### Usage Instructions
 
59
  Please refer to the [github repository](https://github.com/AI4Bharat/IndicTrans2/tree/main/huggingface_inference) for a detail description on how to use HF compatible IndicTrans2 models for inference.
60
 
61
+ ```python
62
+ import torch
63
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
64
+ from IndicTransTokenizer import IndicProcessor
65
+
66
+
67
+ model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
68
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
69
+
70
+ model = AutoModelForSeq2SeqLM.from_pretrained(
71
+ model_name,
72
+ trust_remote_code=True
73
+ )
74
+
75
+ ip = IndicProcessor(inference=True)
76
+
77
+ input_sentences = [
78
+ "When I was young, I used to go to the park every day.",
79
+ "We watched a new movie last week, which was very inspiring.",
80
+ "If you had met me at that time, we would have gone out to eat.",
81
+ "My friend has invited me to his birthday party, and I will give him a gift.",
82
+ ]
83
+
84
+ src_lang, tgt_lang = "eng_Latn", "hin_Deva"
85
+
86
+ batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)
87
+
88
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
89
+
90
+ # Tokenize the sentences and generate input encodings
91
+ inputs = tokenizer(
92
+ batch,
93
+ truncation=True,
94
+ padding="longest",
95
+ return_tensors="pt",
96
+ return_attention_mask=True,
97
+ ).to(DEVICE)
98
+
99
+ # Generate translations using the model
100
+ with torch.no_grad():
101
+ generated_tokens = model.generate(
102
+ **inputs,
103
+ use_cache=True,
104
+ min_length=0,
105
+ max_length=256,
106
+ num_beams=5,
107
+ num_return_sequences=1,
108
+ )
109
+
110
+ # Decode the generated tokens into text
111
+ with tokenizer.as_target_tokenizer():
112
+ generated_tokens = tokenizer.batch_decode(
113
+ generated_tokens.detach().cpu().tolist(),
114
+ skip_special_tokens=True,
115
+ clean_up_tokenization_spaces=True,
116
+ )
117
+
118
+ # Postprocess the translations, including entity replacement
119
+ translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
120
+
121
+ for input_sentence, translation in zip(input_sentences, translations):
122
+ print(f"{src_lang}: {input_sentence}")
123
+ print(f"{tgt_lang}: {translation}")
124
+ ```
125
+
126
+ **Note: IndicTrans2 is now compatible with AutoTokenizer, however you need to use IndicProcessor from [IndicTransTokenizer](https://github.com/VarunGumma/IndicTransTokenizer) for preprocessing before tokenization.**
127
+
128
 
129
 
130
  ### Citation