namdp-ptit
/

ViRanker

@@ -1,17 +1,17 @@
 ---
 license: apache-2.0
 language:
-- vi
 library_name: transformers
 pipeline_tag: text-classification
 tags:
-- transformers
-- cross-encoder
-- rerank
 datasets:
-- unicamp-dl/mmarco
 widget:
-  - query: tỉnh nào có diện tích lớn nhất việt nam.
     output:
       - label: >-
           nghệ an có diện tích lớn nhất việt nam
@@ -19,4 +19,89 @@ widget:
       - label: >-
           bắc ninh có diện tích nhỏ nhất việt nam
         score: 0.05
----

 ---
 license: apache-2.0
 language:
+  - vi
 library_name: transformers
 pipeline_tag: text-classification
 tags:
+  - transformers
+  - cross-encoder
+  - rerank
 datasets:
+  - unicamp-dl/mmarco
 widget:
+  - text: tỉnh nào có diện tích lớn nhất việt nam.
     output:
       - label: >-
           nghệ an có diện tích lớn nhất việt nam
       - label: >-
           bắc ninh có diện tích nhỏ nhất việt nam
         score: 0.05
+---
+# Reranker
+* [Usage](#usage)
+    * [Using FlagEmbedding](#using-flagembedding)
+    * [Using Huggineface transformers](#using-huggingface-transformers)
+* [Fine tune](#fine-tune)
+    * [Data format](#data-format)
+Different from embedding model, reranker uses question and document as input and directly output similarity instead of
+embedding.
+You can get a relevance score by inputting query and passage to the reranker.
+And the score can be mapped to a float value in [0,1] by sigmoid function.
+## Usage
+### Using FlagEmbedding
+```
+pip install -U FlagEmbedding
+```
+Get relevance scores (higher scores indicate more relevance):
+```python
+from FlagEmbedding import FlagReranker
+reranker = FlagReranker('namdp/bge-reranker-vietnamese',
+                        use_fp16=True)  # Setting use_fp16 to True speeds up computation with a slight performance degradation
+score = reranker.compute_score(['query', 'passage'])
+print(score)  # -5.65234375
+# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
+score = reranker.compute_score(['query', 'passage'], normalize=True)
+print(score)  # 0.003497010252573502
+scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?',
+                                                            'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
+print(scores)  # [-8.1875, 5.26171875]
+# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
+scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?',
+                                                            'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']],
+                                normalize=True)
+print(scores)  # [0.00027803096387751553, 0.9948403768236574]
+```
+### Using Huggingface transformers
+```
+pip install -U transformers
+```
+Get relevance scores (higher scores indicate more relevance):
+```python
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained('namdp/bge-reranker-vietnamese')
+model = AutoModelForSequenceClassification.from_pretrained('namdp/bge-reranker-vietnamese')
+model.eval()
+pairs = [['what is panda?', 'hi'], ['what is panda?',
+                                    'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
+with torch.no_grad():
+    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
+    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
+    print(scores)
+```
+## Fine-tune
+### Data Format
+Train data should be a json file, where each line is a dict like this:
+```
+{"query": str, "pos": List[str], "neg": List[str]}
+```
+`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts, `prompt` indicates the
+relationship between query and texts. If you have no negative texts for a query, you can random sample some from the
+entire corpus as the negatives.