ViRanker / README.md

Update README.md

8968c80 verified 3 months ago

7.02 kB

	---
	language:
	- vi
	license: apache-2.0
	library_name: transformers
	tags:
	- transformers
	- cross-encoder
	- rerank
	datasets:
	- unicamp-dl/mmarco
	pipeline_tag: text-classification
	widget:
	- text: tỉnh nào có diện tích lớn nhất việt nam
	output:
	- label: nghệ an có diện tích lớn nhất việt nam
	score: 0.99999
	- label: bắc ninh có diện tích nhỏ nhất việt nam
	score: 0.0001
	---

	# Reranker

	* [Usage](#usage)
	* [Using FlagEmbedding](#using-flagembedding)
	* [Using Huggingface transformers](#using-huggingface-transformers)
	* [Fine tune](#fine-tune)
	* [Data format](#data-format)
	* [Performance](#performance)
	* [Contact](#contact)
	* [Support The Project](#support-the-project)
	* [Citation](#citation)

	Different from embedding model, reranker uses question and document as input and directly output similarity instead of
	embedding.
	You can get a relevance score by inputting query and passage to the reranker.
	And the score can be mapped to a float value in [0,1] by sigmoid function.

	## Usage

	### Using FlagEmbedding

	```
	pip install -U FlagEmbedding
	```

	Get relevance scores (higher scores indicate more relevance):

	```python
	from FlagEmbedding import FlagReranker

	reranker = FlagReranker('namdp-ptit/ViRanker',
	use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

	score = reranker.compute_score(['ai là vị vua cuối cùng của việt nam', 'vua bảo đại là vị vua cuối cùng của nước ta'])
	print(score) # 13.71875

	# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
	score = reranker.compute_score(['ai là vị vua cuối cùng của việt nam', 'vua bảo đại là vị vua cuối cùng của nước ta'],
	normalize=True)
	print(score) # 0.99999889840464

	scores = reranker.compute_score(
	[
	['ai là vị vua cuối cùng của việt nam', 'vua bảo đại là vị vua cuối cùng của nước ta'],
	['ai là vị vua cuối cùng của việt nam', 'lý nam đế là vị vua đầu tiên của nước ta']
	]
	)
	print(scores) # [13.7265625, -8.53125]

	# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
	scores = reranker.compute_score(
	[
	['ai là vị vua cuối cùng của việt nam', 'vua bảo đại là vị vua cuối của nước ta'],
	['ai là vị vua cuối cùng của việt nam', 'lý nam đế là vị vua đầu tiên của nước ta']
	],
	normalize=True
	)
	print(scores) # [0.99999889840464, 0.00019716942196222918]
	```

	### Using Huggingface transformers

	```
	pip install -U transformers
	```

	Get relevance scores (higher scores indicate more relevance):

	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained('namdp-ptit/ViRanker')
	model = AutoModelForSequenceClassification.from_pretrained('namdp-ptit/ViRanker')
	model.eval()

	pairs = [
	['ai là vị vua cuối cùng của việt nam', 'vua bảo đại là vị vua cuối cùng của nước ta'],
	['ai là vị vua cuối cùng của việt nam', 'lý nam đế là vị vua đầu tiên của nước ta']
	],
	with torch.no_grad():
	inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
	scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
	print(scores)
	```

	## Fine-tune

	### Data Format

	Train data should be a json file, where each line is a dict like this:

	```
	{"query": str, "pos": List[str], "neg": List[str]}
	```

	`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts. If you have no negative
	texts for a query, you can random sample some from the entire corpus as the negatives.

	Besides, for each query in the train data, we used LLMs to generate hard negative for them by asking LLMs to create a
	document that is the opposite one of the documents in 'pos'.

	## Performance

	Below is a comparision table of the results we achieved compared to some other pre-trained Cross-Encoders on
	the [MS MMarco Passage Reranking - Vi - Dev](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.

	\| Model-Name \| NDCG@3 \| MRR@3 \| NDCG@5 \| MRR@5 \| NDCG@10 \| MRR@10 \| Docs / Sec \|
	\|-----------------------------------------------------------------------------------------------------------------------------------------\|:-----------\|:-----------\|:-----------\|:-----------\|:-----------\|:-----------\|:-----------\|
	\| [namdp-ptit/ViRanker](https://huggingface.co/namdp-ptit/ViRanker) \| 0.6815 \| 0.6641 \| 0.6983 \| 0.6894 \| 0.7302 \| 0.7107 \| 2.02
	\| [itdainb/PhoRanker](https://huggingface.co/itdainb/PhoRanker) \| 0.6625 \| 0.6458 \| 0.7147 \| 0.6731 \| 0.7422 \| 0.6830 \| 15
	\| [kien-vu-uet/finetuned-phobert-passage-rerank-best-eval](https://huggingface.co/kien-vu-uet/finetuned-phobert-passage-rerank-best-eval) \| 0.0963 \| 0.0883 \| 0.1396 \| 0.1131 \| 0.1681 \| 0.1246 \| 15
	\| [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) \| 0.6087 \| 0.5841 \| 0.6513 \| 0.6062 \| 0.6872 \| 0.62091 \| 3.51
	\| [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma) \| 0.6088 \| 0.5908 \| 0.6446 \| 0.6108 \| 0.6785 \| 0.6249 \| 1.29

	## Contact

	Email: [email protected]

	LinkedIn: [Dang Phuong Nam](https://www.linkedin.com/in/dang-phuong-nam-157912288/)

	Facebook: [Phương Nam](https://www.facebook.com/phuong.namdang.7146557)

	## Support The Project

	If you find this project helpful and wish to support its ongoing development, here are some ways you can contribute:

	1. Star the Repository: Show your appreciation by starring the repository. Your support motivates further
	development
	and enhancements.
	2. Contribute: We welcome your contributions! You can help by reporting bugs, submitting pull requests, or
	suggesting new features.
	3. Donate: If you’d like to support financially, consider making a donation. You can donate through:
	- Vietcombank: 9912692172 - DANG PHUONG NAM

	Thank you for your support!

	## Citation

	Please cite as

	```Plaintext
	@misc{ViRanker,
	title={ViRanker: A Cross-encoder Model for Vietnamese Text Ranking},
	author={Nam Dang Phuong},
	year={2024},
	publisher={Huggingface},
	}
	```