nreimers commited on
Commit
294fb16
1 Parent(s): 10e09d2
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ license: apache-2.0
4
+ tags:
5
+ - sentence-transformers
6
+ - feature-extraction
7
+ - sentence-similarity
8
+ - transformers
9
+ ---
10
+
11
+ # sentence-transformers/msmarco-bert-co-condensor
12
+
13
+ This is a port of the [Luyu/co-condenser-marco-retriever](https://huggingface.co/Luyu/co-condenser-marco-retriever) model to [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and is optimized for the task of semantic search.
14
+
15
+
16
+ It is based on the paper: [Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval](https://arxiv.org/abs/2108.05540)
17
+
18
+
19
+ ## Evaluation
20
+
21
+ | Model | MS MARCO Dev (MRR@10) | TREC DL 2019 | TREC DL 2020 | FiQA (NDCG@10) | TREC COVID (NDCG@10) | TREC News (NDCG@10) | TREC Robust04 (NDCG@10) |
22
+ | ---- | :--: | :--: | :--: | :--: | :--: | :--: | :--: |
23
+ | [msmarco-bert-co-condensor](https://huggingface.co/sentence-transformers/sentence-transformers/msmarco-bert-co-condensor) | 35.51 | 68.16 | 69.13 | 26.04 | 66.89 | 28.54 | 30.71 |
24
+ | [msmarco-distilbert-base-tas-b](https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b) | 34.43 | 71.04 | 69.78
25
+ 30.02 | 65.39 | 37.70 | 42.70 |
26
+ | [msmarco-distilbert-dot-v5](https://huggingface.co/sentence-transformers/msmarco-distilbert-dot-v5) | 37.25 | 70.14 | 71.08 | 28.61 | 71.96 | 37.88 | 38.29 | 44.19 |
27
+ | [msmarco-bert-base-dot-v5](https://huggingface.co/sentence-transformers/msmarco-bert-base-dot-v5) | 38.08 | 70.51 | 73.45 | 32.29 | 74.81 | 38.81 | 42.67 | 47.15
28
+ | [msmarco-roberta-base-ance-firstp](https://huggingface.co/sentence-transformers/msmarco-roberta-base-ance-firstp) | 33.01 | 67.84 | 66.04 | 29.5 | 67.12 | 38.2 | 39.2
29
+
30
+
31
+ For more details on the comparison, see: [SBERT.net - MSMARCO Models](https://www.sbert.net/docs/pretrained-models/msmarco-v5.html)
32
+
33
+ In the paper, Gao & Callan claim a MS MARCO-Dev score of 38.2 (MRR@10). This is achieved by changing the benchmark: The orginal MS MARCO dataset just provides queries and text passages, from which you must retrieve the relevant passages for a given query.
34
+
35
+ In their [code](https://github.com/luyug/Dense/blob/454af38e06fe79aac8243b0fa31387c07ee874ab/examples/msmarco-passage-ranking/get_data.sh#L10), they combine the passages with the document titles from MS MARCO document task, i.e. they train and evaluate their model with additional information from a different benchmark. In the above table, the score of 35.41 (MRR@10) is on the MS MARCO Passages benchmark as it is proposed, without having the document titles.
36
+
37
+ They further trained their model with the document titles, which creates an information leackage: The document titles were re-constructed by the MS MARCO organizers at a later stage for the MS MARCO document benchmark. It was not possible to reconstruct all document titles for all passages. However, the distribution of having a title is not equal for relevant and non-relevant passages: 71.9% of the relevant passages have a document title, while only 64.4% of the non-relevant passages have a title. Hence, the model can learn that, as soon as there is a document title, the probability is higher that this passage is annotated as relevant. It will not make the decision based on the passage content, but by the artifact if there is a title or not.
38
+
39
+ The information leackage and the change of the benchmark likely leads to the inflated scores reported in the paper.
40
+
41
+
42
+ ## Usage (Sentence-Transformers)
43
+
44
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
45
+
46
+ ```
47
+ pip install -U sentence-transformers
48
+ ```
49
+
50
+ Then you can use the model like this:
51
+
52
+ ```python
53
+ from sentence_transformers import SentenceTransformer, util
54
+
55
+ query = "How many people live in London?"
56
+ docs = ["Around 9 Million people live in London", "London is known for its financial district"]
57
+
58
+ #Load the model
59
+ model = SentenceTransformer('sentence-transformers/msmarco-bert-co-condensor')
60
+
61
+ #Encode query and documents
62
+ query_emb = model.encode(query)
63
+ doc_emb = model.encode(docs)
64
+
65
+ #Compute dot score between query and all document embeddings
66
+ scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
67
+
68
+ #Combine docs & scores
69
+ doc_score_pairs = list(zip(docs, scores))
70
+
71
+ #Sort by decreasing score
72
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
73
+
74
+ #Output passages & scores
75
+ for doc, score in doc_score_pairs:
76
+ print(score, doc)
77
+ ```
78
+
79
+
80
+
81
+ ## Usage (HuggingFace Transformers)
82
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
83
+
84
+ ```python
85
+ from transformers import AutoTokenizer, AutoModel
86
+ import torch
87
+
88
+ #CLS Pooling - Take output from first token
89
+ def cls_pooling(model_output):
90
+ return model_output.last_hidden_state[:,0]
91
+
92
+ #Encode text
93
+ def encode(texts):
94
+ # Tokenize sentences
95
+ encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
96
+
97
+ # Compute token embeddings
98
+ with torch.no_grad():
99
+ model_output = model(**encoded_input, return_dict=True)
100
+
101
+ # Perform pooling
102
+ embeddings = cls_pooling(model_output)
103
+
104
+ return embeddings
105
+
106
+
107
+ # Sentences we want sentence embeddings for
108
+ query = "How many people live in London?"
109
+ docs = ["Around 9 Million people live in London", "London is known for its financial district"]
110
+
111
+ # Load model from HuggingFace Hub
112
+ tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-co-condensor")
113
+ model = AutoModel.from_pretrained("sentence-transformers/msmarco-bert-co-condensor")
114
+
115
+ #Encode query and docs
116
+ query_emb = encode(query)
117
+ doc_emb = encode(docs)
118
+
119
+ #Compute dot score between query and all document embeddings
120
+ scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
121
+
122
+ #Combine docs & scores
123
+ doc_score_pairs = list(zip(docs, scores))
124
+
125
+ #Sort by decreasing score
126
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
127
+
128
+ #Output passages & scores
129
+ for doc, score in doc_score_pairs:
130
+ print(score, doc)
131
+ ```
132
+
133
+
134
+
135
+ ## Evaluation Results
136
+
137
+
138
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=sentence-transformers/msmarco-bert-co-condensor)
139
+
140
+
141
+ ## Full Model Architecture
142
+ ```
143
+ SentenceTransformer(
144
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: DistilBertModel
145
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
146
+ )
147
+ ```
148
+
149
+ ## Citing & Authors
150
+
151
+ Have a look at: [Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval](https://arxiv.org/abs/2108.05540)
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Luyu/co-condenser-marco-retriever",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "id2label": {
12
+ "0": "LABEL_0"
13
+ },
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 3072,
16
+ "label2id": {
17
+ "LABEL_0": 0
18
+ },
19
+ "layer_norm_eps": 1e-12,
20
+ "max_position_embeddings": 512,
21
+ "model_type": "bert",
22
+ "num_attention_heads": 12,
23
+ "num_hidden_layers": 12,
24
+ "pad_token_id": 0,
25
+ "position_embedding_type": "absolute",
26
+ "transformers_version": "4.6.1",
27
+ "type_vocab_size": 2,
28
+ "use_cache": true,
29
+ "vocab_size": 30522
30
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.0.0",
4
+ "transformers": "4.6.1",
5
+ "pytorch": "1.8.1"
6
+ }
7
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad753a56c5af3fbbf2bdfd5f5c1b89e5e9303435aba8322b4259e20b32f194e0
3
+ size 438012727
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 256,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "do_basic_tokenize": true, "never_split": null, "model_max_length": 512, "name_or_path": "Luyu/co-condenser-marco-retriever", "special_tokens_map_file": "/bos/tmp0/luyug/outputs/condenser/models/l2-s6-km-L128-e8-lr1e-4-b256/special_tokens_map.json"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff