Junseong commited on
Commit
0c1a0b0
1 Parent(s): 3a227db

Simplify usage; integrate Sentence Transformers (+ LlamaIndex/LangChain, etc.) (#1)

Browse files

- Integrate Linq-Embed-Mistral with Sentence Transformers (e108f5efd5ecef01c5e7096883390705cfeec446)

1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 4096,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": true,
9
+ "include_prompt": true
10
+ }
README.md CHANGED
@@ -2,6 +2,7 @@
2
  tags:
3
  - mteb
4
  - transformers
 
5
  model-index:
6
  - name: Linq-Embed-Mistral
7
  results:
@@ -1792,9 +1793,39 @@ For more details, refer to [this blog post](https://getlinq.com/blog/linq-embed-
1792
 
1793
  ## How to use
1794
 
1795
- ### Transformers
 
 
 
 
 
 
 
 
1796
 
1797
- Here is an example of how to encode queries and passages from the Mr.TyDi training dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1798
 
1799
  ```python
1800
  import torch
 
2
  tags:
3
  - mteb
4
  - transformers
5
+ - sentence-transformers
6
  model-index:
7
  - name: Linq-Embed-Mistral
8
  results:
 
1793
 
1794
  ## How to use
1795
 
1796
+ Here is an example of how to encode queries and passages from the Mr.TyDi training dataset, both with Sentence Transformers or Transformers directly.
1797
+
1798
+ ### Sentence Transformers
1799
+
1800
+ ```python
1801
+ from sentence_transformers import SentenceTransformer
1802
+
1803
+ # Load the model
1804
+ model = SentenceTransformer("Linq-AI-Research/Linq-Embed-Mistral")
1805
 
1806
+ # Each query must come with a one-sentence instruction that describes the task
1807
+ task = 'Given a question, retrieve Wikipedia passages that answer the question'
1808
+ prompt = f"Instruct: {task}\nQuery: "
1809
+ queries = [
1810
+ "최초의 원자력 발전소는 무엇인가?",
1811
+ "Who invented Hangul?"
1812
+ ]
1813
+ passages = [
1814
+ "현재 사용되는 핵분열 방식을 이용한 전력생산은 1948년 9월 미국 테네시주 오크리지에 설치된 X-10 흑연원자로에서 전구의 불을 밝히는 데 사용되면서 시작되었다. 그리고 1954년 6월에 구소련의 오브닌스크에 건설된 흑연감속 비등경수 압력관형 원자로를 사용한 오브닌스크 원자력 발전소가 시험적으로 전력생산을 시작하였고, 최초의 상업용 원자력 엉더이로를 사용한 영국 셀라필드 원자력 단지에 위치한 콜더 홀(Calder Hall) 원자력 발전소로, 1956년 10월 17일 상업 운전을 시작하였다.",
1815
+ "Hangul was personally created and promulgated by the fourth king of the Joseon dynasty, Sejong the Great.[1][2] Sejong's scholarly institute, the Hall of Worthies, is often credited with the work, and at least one of its scholars was heavily involved in its creation, but it appears to have also been a personal project of Sejong."
1816
+ ]
1817
+
1818
+ # Encode the queries and passages. We only use the prompt for the queries
1819
+ query_embeddings = model.encode(queries, prompt=prompt)
1820
+ passage_embeddings = model.encode(passages)
1821
+
1822
+ # Compute the (cosine) similarity scores
1823
+ scores = model.similarity(query_embeddings, passage_embeddings) * 100
1824
+ print(scores.tolist())
1825
+ # [[73.72908782958984, 30.122787475585938], [29.15508460998535, 79.25375366210938]]
1826
+ ```
1827
+
1828
+ ### Transformers
1829
 
1830
  ```python
1831
  import torch
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.0",
4
+ "transformers": "4.41.2",
5
+ "pytorch": "2.3.0+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 32768,
3
+ "do_lower_case": false
4
+ }