Files changed (1) hide show
  1. README.md +59 -0
README.md CHANGED
@@ -1,3 +1,62 @@
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
4
+
5
+ # BLASER 2.0
6
+ [[Paper]]()
7
+
8
+ BLASER 2.0 is the new version of BLASER ([Chen et al., 2023](https://aclanthology.org/2023.acl-long.504/)),
9
+ a family of models for automatic evaluation of machine translation quality.
10
+
11
+ BLASER 2.0 is based on [SONAR](https://huggingface.co/facebook/SONAR) sentence embeddings
12
+ and works with both speech and text modalities.
13
+
14
+ The actual model predicts a similarity score for the translated sentence based on the translation and the source sentence.
15
+ This, it can be applied in settings where reference translations are missing or if their quality is questionable.
16
+ In contrast, its sibling model, [BLASER 2.0-referenced](https://huggingface.co/facebook/blaser-2.0-ref), requires also a reference translation.
17
+
18
+ Supervised BLASER models are trained to predict cross-lingual semantic similarity scores,
19
+ XSTS ([Licht et al., 2022](https://aclanthology.org/2022.amta-research.24/)),
20
+ on a scale where 1 corresponds to completely unrelated sentences and
21
+ 5 corresponds to fully semantically equivalent sentences.
22
+ The models predictions, though, are unbounded and can occasionally surpass these limits.
23
+
24
+ ## Installation
25
+
26
+ See the SONAR github [repo](https://github.com/facebookresearch/SONAR) for the installation instructions.
27
+
28
+ ## Usage
29
+
30
+ BLASER 2.0 models accept 1024-dimensional SONAR sentence embeddings as inputs,
31
+ and produce a single score as an output.
32
+ The code below illustrates their usage with text embeddings:
33
+
34
+ ```Python
35
+ from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
36
+ from sonar.models.blaser.loader import load_blaser_model
37
+
38
+ blaser = load_blaser_model("blaser_2_0_qe").eval()
39
+ text_embedder = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")
40
+
41
+ src_embs = text_embedder.predict(["Le chat s'assit sur le tapis."], source_lang="fra_Latn")
42
+ mt_embs = text_embedder.predict(["The cat sat down on the carpet."], source_lang="eng_Latn")
43
+ print(blaser(src=src_embs, mt=mt_embs).item()) # 4.708
44
+ ```
45
+
46
+ With BLASER 2.0 models, SONAR text and speech embeddings can be used interchangeably.
47
+
48
+
49
+ ## Model details
50
+
51
+ - **Developed by:** Seamless Communication et al.
52
+ - **License:** CC-BY-NC 4.0 license
53
+ - **Citation:** If you use BLASER 2.0 in your work, please cite:
54
+
55
+ ```bibtex
56
+ @article{seamlessm4t2023,
57
+ title={SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation},
58
+ author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
59
+ journal={ArXiv},
60
+ year={2023}
61
+ }
62
+ ```