FacebookAI/xlm-roberta-base · Cosine Similarity is very high with Japanese Sentences

May 25, 2023

I’m using the modle like the code below.

'''
from sentence_transformers import SentenceTransformer
from torch import nn
import torch
import numpy as np

st = SentenceTransformer("xlm-roberta-base")
a = "最高の商品ですね。何回もリピートします。"
b = "使いづらい！二度と買いません。"
x = st.encode(a)
y = st.encode(b)
x = torch.from_numpy(x.astype(np.float32)).clone()
y = torch.from_numpy(y.astype(np.float32)).clone()
cos = nn.CosineSimilarity(dim=0, eps=1e-6)

print(cos(x, y))
'''

These sentences(a and b) are not similar,but the cosSimilarity Score shows 0.9980.
I'm very new to NLP,So I don't know how to reslove this problem.
Hope someone help me.

astyperand

Jul 27, 2023

Since you are using SentenceTransformer trying asking https://github.com/UKPLab/sentence-transformers/issues
I used SentenceTransformer('paraphrase-multilingual-mpnet-base-v2') and got cosine_similarity of 0.43 for the 2 sentence in your example.

Robfuller7

Mar 31

I am having a similar experience comparing Russian and English sentences. A pair of sentences which are completely different in meaning yield very high cosine similarity for this particular model (xlm-roberta-base) but as astyperand points out, using a different model, the approach as expected ie high cosine similarity for sentences with similar meaning and low for those with dissimilar meaning. I would like to know if this a reflection of the model or whether I am not using the model in the way intended.