Cosine Similarity is very high with Japanese Sentences
Iām using the modle like the code below.
'''
from sentence_transformers import SentenceTransformer
from torch import nn
import torch
import numpy as np
st = SentenceTransformer("xlm-roberta-base")
a = "ęé«ć®ååć§ćććä½åććŖćć¼ććć¾ćć"
b = "ä½æćć„ććļ¼äŗåŗ¦ćØč²·ćć¾ććć"
x = st.encode(a)
y = st.encode(b)
x = torch.from_numpy(x.astype(np.float32)).clone()
y = torch.from_numpy(y.astype(np.float32)).clone()
cos = nn.CosineSimilarity(dim=0, eps=1e-6)
print(cos(x, y))
'''
These sentences(a and b) are not similar,but the cosSimilarity Score shows 0.9980.
I'm very new to NLP,So I don't know how to reslove this problem.
Hope someone help me.
Since you are using SentenceTransformer
trying asking https://github.com/UKPLab/sentence-transformers/issues
I used SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')
and got cosine_similarity of 0.43 for the 2 sentence in your example.
I am having a similar experience comparing Russian and English sentences. A pair of sentences which are completely different in meaning yield very high cosine similarity for this particular model (xlm-roberta-base) but as astyperand points out, using a different model, the approach as expected ie high cosine similarity for sentences with similar meaning and low for those with dissimilar meaning. I would like to know if this a reflection of the model or whether I am not using the model in the way intended.