Is this model aligned/censored?

#17
by xiliny - opened

Dear authors, we are using NV-Embed-v2 in a research project about online hate speech detection. In particular, we compute embeddings for social media comments with NV-Embed-v2 and run logistic regression to classify them into hate or non-hate speech. The resultant classifier achieved a test accuracy of 93% on the test set, which is comparable to the reported 92.74% test accuracy on ToxicConversationsClassification. Perhaps surprisingly, when we manually inspect the mis-classified instances, many of them contains obvious and relatively short racial slurs.

To the best of our knowledge, the base model Mistral 7B is uncensored (in fact that's one of its selling points), so I wonder if you intentionally or accidentally did some model alignment during the fine-tuning, which may make it less sensitive to these racial slurs? If not, could you please share some thoughts regarding why would the model succeed in detecting long, complicated sentences of hate speech but fail in detecting the short and obvious ones?

Sign up or log in to comment