FinLang/finance-embeddings-investopedia

This is the Investopedia embedding for finance application by the FinLang team. The model is trained using our open-sourced finance dataset from https://huggingface.co/datasets/FinLang/investopedia-embedding-dataset

This is a finetuned embedding model on top of BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search in RAG applications.

This project is for research purposes only. Third-party datasets may be subject to additional terms and conditions under their associated licenses.

Plans

The research paper will be published soon.
We are working on a v2 version of the model where we are increasing the training corpus of financial data and using improved techniques for training embeddings.

Usage (LLamaIndex)

Simply specify the Finlang embedding during the indexing procedure for your Financial RAG applications.

from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="FinLang/investopedia_embedding")

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed (see https://huggingface.co/sentence-transformers):

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('FinLang/investopedia_embedding')
embeddings = model.encode(sentences)
print(embeddings)

Example code testing:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("FinLang/investopedia_embedding")

query_1 = "What is a potential concern with allowing someone else to store your cryptocurrency keys, and is it possible to decrypt a private key?"
query_2 = "A potential concern is that the entity holding your keys has control over your cryptocurrency in a custodial relationship. While it is theoretically possible to decrypt a private key, with current technology, it would take centuries or millennia for the 115 quattuorvigintillion possibilities. Most hacks and thefts occur in wallets, where private keys are stored."

embedding_1 = model.encode(query_1)
embedding_2 = model.encode(query_2)
scores = (embedding_1*embedding_2).sum()
print(scores) # 0.862

Evaluation Results

We evaluate our model on unseen pairs of sentences for similarity and unseen shuffled pairs of sentences for dissimilarity. Our evaluation suite contains sentence pairs from: Investopedia (to test for proficiency on finance), and Gooaq, MSMARCO,stackexchange_duplicate_questions_title_title, yahoo_answers_title_answer (to evaluate models ability to avoid forgetting after finetuning).

License

Since non-commercial datasets are used for fine-tuning, we release this model as cc-by-nc-4.0.