Open-Source AI Cookbook documentation

Building A RAG System with Gemma, MongoDB and Open Source Models

Open-Source AI Cookbook

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Building A RAG System with Gemma, MongoDB and Open Source Models

Authored By: Richmond Alake

Step 1: Installing Libraries

The shell command sequence below installs libraries for leveraging open-source large language models (LLMs), embedding models, and database interaction functionalities. These libraries simplify the development of a RAG system, reducing the complexity to a small amount of code:

PyMongo: A Python library for interacting with MongoDB that enables functionalities to connect to a cluster and query data stored in collections and documents.
Pandas: Provides a data structure for efficient data processing and analysis using Python
Hugging Face datasets: Holds audio, vision, and text datasets
Hugging Face Accelerate: Abstracts the complexity of writing code that leverages hardware accelerators such as GPUs. Accelerate is leveraged in the implementation to utilise the Gemma model on GPU resources.
Hugging Face Transformers: Access to a vast collection of pre-trained models
Hugging Face Sentence Transformers: Provides access to sentence, text, and image embeddings.

!pip install datasets pandas pymongo sentence_transformers
!pip install -U transformers
# Install below if using GPU
!pip install accelerate

Step 2: Data sourcing and preparation

The data utilised in this tutorial is sourced from Hugging Face datasets, specifically the AIatMongoDB/embedded_movies dataset.

# Load Dataset
from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/AIatMongoDB/embedded_movies
dataset = load_dataset("AIatMongoDB/embedded_movies")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset["train"])

dataset_df.head(5)

The operations within the following code snippet below focus on enforcing data integrity and quality.

The first process ensures that each data point’s fullplot attribute is not empty, as this is the primary data we utilise in the embedding process.
This step also ensures we remove the plot_embedding attribute from all data points as this will be replaced by new embeddings created with a different embedding model, the gte-large.

>>> # Data Preparation

>>> # Remove data point where plot coloumn is missing
>>> dataset_df = dataset_df.dropna(subset=["fullplot"])
>>> print("\nNumber of missing values in each column after removal:")
>>> print(dataset_df.isnull().sum())

>>> # Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with an open source embedding model from Hugging Face
>>> dataset_df = dataset_df.drop(columns=["plot_embedding"])
>>> dataset_df.head(5)

Number of missing values in each column after removal:
num_mflix_comments      0
genres                  0
countries               0
directors              12
fullplot                0
writers                13
awards                  0
runtime                14
type                    0
rated                 279
metacritic            893
poster                 78
languages               1
imdb                    0
plot                    0
cast                    1
plot_embedding          1
title                   0
dtype: int64

Step 3: Generating embeddings

The steps in the code snippets are as follows:

Import the SentenceTransformer class to access the embedding models.
Load the embedding model using the SentenceTransformer constructor to instantiate the gte-large embedding model.
Define the get_embedding function, which takes a text string as input and returns a list of floats representing the embedding. The function first checks if the input text is not empty (after stripping whitespace). If the text is empty, it returns an empty list. Otherwise, it generates an embedding using the loaded model.
Generate embeddings by applying the get_embedding function to the “fullplot” column of the dataset_df DataFrame, generating embeddings for each movie’s plot. The resulting list of embeddings is assigned to a new column named embedding.

Note: It’s not necessary to chunk the text in the full plot, as we can ensure that the text length remains within a manageable range.

from sentence_transformers import SentenceTransformer

# https://huggingface.co/thenlper/gte-large
embedding_model = SentenceTransformer("thenlper/gte-large")


def get_embedding(text: str) -> list[float]:
    if not text.strip():
        print("Attempted to get embedding for empty text.")
        return []

    embedding = embedding_model.encode(text)

    return embedding.tolist()


dataset_df["embedding"] = dataset_df["fullplot"].apply(get_embedding)

dataset_df.head()

Step 4: Database setup and connection

MongoDB acts as both an operational and a vector database. It offers a database solution that efficiently stores, queries and retrieves vector embeddings—the advantages of this lie in the simplicity of database maintenance, management and cost.

To create a new MongoDB database, set up a database cluster:

Head over to MongoDB official site and register for a free MongoDB Atlas account, or for existing users, sign into MongoDB Atlas.
Select the ‘Database’ option on the left-hand pane, which will navigate to the Database Deployment page, where there is a deployment specification of any existing cluster. Create a new database cluster by clicking on the “+Create” button.
Select all the applicable configurations for the database cluster. Once all the configuration options are selected, click the “Create Cluster” button to deploy the newly created cluster. MongoDB also enables the creation of free clusters on the “Shared Tab”.

Note: Don’t forget to whitelist the IP for the Python host or 0.0.0.0/0 for any IP when creating proof of concepts.
After successfully creating and deploying the cluster, the cluster becomes accessible on the ‘Database Deployment’ page.
Click on the “Connect” button of the cluster to view the option to set up a connection to the cluster via various language drivers.
This tutorial only requires the cluster’s URI(unique resource identifier). Grab the URI and copy it into the Google Colabs Secrets environment in a variable named MONGO_URI or place it in a .env file or equivalent.

4.1 Database and Collection Setup

Before moving forward, ensure the following prerequisites are met

Database cluster set up on MongoDB Atlas
Obtained the URI to your cluster

For assistance with database cluster setup and obtaining the URI, refer to our guide for setting up a MongoDB cluster and getting your connection string

Once you have created a cluster, create the database and collection within the MongoDB Atlas cluster by clicking + Create Database in the cluster overview page.

Here is a guide for creating a database and collection

The database will be named movies.

The collection will be named movie_collection_2.

Step 5: Create a Vector Search Index

At this point make sure that your vector index is created via MongoDB Atlas.

This next step is mandatory for conducting efficient and accurate vector-based searches based on the vector embeddings stored within the documents in the movie_collection_2 collection.

Creating a Vector Search Index enables the ability to traverse the documents efficiently to retrieve documents with embeddings that match the query embedding based on vector similarity.

Go here to read more about MongoDB Vector Search Index.

{
 "fields": [{
     "numDimensions": 1024,
     "path": "embedding",
     "similarity": "cosine",
     "type": "vector"
   }]
}

The 1024 value of the numDimension field corresponds to the dimension of the vector generated by the gte-large embedding model. If you use the gte-base or gte-small embedding models, the numDimension value in the vector search index must be set to 768 and 384, respectively.

Step 6: Establish Data Connection

The code snippet below also utilises PyMongo to create a MongoDB client object, representing the connection to the cluster and enabling access to its databases and collections.

>>> import pymongo
>>> from google.colab import userdata


>>> def get_mongo_client(mongo_uri):
...     """Establish connection to the MongoDB."""
...     try:
...         client = pymongo.MongoClient(mongo_uri)
...         print("Connection to MongoDB successful")
...         return client
...     except pymongo.errors.ConnectionFailure as e:
...         print(f"Connection failed: {e}")
...         return None


... mongo_uri = userdata.get("MONGO_URI")
... if not mongo_uri:
...     print("MONGO_URI not set in environment variables")

... mongo_client = get_mongo_client(mongo_uri)

... # Ingest data into MongoDB
... db = mongo_client["movies"]
... collection = db["movie_collection_2"]

Connection to MongoDB successful

# Delete any existing records in the collection
collection.delete_many({})

Ingesting data into a MongoDB collection from a pandas DataFrame is a straightforward process that can be efficiently accomplished by converting the DataFrame into dictionaries and then utilising the insert_many method on the collection to pass the converted dataset records.

>>> documents = dataset_df.to_dict("records")
>>> collection.insert_many(documents)

>>> print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed

Step 7: Perform Vector Search on User Queries

The following step implements a function that returns a vector search result by generating a query embedding and defining a MongoDB aggregation pipeline.

The pipeline, consisting of the $vectorSearch and $project stages, executes queries using the generated vector and formats the results to include only the required information, such as plot, title, and genres while incorporating a search score for each result.

def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "embedding",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 4,  # Return top 4 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "fullplot": 1,  # Include the plot field
                "title": 1,  # Include the title field
                "genres": 1,  # Include the genres field
                "score": {"$meta": "vectorSearchScore"},  # Include the search score
            }
        },
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

Step 8: Handling user queries and loading Gemma

def get_search_result(query, collection):

    get_knowledge = vector_search(query, collection)

    search_result = ""
    for result in get_knowledge:
        search_result += f"Title: {result.get('title', 'N/A')}, Plot: {result.get('fullplot', 'N/A')}\n"

    return search_result

>>> # Conduct query with retrival of sources
>>> query = "What is the best romantic movie to watch and why?"
>>> source_information = get_search_result(query, collection)
>>> combined_information = (
...     f"Query: {query}\nContinue to answer the query by using the Search Results:\n{source_information}."
... )

>>> print(combined_information)

Query: What is the best romantic movie to watch and why?
Continue to answer the query by using the Search Results:
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?
Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as "Pearl Harbor."
Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.
Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.
.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
# CPU Enabled uncomment below 👇🏽
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
# GPU Enabled use below 👇🏽
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")

>>> # Moving tensors to GPU
>>> input_ids = tokenizer(combined_information, return_tensors="pt").to("cuda")
>>> response = model.generate(**input_ids, max_new_tokens=500)
>>> print(tokenizer.decode(response[0]))

Based on the search results, the best romantic movie to watch is **Shut Up and Kiss Me!** because it is a romantic comedy that explores the complexities of love and relationships. The movie is funny, heartwarming, and thought-provoking.

< > Update on GitHub

←Building a RAG System with Gemma, Elasticsearch and Open Source Models Simple RAG using Hugging Face Zephyr and LangChain→