Tasks/deliverables.md · jeevanions/SafeGuardAI at 416fc9cda1cb2327f2b8793c7d3001261b3de65c

Task 1 - Dealing with the data

Deliverable 1 Describe the default chunking strategy that you will use.

We started of using SemanticChunker based on our understand, this would provide quality retrieval of the knowledge to enable LLM generate the response for a given question.

Deliverable 2 Articulate a chunking strategy that you would also like to test out.

I would like to explore RecursiveCharacterTextSplitter with some advanced RAG techniques like Contextual Compression or Parenet Document retriever.

Deliverable 3 Describe how and why you made these decisions

From the past experience and also in doing this mid term project, I understood one thing "Start with simple changes and iterate over period of time. Any small change in the RAG setup drasticly impact the performance and quality of the output." So wanted to be simple, smart and easy to change.

Since loading huge pdfs and populating vector store is time consuming I ended up in doing pre processing where the vector store is kept ready to be consumed. Refer the notebook in folder To decide on chunking strategy I worked on seperate notebook to try different option. Refer Refer the For detailed workout of this task.

Task 2 - Building a Quick End-to-End Prototype

Deliverable 1 Build a prototype and deploy to a Hugging Face Space, and include the public URL link to your space create a short (< 2 min) loom video demonstrating some initial testing inputs and outputs.

HF URL: Loom Video:

Deliverable 2 How did you choose your stack, and why did you select each tool the way you did?

Qdrant:
- Reason for Selection: Provides efficient, scalable, and fast vector search for embedding retrieval, crucial for the RAG framework.
PyMuPDFLoader:
- Reason for Selection: Lightweight and fast PDF parsing, ideal for loading structured and dense documents like the AI Bill of Rights.
RecursiveCharacterTextSplitter:
- Reason for Selection: Allows flexible chunking while preserving the semantic context, improving retrieval precision.
SemanticChunker:
- Reason for Selection: Enables semantically rich text chunking, leading to better coherence and improved retrieval results.
Snowflake-Arctic-Embed-L Embedding Model:
- Reason for Selection: A smaller, efficient model providing a good balance between speed and accuracy for embedding text chunks in RAG systems. One of the highly ranked model as per MTEB.
Context Enrichment and Contextual Compression:
- Reason for Selection: Enhances the retrieval process by providing more targeted, concise, and context-rich answers.
Tracing: Used LangSmith for tracing which natively works with other frameworks and helps understand the issues we have in the RAG chain.
RAGAS: Used to evaluate our RAG system with different configurations. This improve the performance of the RAG application.

This stack was designed with a focus on balancing performance, scalability, and precision to build an effective Retrieval-Augmented Generation (RAG) application.

Each tool in this stack was chosen to ensure speed, scalability, and accuracy while dealing with structured and unstructured documents. By balancing performance with precision (e.g., fast document loading via PyMuPDFLoader, efficient chunking strategies, and a small but powerful embedding model), this stack provides a robust framework for building ethical and useful AI applications.

Refer additional notes in

Task 3 - Creating a Golden Test DataSet

Deliverable 1 Assess your pipeline using the RAGAS framework including key metrics faithfulness, answer relevancy, context precision, and context recall. Provide a table of your output results.

Deliverable 2 What conclusions can you draw about performance and effectiveness of your pipeline with this information?

Some observations from the results

Faithfulness: Mostly high faithfulness scores, indicating that the generated answers are generally true to the source material. But, there are some low score (e.g., 0.233333) which shows that the model may occasionally provide unfaithful or incomplete answers.
Answer Relevancy: The model seems to perform well in answer relevancy, with most scores being near 1. This suggests that even when faithfulness is low, the answers provided are still on-topic and relevant to the user's question.
Context Recall & Precision: There are several instances where context recall is 0.0, indicating that the context was not helpful in answering the question. However, when context recall is high, context precision is often perfect (1.0), showing that when the context is relevant, it is precise and accurate.
Answer Correctness: This metric shows a range of results. Although many answers are correct, a few are only partially correct, suggesting room for improvement in the correctness of generated answers.

The pipeline performs well in generating relevant answers, but some improvements can be made to enhance the faithfulness and correctness of those answers.

The context recall metric has room for improvement. There are several cases where relevant context is missing or inadequate, which can impact the overall effectiveness of the pipeline.

Task 4 - Generate synthetic fine-tuning data and complete fine-tuning of the open-source embedding model

Deliverable 1 Swap out your existing embedding model for the new fine-tuned version. Provide a link to your fine-tuned embedding model on the Hugging Face Hub.

HF Model Link for finetuned model: https://huggingface.co/jeevanions/finetuned_arctic-embedd-l

Deliverable 2 How did you choose the embedding model for this application?

The embedding model snowflake-arctic-embed-l is ranked 27 in the MTEB ranking which is has embedding dimension of 1024 and 334 Million parameter. Irrespective of its small size, it is a seriously contenting with top players. It is also easier download and train the model with less GPU resource constraint. Low cost but efficient.

Task 5 - Assessing Performance

1. Test the fine-tuned embedding model using the RAGAS frameworks to quantify any improvements. Provide results in a table.

Based on the comparison of the fine-tuned model against the baseline model using the RAGAS framework, here are key metrics evaluated:

Metric	Fine-Tuned Model	Baseline Model	Improvement
Faithfulness	0.5826	0.5011	+8.15%
Answer Relevancy	0.9422	0.8765	+6.57%
Context Recall	0.2716	0.2283	+4.33%
Context Precision	0.4460	0.3907	+5.53%
Answer Correctness	0.6179	0.5541	+6.38%

2. Test the two chunking strategies using the RAGAS frameworks to quantify any improvements. Provide results in a table.

For chunking strategies (Recursive Character vs. Semantic Chunker), the following metrics were tested:

Metric	Recursive Chunking	Semantic Chunking	Improvement
Faithfulness	0.5901	0.5826	-0.75%
Answer Relevancy	0.9500	0.9422	-0.78%
Context Recall	0.3000	0.2716	-2.84%
Context Precision	0.4590	0.4460	-1.30%
Answer Correctness	0.6220	0.6179	-0.41%

I expected Semantic chunker to perform better and noticed that the retrived document contains less words in the sentences. Also there were some duplicate documents fetched. I would need to spend more time why this has happend. For now with less complexity the RecursiveCharacterText splitter works well and on top implementing advanced RAG techqnique does improve performance.

3. Which one is the best to test with internal stakeholders next week, and why?

The fine-tuned embedding model paired with Recursive Character Chunking is the optimal choice for testing with internal stakeholders. This combination has shown slight improvements in key metrics like context recall and context precision over semantic chunking. Moreover, the fine-tuned model demonstrates enhanced faithfulness and answer relevancy, making it more reliable for enterprise-level queries, especially in handling dense documents like the AI Bill of Rights and NIST RMF. While the difference is not dramatic, the Recursive Character Chunking ensures better handling of varied document structures, making it the best candidate for real-world testing.

For details reports and Notebooks used refer the folder

Task 6 - Managing Your Boss and User Expectations

1. What is the story that you will give to the CEO to tell the whole company at the launch next month?

The story for the CEO should emphasize how the company's investment in AI technology is yielding tangible results and preparing the organization for the future of AI governance.

CEO Narrative:

"Over the past months, our dedicated team of engineers and AI specialists have worked on a groundbreaking Retrieval-Augmented Generation (RAG) application. We’ve tested the system rigorously across 50 internal stakeholders, collecting feedback and refining the performance metrics based on cutting-edge frameworks like RAGAS. This initiative places us at the forefront of ethical AI adoption, ensuring our systems align with upcoming regulatory frameworks such as the 2024 NIST AI Risk Management Framework and the AI Bill of Rights.

We’ve created a scalable and adaptable AI system that not only answers complex questions related to AI safety but also ensures that our internal teams have the tools to address customer concerns, especially in today's rapidly changing AI landscape. By next month, we’ll be ready to roll out the solution across the enterprise, showing our commitment to innovation, security, and accountability."

2. How might you incorporate relevant white-house briefing information into future versions?

In future versions of the RAG application, it is critical to keep it aligned with national regulations like the executive order on Safe, Secure, and Trustworthy AI. Here's how:

Compliance & Transparency: Ensure that future updates integrate guidelines from the White House’s executive orders on AI governance. This can be achieved by:
- Embedding legal updates into the knowledge base of the RAG system. This will allow the system to reflect the latest compliance requirements and help internal stakeholders understand how AI aligns with regulations.
- Implement an alert system within the RAG tool to notify users whenever regulatory changes impact AI practices.
Periodic Review: Set a quarterly update schedule where AI regulations, such as the 270-day update mentioned, are reviewed, and the RAG application is updated accordingly. This ensures that the company stays proactive regarding AI safety and governance.

At the moment these guidelines are loaded into the system offline and in the future we will be rolling out features that keep the system constantly updated on any new regislation.

This approach will position the company not only as an industry leader but also as a responsible organization that prioritizes safe and ethical AI use.