# Deliverable 1 Assess your pipeline using the RAGAS framework including key metrics faithfulness, answer relevancy, context precision, and context recall. Provide a table of your output results. ## Pipeline configuration 1. Qdrant - Cloud hosted vector database 2. PyMuPdfLoader - Pdf loader from langchain 3. Snowflake/snowflake-arctic-embed-l - Open source embedding model used 4. SemanticChunker & RecursiveCharacterText with contextual compression - Chunking strategy [Note that SemanticChunker seems to be unreliable and produces duplicate chunks] 5. gpt-4o-mini - Generator LLM 6. gpt-40 - Critic LLM 7. Distribution - simple 0.5, multi_context 0.3 and reasoning 0.2 8. Ragas metrics - faithfulness, answer_relevancy, context_recall, context_precision,answer_correctness 9. Sythetic Questions generated - 269 ![tak3-del1](/task3-del1.png) ![tak3-del1](/task3-del11.png) # Deliverable 2 What conclusions can you draw about performance and effectiveness of your pipeline with this information? ## Observations: Some observations from the results - **Faithfulness**: Mostly high faithfulness scores, indicating that the generated answers are generally true to the source material. But, there are some low score (e.g., 0.233333) which shows that the model may occasionally provide unfaithful or incomplete answers. - **Answer Relevancy**: The model seems to perform well in answer relevancy, with most scores being near 1. This suggests that even when faithfulness is low, the answers provided are still on-topic and relevant to the user's question. - **Context Recall & Precision**: There are several instances where **context recall** is 0.0, indicating that the context was not helpful in answering the question. However, when context recall is high, **context precision** is often perfect (1.0), showing that when the context is relevant, it is precise and accurate. - **Answer Correctness**: This metric shows a range of results. Although many answers are correct, a few are only partially correct, suggesting room for improvement in the correctness of generated answers. The pipeline performs well in generating relevant answers, but some improvements can be made to enhance the faithfulness and correctness of those answers. The **context recall** metric has room for improvement. There are several cases where relevant context is missing or inadequate, which can impact the overall effectiveness of the pipeline.