Kuldeep Singh Sidhu's picture
5 3

Kuldeep Singh Sidhu

singhsidhukuldeep

AI & ML interests

Seeking contributors for a completely open-source 🚀 Data Science platform! singhsidhukuldeep.github.io

Organizations

singhsidhukuldeep's activity

posted an update 1 day ago
view post
Post
903
Exciting Research Alert: Revolutionizing Dense Passage Retrieval with Entailment Tuning!

The good folks at HKUST have developed a novel approach that significantly improves information retrieval by leveraging natural language inference.

The entailment tuning approach consists of several key steps to enhance dense passage retrieval performance.

Data Preparation
- Convert questions into existence claims using rule-based transformations.
- Combine retrieval data with NLI data from SNLI and MNLI datasets.
- Unify the format of both data types using a consistent prompting framework.

Entailment Tuning Process
- Initialize the model using pre-trained language models like BERT or RoBERTa.
- Apply aggressive masking (β=0.8) specifically to the hypothesis components while preserving premise information.
- Train the model to predict the masked hypothesis tokens from the premise content.
- Run the training for 10 epochs using 8 GPUs, taking approximately 1.5-3.5 hours.

Training Arguments for Entailment Tuning (Yes! They Shared Them)
- Use a learning rate of 2e-5 with 100 warmup steps.
- Set batch size to 128.
- Apply weight decay of 0.01.
- Utilize the Adam optimizer with beta values (0.9, 0.999).
- Maintain maximum gradient norm at 1.0.

Deployment
- Index passages using FAISS for efficient retrieval.
- Shard vector store across multiple GPUs.
- Enable sub-millisecond retrieval of the top-100 passages per query.

Integration with Existing Systems
- Insert entailment tuning between pre-training and fine-tuning stages.
- Maintain compatibility with current dense retrieval methods.
- Preserve existing contrastive learning approaches during fine-tuning.

Simple, intuitive, and effective!

This advancement significantly improves the quality of retrieved passages for question-answering systems and retrieval-augmented generation tasks.
posted an update 9 days ago
view post
Post
2524
Good folks from @Microsoft have released an exciting breakthrough in GUI automation!

OmniParser – a game-changing approach for pure vision-based GUI agents that works across multiple platforms and applications.

Key technical innovations:
- Custom-trained interactable icon detection model using 67k screenshots from popular websites
- Specialized BLIP-v2 model fine-tuned on 7k icon-description pairs for extracting functional semantics
- Novel combination of icon detection, OCR, and semantic understanding to create structured UI representations

The results are impressive:
- Outperforms GPT-4V baseline by significant margins on the ScreenSpot benchmark
- Achieves 73% accuracy on Mind2Web without requiring HTML data
- Demonstrates a 57.7% success rate on AITW mobile tasks

What makes OmniParser special is its ability to work across platforms (mobile, desktop, web) using only screenshot data – no HTML or view hierarchy needed. This opens up exciting possibilities for building truly universal GUI automation tools.

The team has open-sourced both the interactable region detection dataset and icon description dataset to accelerate research in this space.

Kudos to the Microsoft Research team for pushing the boundaries of what's possible with pure vision-based GUI understanding!

What are your thoughts on vision-based GUI automation?
posted an update 11 days ago
view post
Post
1131
Good folks from @Microsoft Research have just released bitnet.cpp, a game-changing inference framework that achieves remarkable performance gains.

Key Technical Highlights:
- Achieves speedups of up to 6.17x on x86 CPUs and 5.07x on ARM CPUs
- Reduces energy consumption by 55.4–82.2%
- Enables running 100B parameter models at human reading speed (5–7 tokens/second) on a single CPU

Features Three Optimized Kernels:
1. I2_S: Uses 2-bit weight representation
2. TL1: Implements 4-bit index lookup tables for every two weights
3. TL2: Employs 5-bit compression for every three weights

Performance Metrics:
- Lossless inference with 100% accuracy compared to full-precision models
- Tested across model sizes from 125M to 100B parameters
- Evaluated on both Apple M2 Ultra and Intel i7-13700H processors

This breakthrough makes running large language models locally more accessible than ever, opening new possibilities for edge computing and resource-constrained environments.
  • 4 replies
·
posted an update 12 days ago
view post
Post
2593
If you have ~300+ GB of V-RAM, you can run Mochi from @genmo

A SOTA model that dramatically closes the gap between closed and open video generation models.

Mochi 1 introduces revolutionary architecture featuring joint reasoning over 44,520 video tokens with full 3D attention. The model implements extended learnable rotary positional embeddings (RoPE) in three dimensions, with network-learned mixing frequencies for space and time axes.

The model incorporates cutting-edge improvements, including:
- SwiGLU feedforward layers
- Query-key normalization for enhanced stability
- Sandwich normalization for controlled internal activations

What is currently available?
The base model delivers impressive 480p video generation with exceptional motion quality and prompt adherence. Released under the Apache 2.0 license, it's freely available for both personal and commercial applications.

What's Coming?
Genmo has announced Mochi 1 HD, scheduled for release later this year, which will feature:
- Enhanced 720p resolution
- Improved motion fidelity
- Better handling of complex scene warping
  • 2 replies
·
posted an update 16 days ago
view post
Post
1274
Looks like @Meta thinks we forgot they created PyTorch, so now they've open-sourced Lingua, a powerful and flexible library for training and inferencing large language models.

Things that stand out:

- Architecture: Pure PyTorch nn.Module implementation for easy customization.

- Checkpointing: Uses the new PyTorch distributed saving method (.distcp format) for flexible model reloading across different GPU configurations.

- Configuration: Utilizes data classes and YAML files for intuitive setup and modification.

- Profiling: Integrates with xFormers' profiler for automatic MFU and HFU calculation, plus memory profiling.

- Slurm Integration: Includes stool.py for seamless job launching on Slurm clusters.

Some results from @Meta to show off:

- 1B parameter models trained on 60B tokens achieve strong performance across various NLP tasks.

- 7B parameter Mamba model (trained on 200B tokens) shows competitive results with Llama 7B on benchmarks like ARC, MMLU, and BBH.

If you're working on LLM research or looking to experiment with cutting-edge language model architectures, Lingua is definitely worth exploring.
posted an update 18 days ago
view post
Post
1701
Good folks at @Apple have developed a novel method called KV Prediction that significantly reduces the "time to first token" (TTFT) for on-device LLM inference.

Some highlights of the paper:

• Uses a small auxiliary transformer model to efficiently predict the KV cache of a larger base model
• Reduces TTFT by up to 4x while retaining 60-80% accuracy on benchmarks
• Achieves Pareto-optimal efficiency-accuracy trade-off compared to baselines
• Demonstrates 15-50% relative accuracy improvements on TriviaQA at equal TTFT FLOP budgets
• Shows up to 30% accuracy gains on HumanEval code completion at fixed TTFT FLOP counts
• Validated on Apple M2 Pro CPU, proving FLOP gains translate to real-world speedups


So, how's it done?

Based on the KV Prediction method described in the paper, here are the key steps for how it's done:

1. Choose a base model and an auxiliary model:
- The base model is a larger, pretrained transformer model that will be used for final generation.
- The auxiliary model is a smaller transformer model used to efficiently process the input prompt.

2. Design the KV predictor:
- Create a set of learned linear projections to map from the auxiliary model's KV cache to the base model's KV cache.
- Define a mapping from auxiliary cache layers to base cache layers.

3. Training process:
- Pass input tokens through the auxiliary model to get its KV cache.
- Use the KV predictor to generate a predicted KV cache for the base model.
- Run the base model using the predicted KV cache and compute losses.
- Backpropagate errors through the frozen base model to update the auxiliary model and KV predictor.

4. Inference process:
- Process the input prompt with the auxiliary model to get its KV cache.
- Use the KV predictor to generate the predicted base model KV cache.
- Run a single token generation step with the base model using the predicted KV cache.
- Continue autoregressive generation with the base model as normal.

Excited to hear your thoughts!
posted an update 19 days ago
view post
Post
382
All the way from Korea, a novel approach called Mentor-KD significantly improves the reasoning abilities of small language models.

Mentor-KD introduces an intermediate-sized "mentor" model to augment training data and provide soft labels during knowledge distillation from large language models (LLMs) to smaller models.

Broadly, it’s a two-stage process:
1) Fine-tune the mentor on filtered Chain-of-Thought (CoT) annotations from an LLM teacher.
2) Use the mentor to generate additional CoT rationales and soft probability distributions.

The student model is then trained using:
- CoT rationales from both the teacher and mentor (rationale distillation).
- Soft labels from the mentor (soft label distillation).

Results show that Mentor-KD consistently outperforms baselines, with up to 5% accuracy gains on some tasks.

Mentor-KD is especially effective in low-resource scenarios, achieving comparable performance to baselines while using only 40% of the original training data.

This work opens up exciting possibilities for making smaller, more efficient language models better at complex reasoning tasks.

What are your thoughts on this approach?
posted an update 20 days ago
view post
Post
2135
While Google's Transformer might have introduced "Attention is all you need," Microsoft and Tsinghua University are here with the DIFF Transformer, stating, "Sparse-Attention is all you need."

The DIFF Transformer outperforms traditional Transformers in scaling properties, requiring only about 65% of the model size or training tokens to achieve comparable performance.

The secret sauce? A differential attention mechanism that amplifies focus on relevant context while canceling out noise, leading to sparser and more effective attention patterns.

How?
- It uses two separate softmax attention maps and subtracts them.
- It employs a learnable scalar λ for balancing the attention maps.
- It implements GroupNorm for each attention head independently.
- It is compatible with FlashAttention for efficient computation.

What do you get?
- Superior long-context modeling (up to 64K tokens).
- Enhanced key information retrieval.
- Reduced hallucination in question-answering and summarization tasks.
- More robust in-context learning, less affected by prompt order.
- Mitigation of activation outliers, opening doors for efficient quantization.

Extensive experiments show DIFF Transformer's advantages across various tasks and model sizes, from 830M to 13.1B parameters.

This innovative architecture could be a game-changer for the next generation of LLMs. What are your thoughts on DIFF Transformer's potential impact?
  • 1 reply
·
posted an update 22 days ago
view post
Post
534
Good folks from Universitat Politècnica de Catalunya, University of Groningen, and Meta have released "A Primer on the Inner Workings of Transformer-based Language Models."

They don't make survey papers like they used to, but this is an exciting new survey on Transformer LM interpretability!

This comprehensive survey provides a technical deep dive into:

• Transformer architecture components (attention, FFN, residual stream)
• Methods for localizing model behavior:
- Input attribution (gradient & perturbation-based)
- Component importance (logit attribution, causal interventions)
• Information decoding techniques:
- Probing, linear feature analysis
- Sparse autoencoders for disentangling features
• Key insights on model internals:
- Attention mechanisms (induction heads, copy suppression)
- FFN neuron behaviors
- Residual stream properties
- Multi-component emergent behaviors

The paper offers a unified notation and connects insights across different areas of interpretability research. It's a must-read for anyone working on understanding large language models!

Some fascinating technical highlights:
- Detailed breakdowns of attention head circuits (e.g., IOI task)
- Analysis of factual recall mechanisms
- Overview of polysemanticity and superposition
- Discussion of grokking as circuit emergence

What interpretability insights do you find most intriguing?
posted an update 23 days ago
view post
Post
1995
Just started going through the latest "State of AI Report 2024", and I cannot get over the predictions!

The report predicts major developments in AI over the next 12 months, including a $10B+ investment from a sovereign state into a large US AI lab, triggering national security scrutiny, and a viral app created by someone without coding skills.

It forecasts changes in data collection practices due to frontier labs facing trials, softer-than-expected EU AI Act implementations, and the rise of an open-source alternative to OpenAI GPT-4 outperforming in benchmarks.

NVIDIA’s dominance will remain largely unchallenged, investment in humanoid robots will decline, Apple’s on-device AI research will gain momentum, and a research paper by an AI scientist will be accepted at a major conference.

Lastly, a GenAI-based video game is expected to achieve breakout success.

Yet to go through all 200+ pages... will post summarized thoughts later.
  • 2 replies
·
replied to their post about 1 month ago
view reply

Here's why you should be pumped:

🔥 Supercharge your models:
• Up to 97% speedup for LLaMA 3 8B inference
• 50% speedup for LLaMA 3 70B pretraining on H100
• 53% speedup for diffusion models on H100

💾 Slash memory usage:
• 73% peak VRAM reduction for LLaMA 3.1 8B at 128K context length
• 50% model VRAM reduction for CogVideoX

Whether you're working on LLMs, diffusion models, or other AI applications, torchao is a must-have tool in your arsenal. It's time to make your models faster, smaller, and more efficient!

So, what use cases do you expect out of this?

posted an update about 1 month ago
view post
Post
1252
Good folks at @PyTorch have just released torchao, a game-changing library for native architecture optimization.

-- How torchao Works (They threw the kitchen-sink at it...)

torchao leverages several advanced techniques to optimize PyTorch models, making them faster and more memory-efficient. Here's an overview of its key mechanisms:

Quantization

torchao employs various quantization methods to reduce model size and accelerate inference:

• Weight-only quantization: Converts model weights to lower precision formats like int4 or int8, significantly reducing memory usage.
• Dynamic activation quantization: Quantizes activations on-the-fly during inference, balancing performance and accuracy.
• Automatic quantization: The autoquant function intelligently selects the best quantization strategy for each layer in a model.

Low-bit Datatypes

The library utilizes low-precision datatypes to speed up computations:

• float8: Enables float8 training for linear layers, offering substantial speedups for large models like LLaMA 3 70B.
• int4 and int8: Provide options for extreme compression of weights and activations.

Sparsity Techniques

torchao implements sparsity methods to reduce model density:

• Semi-sparse weights: Combine quantization with sparsity for compute-bound models.

KV Cache Optimization

For transformer-based models, torchao offers KV cache quantization, leading to significant VRAM reductions for long context lengths.

Integration with PyTorch Ecosystem

torchao seamlessly integrates with existing PyTorch tools:

• Compatible with torch.compile() for additional performance gains.
• Works with FSDP2 for distributed training scenarios.
• Supports most PyTorch models available on Hugging Face out-of-the-box.

By combining these techniques, torchao enables developers to significantly improve the performance and efficiency of their PyTorch models with minimal code changes and accuracy impact.
·
replied to their post about 1 month ago
view reply
  1. Monitor and interact:
  • Use the OpenDevin User Interface (UI) to view the agent's actions and progress.
  • Provide additional instructions or feedback if needed.
  1. Evaluate results:
  • Review the agent's output or completed task.
  • Optionally, use the evaluation framework to assess the agent's performance on specific benchmarks.
  1. Iterate and improve:
  • Based on the results, refine the agent's prompts, skills, or implementation as needed.

Remember that OpenDevin is a flexible platform, so the exact steps may vary depending on your specific use case and the type of agent you're working with.

OpenDevin is a community-driven project with over 160 contributors and 1.3K+ contributions. It's poised to accelerate research and real-world applications in agentic AI systems.

posted an update about 1 month ago
view post
Post
1187
Researchers have introduced OpenDevin, an open-source platform for building powerful AI agents that interact with the world through software interfaces.

Here is a speed-run of features:

- Flexible agent abstraction, allowing easy implementation of diverse AI agents
- Sandboxed Linux environment and web browser for safe code execution and web interaction
- Core actions including IPythonRunCellAction, CmdRunAction, and BrowserInteractiveAction
- AgentSkills library with reusable tools like file-editing utilities and multi-modal document parsing
- Multi-agent delegation for complex task solving
- Comprehensive evaluation framework with 15 benchmarks across software engineering and the web

Here is how you get Devin working:

1. Set up the environment:
- Install OpenDevin by following the instructions in the GitHub repository (https://github.com/OpenDevin/OpenDevin).
- Ensure you have the necessary dependencies installed.

2. Choose an agent:
- Select an agent from the AgentHub, such as the CodeActAgent or BrowsingAgent.
- Alternatively, create your own agent by implementing the agent abstraction.

3. Configure the environment:
- Set up the sandboxed Linux environment and web browser.
- Mount any necessary files or directories into the workspace.

4. Define the task:
- Specify the task you want the agent to perform, such as writing code, debugging, or web browsing.

5. Initialize the agent:
- Create an instance of your chosen agent.
- Set any necessary parameters or prompts.

6. Start the interaction:
- Begin the agent's execution loop, which typically involves:
a. The agent perceiving the current state
b. Deciding on an action
c. Executing the action in the environment
d. Observing the results

Continued in comments...
  • 2 replies
·
posted an update about 1 month ago
view post
Post
3976
Researchers have developed a novel approach called Logic-of-Thought (LoT) that significantly enhances the logical reasoning capabilities of large language models (LLMs).

Here are the steps on how Logic-of-Thought (LoT) is implemented:

-- 1. Logic Extraction

1. Use Large Language Models (LLMs) to identify sentences containing conditional reasoning relationships from the input context.
2. Generate a collection of sentences with logical relationships.
3. Use LLMs to extract the set of propositional symbols and logical expressions from the collection.
4. Identify propositions with similar meanings and represent them using identical propositional symbols.
5. Analyze the logical relationships between propositions based on their natural language descriptions.
6. Add negation (¬) for propositions that express opposite meanings.
7. Use implication (→) to connect propositional symbols when a conditional relationship exists.

-- 2. Logic Extension

1. Apply logical reasoning laws to the collection of logical expressions from the Logic Extraction phase.
2. Use a Python program to implement logical deduction and expand the expressions.
3. Apply logical laws such as Double Negation, Contraposition, and Transitivity to derive new logical expressions.

-- 3. Logic Translation

1. Use LLMs to translate the newly generated logical expressions into natural language descriptions.
2. Combine the natural language descriptions of propositional symbols according to the extended logical expressions.
3. Incorporate the translated logical information as a new part of the original input prompt.

-- 4. Integration with Existing Prompting Methods

1. Combine the LoT-generated logical information with the original prompt.
2. Use this enhanced prompt with existing prompting methods like Chain-of-Thought (CoT), Self-Consistency (SC), or Tree-of-Thoughts (ToT).
3. Feed the augmented prompt to the LLM to generate the final answer.

What do you think about LoT?
  • 1 reply
·
posted an update about 1 month ago
view post
Post
1491
I'm thrilled to share that I’ve just released the Contextual Multi-Armed Bandits Library, a comprehensive Python toolkit that brings together a suite of both contextual and non-contextual bandit algorithms. Whether you're delving into reinforcement learning research or building practical applications, this library is designed to accelerate your work.

What's Inside:

- Contextual Algorithms:
- LinUCB
- Epsilon-Greedy
- KernelUCB
- NeuralLinearBandit
- DecisionTreeBandit

- Non-Contextual Algorithms:
- Upper Confidence Bound (UCB)
- Thompson Sampling

Key Features:

- Modular Design: Easily integrate and customize algorithms for your specific needs.
- Comprehensive Documentation: Clear instructions and examples to get you started quickly.
- Educational Value: Ideal for learning and teaching concepts in reinforcement learning and decision-making under uncertainty.

GitHub Repository: https://github.com/singhsidhukuldeep/contextual-bandits
PyPi: https://pypi.org/project/contextual-bandits-algos/

I am eager to hear your feedback, contributions, and ideas. Feel free to open issues, submit pull requests, or fork the project to make it your own.
posted an update about 1 month ago
view post
Post
2438
Good folks at Meta has just unveiled Llama 3.2, pushing the boundaries of language models and computer vision.

Even more interesting is how they trained this cutting-edge model:

1️⃣ Architecture:
Llama 3.2 uses an optimized transformer architecture with auto-regressive capabilities. The largest models (11B and 90B) now support multimodal inputs, integrating both text and images.

2️⃣ Training Pipeline:
• Started with pretrained Llama 3.1 text models
• Added image adapters and encoders
• Pretrained on large-scale noisy (image, text) pair data
• Fine-tuned on high-quality in-domain and knowledge-enhanced (image, text) pairs

3️⃣ Vision Integration:
• Trained adapter weights to integrate a pre-trained image encoder
• Used cross-attention layers to feed image representations into the language model
• Preserved text-only capabilities by not updating language model parameters during adapter training

4️⃣ Post-Training Alignment:
• Multiple rounds of supervised fine-tuning (SFT)
• Rejection sampling (RS)
• Direct preference optimization (DPO)
• Synthetic data generation using Llama 3.1 for Q&A augmentation
• Reward model ranking for high-quality fine-tuning data

5️⃣ Lightweight Models:
• Used pruning and distillation techniques for 1B and 3B models
• Structured pruning from Llama 3.1 8B model
• Knowledge distillation using Llama 3.1 8B and 70B as teachers

6️⃣ Context Length:
All models support an impressive 128K token context length.

7️⃣ Safety Measures:
Incorporated safety mitigation data to balance helpfulness and safety.

The result? A suite of models ranging from edge-friendly 1B parameters to powerful 90B parameter versions, capable of sophisticated reasoning across text and images. Llama 3.2 is set to revolutionize AI applications from mobile devices to enterprise-scale solutions.

What are your thoughts on these advancements? How do you see Llama 3.2 impacting your industry? Let's discuss in the comments!
replied to their post about 1 month ago
view reply
  1. Generate task instances:
  • Create multiple instances of each task with varying complexities.
  • Ensure task instances can be extended to arbitrary context lengths.
  1. Develop prompts and scoring methods:
  • Create few-shot prompts for each task to guide model responses.
  • Implement appropriate scoring methods for each task (e.g., approximate accuracy, string similarity).
  1. Evaluate models:
  • Test frontier models with long-context capabilities (e.g., Gemini, GPT-4, Claude).
  • Evaluate models on contexts up to 128K tokens, and some up to 1M tokens.
  1. Analyze results:
  • Compare model performance across different tasks and context lengths.
  • Identify trends in performance degradation and generalization capabilities.
  1. Iterate and refine:
  • Adjust task parameters and prompts as needed to ensure robust evaluation.
  • Address any issues or limitations discovered during testing.
posted an update about 1 month ago
view post
Post
1465
Researchers from @GoogleDeepMind have introduced "Michelangelo" — a novel framework for evaluating large language models on long-context reasoning tasks beyond simple retrieval.

They have proposed three minimal tasks to test different aspects of long-context reasoning:
- Latent List: Tracking a Python list's state over many operations.
- MRCR: Multi-round coreference resolution in conversations.
- IDK: Determining if an answer exists in a long context.

They found significant performance drop-offs before 32K tokens on these tasks, indicating room for improvement in long-context reasoning.

Here are the key steps for creating the Michelangelo long-context evaluations:

1. Develop the Latent Structure Queries (LSQ) framework:
- Create a framework for generating long-context evaluations that can be extended arbitrarily in length and complexity.
- Ensure the framework measures capabilities beyond simple retrieval.

2. Design minimal tasks using the LSQ framework:
- Create tasks that test different aspects of long-context reasoning.
- Ensure tasks are minimally complex while still challenging for current models.

3. Implement the Latent List task:
- Create a Python list-based task with operations that modify the list.
- Include relevant and irrelevant operations to test model understanding.
- Develop view operations to query the final state of the list.

4. Implement the Multi-Round Coreference Resolution (MRCR) task:
- Generate conversations with user requests and model responses on various topics.
- Place specific requests randomly in the context.
- Require models to reproduce outputs based on queries about the conversation.

5. Implement the IDK task:
- Create contexts with invented stories or information.
- Develop questions that may or may not have answers in the context.
- Include multiple-choice options, always including "I don't know" as an option.

More in comments...
  • 1 reply
·
posted an update about 1 month ago
view post
Post
1507
Although this might sound like another way to make money on LLM API calls...

Good folks at @AnthropicAI just introduced Contextual Retrieval, and it's a significant yet logical step up from simple Retrieval-Augmented Generation (RAG)!

Here are the steps to implement Contextual Retrieval based on Anthropic's approach:

1. Preprocess the knowledge base:
- Break down documents into smaller chunks (typically a few hundred tokens each).
- Generate contextual information for each chunk using Claude 3 Haiku with a specific prompt.
- Prepend the generated context (usually 50-100 tokens) to each chunk.

2. Create embeddings and a BM25 index:
- Use an embedding model (Gemini or Voyage recommended) to convert contextualized chunks into vector embeddings.
- Create a BM25 index using the contextualized chunks.

3. Set up the retrieval process:
- Implement a system to search both the vector embeddings and the BM25 index.
- Use rank fusion techniques to combine and deduplicate results from both searches.

4. Implement reranking (optional but recommended):
- Retrieve the top 150 potentially relevant chunks initially.
- Use a reranking model (e.g., Cohere reranker) to score these chunks based on relevance to the query.
- Select the top 20 chunks after reranking.

5. Integrate with the generative model:
- Add the top 20 chunks (or top K, based on your specific needs) to the prompt sent to the generative model.

6. Optimize for your use case:
- Experiment with chunk sizes, boundary selection, and overlap.
- Consider creating custom contextualizer prompts for your specific domain.
- Test different numbers of retrieved chunks (5, 10, 20) to find the optimal balance.

7. Leverage prompt caching:
- Use Claude's prompt caching feature to reduce costs when generating contextualized chunks.
- Cache the reference document once and reference it for each chunk, rather than passing it repeatedly.

8. Evaluate and iterate