Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeep 
posted an update Sep 25
Post
1465
Researchers from @GoogleDeepMind have introduced "Michelangelo" — a novel framework for evaluating large language models on long-context reasoning tasks beyond simple retrieval.

They have proposed three minimal tasks to test different aspects of long-context reasoning:
- Latent List: Tracking a Python list's state over many operations.
- MRCR: Multi-round coreference resolution in conversations.
- IDK: Determining if an answer exists in a long context.

They found significant performance drop-offs before 32K tokens on these tasks, indicating room for improvement in long-context reasoning.

Here are the key steps for creating the Michelangelo long-context evaluations:

1. Develop the Latent Structure Queries (LSQ) framework:
- Create a framework for generating long-context evaluations that can be extended arbitrarily in length and complexity.
- Ensure the framework measures capabilities beyond simple retrieval.

2. Design minimal tasks using the LSQ framework:
- Create tasks that test different aspects of long-context reasoning.
- Ensure tasks are minimally complex while still challenging for current models.

3. Implement the Latent List task:
- Create a Python list-based task with operations that modify the list.
- Include relevant and irrelevant operations to test model understanding.
- Develop view operations to query the final state of the list.

4. Implement the Multi-Round Coreference Resolution (MRCR) task:
- Generate conversations with user requests and model responses on various topics.
- Place specific requests randomly in the context.
- Require models to reproduce outputs based on queries about the conversation.

5. Implement the IDK task:
- Create contexts with invented stories or information.
- Develop questions that may or may not have answers in the context.
- Include multiple-choice options, always including "I don't know" as an option.

More in comments...
  1. Generate task instances:
  • Create multiple instances of each task with varying complexities.
  • Ensure task instances can be extended to arbitrary context lengths.
  1. Develop prompts and scoring methods:
  • Create few-shot prompts for each task to guide model responses.
  • Implement appropriate scoring methods for each task (e.g., approximate accuracy, string similarity).
  1. Evaluate models:
  • Test frontier models with long-context capabilities (e.g., Gemini, GPT-4, Claude).
  • Evaluate models on contexts up to 128K tokens, and some up to 1M tokens.
  1. Analyze results:
  • Compare model performance across different tasks and context lengths.
  • Identify trends in performance degradation and generalization capabilities.
  1. Iterate and refine:
  • Adjust task parameters and prompts as needed to ensure robust evaluation.
  • Address any issues or limitations discovered during testing.