Post
1465
Researchers from
@GoogleDeepMind
have introduced "Michelangelo" — a novel framework for evaluating large language models on long-context reasoning tasks beyond simple retrieval.
They have proposed three minimal tasks to test different aspects of long-context reasoning:
- Latent List: Tracking a Python list's state over many operations.
- MRCR: Multi-round coreference resolution in conversations.
- IDK: Determining if an answer exists in a long context.
They found significant performance drop-offs before 32K tokens on these tasks, indicating room for improvement in long-context reasoning.
Here are the key steps for creating the Michelangelo long-context evaluations:
1. Develop the Latent Structure Queries (LSQ) framework:
- Create a framework for generating long-context evaluations that can be extended arbitrarily in length and complexity.
- Ensure the framework measures capabilities beyond simple retrieval.
2. Design minimal tasks using the LSQ framework:
- Create tasks that test different aspects of long-context reasoning.
- Ensure tasks are minimally complex while still challenging for current models.
3. Implement the Latent List task:
- Create a Python list-based task with operations that modify the list.
- Include relevant and irrelevant operations to test model understanding.
- Develop view operations to query the final state of the list.
4. Implement the Multi-Round Coreference Resolution (MRCR) task:
- Generate conversations with user requests and model responses on various topics.
- Place specific requests randomly in the context.
- Require models to reproduce outputs based on queries about the conversation.
5. Implement the IDK task:
- Create contexts with invented stories or information.
- Develop questions that may or may not have answers in the context.
- Include multiple-choice options, always including "I don't know" as an option.
More in comments...
They have proposed three minimal tasks to test different aspects of long-context reasoning:
- Latent List: Tracking a Python list's state over many operations.
- MRCR: Multi-round coreference resolution in conversations.
- IDK: Determining if an answer exists in a long context.
They found significant performance drop-offs before 32K tokens on these tasks, indicating room for improvement in long-context reasoning.
Here are the key steps for creating the Michelangelo long-context evaluations:
1. Develop the Latent Structure Queries (LSQ) framework:
- Create a framework for generating long-context evaluations that can be extended arbitrarily in length and complexity.
- Ensure the framework measures capabilities beyond simple retrieval.
2. Design minimal tasks using the LSQ framework:
- Create tasks that test different aspects of long-context reasoning.
- Ensure tasks are minimally complex while still challenging for current models.
3. Implement the Latent List task:
- Create a Python list-based task with operations that modify the list.
- Include relevant and irrelevant operations to test model understanding.
- Develop view operations to query the final state of the list.
4. Implement the Multi-Round Coreference Resolution (MRCR) task:
- Generate conversations with user requests and model responses on various topics.
- Place specific requests randomly in the context.
- Require models to reproduce outputs based on queries about the conversation.
5. Implement the IDK task:
- Create contexts with invented stories or information.
- Develop questions that may or may not have answers in the context.
- Include multiple-choice options, always including "I don't know" as an option.
More in comments...