@singhsidhukuldeep on Hugging Face: "Researchers from @GoogleDeepMind have introduced "Michelangelo"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

singhsidhukuldeep

posted an update Sep 25

Post

1465

Researchers from @GoogleDeepMind have introduced "Michelangelo" — a novel framework for evaluating large language models on long-context reasoning tasks beyond simple retrieval.

They have proposed three minimal tasks to test different aspects of long-context reasoning:
- Latent List: Tracking a Python list's state over many operations.
- MRCR: Multi-round coreference resolution in conversations.
- IDK: Determining if an answer exists in a long context.

They found significant performance drop-offs before 32K tokens on these tasks, indicating room for improvement in long-context reasoning.

Here are the key steps for creating the Michelangelo long-context evaluations:

1. Develop the Latent Structure Queries (LSQ) framework:
- Create a framework for generating long-context evaluations that can be extended arbitrarily in length and complexity.
- Ensure the framework measures capabilities beyond simple retrieval.

2. Design minimal tasks using the LSQ framework:
- Create tasks that test different aspects of long-context reasoning.
- Ensure tasks are minimally complex while still challenging for current models.

3. Implement the Latent List task:
- Create a Python list-based task with operations that modify the list.
- Include relevant and irrelevant operations to test model understanding.
- Develop view operations to query the final state of the list.

4. Implement the Multi-Round Coreference Resolution (MRCR) task:
- Generate conversations with user requests and model responses on various topics.
- Place specific requests randomly in the context.
- Require models to reproduce outputs based on queries about the conversation.

5. Implement the IDK task:
- Create contexts with invented stories or information.
- Develop questions that may or may not have answers in the context.
- Include multiple-choice options, always including "I don't know" as an option.

More in comments...

singhsidhukuldeep

Sep 25

Generate task instances:

Create multiple instances of each task with varying complexities.
Ensure task instances can be extended to arbitrary context lengths.

Develop prompts and scoring methods:

Create few-shot prompts for each task to guide model responses.
Implement appropriate scoring methods for each task (e.g., approximate accuracy, string similarity).

Evaluate models:

Test frontier models with long-context capabilities (e.g., Gemini, GPT-4, Claude).
Evaluate models on contexts up to 128K tokens, and some up to 1M tokens.

Analyze results:

Compare model performance across different tasks and context lengths.
Identify trends in performance degradation and generalization capabilities.

Iterate and refine:

Adjust task parameters and prompts as needed to ensure robust evaluation.
Address any issues or limitations discovered during testing.

In this post

singhsidhukuldeep Kuldeep Singh Sidhu