Data Engineering for Scaling Language Models to 128K Context
Abstract
We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular the ability to utilize information at arbitrary input locations, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the quantity and quality of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize domain balance and length upsampling. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Extending LLMs' Context Window with 100 Samples (2024)
- LongAlign: A Recipe for Long Context Alignment of Large Language Models (2024)
- Structured Packing in LLM Training Improves Long Context Utilization (2023)
- LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning (2024)
- E^2-LLM: Efficient and Extreme Length Extension of Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Is the idea here mainly:
- Data - (novel contribution) continual pretraining while perserving the pretraining data mixture (avoid biasing benchmark performance in other areas, in contrast to e.g. just training on long-form books)
- Architecture - minimal changes beyond Adjusted Base Frequency (changing the base from 10m000 to 500,000 à la Code LLaMA).
- Training - with recent sub-quadratic memory optimizations (Flash Attention), brute-force training with long sequences is no longer prohibitively expensive, and a large part of the latency bottleneck has shifted to linear IO cost (for < ~50K sequences). I believe FlashAttention 2 proposes a double-buffering technique that can also help "overlap" these IO and GEMM costs to avoid serializing on them.
I believe https://arxiv.org/abs/2309.16039 also proposes something very similar (continual pretraining using 500000 ABF as the only minor architectural change), but using lots of tokens for continual pretraining and without preserving the same pretraining data mixture.
I tend to view the contribution is data and data alone, not only the data composition but also the data scale.
When comparing this work with https://arxiv.org/abs/2309.16039, note a foundamental difference is that we hypothesize that the long-context capability is already within the base model, and one only needs very light weight continue pretrain to unlock it, i.e. only use 5B data. This is a good news for research and open source.
But https://arxiv.org/abs/2309.16039 (implicitly) holds the opposite belief that the long context capability is NOT within the base model, and they continue pretrain on 400B tokens. This sends an inaccurate and costly message to the community, as it indicates long context can be as expensive as pretraining.
Consequently, imagine a company trying to build long context model. Before our paper, suppose they follow https://arxiv.org/abs/2309.16039, then they may need to spend 128 A100s for two weeks. After knowing our message, they can reduce their cost to 8 A100s of 5 days. This is a million dollar cost reduction.
And it already happened :)
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper