Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models
Abstract
Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can bridge between their pretraining data mixture, comprised of multiple distinct task families, to identify and learn new tasks in-context which are both inside and outside the pretraining distribution. Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of (x, f(x)) pairs rather than natural language. Our empirical results show transformers demonstrate near-optimal unsupervised model selection capabilities, in their ability to first in-context identify different task families and in-context learn within them when the task families are well-represented in their pretraining data. However when presented with tasks or functions which are out-of-domain of their pretraining data, we demonstrate various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks. Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.
Community
Investigates out-of-distribution (OOD) performance by transformer models by a sequence learning model/task; in-domain/context is good, but OOD fails (transformer models may not be as generalizable OOD as thought); OOD generalisability (emergent behaviours) could be due to large pre-training data mixtures (and not model generalisation). Few-shot ICL (In-Context Learning) test setup: Goal is to provide a sequence of input and labels (a label function for sequence is drawn from a function distribution), then predict label for an input; stack as (input, label) pairs and give to model (like examples in few-shot COT promoting). Pre-train model from data source using teacher forcing objective (predict the next term/label given past sequence); use decoder-only architecture. Uses base function classes: linear (dense and sparse), ReLU, sinusoidal, and mixtures. Mixture models work well on the linear set (dense and sparse) - faster drop in MSE against in-context data points. Text ICL generalization on unseen function sets (convex combinations of pre-training classes) and extreme versions of seen/pretrained functions (higher or lower frequency sinusoids); interpolation phenomena found in linear doesn’t generalize to linear-sinusoid combination; in case of mixture, it generalized well towards the extremes (of the seen data); generalization starts strong and decreases dramatically as task becomes more OOD (for extreme versions of seen functions). Have to still explore behavior in tokenized models (for NLP tasks); couldn’t train the model by binning; still need to investigate OOD generalization ability of LLMs. Appendix has architecture (implemented in Pax) and function class definitions; more experiments with selection (combinations of linear and ReLU for different parameter size models). From Google DeepMind.
Links: PapersWithCode
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper