Scaling Laws for Downstream Task Performance of Large Language Models
Abstract
Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream BLEU score with good accuracy using a log-law. However, there are also cases where moderate misalignment causes the BLEU score to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these observations, we provide new practical insights for choosing appropriate pretraining data.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Selecting Large Language Model to Fine-tune via Rectified Scaling Law (2024)
- Scaling Laws for Forgetting When Fine-Tuning Large Language Models (2024)
- Large Language Model Evaluation via Matrix Entropy (2024)
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (2024)
- DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Super interesting paper. I respect that BLEU and Cross Entropy are quite objective. Id love to see this with other tasks too, like QA or maybe one of the multiple choice tasks.
btw, I love the structure of having a main findings section!
• We observe that, when the distributions of the pretraining and downstream tasks are well-aligned, the
BLEU score and downstream cross-entropy improve monotonically with more pretraining. For BLEU
score, we propose a new log scaling law and show that it has good predictive accuracy.
• When the distributions are not sufficiently aligned and the finetuning data size is relatively small, the
BLEU score fluctuates or even gets worse with more pretraining–losing the monotonic scaling behavior.
In these same settings, we find that the downstream cross-entropy still scales monotonically according
to a power-law
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper