Post
2463
๐๐๐ฉ๐๐ซ ๐๐๐ฏ๐ข๐๐ฐ: ๐๐ก๐จ-๐ - ๐๐จ ๐ง๐จ๐ญ ๐ฎ๐ฌ๐ ๐๐ฅ๐ฅ ๐ญ๐จ๐ค๐๐ง๐ฌ ๐๐ช๐ฎ๐๐ฅ๐ฅ๐ฒ ๐ข๐ง ๐ฒ๐จ๐ฎ๐ซ ๐ญ๐ซ๐๐ข๐ง๐ข๐ง๐ ! โ๏ธโ๏ธ
A new paper topping Daily papers questions a hidden assumption in LLM training:
๐ค ๐๐๐ค๐ช๐ก๐ ๐ฌ๐ ๐ง๐๐๐ก๐ก๐ฎ ๐ช๐จ๐ ๐๐ก๐ก ๐ฉ๐ค๐ ๐๐ฃ๐จ ๐๐ฆ๐ช๐๐ก๐ก๐ฎ ๐๐ฃ ๐ค๐ช๐ง ๐๐๐'๐จ ๐ฉ๐ง๐๐๐ฃ๐๐ฃ๐ ?
Some tokens are more relevant than others, and some are mostly noise (just look up the history of ๐๐ฐ๐ญ๐ช๐ฅ๐๐ฐ๐ญ๐ฅ๐๐ข๐จ๐ช๐ฌ๐ข๐ณ๐ฑ).
So this paper introduces ๐ฆ๐ฒ๐น๐ฒ๐ฐ๐๐ถ๐๐ฒ ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น๐ถ๐ป๐ด, which is actually really simple:
โก๏ธ A specific metric measures the relevance of each token. Then during training, only the top k% tokens for this relevance metric count in the loss calculation.
Authors test this method by training models on the difficult MATH dataset (only competition mathematics problems).
โก๏ธ Their technique seems like a new must-do in LLM training: Training is much faster and reaches an impressive performance!
๐๐๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
โ โฑ๏ธ Training is x5 to x10 faster to reach equivalent performance compared to standard language modeling.
โ ๐ช Their 1B model achieves close to GPT4 Chain-of-Thought performance on MATH!
โ ๐ Their 7B model match performance of the state-of-the-art DeepSeek for the same size, while trained on only 3% of tokens
๐๐๐๐ข๐ญ๐ข๐จ๐ง๐๐ฅ ๐ข๐ง๐ฌ๐ข๐ ๐ก๐ญ๐ฌ ๐ก
โ Datasets used for pre-training, even after pre-filtering, still contain a large proportion of noisy tokens ๐
โ Authors show that when you reduce loss on noisy tokens, you actually reduce accuracy (Figure 7). So Selective Language Modeling seems fundamental! โ
Find great reads in @akhaliq 's Daily Papers ๐ https://huggingface.co/papers
Paper added to my collection ๐ m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7
A new paper topping Daily papers questions a hidden assumption in LLM training:
๐ค ๐๐๐ค๐ช๐ก๐ ๐ฌ๐ ๐ง๐๐๐ก๐ก๐ฎ ๐ช๐จ๐ ๐๐ก๐ก ๐ฉ๐ค๐ ๐๐ฃ๐จ ๐๐ฆ๐ช๐๐ก๐ก๐ฎ ๐๐ฃ ๐ค๐ช๐ง ๐๐๐'๐จ ๐ฉ๐ง๐๐๐ฃ๐๐ฃ๐ ?
Some tokens are more relevant than others, and some are mostly noise (just look up the history of ๐๐ฐ๐ญ๐ช๐ฅ๐๐ฐ๐ญ๐ฅ๐๐ข๐จ๐ช๐ฌ๐ข๐ณ๐ฑ).
So this paper introduces ๐ฆ๐ฒ๐น๐ฒ๐ฐ๐๐ถ๐๐ฒ ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น๐ถ๐ป๐ด, which is actually really simple:
โก๏ธ A specific metric measures the relevance of each token. Then during training, only the top k% tokens for this relevance metric count in the loss calculation.
Authors test this method by training models on the difficult MATH dataset (only competition mathematics problems).
โก๏ธ Their technique seems like a new must-do in LLM training: Training is much faster and reaches an impressive performance!
๐๐๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
โ โฑ๏ธ Training is x5 to x10 faster to reach equivalent performance compared to standard language modeling.
โ ๐ช Their 1B model achieves close to GPT4 Chain-of-Thought performance on MATH!
โ ๐ Their 7B model match performance of the state-of-the-art DeepSeek for the same size, while trained on only 3% of tokens
๐๐๐๐ข๐ญ๐ข๐จ๐ง๐๐ฅ ๐ข๐ง๐ฌ๐ข๐ ๐ก๐ญ๐ฌ ๐ก
โ Datasets used for pre-training, even after pre-filtering, still contain a large proportion of noisy tokens ๐
โ Authors show that when you reduce loss on noisy tokens, you actually reduce accuracy (Figure 7). So Selective Language Modeling seems fundamental! โ
Find great reads in @akhaliq 's Daily Papers ๐ https://huggingface.co/papers
Paper added to my collection ๐ m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7