-
Instruction Pre-Training: Language Models are Supervised Multitask Learners
Paper • 2406.14491 • Published • 85 -
Pre-training Small Base LMs with Fewer Tokens
Paper • 2404.08634 • Published • 34 -
Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
Paper • 2405.15319 • Published • 25 -
Can LLMs Learn by Teaching? A Preliminary Study
Paper • 2406.14629 • Published • 17
Collections
Discover the best community collections!
Collections including paper arxiv:2404.08634
-
Rho-1: Not All Tokens Are What You Need
Paper • 2404.07965 • Published • 84 -
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
Paper • 2404.10667 • Published • 16 -
Instruction-tuned Language Models are Better Knowledge Learners
Paper • 2402.12847 • Published • 24 -
DoRA: Weight-Decomposed Low-Rank Adaptation
Paper • 2402.09353 • Published • 26
-
Pre-training Small Base LMs with Fewer Tokens
Paper • 2404.08634 • Published • 34 -
Ziya2: Data-centric Learning is All LLMs Need
Paper • 2311.03301 • Published • 16 -
How to Train Data-Efficient LLMs
Paper • 2402.09668 • Published • 38 -
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Paper • 2404.06395 • Published • 21
-
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Paper • 2404.07839 • Published • 41 -
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
Paper • 2307.05695 • Published • 22 -
Rho-1: Not All Tokens Are What You Need
Paper • 2404.07965 • Published • 84 -
Pre-training Small Base LMs with Fewer Tokens
Paper • 2404.08634 • Published • 34
-
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Paper • 2404.05961 • Published • 64 -
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Paper • 2404.07143 • Published • 103 -
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies
Paper • 2404.08197 • Published • 27 -
Pre-training Small Base LMs with Fewer Tokens
Paper • 2404.08634 • Published • 34
-
Beyond Language Models: Byte Models are Digital World Simulators
Paper • 2402.19155 • Published • 49 -
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 52 -
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper • 2403.00522 • Published • 44 -
Resonance RoPE: Improving Context Length Generalization of Large Language Models
Paper • 2403.00071 • Published • 22
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 52 -
Beyond Language Models: Byte Models are Digital World Simulators
Paper • 2402.19155 • Published • 49 -
StarCoder 2 and The Stack v2: The Next Generation
Paper • 2402.19173 • Published • 134 -
Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 18