RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval Paper • 2409.10516 • Published Sep 16 • 39
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding Paper • 2408.11049 • Published Aug 20 • 12
view article Article A failed experiment: Infini-Attention, and why we should keep trying? Aug 14 • 50
view article Article RegMix: Data Mixture as Regression for Language Model Pre-training By SivilTaram • Jul 11 • 10
view article Article MInference 1.0: 10x Faster Million Context Inference with a Single GPU By liyucheng • Jul 11 • 11
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention Paper • 2407.02490 • Published Jul 2 • 23
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression Paper • 2403.12968 • Published Mar 19 • 24
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits Paper • 2402.17764 • Published Feb 27 • 603
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning Paper • 2402.06619 • Published Feb 9 • 54
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time Paper • 2310.17157 • Published Oct 26, 2023 • 11
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models Paper • 2310.05736 • Published Oct 9, 2023 • 4
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression Paper • 2310.06839 • Published Oct 10, 2023 • 3
Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration Paper • 2307.05300 • Published Jul 11, 2023 • 18