admarcosai
's Collections
Efficient Inference
updated
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper
•
2311.04934
•
Published
•
28
Routing to the Expert: Efficient Reward-guided Ensemble of Large
Language Models
Paper
•
2311.08692
•
Published
•
12
Exponentially Faster Language Modelling
Paper
•
2311.10770
•
Published
•
118
Memory Augmented Language Models through Mixture of Word Experts
Paper
•
2311.10768
•
Published
•
16
Unlocking Anticipatory Text Generation: A Constrained Approach for
Faithful Decoding with Large Language Models
Paper
•
2312.06149
•
Published
•
2
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
•
2312.04985
•
Published
•
38
Distributed Inference and Fine-tuning of Large Language Models Over The
Internet
Paper
•
2312.08361
•
Published
•
25
Steering Llama 2 via Contrastive Activation Addition
Paper
•
2312.06681
•
Published
•
11
Context Tuning for Retrieval Augmented Generation
Paper
•
2312.05708
•
Published
•
16
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Paper
•
2312.12456
•
Published
•
41
Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon
Paper
•
2401.03462
•
Published
•
27
Efficient LLM inference solution on Intel GPU
Paper
•
2401.05391
•
Published
•
8
Supervised Knowledge Makes Large Language Models Better In-context
Learners
Paper
•
2312.15918
•
Published
•
8
BlackMamba: Mixture of Experts for State-Space Models
Paper
•
2402.01771
•
Published
•
23
BitDelta: Your Fine-Tune May Only Be Worth One Bit
Paper
•
2402.10193
•
Published
•
19
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper
•
2402.11131
•
Published
•
42