mixture-of-experts - a lgaalves Collection

lgaalves 's Collections

table-data-extraction

language-models

mixture-of-experts

mixture-of-experts

updated Feb 27

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Paper • 1701.06538 • Published Jan 23, 2017 • 4
Sparse Networks from Scratch: Faster Training without Losing Performance

Paper • 1907.04840 • Published Jul 10, 2019 • 3
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Paper • 1910.02054 • Published Oct 4, 2019 • 4
A Mixture of h-1 Heads is Better than h Heads

Paper • 2005.06537 • Published May 13, 2020 • 2
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Paper • 2006.16668 • Published Jun 30, 2020 • 3
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Paper • 2101.03961 • Published Jan 11, 2021 • 14
FastMoE: A Fast Mixture-of-Expert Training System

Paper • 2103.13262 • Published Mar 24, 2021 • 2
BASE Layers: Simplifying Training of Large, Sparse Models

Paper • 2103.16716 • Published Mar 30, 2021 • 3
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts

Paper • 2105.03036 • Published May 7, 2021 • 2
DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

Paper • 2106.03760 • Published Jun 7, 2021 • 3
Scaling Vision with Sparse Mixture of Experts

Paper • 2106.05974 • Published Jun 10, 2021 • 3
Hash Layers For Large Sparse Models

Paper • 2106.04426 • Published Jun 8, 2021 • 2
DEMix Layers: Disentangling Domains for Modular Language Modeling

Paper • 2108.05036 • Published Aug 11, 2021 • 3
A Machine Learning Perspective on Predictive Coding with PAQ

Paper • 1108.3298 • Published Aug 16, 2011 • 2
Efficient Large Scale Language Modeling with Mixtures of Experts

Paper • 2112.10684 • Published Dec 20, 2021 • 2
Unified Scaling Laws for Routed Language Models

Paper • 2202.01169 • Published Feb 2, 2022 • 2
ST-MoE: Designing Stable and Transferable Sparse Expert Models

Paper • 2202.08906 • Published Feb 17, 2022 • 2
Mixture-of-Experts with Expert Choice Routing

Paper • 2202.09368 • Published Feb 18, 2022 • 3
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Paper • 2206.02770 • Published Jun 6, 2022 • 3
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Paper • 2208.03306 • Published Aug 5, 2022 • 2
A Review of Sparse Expert Models in Deep Learning

Paper • 2209.01667 • Published Sep 4, 2022 • 3
Sparsity-Constrained Optimal Transport

Paper • 2209.15466 • Published Sep 30, 2022 • 1
Mixture of Attention Heads: Selecting Attention Heads Per Token

Paper • 2210.05144 • Published Oct 11, 2022 • 2
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

Paper • 2211.15841 • Published Nov 29, 2022 • 7
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Paper • 2212.05055 • Published Dec 9, 2022 • 5
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models

Paper • 2305.14705 • Published May 24, 2023
From Sparse to Soft Mixtures of Experts

Paper • 2308.00951 • Published Aug 2, 2023 • 20
Approximating Two-Layer Feedforward Networks for Efficient Transformers

Paper • 2310.10837 • Published Oct 16, 2023 • 10
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Paper • 2310.16795 • Published Oct 25, 2023 • 26
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Paper • 2312.07987 • Published Dec 13, 2023 • 40
Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning

Paper • 2312.12379 • Published Dec 19, 2023 • 2
Fast Inference of Mixture-of-Experts Language Models with Offloading

Paper • 2312.17238 • Published Dec 28, 2023 • 7
Mixtral of Experts

Paper • 2401.04088 • Published Jan 8 • 157
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

Paper • 2401.04081 • Published Jan 8 • 70
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Paper • 2401.06066 • Published Jan 11 • 42