LiyuanLucasLiu commited on
Commit
a718c05
1 Parent(s): 234a12a

link added

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -22,9 +22,9 @@ library_name: transformers
22
 
23
  - With **only 6.6B** activate parameters, GRIN MoE achieves **exceptionally good** performance across a diverse set of tasks, particularly in coding and mathematics tasks.
24
 
25
- - GRIN uses **SparseMixer-v2** to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.
26
 
27
- - GRIN scales MoE training with **neither expert parallelism nor token dropping**, while the conventional MoE training employs expert parallelism and deploys token dropping.
28
 
29
  ## Intended Uses
30
 
 
22
 
23
  - With **only 6.6B** activate parameters, GRIN MoE achieves **exceptionally good** performance across a diverse set of tasks, particularly in coding and mathematics tasks.
24
 
25
+ - GRIN uses [**SparseMixer-v2**](https://arxiv.org/html/2409.12136v1#Pt1) to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.
26
 
27
+ - GRIN scales MoE training with [**neither expert parallelism nor token dropping**](https://arxiv.org/pdf/2409.12136#page=5.42), while the conventional MoE training employs expert parallelism and deploys token dropping.
28
 
29
  ## Intended Uses
30