iofu728 commited on
Commit
c3766a1
1 Parent(s): 5251936

Feature(MInference): update information

Browse files
Files changed (1) hide show
  1. app.py +3 -5
app.py CHANGED
@@ -22,15 +22,13 @@ _Huiqiang Jiang†, Yucheng Li†, Chengruidong Zhang†, Qianhui Wu, Xufang Luo
22
  <a href="https://aka.ms/MInference" target="blank"> [Project Page]</a>
23
  <a href="https://arxiv.org/abs/2407.02490" target="blank"> [Paper]</a></h3>
24
 
25
- <h3>Now, you can process <b>1M context 10x faster in a single A100</b> using Long-context LLMs like LLaMA-3-8B-1M, GLM-4-1M, with even <b>better accuracy</b>, try <b>MInference 1.0</b> right now!</h3>
26
-
27
- ## TL;DR
28
- **MInference 1.0** leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for long-context LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a **10x speedup** for pre-filling on an A100 while maintaining accuracy.
29
-
30
  ## News
31
  - 🪗 [24/07/07] Thanks @AK for sponsoring. You can now use MInference online in the [HF Demo](https://huggingface.co/spaces/microsoft/MInference) with ZeroGPU.
32
  - 🧩 [24/07/03] We will present **MInference 1.0** at the _**Microsoft Booth**_ and _**ES-FoMo**_ at ICML'24. See you in Vienna!
33
 
 
 
 
34
  <font color="brown"><b>This is only a deployment demo. You can follow the code below to try MInference locally.</b></font>
35
 
36
  ```bash
 
22
  <a href="https://aka.ms/MInference" target="blank"> [Project Page]</a>
23
  <a href="https://arxiv.org/abs/2407.02490" target="blank"> [Paper]</a></h3>
24
 
 
 
 
 
 
25
  ## News
26
  - 🪗 [24/07/07] Thanks @AK for sponsoring. You can now use MInference online in the [HF Demo](https://huggingface.co/spaces/microsoft/MInference) with ZeroGPU.
27
  - 🧩 [24/07/03] We will present **MInference 1.0** at the _**Microsoft Booth**_ and _**ES-FoMo**_ at ICML'24. See you in Vienna!
28
 
29
+ ## TL;DR
30
+ **MInference 1.0** leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for long-context LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a **10x speedup** for pre-filling on an A100 while maintaining accuracy.
31
+
32
  <font color="brown"><b>This is only a deployment demo. You can follow the code below to try MInference locally.</b></font>
33
 
34
  ```bash