Spaces:
Running
on
Zero
Running
on
Zero
Feature(MInference): update information
Browse files
app.py
CHANGED
@@ -22,15 +22,13 @@ _Huiqiang Jiang†, Yucheng Li†, Chengruidong Zhang†, Qianhui Wu, Xufang Luo
|
|
22 |
<a href="https://aka.ms/MInference" target="blank"> [Project Page]</a>
|
23 |
<a href="https://arxiv.org/abs/2407.02490" target="blank"> [Paper]</a></h3>
|
24 |
|
25 |
-
<h3>Now, you can process <b>1M context 10x faster in a single A100</b> using Long-context LLMs like LLaMA-3-8B-1M, GLM-4-1M, with even <b>better accuracy</b>, try <b>MInference 1.0</b> right now!</h3>
|
26 |
-
|
27 |
-
## TL;DR
|
28 |
-
**MInference 1.0** leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for long-context LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a **10x speedup** for pre-filling on an A100 while maintaining accuracy.
|
29 |
-
|
30 |
## News
|
31 |
- 🪗 [24/07/07] Thanks @AK for sponsoring. You can now use MInference online in the [HF Demo](https://huggingface.co/spaces/microsoft/MInference) with ZeroGPU.
|
32 |
- 🧩 [24/07/03] We will present **MInference 1.0** at the _**Microsoft Booth**_ and _**ES-FoMo**_ at ICML'24. See you in Vienna!
|
33 |
|
|
|
|
|
|
|
34 |
<font color="brown"><b>This is only a deployment demo. You can follow the code below to try MInference locally.</b></font>
|
35 |
|
36 |
```bash
|
|
|
22 |
<a href="https://aka.ms/MInference" target="blank"> [Project Page]</a>
|
23 |
<a href="https://arxiv.org/abs/2407.02490" target="blank"> [Paper]</a></h3>
|
24 |
|
|
|
|
|
|
|
|
|
|
|
25 |
## News
|
26 |
- 🪗 [24/07/07] Thanks @AK for sponsoring. You can now use MInference online in the [HF Demo](https://huggingface.co/spaces/microsoft/MInference) with ZeroGPU.
|
27 |
- 🧩 [24/07/03] We will present **MInference 1.0** at the _**Microsoft Booth**_ and _**ES-FoMo**_ at ICML'24. See you in Vienna!
|
28 |
|
29 |
+
## TL;DR
|
30 |
+
**MInference 1.0** leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for long-context LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a **10x speedup** for pre-filling on an A100 while maintaining accuracy.
|
31 |
+
|
32 |
<font color="brown"><b>This is only a deployment demo. You can follow the code below to try MInference locally.</b></font>
|
33 |
|
34 |
```bash
|