Raincleared
commited on
Commit
•
0485383
1
Parent(s):
e71389a
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -79,7 +79,7 @@ The 7B model is trained on 8 A100 GPUs. The learning rate (LR) is controlled by
|
|
79 |
|
80 |
The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Refer to Section 4.2 of [paper](TODO) for more details.
|
81 |
|
82 |
-
| Setting | Average<br>Sparsity | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI
|
83 |
| :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
|
84 |
| Original-7B | - | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 | 37.96 |
|
85 |
| ReluLLaMA-7B | 66.98 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 | 37.62 |
|
@@ -101,14 +101,14 @@ First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of
|
|
101 |
|
102 |
Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:
|
103 |
|
104 |
-
- Step (2)
|
105 |
-
- Step (3)
|
106 |
|
107 |
where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.
|
108 |
|
109 |
The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](TODO) for more details.
|
110 |
|
111 |
-
| Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed |
|
112 |
| :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
|
113 |
| ReluLLaMA-7B | 66.98 | 90.89 | 58.95 | 11.37 | 67.12 | 1.35 | 63.00 | 1.32 |
|
114 |
| Vanilla ReLU-7B | 66.04 | 87.72 | 72.57 | 12.04 | 67.85 | 1.33 | 63.28 | 1.31 |
|
|
|
79 |
|
80 |
The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Refer to Section 4.2 of [paper](TODO) for more details.
|
81 |
|
82 |
+
| Setting | Average<br>Sparsity | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI Eval | Average |
|
83 |
| :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
|
84 |
| Original-7B | - | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 | 37.96 |
|
85 |
| ReluLLaMA-7B | 66.98 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 | 37.62 |
|
|
|
101 |
|
102 |
Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:
|
103 |
|
104 |
+
- Step (2) `S2`: a fused operator of ReLU and \\(\mathbf{s} \odot (\mathbf{x} \mathbf{W}_1^T)\\);
|
105 |
+
- Step (3) `S3`: a sparse matrix-vector multiplication operator \\(\mathbf{x}_1 \mathbf{W}_2^T\\).
|
106 |
|
107 |
where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.
|
108 |
|
109 |
The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](TODO) for more details.
|
110 |
|
111 |
+
| Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | `S2`<br>Time | `S2`<br>Speedup | `S3`<br/>Time | `S3`<br/>Speedup |
|
112 |
| :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
|
113 |
| ReluLLaMA-7B | 66.98 | 90.89 | 58.95 | 11.37 | 67.12 | 1.35 | 63.00 | 1.32 |
|
114 |
| Vanilla ReLU-7B | 66.04 | 87.72 | 72.57 | 12.04 | 67.85 | 1.33 | 63.28 | 1.31 |
|