SparseLLM
/

prosparse-llama-2-7b

@@ -79,7 +79,7 @@ The 7B model is trained on 8 A100 GPUs. The learning rate (LR) is controlled by
 The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Refer to Section 4.2 of [paper](TODO) for more details.
-|        Setting        | Average<br>Sparsity | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU  |  BBH  | AGI<br>Eval | Average |
 | :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
 |      Original-7B      |          -          |       16.37        |          69.59           |          61.87           | 12.96 | 44.45 | 32.96 |    27.53    |  37.96  |
 |     ReluLLaMA-7B      |        66.98        |       15.85        |          69.64           |          70.54           | 5.84  | 38.64 | 35.07 |    27.73    |  37.62  |
@@ -101,14 +101,14 @@ First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of
 Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:
-- Step (2): a fused operator of ReLU and \\(\mathbf{s} \odot (\mathbf{x} \mathbf{W}_1^T)\\);
-- Step (3): a sparse matrix-vector multiplication operator \\(\mathbf{x}_1 \mathbf{W}_2^T\\).
 where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.
 The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](TODO) for more details.
-|        Setting        | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | Step (2)<br>Time | Step (2)<br>Speedup | Step (3)<br/>Time | Step (3)<br/>Speedup |
 | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
 |     ReluLLaMA-7B      |        66.98        |        90.89         |         58.95         |        11.37        |      67.12       |        1.35         |       63.00       |         1.32         |
 |    Vanilla ReLU-7B    |        66.04        |        87.72         |         72.57         |        12.04        |      67.85       |        1.33         |       63.28       |         1.31         |

 The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Refer to Section 4.2 of [paper](TODO) for more details.
+|        Setting        | Average<br>Sparsity | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU  |  BBH  | AGI Eval | Average |
 | :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
 |      Original-7B      |          -          |       16.37        |          69.59           |          61.87           | 12.96 | 44.45 | 32.96 |    27.53    |  37.96  |
 |     ReluLLaMA-7B      |        66.98        |       15.85        |          69.64           |          70.54           | 5.84  | 38.64 | 35.07 |    27.73    |  37.62  |
 Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:
+- Step (2) `S2`: a fused operator of ReLU and \\(\mathbf{s} \odot (\mathbf{x} \mathbf{W}_1^T)\\);
+- Step (3) `S3`: a sparse matrix-vector multiplication operator \\(\mathbf{x}_1 \mathbf{W}_2^T\\).
 where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.
 The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](TODO) for more details.
+|        Setting        | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | `S2`<br>Time | `S2`<br>Speedup | `S3`<br/>Time | `S3`<br/>Speedup |
 | :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
 |     ReluLLaMA-7B      |        66.98        |        90.89         |         58.95         |        11.37        |      67.12       |        1.35         |       63.00       |         1.32         |
 |    Vanilla ReLU-7B    |        66.04        |        87.72         |         72.57         |        12.04        |      67.85       |        1.33         |       63.28       |         1.31         |