gpt-j-6b-sparse / README.md
weiweiz1's picture
Update README.md
b1b00fa
metadata
language:
  - en
tags:
  - pytorch
  - causal-lm
license: apache-2.0

Sparse GPT-J 6B

Model Description

The sparse version of GPT-J 6B is a pruned variant derived from the original GPT-J 6B model and the vast majority of linear layers maintain a 40% unstructured sparsity (except for the 'lm_head').

Hyperparameter Value
nparametersn_{parameters} 6053381344
nlayersn_{layers} 28*
dmodeld_{model} 4096
dffd_{ff} 16384
nheadsn_{heads} 16
dheadd_{head} 256
nctxn_{ctx} 2048
nvocabn_{vocab} 50257/50400† (same tokenizer as GPT-2/3)
Positional Encoding Rotary Position Embedding RoPE
RoPE Dimensions 64

* Each layer consists of one feedforward block and one self attention block.

† Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer.

The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.

Evaluation results

Evaluating the accuracy of the sparse model of gpt-j-6b using the lambada_openai dataset in lm_eval, providing the accuracy fluctuation under two precisions: FP32 and BF16.

Sparsity Dataset Precision Dense Acc ↑ Sparse Acc ↑ Acc fluctuations
40% Lambada_openai FP32 0.6831 0.6922 +1.33%
40% Lambada_openai BF16 0.6771 0.6874 +0.63%