TransNormerLLM2 -- A Faster and Better LLM

Introduction
- Diff of TransNormerLLM2
Released Weights
Benchmark Results
Inference and Deployment
- Dependency Installation
- Inference
Fine-tuning the Model
Community and Ecosystem
Disclaimer, License and Citation

Introduction

This official repo introduces the TransNormerLLM model, featuring its open-source weights. Additionally, it provides codes for Supervised Fine-tuning (SFT) and inference.

TransNormerLLM evolving from TransNormer, standing out as the first LLM within the linear transformer architecture. Additionally, it distinguishes itself by being the first non-Transformer LLM to exceed both traditional Transformer and other efficient Transformer models (such as, RetNet and Mamba) in terms of speed and performance.

TransNormerLLM1 is released in Nov 2023, featuring three versions with 385M, 1B, and 7B parameters, trained on 1.4 trillion tokens.
The latest update transitions from TransNormerLLM1 to TransNormerLLM2, offering three updated versions with 1B, 3B, and 7B parameters, trained on 0.3 trillion tokens.
All versions are available as open-source under the Apache-2.0 license.

Diff of TransNormerLLM2

TransNormerLLM1 incorporates Simple GLU in its channel mixer, GLA in the token mixer, and SRMSNorm for normalization. In this model, the channel and token mixers function sequentially in a pipeline arrangement.
TransNormerLLM2 also utilizes Simple GLU in the channel mixer, GLA in the token mixer, and SRMSNorm for normalization. However, in this version, the channel and token mixers operate concurrently, in parallel.

Released Weights

The specific released versions and download links are shown as below:

param	token	Base Models
v1-385M	1400B	🤗 TransNormerLLM-385M
v1-1B	1400B	🤗 TransNormerLLM-1B
v1-7B	1400B	🤗 TransNormerLLM-7B
v2-1B	300B	🤗 TransNormerLLM2-1B-300B
v2-3B	300B	🤗 TransNormerLLM2-3B-300B
v2-7B	300B	🤗 TransNormerLLM2-7B-300B

Benchmark Results

TransNormerLLM are evaluated on Commonsense Reasoning tasks and Multiple-Choice questions. For comparison, a range of open-source models are chosen for comparison, encompassing both Transformer-based and non-Transformer-based architectures. The evaluations of all models are conducted using the official settings and the lm-evaluation-harness framework.

Model	PS	T	BoolQ	PIQA	HS	WG	ARC-e	ARC-c	OBQA	MMLU	CMMLU	C-Eval
GPT-J	6.9	0.3	65.44	75.41	66.25	64.09	66.92	36.60	38.20	25.40	26.47	23.39
OPT	6.7	0.3	66.18	76.22	67.21	65.19	65.66	34.64	37.20	24.57	25.36	25.32
Pythia	6.9	0.3	63.46	75.14	63.92	60.77	67.34	35.41	37.00	24.64	25.56	26.40
BLOOM	7.1	0.35	62.91	72.69	62.33	64.01	65.11	33.45	35.80	26.25	24.97	24.25
RWKV	7.4	-	-	76.06	65.51	61.01	67.80	37.46	40.20	24.96	-	-
MPT	6.9	1.0	73.88	79.43	76.25	68.27	74.79	41.72	42.20	30.80	25.99	24.06
Falcon	7.2	1.5	73.73	79.38	76.3	67.17	74.62	43.60	43.80	27.79	25.73	22.92
Baichuan1	7.0	1.2	70.09	76.01	70.06	64.09	71.72	40.53	38.20	42.30	44.43	42.80
Baichuan2	7.0	2.6	72.72	76.50	72.17	68.35	75.17	42.32	39.60	54.16	57.07	54.00
ChatGLM1	6.7	1.0	74.74	68.88	45.57	52.25	48.78	31.66	36.80	40.63	37.48	40.23
ChatGLM2	7.1	1.4	77.65	69.37	50.51	57.62	59.13	34.30	37.00	45.46	48.80	52.55
OpenLLaMAv1	6.7	1.0	70.43	75.68	69.23	66.69	71.17	38.57	39.00	30.49	25.40	26.09
OpenLLaMAv2	6.7	1.0	72.20	78.84	74.51	65.67	72.39	41.30	41.00	41.29	29.58	30.01
LLaMA1	6.7	1.0	76.50	79.80	76.10	70.10	72.80	47.60	57.20	35.10	25.62	25.72
LLaMA2	6.7	2.0	77.68	78.07	76.02	68.98	76.30	46.33	44.20	45.30	32.96	33.20
TransNormerLLM-7B	6.8	1.4	75.11	85.47	78.61	66.93	73.11	52.99	61.60	44.90	49.32	45.01
TransNormerLLM2-7B	6.8	0.3	65.20	74.37	61.68	60.62	64.6	32.08	38.00	25.80	25.69	26.77

P: parameter size (billion). T: tokens (trillion). BoolQ: acc. PIQA: acc. HellaSwag: acc_norm. WinoGrande: acc. ARC-easy: acc. ARC-challenge: acc_norm. OpenBookQA: acc_norm. MMLU: 5-shot acc. CMMLU: 5-shot acc. C-Eval: 5-shot acc.

Inference and Deployment

Dependency Installation

📝Note Please configure the following environment before using the model:

pip install triton==2.0.0
pip install einops

Notice

If you experience errors associated with Triton, it is advisable to disable Triton.

export use_triton=False

Inference

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("OpenNLPLab/TransNormerLLM2-7B-300B", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("TransNormerLLM2-7B-300B", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
>>> inputs = tokenizer('今天是美好的一天', return_tensors='pt')
>>> pred = model.generate(**inputs, max_new_tokens=8192, repetition_penalty=1.0)
>>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Note: we recommend to use bfloat16 in TransNormerLLM, float16 might lead nan error, please check your divce compatibility!

Fine-tuning the Model

Dependency Installation

git clone https://github.com/OpenNLPLab/TransNormerLLM.git
cd TransNormerLLM/fine-tune
pip install -r requirements.txt

To use lightweight fine-tuning methods like LoRA, you must additionally install peft.

Training

Below, we provide an example of fine-tuning the TransNormerLLM-1B on a single machine with ZeRO-3.

Training Data: alpaca_data.json. This sample data was drawn from alpaca_data.json, consisting of a selection of 52,002 entries, and has been reformatted. The main purpose is to demonstrate how to SFT our model, and effectiveness is not guaranteed.

torchrun \
    --nproc_per_node=8 \
    train.py \
    --model_name_or_path OpenNLPLab/TransNormerLLM-1B \
    --data_path ./alpaca_data.json \
    --output_dir output \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --bf16 true \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 30 \
    --learning_rate 1e-4 \
    --weight_decay 0.1 \
    --warmup_ratio 0.1 \
    --lr_scheduler_type "cosine" \
    --deepspeed 'configs/zero3.json' \
    --logging_steps 1 \
    --dataloader_num_workers 24 \
    --ddp_find_unused_parameters false \
    --tf32 true \

Community and Ecosystem

📢📢📢 We will continuously update the support for TransNormerLLM from the community and ecosystem here 😀😀😀

nanoTransnormer

Disclaimer, License and Citation

Disclaimer

Our team has not created any applications using TransNormerLLM models for any platform including iOS, Android, and web. We urge users not to use these models for illegal activities or anything that could harm national or social security. We also advise against using these models for online services that haven't passed security reviews and legal procedures. We hope everyone will follow these guidelines to ensure technology develops in a safe and lawful way.

We've tried hard to make sure the data in our model training is compliant, but because the model and data are complex, there might still be unexpected issues. If any problems occur from using TransNormerLLM open-source models, like data security issues, public opinion risks, or problems caused by misuse or improper use of the model, we will not be responsible.

License

The community usage of TransNormerLLM model requires adherence to Apache 2.0 and Community License for TransNormerLLM Model. The TransNormerLLM model supports commercial use. If you plan to use the TransNormerLLM model or its derivatives for commercial purposes, please ensure that you have submit the application materials required by the TransNormerLLM Model Community License Agreement via the following contact email: [email protected].

Acknowledgments

Our project is developed based on the following open source projects:

Baichuan for the tokenizer.
metaseq for training.
lm-evaluation-harness for evaluation.

Citation

If you wish to cite our work, please use the following reference:

@article{qin2023scaling,
  title={Scaling transnormer to 175 billion parameters},
  author={Qin, Zhen and Li, Dong and Sun, Weigao and Sun, Weixuan and Shen, Xuyang and Han, Xiaodong and Wei, Yunshen and Lv, Baohong and Yuan, Fei and Luo, Xiao and others},
  journal={arXiv preprint arXiv:2307.14995},
  year={2023}
}

- OpenNLPLab @2024 -

OpenNLPLab
/

TransNormerLLM2-7B-300B

TransNormerLLM2 -- A Faster and Better LLM

Table of Contents

Introduction

Diff of TransNormerLLM2

Released Weights

Benchmark Results

Inference and Deployment

Dependency Installation

Notice

Inference

Fine-tuning the Model

Dependency Installation

Training

Community and Ecosystem

Disclaimer, License and Citation

Disclaimer

License

Acknowledgments

Citation