add

e183385 4 days ago

6.17 kB

	### Model Introduction

	With the rapid development of artificial intelligence technology, large language models (LLMs) have made significant progress in fields such as natural language processing, computer vision, and scientific tasks. However, as the scale of these models increases, optimizing resource consumption while maintaining high performance has become a key challenge. To address this challenge, we have explored Mixture of Experts (MoE) models. The currently unveiled Hunyuan-Large (Hunyuan-MoE-A50B) model is the largest open-source Transformer-based MoE model in the industry, featuring a total of 389 billion parameters and 50 billion active parameters. This is currently the largest open-source Transformer-based MoE model in the industry, featuring a total of 389 billion parameters and 50 billion active parameters.

	By open-sourcing the Hunyuan-Large model and revealing related technical details, we hope to inspire more researchers with innovative ideas and collectively advance the progress and application of AI technology. We welcome you to join our open-source community to explore and optimize future AI models together!

	### Introduction to Model Technical Advantages

	#### Model
	- High-Quality Synthetic Data: By enhancing training with synthetic data, Hunyuan-Large can learn richer representations, handle long-context inputs, and generalize better to unseen data.

	- KV Cache Compression: Utilizes Grouped Query Attention (GQA) and Cross-Layer Attention (CLA) strategies to significantly reduce memory usage and computational overhead of KV caches, improving inference throughput.

	- Expert-Specific Learning Rate Scaling: Sets different learning rates for different experts to ensure each sub-model effectively learns from the data and contributes to overall performance.

	- Long-Context Processing Capability: The pre-trained model supports text sequences up to 256K, and the Instruct model supports up to 128K, significantly enhancing the ability to handle long-context tasks.

	- Extensive Benchmarking: Conducts extensive experiments across various languages and tasks to validate the practical effectiveness and safety of Hunyuan-Large.




	## Benchmark Evaluation
	Hunyuan-Large achieves the best overall performance compared to both Dense and MoE based competitors having similar activated parameter sizes. For aggregated benchmarks such as MMLU, MMLU-Pro, and CMMLU, Hunyuan-Large consistently achieves the best performance, confirming its comprehensive abilities on aggregated tasks. Hunyuan-Large also shows superior performance in commonsense understanding and reasoning, and classical NLP tasks such as QA and reading comprehension tasks (e.g., CommonsenseQA, PIQA, SIQA, BoolQ and TriviaQA). For the mathematics capability, Hunyuan-Large outperforms all baselines in math datasets of GSM8K and MATH, and also gains the best results on CMATH in Chinese.We also observe that Hunyuan-Large achieves the overall best performance in all Chinese tasks (e.g., CMMLU, C-Eval).

	\| Model \| LLama3.1-405B \| LLama3.1-70B \| Mixtral-8x22B \| DeepSeek-V2 \| Hunyuan-Large \|
	\| ---------------- \| ------------- \| ------------ \| ------------- \| ------------ \| ------------- \|
	\| MMLU \| 85.2 \| 79.5 \| 77.6 \| 78.5 \| 88.4 \|
	\| MMLU-Pro \| 61.6 \| 53.8 \| 49.5 \| - \| 60.2 \|
	\| BBH \| 85.9 \| 81.6 \| 78.9 \| 78.9 \| 86.3 \|
	\| HellaSwag--88.7 \| \| \| \| 87.8 \| 86.8 \|
	\| CommonsenseQA \| 85.8 \| 84.1 \| 78.5 \| - \| 92.9 \|
	\| WinoGrande \| 86.7 \| 85.3 \| 83.7 \| 84.9 \| 88.7 \|
	\| PIQA \| - \| - \| 83.6 \| 83.7 \| 88.3 \|
	\| SIQA \| - \| - \| 64.6 \| - \| 83.6 \|
	\| NaturalQuestions \| - \| - \| 40.2 \| 38.7 \| 52.8 \|
	\| BoolQ \| 80 \| 79.4 \| 87.4 \| 84 \| 92.9 \|
	\| DROP \| 84.8 \| 79.6 \| 80.4 \| 80.1 \| 88.9 \|
	\| ARC-C \| 96.1 \| 92.9 \| 91.2 \| 92.4 \| 95 \|
	\| TriviaQA \| - \| - \| 82.1 \| 79.9 \| 89.2 \|
	\| CMMLU \| - \| - \| 60 \| 84 \| 90.2 \|
	\| C-Eval \| - \| - \| 59.6 \| 81.7 \| 91.9 \|
	\| C3 \| - \| - \| 71.4 \| 77.4 \| 82.3 \|
	\| GSM8K \| 89 \| 83.7 \| 83.7 \| 79.2 \| 92.8 \|
	\| MATH \| 53.8 \| 41.4 \| 41.8 \| 43.6 \| 69.8 \|
	\| CMATH \| - \| - \| 72.3 \| 78.7 \| 91.3 \|
	\| HumanEval \| - \| - \| 53.1 \| 48.8 \| 71.4 \|
	\| MBPP \| - \| - \| 78.6 \| 73.9 \| 87.3 \|




	### Citation
	If you find our work helpful, feel free to give us a cite.

	```
	@article{Tencent-Hunyuan-Large,
	title={Hunyuan-Large Technical Report},
	author={Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li Xuemeng Huang, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Fengzong Lian Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Tao Yang Kan Wu, Dengpeng Wu, Guanghu1 Xu, Shaohua Chen, Fusheng Xiang, Shuang Chen, Xiao Feng Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Suncong Zheng, Xiong Kuang, Jianglu Hu Dian Jiao, Yiqi Chen, Jinbao Xue, Yangyu Tao, Chengzhong Xu, Winsony Hu, Feng Zhang, Jianshen Zhu Zhanhui Kang, Di Wang, Jie Jiang},
	journal={arXiv:},
	year={2024}
	}
	```

	### Model Introduction

	With the rapid development of artificial intelligence technology, large language models (LLMs) have made significant progress in fields such as natural language processing, computer vision, and scientific tasks. However, as the scale of these models increases, optimizing resource consumption while maintaining high performance has become a key challenge. To address this challenge, we have explored Mixture of Experts (MoE) models. The currently unveiled Hunyuan-Large (Hunyuan-MoE-A50B) model is the largest open-source Transformer-based MoE model in the industry, featuring a total of 389 billion parameters and 50 billion active parameters. This is currently the largest open-source Transformer-based MoE model in the industry, featuring a total of 389 billion parameters and 50 billion active parameters.

	By open-sourcing the Hunyuan-Large model and revealing related technical details, we hope to inspire more researchers with innovative ideas and collectively advance the progress and application of AI technology. We welcome you to join our open-source community to explore and optimize future AI models together!

	### Introduction to Model Technical Advantages

	#### Model
	- High-Quality Synthetic Data: By enhancing training with synthetic data, Hunyuan-Large can learn richer representations, handle long-context inputs, and generalize better to unseen data.

	- KV Cache Compression: Utilizes Grouped Query Attention (GQA) and Cross-Layer Attention (CLA) strategies to significantly reduce memory usage and computational overhead of KV caches, improving inference throughput.

	- Expert-Specific Learning Rate Scaling: Sets different learning rates for different experts to ensure each sub-model effectively learns from the data and contributes to overall performance.

	- Long-Context Processing Capability: The pre-trained model supports text sequences up to 256K, and the Instruct model supports up to 128K, significantly enhancing the ability to handle long-context tasks.

	- Extensive Benchmarking: Conducts extensive experiments across various languages and tasks to validate the practical effectiveness and safety of Hunyuan-Large.




	## Benchmark Evaluation
	Hunyuan-Large achieves the best overall performance compared to both Dense and MoE based competitors having similar activated parameter sizes. For aggregated benchmarks such as MMLU, MMLU-Pro, and CMMLU, Hunyuan-Large consistently achieves the best performance, confirming its comprehensive abilities on aggregated tasks. Hunyuan-Large also shows superior performance in commonsense understanding and reasoning, and classical NLP tasks such as QA and reading comprehension tasks (e.g., CommonsenseQA, PIQA, SIQA, BoolQ and TriviaQA). For the mathematics capability, Hunyuan-Large outperforms all baselines in math datasets of GSM8K and MATH, and also gains the best results on CMATH in Chinese.We also observe that Hunyuan-Large achieves the overall best performance in all Chinese tasks (e.g., CMMLU, C-Eval).

	\| Model \| LLama3.1-405B \| LLama3.1-70B \| Mixtral-8x22B \| DeepSeek-V2 \| Hunyuan-Large \|
	\| ---------------- \| ------------- \| ------------ \| ------------- \| ------------ \| ------------- \|
	\| MMLU \| 85.2 \| 79.5 \| 77.6 \| 78.5 \| 88.4 \|
	\| MMLU-Pro \| 61.6 \| 53.8 \| 49.5 \| - \| 60.2 \|
	\| BBH \| 85.9 \| 81.6 \| 78.9 \| 78.9 \| 86.3 \|
	\| HellaSwag--88.7 \| \| \| \| 87.8 \| 86.8 \|
	\| CommonsenseQA \| 85.8 \| 84.1 \| 78.5 \| - \| 92.9 \|
	\| WinoGrande \| 86.7 \| 85.3 \| 83.7 \| 84.9 \| 88.7 \|
	\| PIQA \| - \| - \| 83.6 \| 83.7 \| 88.3 \|
	\| SIQA \| - \| - \| 64.6 \| - \| 83.6 \|
	\| NaturalQuestions \| - \| - \| 40.2 \| 38.7 \| 52.8 \|
	\| BoolQ \| 80 \| 79.4 \| 87.4 \| 84 \| 92.9 \|
	\| DROP \| 84.8 \| 79.6 \| 80.4 \| 80.1 \| 88.9 \|
	\| ARC-C \| 96.1 \| 92.9 \| 91.2 \| 92.4 \| 95 \|
	\| TriviaQA \| - \| - \| 82.1 \| 79.9 \| 89.2 \|
	\| CMMLU \| - \| - \| 60 \| 84 \| 90.2 \|
	\| C-Eval \| - \| - \| 59.6 \| 81.7 \| 91.9 \|
	\| C3 \| - \| - \| 71.4 \| 77.4 \| 82.3 \|
	\| GSM8K \| 89 \| 83.7 \| 83.7 \| 79.2 \| 92.8 \|
	\| MATH \| 53.8 \| 41.4 \| 41.8 \| 43.6 \| 69.8 \|
	\| CMATH \| - \| - \| 72.3 \| 78.7 \| 91.3 \|
	\| HumanEval \| - \| - \| 53.1 \| 48.8 \| 71.4 \|
	\| MBPP \| - \| - \| 78.6 \| 73.9 \| 87.3 \|




	### Citation
	If you find our work helpful, feel free to give us a cite.

	```
	@article{Tencent-Hunyuan-Large,
	title={Hunyuan-Large Technical Report},
	author={Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li Xuemeng Huang, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Fengzong Lian Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Tao Yang Kan Wu, Dengpeng Wu, Guanghu1 Xu, Shaohua Chen, Fusheng Xiang, Shuang Chen, Xiao Feng Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Suncong Zheng, Xiong Kuang, Jianglu Hu Dian Jiao, Yiqi Chen, Jinbao Xue, Yangyu Tao, Chengzhong Xu, Winsony Hu, Feng Zhang, Jianshen Zhu Zhanhui Kang, Di Wang, Jie Jiang},
	journal={arXiv:},
	year={2024}
	}
	```