Introduction
Basiclly an update to the old attempt of vicuna-chinese-replication-beta
- We adopted an curriculum-learning like approch, starting from simple QAs to reasoning-intensive coding & mathamatical problems. Coincidentally, Ziya adopted the same idea during SFT period.
- The base model was changed from chinese-llama to chinese-llama-plus. However, as observed by BiLLa, continue training on Chinese-only corpus significantly increases its perplexity on English corpus, which in turns undermines its abilities in fields like mathematical calculation in our preliminary experiment. The subject of continuing-training is under-studied, while using bilingual corpus may be a better alternative as shown so far.
- We changed to the Vicuna v1.1 conversative template and included more CoT training data.
Again, this is for research purpose only. There's no guarantee for its performance. All credits to the original authors of LLaMA and Chinese-LLaMA.
Compared with previous release, new model improves on coding and reasoning problem. However it still suffers from hallucinations and perform poorly on Chinese domain-specific problem, e.g. chinese literature and idioms.
Usage
We use exactly the Vicuna template for training and inference. Sample code as below.
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "keyfan/vicuna-chinese-replication-v1.1"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(checkpoint).cuda()
template = ("A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions. "
"USER: {}\nASSISTANT:")
question = template.format("Who was the president of the United States in 1955?")
inputs = tokenizer.encode(question, return_tensors="pt").cuda()
outputs = model.generate(inputs, do_sample=True, temperature=0.2, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Evaluation
- Result on the Chinese-LLaMA-Alpaca devset compared with the result of Alpaca-Plus-13B. For simplity, we only sample one answer for each question without any cherry-picking. We used the template as provided in their repo. GPT-4 have strong bias for more detailed answers, so the score may not be consistent with human evaluation.
Model | Macro-Average | QA | OQA | REASONING | LITERATURE | ENTERTAINMENT | GENERATION | TRANSLATION | CODE | ETHICS |
---|---|---|---|---|---|---|---|---|---|---|
Alpaca-Plus-13B | 77.3 | 70 | 74 | 70 | 80 | 77 | 82 | 89 | 64 | 90 |
ours | 82.4 | 81 | 87 | 88 | 73 | 78 | 85 | 83 | 83 | 84 |
- Result on the newly released C-Eval test set with 5-shot. We slightly modified MOSS's code from ceval codebase by moving the 'ηζ‘οΌ' (Answer: ) suffix from the end of question to the beginning of the chatbot response.
Average | Avg(Hard) | STEM | Social Science | Humanities | Others |
---|---|---|---|---|---|
37.0 | 29.5 | 34.6 | 44.5 | 35.7 | 35.9 |
- Downloads last month
- 1,555
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.