werty1248's picture
Update README.md
3d60d52 verified
metadata
license: apache-2.0
base_model:
  - mistralai/Mistral-Nemo-Base-2407
language:
  - en
  - ko
  - ja
  - zh
datasets:
  - 4DR1455/finance_questions
  - Aratako/Synthetic-JP-Conversations-Magpie-Nemotron-4-10k
  - Aratako/Synthetic-JP-EN-Coding-Dataset-Magpie-69k
  - Aratako/Synthetic-Japanese-Roleplay-NSFW-Claude-3.5s-10.5k-formatted
  - BCCard/BCCard-Finance-Kor-QnA
  - CarrotAI/ko-code-alpaca-QA
  - ChuGyouk/AI_healthcare_QA_samples_Sonnet3.5
  - DavidLanz/medical_instruction
  - Dusker/lawyer-llama
  - Gryphe/Sonnet3.5-Charcard-Roleplay
  - HAERAE-HUB/qarv-instruct-ko
  - HachiML/alpaca_jp_math
  - Magpie-Align/Magpie-Llama-3.1-Pro-MT-300K-v0.1
  - Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese
  - beomi/KoAlpaca-v1.1a
  - codefuse-ai/Evol-instruction-66k
  - frankminors123/belle-math-zh
  - gbharti/wealth-alpaca_lora
  - iam-ajaymeena/Self-Instruct-Japanese-Elzya-13B
  - jihye-moon/LawQA-Ko
  - jondurbin/gutenberg-dpo-v0.1
  - junyeong-nero/kin_med_100K_edited
  - kyujinpy/KOR-OpenOrca-Platypus-v3
  - lavita/medical-qa-datasets
  - microsoft/orca-math-word-problems-200k
  - neural-bridge/rag-dataset-12000
  - p1atdev/ichikara-instruction
  - qiaojin/PubMedQA
  - shibing624/roleplay-zh-sharegpt-gpt4-data
  - team-hatakeyama-phase2/AutoMultiTurnByCalm3-22B-Corrected-reformatted
  - ymoslem/Law-StackExchange
  - zzunyang/LawQA_LawSee

Mistral-Nemo-NT-Ko-12B-sft

Description

Mistral-Nemo-NT-Ko-12B-sft is an instruction-tuned version of mistralai/Mistral-Nemo-Base-2407, fine-tuned across four languages: English, Korean, Chinese, and Japanese.

The primary goals of this model are language alignment, cross-lingual knowledge transfer and ChatML formatting. This is an intermediate version since preference optimization has not yet been applied.

Features

  • The base model supports a context length of 128K, while I fine-tuned this model with an 8K context size.

  • The model follows to the input language unless the user explicitly specifies an output language (If the language is set by a system role, it may be ignored).

  • Answer length tends to vary by language: English responses are generally longer than average, while Korean responses tend to be shorter. The behavior for Japanese and Chinese is still under observation.

  • Recommended temperature settings: 0.3 to 0.7.

Evaluation

LogicKor

모델 방법 추론 수학 글쓰기 코딩 이해 문법 싱글턴 멀티턴 총점
Mistral-Nemo-NT-Ko-12B-sft cot-1-shot 6.57 7.36 8.57 8.71 9.57 6.43 7.81 7.93 7.87
Mistral-Nemo-NT-Ko-12B-sft 1-shot 8.29 5.71 7.93 9.00 7.93 5.21 7.29 7.40 7.35
Mistral Nemo 1-shot 5.00, 6.50 6.86 8.07 7.64 8.43 7.60 6.57 7.08
Mistral-Nemo-NT-Ko-12B-sft default 4.93 6.00 7.14 5.43 9.71 4.00 6.45 5.95 6.20
Mistral Nemo cot-1-shot 5.43, 6.86 6.07 7.57 5.86 7.57 7.50 5.62 6.56
Mistral Nemo default 0.43, 7.64 6.21 7.14 6.79 7.21 6.26 5.55 5.90

MT-Bench

Model First Second Average
Mistral-Nemo-NT-Ko-12B-sft 8.39 7.99 8.19
* judge-model: GPT-4

Language-Confusion(Korean Only)

Model Monolingual-LPR Monolingual-WPR Crosslingual-LPR Crosslingual-WPR
Mistral-Nemo-NT-Ko-12B-sft 100.00% 99.00% 87.51% 96.96%
Mistral-Nemo-Instruct-2407 90.72% 93.18% 46.75% 92.84%
Meta-Llama-3.1-8B-Instruct 99.00% 96.97% 91.45% 93.01%
gemma-2-9b-it 100.00% 98.00% 87.93% 95.58%

example:

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

I trained Mistral-Nemo-NT-Ko-12B with various system prompt from dozens of dataset. You can chat with/without your system prompt.

Dataset

werty1248/multilingual-instruct-balanced

Training Details

  • GPU: 8xA40
  • epoch: 3
  • total batch size: 8
  • learning rate: 7e-6
  • weight decay: 0.01

Built with Axolotl

See axolotl config

axolotl version: 0.4.1

base_model: mistralai/Mistral-Nemo-Base-2407
model_type: MistralForCausalLM
tokenizer_config: nothingiisreal/MN-12B-Celeste-V1.9 ##axolotl-ai-co/Mistral-Nemo-Base-2407-chatml makes error, why?
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: chatml
datasets:
  - path: werty1248/multilingual-instruct-balanced
    type: sharegpt
    chat_template: chatml

dataset_prepared_path: ./data_preparation
output_dir: /workspace/data

hf_use_auth_token: true

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

wandb_project:
#wandb_entity:
#wandb_watch:
wandb_name:
#wandb_log_model:

gradient_accumulation_steps: 1 ## total_batch = 8
micro_batch_size: 1
num_epochs: 3
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.000007

train_on_inputs: false
group_by_length: false
bf16: auto
fp16: 
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 1000
evals_per_epoch: 1
eval_table_size:
save_steps: 1000
debug:
deepspeed: deepspeed_configs/zero3_bf16.json
weight_decay: 0.01
special_tokens:
  pad_token: <pad>

  • Training loss

image/png