Traditional Chinese LLM Corpus
Traditional Chinese corpus collection for LLM training (pre-training, instruction-tuning, and RLHF/alignment).
Viewer • Updated • 1.78M • 70 • 12Note Contains ~2B tokens from high quality corpus. Cleaned and deduplicated.
liswei/wikipedia-zhtw-dedup
Viewer • Updated • 1.18M • 85 • 1Note Deduplicate version of erhwenkuo/wikipedia-zhtw using MinHash.
liswei/c4-zhtw
Viewer • Updated • 4.86M • 149 • 1Note Deduplicated C4 subset of zhTW. Note: C4 = colossal, cleaned version of Common Crawl
liswei/common-crawl-zhtw
Viewer • Updated • 2.71M • 136 • 3Note Deduplicated CC subset of zhTW.
zetavg/CC-100-zh-Hant-merged
Viewer • Updated • 12.3M • 181 • 3Note Zh-tw subset of CC-100 dataset, which is derived from commoncrawl. Note: CC harms performance as shown in TaiwanLlama.
liswei/coct-en-zhtw-dedup
Viewer • Updated • 217k • 48 • 1Note Deduplicate version of zetavg/coct-en-zh-tw-translations-twp-300k. Zh-tw <-> en paired articles provided by 台灣光華雜誌.
liswei/PromptPair-TW
Viewer • Updated • 119k • 41 • 2Note Traditional Chinese instruction dataset. Contains en <-> tw pairs with system prompts to better adopt from English pre-trained models.
yentinglin/TaiwanChat
Viewer • Updated • 485k • 180 • 53Note Instruction dataset used to train TaiwanLLM v1. Find more details in the paper.
erhwenkuo/alpaca-data-gpt4-chinese-zhtw
Viewer • Updated • 52k • 56 • 6Note Translated from en to zh-tw of the alpaca-gpt4 dataset.
zetavg/mlqa_en_zh_tw
Viewer • Updated • 3.29k • 42 • 7Note zhcn/en multilingual QA translated to zhtw/en. Internal experiment shows that when transferring from English base model, traning on Q:en->A:zh or vice versa improves SFT performance.
zetavg/ShareGPT-Processed
Viewer • Updated • 90.7k • 92 • 29Note The RyokoAI/ShareGPT52K dataset, converted to Markdown and labeled with the language used.
benchang1110/PTT_QA
Updated • 13 • 1
lchakkei/OpenOrca-Traditional-Chinese
Viewer • Updated • 4.23M • 1.29k • 8Note Google translated instruction data from English.
Heng666/Traditional_Chinese-aya_dataset
Viewer • Updated • 4.91k • 153 • 2Heng666/Traditional_Chinese-aya_evaluation_suite
Viewer • Updated • 650 • 57 • 2
ChenWeiLi/Med_Breexe_zhtw
Viewer • Updated • 1.6k • 35 • 4Note Instruction dataset in the Medicine domain. Prompts are translated then feed to Breexe model.
Tarklanse/Traditional_Chinese_roleplay_chat_Dataset
Viewer • Updated • 9.51k • 72 • 36DataAgent/Pretrain-Taiwan-DentistKnowledge-zhTW-290K
Viewer • Updated • 147 • 47 • 1
KSmart/chinese_traditional_chengyu
Viewer • Updated • 111 • 38 • 3Note This is in Simplified Chinese.
liswei/rm-static-zhTW
Viewer • Updated • 81.4k • 39 • 30Note Perference dataset with chosen/reject pair. Translated using m2m100.
ZoneTwelve/ChineseGrammaticalErrorEvaluation
Viewer • Updated • 132 • 57ZoneTwelve/micro_sft_instruct
Viewer • Updated • 10 • 51