Spaces:
Running
Running
Merge branch 'main' of hf.co:spaces/ThaiLLM-Leaderboard/leaderboard
Browse files- src/about.py +29 -40
src/about.py
CHANGED
@@ -11,60 +11,49 @@ TITLE = """<h1>🇹🇭 Thai LLM Leaderboard</h1>"""
|
|
11 |
# <a href="url"></a>
|
12 |
|
13 |
INTRODUCTION_TEXT = """
|
14 |
-
The Thai
|
15 |
As part of an open community project, we welcome you to submit new evaluation tasks or models.
|
16 |
This leaderboard is developed in collaboration with <a href="https://www.scb10x.com">SCB 10X</a>, <a href="https://www.vistec.ac.th/">Vistec</a>, and <a href="https://github.com/SEACrowd">SEACrowd</a>. Read more on <a href="https://blog.opentyphoon.ai/introducing-the-thaillm-leaderboard-thaillm-evaluation-ecosystem-508e789d06bf">Introduction Blog</a>
|
17 |
"""
|
18 |
|
19 |
LLM_BENCHMARKS_TEXT = f"""
|
20 |
-
Evaluations
|
21 |
The leaderboard currently consists of the following benchmarks:
|
22 |
-
- Exam
|
23 |
- <a href="https://huggingface.co/datasets/scb10x/thai_exam">ThaiExam</a>: ThaiExam is a Thai language benchmark based on examinations for high-school students and investment professionals in Thailand.
|
24 |
-
- <a href="https://arxiv.org/abs/2306.05179">M3Exam</a>: M3Exam is a novel benchmark sourced from
|
25 |
-
- LLM-as-a-Judge
|
26 |
-
- <a href="https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai">Thai MT-Bench</a>: <a href="https://arxiv.org/abs/2306.05685">MT-Bench</a>
|
27 |
-
- NLU
|
28 |
-
- <a href="https://huggingface.co/datasets/facebook/belebele">Belebele</a>: Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants
|
29 |
-
- <a href="https://huggingface.co/datasets/facebook/xnli">XNLI</a>: XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages.
|
30 |
-
- <a href="https://huggingface.co/datasets/cambridgeltl/xcopa">XCOPA</a>: XCOPA is a
|
31 |
-
- <a href="https://huggingface.co/datasets/pythainlp/wisesight_sentiment">Wisesight</a>: Wisesight sentiment analysis corpus
|
32 |
-
- NLG
|
33 |
-
- <a href="https://huggingface.co/datasets/csebuetnlp/xlsum">XLSum</a>: XLSum is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC.
|
34 |
-
- <a href="https://huggingface.co/datasets/SEACrowd/flores200">Flores200</a>: FLORES is a benchmark dataset
|
35 |
-
- <a href="https://huggingface.co/datasets/iapp/iapp_wiki_qa_squad">iapp Wiki QA Squad</a>: iapp Wiki QA Squad is an extractive question
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
- Multiple-choice accuracy is calculated using the <a href="https://github.com/SEACrowd/seacrowd-experiments/blob/048536fc0d4614734d479b298ea00a1f520da42b/evaluation/main_nlu_prompt_batch.py#L71">SEACrowd implementation</a> of logits comparison, similar to the method used by the <a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard">Open LLM Leaderboard</a> (<a href="https://github.com/EleutherAI/lm-evaluation-harness">EleutherAI Harness</a>). <a href="https://huggingface.co/blog/open-llm-leaderboard-mmlu">explain</a>
|
40 |
-
- BLEU is calculated using flores200 tokenizer using
|
41 |
-
- ROUGEL is calculated using
|
42 |
-
- LLM-as-a-
|
43 |
|
44 |
-
Reproducibility
|
45 |
|
46 |
-
|
47 |
|
48 |
-
Acknowledgements
|
49 |
|
50 |
-
We
|
51 |
"""
|
52 |
|
53 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
54 |
CITATION_BUTTON_TEXT = r"""@misc{thaillm-leaderboard,
|
55 |
-
author
|
56 |
-
title
|
57 |
-
year
|
58 |
-
publisher
|
59 |
-
|
60 |
-
}
|
61 |
-
|
62 |
-
@misc{lovenia2024seacrowdmultilingualmultimodaldata,
|
63 |
-
title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages},
|
64 |
-
author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya},
|
65 |
-
year={2024},
|
66 |
-
eprint={2406.10118},
|
67 |
-
archivePrefix={arXiv},
|
68 |
-
primaryClass={cs.CL},
|
69 |
-
url={https://arxiv.org/abs/2406.10118},
|
70 |
}"""
|
|
|
11 |
# <a href="url"></a>
|
12 |
|
13 |
INTRODUCTION_TEXT = """
|
14 |
+
The Thai LLM Leaderboard 🇹🇭 aims to standardize evaluation methods for large language models (LLMs) in the Thai language, building on <a href="https://github.com/SEACrowd">SEACrowd</a>.
|
15 |
As part of an open community project, we welcome you to submit new evaluation tasks or models.
|
16 |
This leaderboard is developed in collaboration with <a href="https://www.scb10x.com">SCB 10X</a>, <a href="https://www.vistec.ac.th/">Vistec</a>, and <a href="https://github.com/SEACrowd">SEACrowd</a>. Read more on <a href="https://blog.opentyphoon.ai/introducing-the-thaillm-leaderboard-thaillm-evaluation-ecosystem-508e789d06bf">Introduction Blog</a>
|
17 |
"""
|
18 |
|
19 |
LLM_BENCHMARKS_TEXT = f"""
|
|
|
20 |
The leaderboard currently consists of the following benchmarks:
|
21 |
+
- <b>Exam</b>
|
22 |
- <a href="https://huggingface.co/datasets/scb10x/thai_exam">ThaiExam</a>: ThaiExam is a Thai language benchmark based on examinations for high-school students and investment professionals in Thailand.
|
23 |
+
- <a href="https://arxiv.org/abs/2306.05179">M3Exam</a>: M3Exam is a novel benchmark sourced from authentic and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. This leaderboard uses the Thai subset of M3Exam.
|
24 |
+
- <b>LLM-as-a-Judge</b>
|
25 |
+
- <a href="https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai">Thai MT-Bench</a>: A Thai version of <a href="https://arxiv.org/abs/2306.05685">MT-Bench</a> developed specially by VISTEC for probing Thai generative skills using the LLM-as-a-judge method.
|
26 |
+
- <b>NLU</b>
|
27 |
+
- <a href="https://huggingface.co/datasets/facebook/belebele">Belebele</a>: Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants, where the Thai subset is used in this leaderboard.
|
28 |
+
- <a href="https://huggingface.co/datasets/facebook/xnli">XNLI</a>: XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages. This leaderboard uses the Thai subset of this corpus.
|
29 |
+
- <a href="https://huggingface.co/datasets/cambridgeltl/xcopa">XCOPA</a>: XCOPA is a corpus of translated and re-annotated English COPA, covers 11 languages. This is designed to measure the commonsense reasoning ability in non-English languages. This leaderboard uses the Thai subset of this corpus.
|
30 |
+
- <a href="https://huggingface.co/datasets/pythainlp/wisesight_sentiment">Wisesight</a>: Wisesight sentiment analysis corpus contains social media messages in the Thai language with sentiment labels.
|
31 |
+
- <b>NLG</b>
|
32 |
+
- <a href="https://huggingface.co/datasets/csebuetnlp/xlsum">XLSum</a>: XLSum is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from the BBC. This corpus evaluates the summarization performance in non-English languages, and this leaderboard uses the Thai subset.
|
33 |
+
- <a href="https://huggingface.co/datasets/SEACrowd/flores200">Flores200</a>: FLORES is a machine translation benchmark dataset used to evaluate translation quality between English and low-resource languages. This leaderboard uses the Thai subset of Flores200.
|
34 |
+
- <a href="https://huggingface.co/datasets/iapp/iapp_wiki_qa_squad">iapp Wiki QA Squad</a>: iapp Wiki QA Squad is an extractive question-answering dataset derived from Thai Wikipedia articles.
|
35 |
+
|
36 |
+
|
37 |
+
<b>Metric Implementation Details</b>:
|
38 |
- Multiple-choice accuracy is calculated using the <a href="https://github.com/SEACrowd/seacrowd-experiments/blob/048536fc0d4614734d479b298ea00a1f520da42b/evaluation/main_nlu_prompt_batch.py#L71">SEACrowd implementation</a> of logits comparison, similar to the method used by the <a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard">Open LLM Leaderboard</a> (<a href="https://github.com/EleutherAI/lm-evaluation-harness">EleutherAI Harness</a>). <a href="https://huggingface.co/blog/open-llm-leaderboard-mmlu">explain</a>
|
39 |
+
- BLEU is calculated using flores200's tokenizer using HuggingFace `evaluate` <a href="https://huggingface.co/spaces/evaluate-metric/sacrebleu">implementation</a>.
|
40 |
+
- ROUGEL is calculated using PyThaiNLP newmm tokenizer and HuggingFace `evaluate` <a href="https://huggingface.co/spaces/evaluate-metric/rouge">implementation</a>.
|
41 |
+
- LLM-as-a-judge rating is based on OpenAI's gpt-4o-2024-05-13 using the prompt defined in <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/judge_prompts.jsonl">lmsys MT-Bench</a>.
|
42 |
|
43 |
+
<b>Reproducibility</b>:
|
44 |
|
45 |
+
- For the reproducibility of results, we have open-sourced the evaluation pipeline. Please check out the repository <a href="https://github.com/scb-10x/seacrowd-eval">seacrowd-experiments</a>.
|
46 |
|
47 |
+
<b>Acknowledgements</b>:
|
48 |
|
49 |
+
- We are grateful to previous open-source projects that released datasets, tools, and knowledge. We thank community members for tasks and model submissions. To contribute, please see the submit tab.
|
50 |
"""
|
51 |
|
52 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
53 |
CITATION_BUTTON_TEXT = r"""@misc{thaillm-leaderboard,
|
54 |
+
author={SCB 10X and VISTEC and SEACrowd},
|
55 |
+
title={Thai LLM Leaderboard},
|
56 |
+
year={2024},
|
57 |
+
publisher={Hugging Face},
|
58 |
+
url={https://huggingface.co/spaces/ThaiLLM-Leaderboard/leaderboard}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
}"""
|