Spaces:

ThaiLLM-Leaderboard
/

leaderboard

Running

App Files Files Community

kunato commited on Sep 11

Commit

af4d55c

•

2 Parent(s): cc9a14a b5e73da

Merge branch 'main' of hf.co:spaces/ThaiLLM-Leaderboard/leaderboard

Browse files

Files changed (1) hide show

src/about.py +29 -40

src/about.py CHANGED Viewed

@@ -11,60 +11,49 @@ TITLE = """<h1>🇹🇭 Thai LLM Leaderboard</h1>"""
 # <a href="url"></a>
 INTRODUCTION_TEXT = """
-The Thai-LLM Leaderboard 🇹🇭 focused on standardizing evaluation methods for large language models (LLMs) in the Thai language based on <a href="https://github.com/SEACrowd">SEACrowd</a>,
 As part of an open community project, we welcome you to submit new evaluation tasks or models.
 This leaderboard is developed in collaboration with <a href="https://www.scb10x.com">SCB 10X</a>, <a href="https://www.vistec.ac.th/">Vistec</a>, and <a href="https://github.com/SEACrowd">SEACrowd</a>. Read more on <a href="https://blog.opentyphoon.ai/introducing-the-thaillm-leaderboard-thaillm-evaluation-ecosystem-508e789d06bf">Introduction Blog</a>
 """
 LLM_BENCHMARKS_TEXT = f"""
-Evaluations
 The leaderboard currently consists of the following benchmarks:
-- Exam
   - <a href="https://huggingface.co/datasets/scb10x/thai_exam">ThaiExam</a>: ThaiExam is a Thai language benchmark based on examinations for high-school students and investment professionals in Thailand.
-  - <a href="https://arxiv.org/abs/2306.05179">M3Exam</a>: M3Exam is a novel benchmark sourced from real and official human exam questions for evaluating LLMsin a multilingual, multimodal, and multilevel context. Here is Thai subset of M3Exam.
-- LLM-as-a-Judge
-  - <a href="https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai">Thai MT-Bench</a>: <a href="https://arxiv.org/abs/2306.05685">MT-Bench</a> inspired varient of LLM-as-a-Judge specifically developed by Vistec for Thai language and cultural.
-- NLU
-  - <a href="https://huggingface.co/datasets/facebook/belebele">Belebele</a>: Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Here is Thai subset of Belebele.
-  - <a href="https://huggingface.co/datasets/facebook/xnli">XNLI</a>: XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages. Here is Thai subset of XNLI.
-  - <a href="https://huggingface.co/datasets/cambridgeltl/xcopa">XCOPA</a>: XCOPA is a translation and reannotation of the English COPA to measuring commonsense across languages. Here is Thai subset of XCOPA.
-  - <a href="https://huggingface.co/datasets/pythainlp/wisesight_sentiment">Wisesight</a>: Wisesight sentiment analysis corpus is a social media messages in Thai language with sentiment label.
-- NLG
-  - <a href="https://huggingface.co/datasets/csebuetnlp/xlsum">XLSum</a>: XLSum is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC. Here is Thai subset of XLSum.
-  - <a href="https://huggingface.co/datasets/SEACrowd/flores200">Flores200</a>: FLORES is a benchmark dataset for machine translation between English and low-resource languages. Here is Thai subset of Flores200.
-  - <a href="https://huggingface.co/datasets/iapp/iapp_wiki_qa_squad">iapp Wiki QA Squad</a>: iapp Wiki QA Squad is an extractive question answering dataset from Thai Wikipedia articles.
-Metrics Implementations
 - Multiple-choice accuracy is calculated using the <a href="https://github.com/SEACrowd/seacrowd-experiments/blob/048536fc0d4614734d479b298ea00a1f520da42b/evaluation/main_nlu_prompt_batch.py#L71">SEACrowd implementation</a> of logits comparison, similar to the method used by the <a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard">Open LLM Leaderboard</a> (<a href="https://github.com/EleutherAI/lm-evaluation-harness">EleutherAI Harness</a>). <a href="https://huggingface.co/blog/open-llm-leaderboard-mmlu">explain</a>
-- BLEU is calculated using flores200 tokenizer using huggingface evaluate <a href="https://huggingface.co/spaces/evaluate-metric/sacrebleu">implementation</a>.
-- ROUGEL is calculated using pythainlp newmm tokenizer using huggingface evaluate <a href="https://huggingface.co/spaces/evaluate-metric/rouge">implementation</a>.
-- LLM-as-a-Judge rating is judged by OpenAI gpt-4o-2024-05-13 using prompt specific by <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/judge_prompts.jsonl">lmsys MT-Bench</a>.
-Reproducibility
-To learn more about the evaluation pipeline and reproduce our results, check out the repository <a href="https://github.com/scb-10x/seacrowd-eval">seacrowd-experiments</a>.
-Acknowledgements
-We're grateful to community members for task and model submitting. To contribute, see submit tab.
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""@misc{thaillm-leaderboard,
-  author = {SCB 10X and Vistec and SEACrowd},
-  title = {Thai LLM Leaderboard},
-  year = {2024},
-  publisher = {Hugging Face},
-  howpublished = "\url{https://huggingface.co/spaces/ThaiLLM-Leaderboard/leaderboard}",
-}
-@misc{lovenia2024seacrowdmultilingualmultimodaldata,
-      title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages},
-      author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya},
-      year={2024},
-      eprint={2406.10118},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2406.10118},
 }"""

 # <a href="url"></a>
 INTRODUCTION_TEXT = """
+The Thai LLM Leaderboard 🇹🇭 aims to standardize evaluation methods for large language models (LLMs) in the Thai language, building on <a href="https://github.com/SEACrowd">SEACrowd</a>.
 As part of an open community project, we welcome you to submit new evaluation tasks or models.
 This leaderboard is developed in collaboration with <a href="https://www.scb10x.com">SCB 10X</a>, <a href="https://www.vistec.ac.th/">Vistec</a>, and <a href="https://github.com/SEACrowd">SEACrowd</a>. Read more on <a href="https://blog.opentyphoon.ai/introducing-the-thaillm-leaderboard-thaillm-evaluation-ecosystem-508e789d06bf">Introduction Blog</a>
 """
 LLM_BENCHMARKS_TEXT = f"""
 The leaderboard currently consists of the following benchmarks:
+- <b>Exam</b>
   - <a href="https://huggingface.co/datasets/scb10x/thai_exam">ThaiExam</a>: ThaiExam is a Thai language benchmark based on examinations for high-school students and investment professionals in Thailand.
+  - <a href="https://arxiv.org/abs/2306.05179">M3Exam</a>: M3Exam is a novel benchmark sourced from authentic and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. This leaderboard uses the Thai subset of M3Exam.
+- <b>LLM-as-a-Judge</b>
+  - <a href="https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai">Thai MT-Bench</a>: A Thai version of <a href="https://arxiv.org/abs/2306.05685">MT-Bench</a> developed specially by VISTEC for probing Thai generative skills using the LLM-as-a-judge method.
+- <b>NLU</b>
+  - <a href="https://huggingface.co/datasets/facebook/belebele">Belebele</a>: Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants, where the Thai subset is used in this leaderboard.
+  - <a href="https://huggingface.co/datasets/facebook/xnli">XNLI</a>: XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages. This leaderboard uses the Thai subset of this corpus.
+  - <a href="https://huggingface.co/datasets/cambridgeltl/xcopa">XCOPA</a>: XCOPA is a corpus of translated and re-annotated  English COPA,  covers 11 languages. This is designed to measure the commonsense reasoning ability in non-English languages. This leaderboard uses the Thai subset of this corpus.
+  - <a href="https://huggingface.co/datasets/pythainlp/wisesight_sentiment">Wisesight</a>: Wisesight sentiment analysis corpus contains social media messages in the Thai language with sentiment labels.
+- <b>NLG</b>
+  - <a href="https://huggingface.co/datasets/csebuetnlp/xlsum">XLSum</a>: XLSum is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from the BBC. This corpus evaluates the summarization performance in non-English languages, and this leaderboard uses the Thai subset.
+  - <a href="https://huggingface.co/datasets/SEACrowd/flores200">Flores200</a>: FLORES is a machine translation benchmark dataset used to evaluate translation quality between English and low-resource languages. This leaderboard uses the Thai subset of Flores200.
+  - <a href="https://huggingface.co/datasets/iapp/iapp_wiki_qa_squad">iapp Wiki QA Squad</a>: iapp Wiki QA Squad is an extractive question-answering dataset derived from Thai Wikipedia articles.
+<b>Metric Implementation Details</b>:
 - Multiple-choice accuracy is calculated using the <a href="https://github.com/SEACrowd/seacrowd-experiments/blob/048536fc0d4614734d479b298ea00a1f520da42b/evaluation/main_nlu_prompt_batch.py#L71">SEACrowd implementation</a> of logits comparison, similar to the method used by the <a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard">Open LLM Leaderboard</a> (<a href="https://github.com/EleutherAI/lm-evaluation-harness">EleutherAI Harness</a>). <a href="https://huggingface.co/blog/open-llm-leaderboard-mmlu">explain</a>
+- BLEU is calculated using flores200's tokenizer using HuggingFace `evaluate` <a href="https://huggingface.co/spaces/evaluate-metric/sacrebleu">implementation</a>.
+- ROUGEL is calculated using PyThaiNLP newmm tokenizer and HuggingFace `evaluate` <a href="https://huggingface.co/spaces/evaluate-metric/rouge">implementation</a>.
+- LLM-as-a-judge rating is based on OpenAI's gpt-4o-2024-05-13 using the prompt defined in <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/judge_prompts.jsonl">lmsys MT-Bench</a>.
+<b>Reproducibility</b>:
+- For the reproducibility of results, we have open-sourced the evaluation pipeline. Please check out the repository <a href="https://github.com/scb-10x/seacrowd-eval">seacrowd-experiments</a>.
+<b>Acknowledgements</b>:
+- We are grateful to previous open-source projects that released datasets, tools, and knowledge. We thank community members for tasks and model submissions. To contribute, please see the submit tab.
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""@misc{thaillm-leaderboard,
+  author={SCB 10X and VISTEC and SEACrowd},
+  title={Thai LLM Leaderboard},
+  year={2024},
+  publisher={Hugging Face},
+  url={https://huggingface.co/spaces/ThaiLLM-Leaderboard/leaderboard}
 }"""