Spaces:

upstage
/

open-ko-llm-leaderboard

Running on CPU Upgrade

App Files Files Community

Sean Cho commited on Sep 7, 2023

Commit

f73765d

•

1 Parent(s): 495b288

Initial Korean version

Browse files

Files changed (2) hide show

app.py +6 -6
src/assets/text_content.py +41 -43

app.py CHANGED Viewed

@@ -374,7 +374,7 @@ with demo:
                     gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
                 with gr.Column():
-                    with gr.Accordion(f"✅ Finished Evaluations ({len(finished_eval_queue_df)})", open=False):
                         with gr.Row():
                             finished_eval_table = gr.components.Dataframe(
                                 value=finished_eval_queue_df,
@@ -382,7 +382,7 @@ with demo:
                                 datatype=EVAL_TYPES,
                                 max_rows=5,
                             )
-                    with gr.Accordion(f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})", open=False):
                         with gr.Row():
                             running_eval_table = gr.components.Dataframe(
                                 value=running_eval_queue_df,
@@ -391,7 +391,7 @@ with demo:
                                 max_rows=5,
                             )
-                    with gr.Accordion(f"⏳ Pending Evaluation Queue ({len(pending_eval_queue_df)})", open=False):
                         with gr.Row():
                             pending_eval_table = gr.components.Dataframe(
                                 value=pending_eval_queue_df,
@@ -400,7 +400,7 @@ with demo:
                                 max_rows=5,
                             )
             with gr.Row():
-                gr.Markdown("# ✉️✨ Submit your model here!", elem_classes="markdown-text")
             with gr.Row():
                 with gr.Column():
@@ -443,7 +443,7 @@ with demo:
                         label="Base model (for delta or adapter weights)"
                     )
-            submit_button = gr.Button("Submit Eval")
             submission_result = gr.Markdown()
             submit_button.click(
                 add_new_eval,
@@ -460,7 +460,7 @@ with demo:
             )
         with gr.Row():
-            refresh_button = gr.Button("Refresh")
             refresh_button.click(
                 refresh,
                 inputs=[],

                     gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
                 with gr.Column():
+                    with gr.Accordion(f"✅ 평가 완료 ({len(finished_eval_queue_df)})", open=False):
                         with gr.Row():
                             finished_eval_table = gr.components.Dataframe(
                                 value=finished_eval_queue_df,
                                 datatype=EVAL_TYPES,
                                 max_rows=5,
                             )
+                    with gr.Accordion(f"🔄 평가 진행 대기열 ({len(running_eval_queue_df)})", open=False):
                         with gr.Row():
                             running_eval_table = gr.components.Dataframe(
                                 value=running_eval_queue_df,
                                 max_rows=5,
                             )
+                    with gr.Accordion(f"⏳ 평가 대기 대기열 ({len(pending_eval_queue_df)})", open=False):
                         with gr.Row():
                             pending_eval_table = gr.components.Dataframe(
                                 value=pending_eval_queue_df,
                                 max_rows=5,
                             )
             with gr.Row():
+                gr.Markdown("# ✉️✨ 여기에서 모델을 제출해주세요!", elem_classes="markdown-text")
             with gr.Row():
                 with gr.Column():
                         label="Base model (for delta or adapter weights)"
                     )
+            submit_button = gr.Button("제출하고 평가받기")
             submission_result = gr.Markdown()
             submit_button.click(
                 add_new_eval,
             )
         with gr.Row():
+            refresh_button = gr.Button("새로고침")
             refresh_button.click(
                 refresh,
                 inputs=[],

src/assets/text_content.py CHANGED Viewed

@@ -56,53 +56,54 @@ CHANGELOG_TEXT = f"""
 - Release the leaderboard to public
 """
-TITLE = """<h1 align="center" id="space-title">🤗 Open LLM Leaderboard</h1>"""
 INTRODUCTION_TEXT = f"""
-📐 The 🤗 Open LLM Leaderboard aims to track, rank and evaluate open LLMs and chatbots.
-🤗 Submit a model for automated evaluation on the 🤗 GPU cluster on the "Submit" page!
-The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to compute numbers. Read more details and reproducibility on the "About" page!
-Other cool benchmarks for LLMs are developed at HuggingFace: 🙋🤖 [human and GPT4 evals](https://huggingface.co/spaces/HuggingFaceH4/human_eval_llm_leaderboard), 🖥️ [performance benchmarks](https://huggingface.co/spaces/optimum/llm-perf-leaderboard)
-And also in other labs, check out the [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) and [MT Bench](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) among other great ressources.
 """
 LLM_BENCHMARKS_TEXT = f"""
 # Context
-With the plethora of large language models (LLMs) and chatbots being released week upon week, often with grandiose claims of their performance, it can be hard to filter out the genuine progress that is being made by the open-source community and which model is the current state of the art.
 ## Icons
 {ModelType.PT.to_str(" : ")} model
 {ModelType.FT.to_str(" : ")} model
 {ModelType.IFT.to_str(" : ")} model
 {ModelType.RL.to_str(" : ")} model
-If there is no icon, we have not uploaded the information on the model yet, feel free to open an issue with the model information!
-## How it works
-📈 We evaluate models on 4 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank">  Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
-- <a href="https://arxiv.org/abs/1803.05457" target="_blank">  AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
-- <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
-- <a href="https://arxiv.org/abs/2009.03300" target="_blank">  MMLU </a>  (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
-- <a href="https://arxiv.org/abs/2109.07958" target="_blank">  TruthfulQA </a> (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online. Note: TruthfulQA in the Harness is actually a minima a 6-shots task, as it is prepended by 6 examples systematically, even when launched using 0 for the number of few-shot examples.
-For all these evaluations, a higher score is a better score.
-We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
 ## Details and logs
 You can find:
-- detailed numerical results in the `results` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/results
-- details on the input/outputs for the models in the `details` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/details
-- community queries and running status in the `requests` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/requests
 ## Reproducibility
-To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
-`python main.py --model=hf-causal --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
-` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=2 --output_path=<output_path>`
 The total batch size we get for models which fit on one A100 node is 16 (8 GPUs * 2). If you don't use parallelism, adapt your batch size to fit.
 *You can expect results to vary slightly for different batch sizes because of padding.*
@@ -121,37 +122,34 @@ To get more information about quantization, see:
 """
 EVALUATION_QUEUE_TEXT = f"""
-# Evaluation Queue for the 🤗 Open LLM Leaderboard
-Models added here will be automatically evaluated on the 🤗 cluster.
-## Some good practices before submitting a model
-### 1) Make sure you can load your model and tokenizer using AutoClasses:
-```python
 from transformers import AutoConfig, AutoModel, AutoTokenizer
 config = AutoConfig.from_pretrained("your model name", revision=revision)
 model = AutoModel.from_pretrained("your model name", revision=revision)
 tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
 ```
-If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
-Note: make sure your model is public!
-Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!
-### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
-It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
-### 3) Make sure your model has an open license!
-This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗
-### 4) Fill up your model card
-When we add extra information about models to the leaderboard, it will be automatically taken from the model card
-## In case of model failure
-If your model is displayed in the `FAILED` category, its execution stopped.
-Make sure you have followed the above steps first.
-If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"

 - Release the leaderboard to public
 """
+TITLE = """<h1 align="center" id="space-title">🚀 Open Ko-LLM Leaderboard</h1>"""
 INTRODUCTION_TEXT = f"""
+🚀 Open Ko-LLM Leaderboard는 한국어 초거대 언어모델의 성능을 객관적으로 평가합니다.
+"제출" 페이지에서 모델 제출 시 자동으로 평가됩니다. 평가에 사용되는 GPU는 KT의 지원으로 운영됩니다.
+평가에 사용되는 데이터는 전문 지식, 추론 능력, 환각, 윤리, 상식의 다섯가지 요소를 평가하기 위한 데이터셋으로 구성되어 있습니다.
+벤치마크 데이터셋에 대한 더 자세한 정보는 "정보" 페이지에서 제공되고 있습니다.
+업스테이지와 NIA가 공동 주최하며 업스테이지가 운영합니다.
 """
 LLM_BENCHMARKS_TEXT = f"""
 # Context
+뛰어난 LLM 모델들이 앞다투어 공개되고 있지만 이는 대부분 영어 중심의, 영어 문화권에 익숙한 모델입니다. 저희는 한국어 리더보드 🚀 Open Ko-LLM을 운영하여 한국어와 한국 문화의 특성을 반영한 모델을 평가하고자 합니다. 이를 통해 한국어 사용자들이 편리하게 리더보드를 이용하고 참여하여 한국의 연구 수준 향상에 기여할 수 있기를 바랍니다.
 ## Icons
 {ModelType.PT.to_str(" : ")} model
 {ModelType.FT.to_str(" : ")} model
 {ModelType.IFT.to_str(" : ")} model
 {ModelType.RL.to_str(" : ")} model
+만약 아이콘이 없다면 아직 모델에 대한 정보가 부족함을 나타냅니다.
+모델에 대한 정보는 issue를 통해 전달해주세요! 🤩
+🏴‍☠️ : 해당 아이콘은 이 모델이 커뮤니티에 의해 주의 대상으로 선정되었으므로 이용 자제를 바란다는 의미입니다. 아이콘을 클릭 시 해당 모델에 대한 discussion으로 이동합니다.
+(높은 리더보드 순위를 위해 평가셋을 학습에 이용한 모델 등이 주의 대상으로 선정됩니다)
+## How it works
+📈 HuggingFace OpenLLM에서 운영하는 4개의 태스크(HellaSwag, MMLU, Arc, Truthful QA)의 데이터를 한국어로 번역한 데이터셋을 비롯해 총 6가지의 데이터로 벤치마크를 구성했습니다.
+- Ko-HellaSwag (업스테이지 제공)
+- Ko-MMLU (업스테이지 제공)
+- Ko-Arc (업스테이지 제공)
+- Ko-Truthful QA (업스테이지 제공)
+- KoCommongen (NIA 한국지능정보사회진흥원 제공)
+- 텍스트 윤리검증 데이터 (NIA 한국지능정보사회진흥원 제공)
+LLM 시대에 걸맞는 평가를 위해 상식, 전문 지식, 추론, 환각, 윤리의 다섯가지 요소를 평가하기에 적합한 데이터셋들을 벤치마크로 선정했습니다. 최종 점수는 6개의 평가 데이터에 대한 평균 점수로 환산합니다.
+KT로부터 평가에 사용되는 GPU를 제공받았습니다.
 ## Details and logs
 You can find:
+- 좀 더 자세한 수치 정보는: https://huggingface.co/datasets/open-llm-leaderboard/results
+- 모델의 입출력에 대한 자세한 정보는: https://huggingface.co/datasets/open-llm-leaderboard/details
+- 모델의 평가 큐와 평가 상태는: https://huggingface.co/datasets/open-llm-leaderboard/requests
 ## Reproducibility
+평가 결과를 재현하기 위해서는 [이 버전](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463)의 데이터셋을 이용하세요. (밑에는 코드 및 평가 환경이라서 일단 skip)
 The total batch size we get for models which fit on one A100 node is 16 (8 GPUs * 2). If you don't use parallelism, adapt your batch size to fit.
 *You can expect results to vary slightly for different batch sizes because of padding.*
 """
 EVALUATION_QUEUE_TEXT = f"""
+# 🚀 Open-Ko LLM 리더보드의 평가 큐입니다.
+이곳에 추가된 모델들은 곧 자동적으로 KT의 GPU 위에서 평가될 예정입니다!
+## <모델 제출 전 확인하면 좋은 것들>
+### 1️⃣ 모델과 토크나이저가 AutoClasses로 불러올 수 있나요?
+```
 from transformers import AutoConfig, AutoModel, AutoTokenizer
 config = AutoConfig.from_pretrained("your model name", revision=revision)
 model = AutoModel.from_pretrained("your model name", revision=revision)
 tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
 ```
+만약 이 단계가 실패했다면 에러 메세지를 따라 모델을 디버깅한 후에 제출해주세요.
+⚠️ 모델이 public 상태여야 합니다!
+⚠️ 만약 모델이 use_remote_code=True여야 한다면 잠시 기다려주세요. 현재로서는 아직 이 옵션을 지원하지 않지만 작동할 수 있도록 하고 있습니다!
+### 2️⃣ 모델의 weight를 safetensors로 바꿨나요?
+safetensors는 weight를 보관하는 새로운 포맷으로, 훨씬 안전하고 빠르게 사용할 수 있습니다. 또한 모델의 parameter 개수를 Extended Viewer에 추가할 수 있습니다
+### 3️⃣ 모델이 오픈 라이센스를 따르나요?
+🚀 Open-Ko LLM은 Open LLM을 위한 리더보드로, 많은 사람들이 다양한 모델을 사용하기를 바랍니다
+### 4️⃣ 모델 카드를 작성��셨나요?
+리더보드에 모델에 대한 추가 정보를 업로드할 때 작성하신 모델 카드가 업로드됩니다
+## 모델이 실패한 경우:
+만약 제출한 모델의 상태가 FAILED가 된다면 이는 모델이 실행 중단되었음을 의미합니다. 먼저 위의 네 단계를 모두 따랐는지 확인해보세요. 모든 단계를 따랐음에도 불구하고 실행 중단되었을 때는 EleutherAIHarness 를 로컬에서 실행할 수 있는지 확인하기 위해 위의 코드를 수정 없이 실행하세요. (태스크 별 예시의 수를 제한하기 위해 —limit 파라미터를 추가할 수 있습니다.)
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"