jon-tow leaderboard-pr-bot commited on
Commit
015f44c
1 Parent(s): 8b471c7

Adding Evaluation Results (#14)

Browse files

- Adding Evaluation Results (ba1eb10abfbc233fa99d72aeadb9f6dd81359ec2)


Co-authored-by: Open LLM Leaderboard PR Bot <[email protected]>

Files changed (1) hide show
  1. README.md +122 -6
README.md CHANGED
@@ -1,21 +1,124 @@
1
  ---
 
 
 
 
 
2
  datasets:
3
  - HuggingFaceH4/ultrachat_200k
4
  - HuggingFaceH4/ultrafeedback_binarized
5
  - meta-math/MetaMathQA
6
  - WizardLM/WizardLM_evol_instruct_V2_196k
7
  - Intel/orca_dpo_pairs
8
- language:
9
- - en
10
- tags:
11
- - causal-lm
12
  extra_gated_fields:
13
  Name: text
14
  Email: text
15
  Country: text
16
  Organization or Affiliation: text
17
  I ALLOW Stability AI to email me about new model releases: checkbox
18
- license: other
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
20
  # `StableLM Zephyr 3B`
21
 
@@ -150,4 +253,17 @@ The model is intended to be used as a foundational base model for application-sp
150
 
151
  This model is not trained against adversarial inputs. We strongly recommend pairing this model with an input and output classifier to prevent harmful responses.
152
 
153
- Through our internal red teaming, we discovered that while the model will not output harmful information if not prompted to do so, it is willing to output potentially harmful outputs or misinformation when the user requests it. Using this model will require guardrails around your inputs and outputs to ensure that any outputs returned are not misinformation or harmful. Additionally, as each use case is unique, we recommend running your own suite of tests to ensure proper performance of this model. Finally, do not use the models if they are unsuitable for your application, or for any applications that may cause deliberate or unintentional harm to others.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ license: other
5
+ tags:
6
+ - causal-lm
7
  datasets:
8
  - HuggingFaceH4/ultrachat_200k
9
  - HuggingFaceH4/ultrafeedback_binarized
10
  - meta-math/MetaMathQA
11
  - WizardLM/WizardLM_evol_instruct_V2_196k
12
  - Intel/orca_dpo_pairs
 
 
 
 
13
  extra_gated_fields:
14
  Name: text
15
  Email: text
16
  Country: text
17
  Organization or Affiliation: text
18
  I ALLOW Stability AI to email me about new model releases: checkbox
19
+ model-index:
20
+ - name: stablelm-zephyr-3b
21
+ results:
22
+ - task:
23
+ type: text-generation
24
+ name: Text Generation
25
+ dataset:
26
+ name: AI2 Reasoning Challenge (25-Shot)
27
+ type: ai2_arc
28
+ config: ARC-Challenge
29
+ split: test
30
+ args:
31
+ num_few_shot: 25
32
+ metrics:
33
+ - type: acc_norm
34
+ value: 46.08
35
+ name: normalized accuracy
36
+ source:
37
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=stabilityai/stablelm-zephyr-3b
38
+ name: Open LLM Leaderboard
39
+ - task:
40
+ type: text-generation
41
+ name: Text Generation
42
+ dataset:
43
+ name: HellaSwag (10-Shot)
44
+ type: hellaswag
45
+ split: validation
46
+ args:
47
+ num_few_shot: 10
48
+ metrics:
49
+ - type: acc_norm
50
+ value: 74.16
51
+ name: normalized accuracy
52
+ source:
53
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=stabilityai/stablelm-zephyr-3b
54
+ name: Open LLM Leaderboard
55
+ - task:
56
+ type: text-generation
57
+ name: Text Generation
58
+ dataset:
59
+ name: MMLU (5-Shot)
60
+ type: cais/mmlu
61
+ config: all
62
+ split: test
63
+ args:
64
+ num_few_shot: 5
65
+ metrics:
66
+ - type: acc
67
+ value: 46.17
68
+ name: accuracy
69
+ source:
70
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=stabilityai/stablelm-zephyr-3b
71
+ name: Open LLM Leaderboard
72
+ - task:
73
+ type: text-generation
74
+ name: Text Generation
75
+ dataset:
76
+ name: TruthfulQA (0-shot)
77
+ type: truthful_qa
78
+ config: multiple_choice
79
+ split: validation
80
+ args:
81
+ num_few_shot: 0
82
+ metrics:
83
+ - type: mc2
84
+ value: 46.49
85
+ source:
86
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=stabilityai/stablelm-zephyr-3b
87
+ name: Open LLM Leaderboard
88
+ - task:
89
+ type: text-generation
90
+ name: Text Generation
91
+ dataset:
92
+ name: Winogrande (5-shot)
93
+ type: winogrande
94
+ config: winogrande_xl
95
+ split: validation
96
+ args:
97
+ num_few_shot: 5
98
+ metrics:
99
+ - type: acc
100
+ value: 65.51
101
+ name: accuracy
102
+ source:
103
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=stabilityai/stablelm-zephyr-3b
104
+ name: Open LLM Leaderboard
105
+ - task:
106
+ type: text-generation
107
+ name: Text Generation
108
+ dataset:
109
+ name: GSM8k (5-shot)
110
+ type: gsm8k
111
+ config: main
112
+ split: test
113
+ args:
114
+ num_few_shot: 5
115
+ metrics:
116
+ - type: acc
117
+ value: 42.15
118
+ name: accuracy
119
+ source:
120
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=stabilityai/stablelm-zephyr-3b
121
+ name: Open LLM Leaderboard
122
  ---
123
  # `StableLM Zephyr 3B`
124
 
 
253
 
254
  This model is not trained against adversarial inputs. We strongly recommend pairing this model with an input and output classifier to prevent harmful responses.
255
 
256
+ Through our internal red teaming, we discovered that while the model will not output harmful information if not prompted to do so, it is willing to output potentially harmful outputs or misinformation when the user requests it. Using this model will require guardrails around your inputs and outputs to ensure that any outputs returned are not misinformation or harmful. Additionally, as each use case is unique, we recommend running your own suite of tests to ensure proper performance of this model. Finally, do not use the models if they are unsuitable for your application, or for any applications that may cause deliberate or unintentional harm to others.
257
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
258
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_stabilityai__stablelm-zephyr-3b)
259
+
260
+ | Metric |Value|
261
+ |---------------------------------|----:|
262
+ |Avg. |53.43|
263
+ |AI2 Reasoning Challenge (25-Shot)|46.08|
264
+ |HellaSwag (10-Shot) |74.16|
265
+ |MMLU (5-Shot) |46.17|
266
+ |TruthfulQA (0-shot) |46.49|
267
+ |Winogrande (5-shot) |65.51|
268
+ |GSM8k (5-shot) |42.15|
269
+