File size: 6,254 Bytes
2a26d3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
# Benchmarks evaluations for tablegpt

<p align="center">
    <a href="#-About">🔥About</a> •

    <a href="#-Usage">💻Usage</a> •

</p>


## About

</div>

This is a repo opened for evaluation on different table-related benchmarks for tablegpt. 

Given the complexity of table QA tasks and the uncertainty of input instructions,  we provide evaluation datasets and scripts for 7 capabilities: 

- ✨Code correction based on tables 
- ✨Refusal of ambiguous questions
- ✨Table & field recall in multi-table scenarios
- ✨Table QA output code executable
- ✨Table-Bench.
- ✨Text2Sql.
- ✨TableInstruct, which includes a series of table-related evaluation benchmarks.

In addition, we have integrated other general abilities benchmarks like HumanEval, MBPP and MMLU/CMMLU.
We have built an inference method based on the  local model path using vLLM as the backend, and defined a set of example prompts templates for the above benchmarks. 

## Usage

</div>
</details>

⏬ To use this framework, please first install the repository from GitHub:

```shell

git clone https://github.com/tablegpt/tablegpt-eval

cd tablegpt-eval

pip install -r requirements.txt

```

</div>
</details>

[!Tips]
1. You can run all the benchmarks with the default params by running command `bash run_benchmarks.sh`.
2. If you want more configuration options for running parameters, refer to the typical Python script.
3. Download the .db files before running text2sql evaluation scripts. Download urls refer to `/table_related_benchmarks/evalset/bird_data/dev_databases/README.md`(Bird dev) & `table_related_benchmarks/evalset/spider_data/dev_database/README.md` (Spider dev) & `table_related_benchmarks/evalset/spider_data/test_database/README.md` (Spider test).


### Code correction eval

We provide a non-executable eval dataset based on the Python language. Eval dataset path:

```python

table_related_benchmarks/evalset/code_correction_test/correction_set.json

```

We use the  ***executable_pass_rate*** and ***absolute_match_rate***  of the corrected code in pass-1 to evaluate the model's code correction ability. You can perform code-correction evaluation by running the following Python command:

```bash

python table_related_benchmarks/run_code_correction_eval.py \

		--model_path  <EVAL MODEL PATH> \

		--template  <CHAT_TEMPLATE_NAME, support [llama3, baichuan, chatglm, None], default None> \

    	--eval_results_save_path <PATH TO SAVE THE EVAL RESULTS> \

        --gpus_num <NUMBER OF GPU TO RUN INFERENCE> \

        --temperature <ONE OF THE INFERENCE PARAMETER>

```

### Ambiguous reject eval

We provide 298 table-based queries, with a ratio of about 1:3 between queries marked as ambiguous (to be rejected) and queries that should be accepted and correctly answered. Dataset path:

```python

# test queries

evalset/reject_test/test_query.json

# queries with ground truth

evalset/reject_test/ground_truth.json

```

We use **accuracy**, **recall**, and **F1 score** as metrics to evaluate the LLM's ability in this task. You can perform reject evaluation by  running the following Python command:

```bash

python table_related_benchmarks/run_reject_eval.py \

    --model_path <EVAL MODEL PATH>  \

    --save_path <LLM OUTPUT CONTENT SAVE PATH> \

    --gpus_num <NUMBER OF GPU TO RUN INFERENCE> \

    --temperature <ONE OF THE INFERENCE PARAMETER>

```

### Table&Fields recall eval

The provided eval dataset path:

```python

table_related_benchmarks/evalset/retrieval_test/recall_set.json

```

We use a series of evaluation metrics such as **recall**, **precision**, **Jaccard similarity**, and **Hamming loss** to assess the LLM's performance in table and field retrieval tasks.  You can perform recall evaluation by  running the following Python command:

```bash

python table_related_benchmarks/run_recall_eval.py \

    --model_path <EVAL MODEL PATH> \

    --temperature <TEMPERATURE> \

    --gpus_num <NUMBER OF GPU TO RUN INFERENCE> 

```

### Table QA executable 

Provide 2178 table based queries,  eval dataset path:

```python

table_related_benchmarks/evalset/table_qa_execuate_test/tableqa_samples_with_paths.jsonl

```

We employ ***executable_pass_rate***  of pass-1 to employ the model's tableQA code generation ability. You can perform tableQA evaluation by  running the following Python command:

```bash

python table_related_benchmarks/run_tableqa_execution_eval.py \

    --model_path <EVAL MODEL PATH> \

    --temperature <ONE OF THE INFERENCE PARAMETER> \

    --gpus_num <NUMBER OF GPU TO RUN INFERENCE> 

```

### TableBench evaluation

The provided eval dataset path:

```python

table_related_benchmarks/evalset/TableBench

```

In the evaluation of TableBench, Rough-L was used to assess general QA questions, while pass@1 was used as the evaluation metric for visualization-type samples.  You can perform TableBench evaluation by the following command:

```bash

python table_related_benchmarks/run_table_bench_eval.py \

    --model_path <EVAL MODEL PATH> \

    --temperature <ONE OF THE INFERENCE PARAMETER> \

    --gpus_num <NUMBER OF GPU TO RUN INFERENCE> 

```

### TableInstruct

The provided eval dataset path:

```python

table_related_benchmarks/evalset/TableInstruct

```

You can perform TableInstruct evaluation by the following command:

```bash

python table_related_benchmarks/run_table_instruct_eval.py \

    --model_path <EVAL MODEL PATH> \

    --temperature <ONE OF THE INFERENCE PARAMETER> \

    --gpus_num <NUMBER OF GPU TO RUN INFERENCE> 

```

### Text2Sql
```bash

python table_related_benchmarks/run_text2sql_eval.py --model_path <EVAL MODEL PATH> 

```

### HumanEval
Perform HumanEval evaluation by the following command:

```bash

python general_benchmarks/HumanEval/eval_instruct_vllm.py --model_path <EVAL MODEL PATH> 

```

### MBPP
Perform MBPP evaluation by the following command:

```bash

python general_benchmarks/MBPP/eval_instruct_vllm.py --model_path <EVAL MODEL PATH> 

```

### MMLU & CMMLU

```bash

python general_benchmarks/MMLU/evaluator.py --task <mmlu or cmmlu> --lang <en or zh> --model_path <EVAL MODEL PATH> 

```