sariola commited on
Commit
3ca3f58
1 Parent(s): 0cb362c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -23
README.md CHANGED
@@ -26,8 +26,6 @@ model_creator: Flow AI
26
  model_type: phi3.5
27
  quantized_by: Flow AI
28
  ---
29
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/6kSJKgPh2pDh4tA-Ky0xW.png)
30
-
31
  # Flow-Judge-v0.1-GGUF
32
  - Original model: [Flow-Judge-v0.1](https://huggingface.co/flowaicom/Flow-Judge-v0.1)
33
  - Model collection: [Flow-Judge-v0.1 models](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)
@@ -40,6 +38,7 @@ quantized_by: Flow AI
40
 
41
  This repo contains GGUF quants for [Flow-Judge-v0.1](https://huggingface.co/flowaicom/Flow-Judge-v0.1).
42
 
 
43
  ## Quantization config
44
 
45
  Version used: github:ggerganov/llama.cpp/8e6e2fbe1458ac91387266241262294a964d6b95?narHash=sha256-Z3Rg43p8G9MdxiGvSl9m43KsJ1FvvhQwtzRy/grg9X0%3D
@@ -48,6 +47,7 @@ llama-convert-hf-to-gguf ./flowaicom/Flow-Judge-v0.1 --outfile flow-judge-v0.1-b
48
  llama-quantize flow-judge-v0.1-bf16.gguf flow-judge-v0.1-Q4_K_M.gguf Q4_K_M
49
  ```
50
 
 
51
  ## Running the GGUF file
52
 
53
  ```shell
@@ -55,27 +55,20 @@ llama-server -ngl 33 -t 16 -m Flow-Judge-v0.1-GGUF/flow-judge-v0.1-Q4_K_M.gguf -
55
 
56
  ```
57
 
58
- # Original model card: Flow-Judge-v0.1
59
 
60
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/NgFJqVmUgrhOnphd47VEm.png)
61
 
62
- <div class="center-content">
63
- <div class="links">
64
- <a href="https://github.com/flowaicom/flow-judge">flow-judge library</a>
65
- |
66
- <a href="https://www.flow-ai.com/blog/flow-judge">Technical report</a>
67
- </div>
68
- </div>
69
 
70
  ## Model Summary
71
 
72
  Flow-Judge-v0.1 is a compact yet powerful 3.8B model that offers customizable LLM system evaluations across various fields. The model inherits it's architecture from Phi-3.5-mini instruct model which enables Flow-Judge to deliver high-quality results while maintaining a small footprint. Despite its smaller size, it achieves performance comparable to larger models in both held-out and out-of-domain benchmarks. Flow-Judge-v0.1 supports multiple scoring scales, provides qualitative feedback, and generates structured evaluation outputs. Trained on a smaller synthetic dataset, it represents an efficient approach to AI development. Released under the Apache 2.0 license, Flow Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.
73
 
74
- __More information__
75
- - [Flow Judge website](https://www.flow-ai.com/judge)
76
- - [Technical report](https://www.flow-ai.com/blog/flow-judge)
77
- - [Github repo](https://github.com/flowaicom/flow-judge)
78
-
79
  __Quantized weights__
80
  - [flowaicom/Flow-Judge-v0.1-AWQ](https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ)
81
  - [flowaicom/Flow-Judge-v0.1-GGUF](https://huggingface.co/flowaicom/Flow-Judge-v0.1-GGUF)
@@ -94,7 +87,7 @@ Flow Judge is intended to be used on custom LLM system evaluation tasks.
94
  - 5-Likert: Provides an even more nuanced assessment, with scores ranging from strongly negative to strongly positive, enabling users to capture subtle differences in quality or sentiment.
95
 
96
  - Easy to interpret results:
97
- - Flow Judge produces structured evaluations with <feedback> and <score> tags.
98
  - Qualitative feedback: Flow Judge detects errors and grades outputs and provides qualitative feedback that explains its reasoning for assigning a particular score from the rubric while highlighting problematic parts of the responses.
99
  - Score: Based on a grading rubric Flow Judge will return a numerical score on binary, likert-3 or likert-5 scale.
100
 
@@ -116,12 +109,12 @@ Flow-Judge-v0.1 has been trained on synthetically generated datasets. The constr
116
 
117
  This process creates a comprehensive and diverse set of training instances that enable accurate, domain-specific evaluations of LLM systems in generative AI products while minimizing human intervention.
118
 
119
- Read more about the dataset construction from [here](https://www.flow-ai.com/blog/flow-judge)
120
 
121
 
122
  ### Fine-tuning
123
 
124
- For fine-tuning we used Axolotl's preprocessing to ensure input training data is consistent. We then conducted supervised fine-tuning based on microsoft/Phi-3.5-mini-instruct using RSLoRa. More detailed information about the fine-tuning process is provided in our [technical report](https://www.flow-ai.com/blog/flow-judge).
125
 
126
  ## Usage
127
 
@@ -406,7 +399,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
406
  </tbody>
407
  </table>
408
 
409
- \* _not suitable for 3 likert_
410
 
411
 
412
  ### RAGTruth
@@ -526,7 +519,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
526
  </tr>
527
  </table>
528
 
529
- \* _reported in Galileo luna paper_
530
 
531
 
532
  ### HaluEval, Covid-QA, PubMedQA
@@ -707,7 +700,7 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
707
  </tbody>
708
  </table>
709
 
710
- \* _reported in lynx paper_
711
  ### Feedback Bench
712
 
713
  <table border="1" cellpadding="10" cellspacing="0" style="border-collapse: collapse; width: auto;">
@@ -758,4 +751,16 @@ To run Flow Judge efficiently, ensure your hardware meets the following requirem
758
  </tr>
759
  </table>
760
 
761
- \* _reported in prometheus paper using reference answer. Note the rest of the models have been evaluated without reference answer_
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  model_type: phi3.5
27
  quantized_by: Flow AI
28
  ---
 
 
29
  # Flow-Judge-v0.1-GGUF
30
  - Original model: [Flow-Judge-v0.1](https://huggingface.co/flowaicom/Flow-Judge-v0.1)
31
  - Model collection: [Flow-Judge-v0.1 models](https://huggingface.co/collections/flowaicom/flow-judge-v01-66e6af5fc3b3a128bde07dec)
 
38
 
39
  This repo contains GGUF quants for [Flow-Judge-v0.1](https://huggingface.co/flowaicom/Flow-Judge-v0.1).
40
 
41
+
42
  ## Quantization config
43
 
44
  Version used: github:ggerganov/llama.cpp/8e6e2fbe1458ac91387266241262294a964d6b95?narHash=sha256-Z3Rg43p8G9MdxiGvSl9m43KsJ1FvvhQwtzRy/grg9X0%3D
 
47
  llama-quantize flow-judge-v0.1-bf16.gguf flow-judge-v0.1-Q4_K_M.gguf Q4_K_M
48
  ```
49
 
50
+
51
  ## Running the GGUF file
52
 
53
  ```shell
 
55
 
56
  ```
57
 
 
58
 
 
59
 
60
+
61
+ # Original model card: Flow-Judge-v0.1
62
+
63
+ <p align="center">
64
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/63368577d184e6b53c50e6d0/6kSJKgPh2pDh4tA-Ky0xW.png" alt="Centered image">
65
+ </p>
66
+ <p align="center">🚀 <a href="https://www.flow-ai.com/judge">Flow Judge</a> | 📄 <a href="https://www.flow-ai.com/blog/flow-judge">Technical report</a> | 💻 <a href="https://github.com/flowaicom/flow-judge">flow-judge</a></p>
67
 
68
  ## Model Summary
69
 
70
  Flow-Judge-v0.1 is a compact yet powerful 3.8B model that offers customizable LLM system evaluations across various fields. The model inherits it's architecture from Phi-3.5-mini instruct model which enables Flow-Judge to deliver high-quality results while maintaining a small footprint. Despite its smaller size, it achieves performance comparable to larger models in both held-out and out-of-domain benchmarks. Flow-Judge-v0.1 supports multiple scoring scales, provides qualitative feedback, and generates structured evaluation outputs. Trained on a smaller synthetic dataset, it represents an efficient approach to AI development. Released under the Apache 2.0 license, Flow Judge is an open and accessible model suitable for developers and companies seeking cost-effective and rapid evaluations using custom rubrics.
71
 
 
 
 
 
 
72
  __Quantized weights__
73
  - [flowaicom/Flow-Judge-v0.1-AWQ](https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ)
74
  - [flowaicom/Flow-Judge-v0.1-GGUF](https://huggingface.co/flowaicom/Flow-Judge-v0.1-GGUF)
 
87
  - 5-Likert: Provides an even more nuanced assessment, with scores ranging from strongly negative to strongly positive, enabling users to capture subtle differences in quality or sentiment.
88
 
89
  - Easy to interpret results:
90
+ - Flow Judge produces structured evaluations with `<feedback>` and `<score>` tags.
91
  - Qualitative feedback: Flow Judge detects errors and grades outputs and provides qualitative feedback that explains its reasoning for assigning a particular score from the rubric while highlighting problematic parts of the responses.
92
  - Score: Based on a grading rubric Flow Judge will return a numerical score on binary, likert-3 or likert-5 scale.
93
 
 
109
 
110
  This process creates a comprehensive and diverse set of training instances that enable accurate, domain-specific evaluations of LLM systems in generative AI products while minimizing human intervention.
111
 
112
+ Read more about the dataset construction from [here](https://www.flow-ai.com/blog/flow-judge#dataset-construction)
113
 
114
 
115
  ### Fine-tuning
116
 
117
+ For fine-tuning we used Axolotl's preprocessing to ensure input training data is consistent. We then conducted supervised fine-tuning based on microsoft/Phi-3.5-mini-instruct using RSLoRa. More detailed information about the fine-tuning process is provided in our [technical report](https://www.flow-ai.com/blog/flow-judge#fine-tuning).
118
 
119
  ## Usage
120
 
 
399
  </tbody>
400
  </table>
401
 
402
+ \* _Reported in model paper_
403
 
404
 
405
  ### RAGTruth
 
519
  </tr>
520
  </table>
521
 
522
+ \* _reported in model paper_
523
 
524
 
525
  ### HaluEval, Covid-QA, PubMedQA
 
700
  </tbody>
701
  </table>
702
 
703
+ \* _reported in model paper_
704
  ### Feedback Bench
705
 
706
  <table border="1" cellpadding="10" cellspacing="0" style="border-collapse: collapse; width: auto;">
 
751
  </tr>
752
  </table>
753
 
754
+ \* _reported in model paper using reference answers_
755
+
756
+ ## License
757
+ We opted for the Apache 2.0 license for Flow Judge to provide the community with an open, small yet powerful LM evaluator. Our goal is to support the wider adoption of rigorous evaluation techniques in LLM system development, making them more accessible to practitioners and researchers.
758
+
759
+ ## Limitations and future work
760
+ Multilingual evaluation: Flow Judge has been fine-tuned exclusively on English data. While the foundation model (Phi-3.5-mini-instruct [17]) may possess multilingual capabilities, we have not systematically evaluated Flow Judge performance in non-English contexts. We plan to explore multi-lingual LM evaluators in the future.
761
+
762
+ Long context and structured Inputs: Our training dataset encompasses a wide range of custom metrics relevant to evaluating LLM systems. However, it does not include examples with long context inputs or structured data formats such as JSON, since these are harder to synthetically generate. This limitation may impact Flow Judge's performance when evaluating responses that require processing extensive context or parsing structured input. Extending our model’s capabilities to handle these input types represents an important area for future research.
763
+
764
+ Math and coding: The current version has not been trained on specific task domains such as arithmetic problems or code evaluation. As a result, its performance in these specialized areas may be limited. Future iterations of the model should address these gaps.
765
+
766
+ Domain-specific knowledge and complex multi-step evaluations: Flow Judge may struggle with highly specialized domain knowledge or proprietary data outside the training scope of its foundation model. Additionally, evaluation tasks requiring multi-step reasoning or complex logical processes may challenge the model's capabilities. We strongly recommend conducting meta-evaluations of the model performance before deploying it in specialized or highly complex evaluation scenarios.