Update README.md
Browse files
README.md
CHANGED
@@ -9,15 +9,20 @@ language:
|
|
9 |
- en
|
10 |
pipeline_tag: text-generation
|
11 |
---
|
|
|
12 |
# cerbero-7b Italian LLM 🚀
|
13 |
|
14 |
-
>
|
|
|
|
|
|
|
|
|
15 |
|
16 |
<p align="center">
|
17 |
<img width="300" height="300" src="./README.md.d/cerbero.png">
|
18 |
</p>
|
19 |
|
20 |
-
Built on [**mistral-7b**](https://mistral.ai/news/announcing-mistral-7b/), which outperforms Llama2 13B across all benchmarks and surpasses Llama1 34B in numerous metrics.
|
21 |
|
22 |
**cerbero-7b** is specifically crafted to fill the void in Italy's AI landscape.
|
23 |
|
@@ -27,6 +32,33 @@ A **cambrian explosion** of **Italian Language Models** is essential for buildin
|
|
27 |
|
28 |
**cerbero-7b** is released under the **permissive** Apache 2.0 **license**, allowing **unrestricted usage**, even **for commercial applications**.
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
## Why Cerbero? 🤔
|
31 |
|
32 |
The name "Cerbero," inspired by the three-headed dog that guards the gates of the Underworld in Greek mythology, encapsulates the essence of our model, drawing strength from three foundational pillars:
|
@@ -34,39 +66,33 @@ The name "Cerbero," inspired by the three-headed dog that guards the gates of th
|
|
34 |
- **Base Model: mistral-7b** 🏗️
|
35 |
cerbero-7b builds upon the formidable **mistral-7b** as its base model. This choice ensures a robust foundation, leveraging the power and capabilities of a cutting-edge language model.
|
36 |
|
37 |
-
- **Datasets:
|
38 |
-
|
39 |
|
40 |
- **Licensing: Apache 2.0** 🕊️
|
41 |
Released under the **permissive Apache 2.0 license**, cerbero-7b promotes openness and collaboration. This licensing choice empowers developers with the freedom for unrestricted usage, fostering a community-driven approach to advancing AI in Italy and beyond.
|
42 |
|
43 |
## Training Details 🚀
|
44 |
|
45 |
-
cerbero-7b is **fully fine-tuned
|
46 |
The model is trained on an expansive Italian Large Language Model (LLM) using synthetic datasets generated through dynamic self-chat on a large context window of **8192 tokens**
|
47 |
|
48 |
### Dataset Composition 📊
|
49 |
|
50 |
-
|
51 |
-
|
52 |
-
- **Medical Data:** Capturing nuances in medical language. 🩺
|
53 |
-
- **Technical Content:** Extracted from Stack Overflow to enhance the model's understanding of technical discourse. 💻
|
54 |
-
- **Quora Discussions:** Providing valuable insights into common queries and language usage. ❓
|
55 |
-
- **Alpaca Data Translation:** Italian-translated content from Alpaca contributes to the model's language richness and contextual understanding. 🦙
|
56 |
|
57 |
### Training Setup ⚙️
|
58 |
|
59 |
-
cerbero-7b is trained on an NVIDIA DGX H100:
|
60 |
|
61 |
- **Hardware:** Utilizing 8xH100 GPUs, each with 80 GB VRAM. 🖥️
|
62 |
- **Parallelism:** DeepSpeed Zero stage 1 parallelism for optimal training efficiency.✨
|
63 |
|
64 |
-
The model has been trained for **
|
65 |
|
66 |
## Getting Started 🚀
|
67 |
|
68 |
-
You can load cerbero-7b using [🤗transformers](https://huggingface.co/docs/transformers/index)
|
69 |
-
|
70 |
|
71 |
```python
|
72 |
import torch
|
@@ -77,7 +103,7 @@ tokenizer = AutoTokenizer.from_pretrained("galatolo/cerbero-7b")
|
|
77 |
|
78 |
prompt = """Questa è una conversazione tra un umano ed un assistente AI.
|
79 |
[|Umano|] Come posso distinguere un AI da un umano?
|
80 |
-
[|
|
81 |
|
82 |
input_ids = tokenizer(prompt, return_tensors='pt').input_ids
|
83 |
with torch.no_grad():
|
@@ -85,4 +111,27 @@ with torch.no_grad():
|
|
85 |
|
86 |
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
|
87 |
print(generated_text)
|
88 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
- en
|
10 |
pipeline_tag: text-generation
|
11 |
---
|
12 |
+
|
13 |
# cerbero-7b Italian LLM 🚀
|
14 |
|
15 |
+
> 🔥 Attention! The **new** and **more capable** version of **cerbero-7b** is now **available**!
|
16 |
+
|
17 |
+
> 📢 **cerbero-7b** is the first **100% Free** and Open Source **Italian Large Language Model** (LLM) ready to be used for **research** or **commercial applications**.
|
18 |
+
|
19 |
+
**Try an online demo [here](https://huggingface.co/spaces/galatolo/chat-with-cerbero-7b)** (quantized demo running on CPU, a lot less powerful than the original cerbero-7b)
|
20 |
|
21 |
<p align="center">
|
22 |
<img width="300" height="300" src="./README.md.d/cerbero.png">
|
23 |
</p>
|
24 |
|
25 |
+
Built on top of [**mistral-7b**](https://mistral.ai/news/announcing-mistral-7b/), which outperforms Llama2 13B across all benchmarks and surpasses Llama1 34B in numerous metrics.
|
26 |
|
27 |
**cerbero-7b** is specifically crafted to fill the void in Italy's AI landscape.
|
28 |
|
|
|
32 |
|
33 |
**cerbero-7b** is released under the **permissive** Apache 2.0 **license**, allowing **unrestricted usage**, even **for commercial applications**.
|
34 |
|
35 |
+
## Model Evaluation Results 📈
|
36 |
+
|
37 |
+
The `cerbero-7b` model has been rigorously evaluated across several benchmarks to demonstrate its proficiency in understanding and generating Italian text. Below are the summarized results showcasing its performance:
|
38 |
+
|
39 |
+
### SQuAD-it Evaluation
|
40 |
+
|
41 |
+
The Stanford Question Answering Dataset (SQuAD) in Italian (SQuAD-it) is used to evaluate the model's reading comprehension and question-answering capabilities. The following table presents the F1 score and Exact Match (EM) metrics:
|
42 |
+
|
43 |
+
| Model | F1 Score | Exact Match (EM) |
|
44 |
+
|----------------------------------------------|--------------|----------------------|
|
45 |
+
| **cerbero-7b** | **72.55%** | **55.6%** |
|
46 |
+
| Fauno | 44.46% | 0.00% |
|
47 |
+
| Camoscio | 37.42% | 0.00% |
|
48 |
+
| mistral-7b | 15.55% | 8.50% |
|
49 |
+
|
50 |
+
### EVALITA Benchmark Results
|
51 |
+
|
52 |
+
EVALITA benchmarks assess the model's performance in tasks like toxicity detection, irony detection, and sentiment analysis. The table below shows the F1 scores for these tasks:
|
53 |
+
|
54 |
+
| Model | Toxicity Detection | Irony Detection | Sentiment Analysis |
|
55 |
+
|----------------------------------------------|--------------------|-----------------|--------------------|
|
56 |
+
| **cerbero-7b** | **63.04%** | **48.51%** | **61.80%** |
|
57 |
+
| Fauno | 33.84% | 39.17% | 12.23% |
|
58 |
+
| Camoscio | 38.18% | 39.65% | 13.33% |
|
59 |
+
| mistral-7b | 34.16% | 34.16% | 12.14% |
|
60 |
+
|
61 |
+
|
62 |
## Why Cerbero? 🤔
|
63 |
|
64 |
The name "Cerbero," inspired by the three-headed dog that guards the gates of the Underworld in Greek mythology, encapsulates the essence of our model, drawing strength from three foundational pillars:
|
|
|
66 |
- **Base Model: mistral-7b** 🏗️
|
67 |
cerbero-7b builds upon the formidable **mistral-7b** as its base model. This choice ensures a robust foundation, leveraging the power and capabilities of a cutting-edge language model.
|
68 |
|
69 |
+
- **Datasets: Cerbero Dataset** 📚
|
70 |
+
The Cerbero Dataset is a groundbreaking collection specifically curated to enhance the proficiency of cerbero-7b in understanding and generating Italian text. This dataset is a product of an innovative method combining dynamic self-chat mechanisms with advanced Large Language Model (LLM) technology. Refer to the [paper](README.md) for more details.
|
71 |
|
72 |
- **Licensing: Apache 2.0** 🕊️
|
73 |
Released under the **permissive Apache 2.0 license**, cerbero-7b promotes openness and collaboration. This licensing choice empowers developers with the freedom for unrestricted usage, fostering a community-driven approach to advancing AI in Italy and beyond.
|
74 |
|
75 |
## Training Details 🚀
|
76 |
|
77 |
+
**cerbero-7b** is a **fully fine-tuned** LLM, distinguishing itself from LORA or QLORA fine-tunes.
|
78 |
The model is trained on an expansive Italian Large Language Model (LLM) using synthetic datasets generated through dynamic self-chat on a large context window of **8192 tokens**
|
79 |
|
80 |
### Dataset Composition 📊
|
81 |
|
82 |
+
> 📢 Details on the **Cerbero Dataset** will be updated shortly!
|
|
|
|
|
|
|
|
|
|
|
83 |
|
84 |
### Training Setup ⚙️
|
85 |
|
86 |
+
**cerbero-7b** is trained on an NVIDIA DGX H100:
|
87 |
|
88 |
- **Hardware:** Utilizing 8xH100 GPUs, each with 80 GB VRAM. 🖥️
|
89 |
- **Parallelism:** DeepSpeed Zero stage 1 parallelism for optimal training efficiency.✨
|
90 |
|
91 |
+
The model has been trained for **1 epoch**, ensuring a convergence of knowledge and proficiency in handling diverse linguistic tasks.
|
92 |
|
93 |
## Getting Started 🚀
|
94 |
|
95 |
+
You can load **cerbero-7b** using [🤗transformers](https://huggingface.co/docs/transformers/index)
|
|
|
96 |
|
97 |
```python
|
98 |
import torch
|
|
|
103 |
|
104 |
prompt = """Questa è una conversazione tra un umano ed un assistente AI.
|
105 |
[|Umano|] Come posso distinguere un AI da un umano?
|
106 |
+
[|Assistente|]"""
|
107 |
|
108 |
input_ids = tokenizer(prompt, return_tensors='pt').input_ids
|
109 |
with torch.no_grad():
|
|
|
111 |
|
112 |
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
|
113 |
print(generated_text)
|
114 |
+
```
|
115 |
+
|
116 |
+
### GGUF and llama.cpp
|
117 |
+
|
118 |
+
**cerbero-7b** is fully **compatibile** with [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
119 |
+
|
120 |
+
You can find the **original** and **quantized** versions of **cerbero-7b** in the `gguf` format [here](https://huggingface.co/galatolo/cerbero-7b-gguf/tree/main)
|
121 |
+
|
122 |
+
```python
|
123 |
+
from llama_cpp import Llama
|
124 |
+
from huggingface_hub import hf_hub_download
|
125 |
+
|
126 |
+
llm = Llama(
|
127 |
+
model_path=hf_hub_download(
|
128 |
+
repo_id="galatolo/cerbero-7b-gguf",
|
129 |
+
filename="ggml-model-Q4_K.gguf",
|
130 |
+
),
|
131 |
+
n_ctx=4086,
|
132 |
+
)
|
133 |
+
|
134 |
+
llm.generate("""Questa è una conversazione tra un umano ed un assistente AI.
|
135 |
+
[|Umano|] Come posso distinguere un AI da un umano?
|
136 |
+
[|Assistente|]""")
|
137 |
+
```
|