|
--- |
|
license: mit |
|
language: |
|
- it |
|
--- |
|
|
|
# Model Card for Modello Italia 9B |
|
This is our UNOFFICIAL conversion of the OFFICIAL model checkpoint of *"Modello Italia 9B"*, the first Large Language Model (LLM) developed by [iGenius](https://it.igenius.ai/) in collaboration [CINECA](https://www.cineca.it/). |
|
* More information about Modello Italia: [click here](https://it.igenius.ai/language-models). |
|
|
|
## 🚨 Disclaimers |
|
* This is an UNOFFICIAL conversion of the OFFICIAL model checkpoint released by iGenius. |
|
* The original model was developed using LitGPT, therefore, the weights need to be converted before they can be used with Hugging Face transformers. |
|
* **Note:** By using this model, you accept the iGenius' [**terms and conditions**](https://secure.igenius.ai/legal/italia_terms_and_conditions.pdf). |
|
|
|
## 🚨 Biases and Risks |
|
From the terms and conditions of iGenius for Modello Italia: |
|
> Modello Italia è concepito per essere utilizzato da tutti e per adattarsi a una vasta gamma di casi |
|
d'uso. È stato progettato con l'obiettivo di essere accessibile a persone provenienti da |
|
background, esperienze e prospettive diverse. Modello Italia si rivolge agli utenti e alle loro |
|
esigenze senza inserire giudizi superflui o normative, riconoscendo al contempo che anche |
|
contenuti potenzialmente problematici in determinati contesti possono avere scopi validi in altri. |
|
Il rispetto per la dignità e l'autonomia di tutti gli utenti, specialmente in termini di libertà di |
|
pensiero ed espressione, è un pilastro fondamentale del suo design. Tuttavia, essendo una nuova |
|
tecnologia, Modello Italia comporta rischi legati al suo utilizzo. I test condotti finora sono stati |
|
eseguiti in italiano e non hanno potuto coprire tutte le possibili situazioni. Pertanto, come per |
|
tutti gli LLM, non è possibile prevedere in anticipo gli output di Modello Italia e il modello |
|
potrebbe in alcuni casi generare risposte imprecise, tendenziose o altre risposte discutibili. Prima |
|
di utilizzare Modello Italia in qualsiasi contesto, gli sviluppatori sono fortemente incoraggiati a |
|
eseguire test di sicurezza e adattamento specifici per le loro applicazioni. |
|
|
|
We are aware of the biases and potential problematic/toxic content that current pretrained large language models exhibit: more specifically, as probabilistic models of (Italian and English) languages, they reflect and amplify the biases of their training data. |
|
|
|
For more information about this issue, please refer to our survey paper: |
|
* [Biases in Large Language Models: Origins, Inventory, and Discussion](https://dl.acm.org/doi/full/10.1145/3597307) |
|
|
|
## Training dataset |
|
The following information is based on the information we could gather, that is, it is NOT official. |
|
Please take it with a pinch of salt as we continue to study Modello Italia. |
|
* **The training data of Modello Italia is unknown;** |
|
* Modello Italia is probably trained on around 1T tokens of Italian text; |
|
* We know that the training data is mostly Italian text and source code; |
|
* We know that the training data includes text from Editoria Nazionale. |
|
|
|
## Tokenizer |
|
The following information is based on the information we could gather, that is, it is NOT official. |
|
Please take it with a pinch of salt as we continue to study Modello Italia. |
|
* The tokenizer is **vanilla SentencePiece**, probably trained from scratch on Italian data; |
|
* Vocabulary size is 50000. |
|
|
|
## Model architecture |
|
The following information is based on the information we could gather, that is, it is NOT official. |
|
Please take it with a pinch of salt as we continue to study Modello Italia. |
|
* The model architecture is **based on GPT-NeoX**. |
|
|
|
## Results |
|
For a detailed comparison of model performance, check out the [Leaderboard for Italian Language Models](https://huggingface.co/spaces/FinancialSupport/open_ita_llm_leaderboard). |
|
|
|
Here's a breakdown of the performance metrics: |
|
|
|
| Metric | hellaswag_it acc_norm | arc_it acc_norm | m_mmlu_it 5-shot acc | Average | |
|
|:----------------------------|:----------------------|:----------------|:---------------------|:--------| |
|
| **Accuracy Normalized** | 0.5679 | 0.3849 | 0.3522 | 0.4350 | |
|
|
|
## How to use Modello Italia with Hugging Face transformers |
|
|
|
```python |
|
import torch |
|
import transformers as tr |
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
tokenizer = tr.AutoTokenizer.from_pretrained("sapienzanlp/modello-italia-9b") |
|
model = tr.AutoModelForCausalLM.from_pretrained( |
|
"sapienzanlp/modello-italia-9b", |
|
device_map=device, |
|
torch_dtype=torch.bfloat16 |
|
) |
|
|
|
MY_SYSTEM_PROMPT_SHORT = ( |
|
"Tu sei Modello Italia, un modello di linguaggio naturale addestrato da iGenius." |
|
) |
|
prompt = "Ciao, chi sei?" |
|
messages = [ |
|
{"role": "system", "content": MY_SYSTEM_PROMPT_SHORT}, |
|
{"role": "user", "content": prompt}, |
|
] |
|
tokenized_chat = tokenizer.apply_chat_template( |
|
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" |
|
).to(device) |
|
|
|
out = model.generate( |
|
tokenized_chat, |
|
max_new_tokens=200, |
|
do_sample=False |
|
) |
|
``` |