mmoreirast
commited on
Commit
•
eb5e5e3
1
Parent(s):
d03fabf
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,136 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- mmoreirast/medicine-training-pt
|
5 |
+
- mmoreirast/medicine-evaluation-pt
|
6 |
+
language:
|
7 |
+
- pt
|
8 |
+
metrics:
|
9 |
+
- perplexity
|
10 |
+
library_name: transformers
|
11 |
+
tags:
|
12 |
+
- llama-2
|
13 |
+
- pt
|
14 |
+
- medicine
|
15 |
+
---
|
16 |
+
# Doctor Llama Chat
|
17 |
+
|
18 |
+
|
19 |
+
This repository contains a version of [TeenyTinyLlama-460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m) fine-tuned on the [aira-med-training-pt](https://huggingface.co/datasets/mmoreirast/aira-med-training-pt) dataset.
|
20 |
+
|
21 |
+
The main objective of the Doctor Llama model was to study the step-by-step process involved in fine-tuning models in Portuguese, taking into account the challenges encountered in the medical field.
|
22 |
+
|
23 |
+
This model was created as part of the course completion project for **Biomedical Informatics at the Federal University of Paraná**. For more information, access the full text at the following link.
|
24 |
+
|
25 |
+
## Author
|
26 |
+
Mariana Moreira dos Santos ([LinkedIn](https://www.linkedin.com/in/mmoreirast/))
|
27 |
+
|
28 |
+
## Code
|
29 |
+
You can check the codes used to fine-tune the model at the following [Google Colab](https://colab.research.google.com/drive/1SvJvTcH3IRnsEv72UxkVmV0oClCZARtE?usp=sharing) link.
|
30 |
+
|
31 |
+
## Fine-tuning details
|
32 |
+
- **Base model:** [TeenyTinyLlama 460m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-460m)
|
33 |
+
- **Context length:** 2048 tokens
|
34 |
+
- **Dataset for fine-tuning:** [medicine-training-pt](mmoreirast/medicine-training-pt)
|
35 |
+
- **Dataset for evaluation:** [medicine-evaluation-pt](https://huggingface.co/datasets/mmoreirast/medicine-evaluation-pt)
|
36 |
+
- **Language:** Portuguese
|
37 |
+
- **GPU:** NVIDIA A100-SXM4-40GB
|
38 |
+
- **Training time**: ~5 hours
|
39 |
+
|
40 |
+
## Parameters
|
41 |
+
- **Number of Epochs:** 4
|
42 |
+
- **Batch size:** 8
|
43 |
+
- **Optimizer:** torch.optim.AdamW (warmup_steps = 1e3, learning_rate = 1e-5, epsilon = 1e-8)
|
44 |
+
|
45 |
+
## Evaluations
|
46 |
+
|
47 |
+
|
48 |
+
| Model |Perplexity |Evaluation Loss |
|
49 |
+
|---------------------------|-----------------|-------------------|
|
50 |
+
| TeenyTinyLlama 160m | 22.51 | 3.11 |
|
51 |
+
| **Doctor Llama 160m** | 15.68 | 2.75 |
|
52 |
+
| TeenyTinyLlama 460m | 13.09 | 2.57 |
|
53 |
+
| **Doctor Llama 460m** | 10.94 | 2.39 |
|
54 |
+
| TeenyTinyLlama 460m Chat | 21.22 | 3.05 |
|
55 |
+
| **Doctor Llama Chat** | 11.13 | 2.41 |
|
56 |
+
|
57 |
+
|
58 |
+
## Basic usage
|
59 |
+
Using the `pipeline`:
|
60 |
+
|
61 |
+
```python
|
62 |
+
from transformers import pipeline
|
63 |
+
|
64 |
+
generator = pipeline("text-generation", model="mmoreirast/Doctor-Llama-460m")
|
65 |
+
|
66 |
+
completions = generator("Me fale sobre o sistema nervoso", num_return_sequences=2, max_new_tokens=100)
|
67 |
+
|
68 |
+
for comp in completions:
|
69 |
+
print(f"🤖 {comp['generated_text']}")
|
70 |
+
```
|
71 |
+
|
72 |
+
Using the `AutoTokenizer` and `AutoModelForCausalLM`:
|
73 |
+
|
74 |
+
```python
|
75 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
76 |
+
import torch
|
77 |
+
|
78 |
+
# Load model and the tokenizer
|
79 |
+
tokenizer = AutoTokenizer.from_pretrained("mmoreirast/Doctor-Llama-460m", revision='main')
|
80 |
+
model = AutoModelForCausalLM.from_pretrained("mmoreirast/Doctor-Llama-460m", revision='main')
|
81 |
+
|
82 |
+
# Pass the model to your device
|
83 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
84 |
+
model.eval()
|
85 |
+
model.to(device)
|
86 |
+
|
87 |
+
# Tokenize the inputs and pass them to the device
|
88 |
+
inputs = tokenizer("Me fale sobre o sistema nervoso", return_tensors="pt").to(device)
|
89 |
+
|
90 |
+
# Generate some text
|
91 |
+
completions = model.generate(**inputs, num_return_sequences=2, max_new_tokens=100)
|
92 |
+
|
93 |
+
# Print the generated text
|
94 |
+
for i, completion in enumerate(completions):
|
95 |
+
print(f'🤖 {tokenizer.decode(completion)}')
|
96 |
+
```
|
97 |
+
## Intended Uses
|
98 |
+
|
99 |
+
The main objective of the Doctor Llama model was to study the step-by-step process involved in fine-tuning models in Portuguese, taking into account the challenges encountered in the medical field. You may also further fine-tune and adapt Doctor Llama for deployment, as long as your use is following the Apache 2.0 license. If you decide to use pre-trained Doctor Llama as a basis for your fine-tuned model, please conduct your own risk and bias assessment.
|
100 |
+
|
101 |
+
## Out-of-scope Use
|
102 |
+
|
103 |
+
Doctor Llama is not intended for deployment. It is not a product and should not be used for human-facing interactions.
|
104 |
+
|
105 |
+
Doctor Llama models are Brazilian Portuguese language only and are not suitable for translation or generating text in other languages.
|
106 |
+
|
107 |
+
## Limitations
|
108 |
+
|
109 |
+
As described in the Teeny Tiny Llama model, the Doctor Llama also has the following limitations:
|
110 |
+
|
111 |
+
- **Hallucinations:** This model can produce content that can be mistaken for truth but is, in fact, misleading or entirely false, i.e., hallucination.
|
112 |
+
|
113 |
+
- **Biases and Toxicity:** This model inherits the social and historical stereotypes from the data used to train it. Given these biases, the model can produce toxic content, i.e., harmful, offensive, or detrimental to individuals, groups, or communities.
|
114 |
+
|
115 |
+
- **Unreliable Code:** The model may produce incorrect code snippets and statements. These code generations should not be treated as suggestions or accurate solutions.
|
116 |
+
|
117 |
+
- **Language Limitations:** The model is primarily designed to understand standard Brazilian Portuguese. Other languages might challenge its comprehension, leading to potential misinterpretations or errors in response.
|
118 |
+
|
119 |
+
- **Repetition and Verbosity:** The model may get stuck on repetition loops (especially if the repetition penalty during generations is set to a meager value) or produce verbose responses unrelated to the prompt it was given.
|
120 |
+
|
121 |
+
Hence, even though our models are released with a permissive license, we urge users to perform their risk analysis on these models if intending to use them for real-world applications and also have humans moderating the outputs of these models in applications where they will interact with an audience, guaranteeing users are always aware they are interacting with a language model.
|
122 |
+
|
123 |
+
## Cite as 🤗
|
124 |
+
```latex
|
125 |
+
@misc{moreira2024docllama,
|
126 |
+
title = {Um Estudo sobre LLMs em Português para a Área Médica},
|
127 |
+
author = {Mariana Moreira dos Santos, André Ricardo Abed Grégio},
|
128 |
+
url = {},
|
129 |
+
year={2024}
|
130 |
+
}
|
131 |
+
```
|
132 |
+
## Acknowledgements
|
133 |
+
The TeenyTinyLlama base models used here were created by Nicholas Kluge Corrêa and his team. For more information, visit [TeenyTinyLlama](https://huggingface.co/collections/nicholasKluge/teenytinyllama-6582ea8129e72d1ea4d384f1).
|
134 |
+
|
135 |
+
## License
|
136 |
+
Doctor Llama is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.
|