license: apache-2.0
datasets:
- nicholasKluge/portuguese-corpus-v3
language:
- pt
metrics:
- perplexity
library_name: transformers
pipeline_tag: text-generation
tags:
- text-generation-inference
widget:
- text: >-
Mamonas Assassinas, também conhecida como Mamonas, foi uma banda
brasileira de rock formada em Guarulhos em 1995, originada da banda
Utopia. Notabilizaram-se por suas composições que conciliavam pop rock com
gêneros musicais populares, tais como
example_title: Exemplo
- text: >-
A adaptação cinematográfica de The Last of the Mohicans (1977) foi
dirigida pelo cineasta americano James L. Conway. A produção foi produzida
pela
example_title: Exemplo
- text: >-
Frente parlamentar pede a Pacheco que devolva ao governo MP que revisa
desoneração.
Criticada por parlamentares e setores, MP foi editada nesta sexta e
reonera, de forma gradual, a folha de pagamento de 17 segmentos. Eventual
devolução invalidaria efeitos da medida.
example_title: Exemplo
- text: >-
Salve salve Pythonista!
Nesse post você vai aprender a função mais utilizada por todo Pythonista:
a função print! Usamos a função print para mostrar dados na tela dos
usuários, através do terminal. A função para imprimir dados em Python é a
função `print()`. Ela é responsável por mostrar valores em seu terminal:
example_title: Exemplo
inference:
parameters:
repetition_penalty: 1.2
temperature: 0.2
top_k: 20
top_p: 0.2
max_new_tokens: 150
co2_eq_emissions:
emissions: 5.6
source: CodeCarbon
training_type: pre-training
geographical_location: Germany
hardware_used: NVIDIA A100-SXM4-40GB
TeenyTinyLlama-160m
Model Summary
Given the lack of available monolingual foundational models in non-English languages and the fact that some of the most used and downloaded models by the community are those small enough to allow individual researchers and hobbyists to use them in low-resource environments, we developed the TeenyTinyLlama: a series of small foundational models trained in Brazilian Portuguese.
TeenyTinyLlama is a series of compact language models based on the Llama 2 architecture. These models are designed to deliver efficient natural language processing capabilities while being resource-conscious.
Also, TeenyTinyLlama models were trained by leveraging scaling laws to determine the optimal number of tokens per parameter while incorporating preference pre-training.
Details
- Architecture: a Transformer-based model pre-trained via causal language modeling
- Size: 162,417,408 million parameters
- Context length: 2048 tokens
- Dataset: Portuguese-Corpus-v3 (6.2B tokens)
- Language: Portuguese
- Number of steps: 457,969 (3.7B tokens)
- GPU: 1 NVIDIA A100-SXM4-40GB
- Training time: ~ 36 hours
- Emissions: 5.6 KgCO2 (Germany)
- Total energy consumption: 15.5 kWh
This repository has the source code used to train this model. The main libraries used are:
Training Set-up
These are the main arguments used in the training of this model:
Arguments | Value |
---|---|
vocabulary size | 32000 |
hidden dimension size | 768 |
intermediate dimension size | 3072 |
context length | 2048 |
nº attention heads | 12 |
nº hidden layers | 12 |
nº key value heads | 12 |
nº training samples | 1831873 |
nº validation samples | 18000 |
nº epochs | 1 |
evaluation steps | 100000 |
train batch size | 4 |
eval batch size | 4 |
gradient accumulation steps | 1 |
optimizer | torch.optim.AdamW |
learning rate | 0.0006 |
adam epsilon | 0.00000001 |
weight decay | 0.01 |
scheduler type | "cosine" |
warmup steps | 5000 |
gradient checkpointing | false |
seed | 42 |
mixed precision | 'no' |
torch dtype | "float32" |
tf32 | true |
Intended Uses
The primary intended use of TeenyTinyLlama is to research the behavior, functionality, and limitations of large language models. Checkpoints saved during training are intended to provide a controlled setting for performing scientific experiments. You may also further fine-tune and adapt TeenyTinyLlama-160m for deployment, as long as your use is in accordance with the Apache 2.0 license. If you decide to use pre-trained TeenyTinyLlama-160m as a basis for your fine-tuned model, please conduct your own risk and bias assessment.
Basic usage
Using the pipeline
:
from transformers import pipeline
generator = pipeline("text-generation", model="nicholasKluge/Teeny-tiny-llama-160m")
completions = generator("Astronomia é a ciência", num_return_sequences=2, max_new_tokens=100)
for comp in completions:
print(f"🤖 {comp['generated_text']}")
Using the AutoTokenizer
and AutoModelForCausalLM
:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and the tokenizer
tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/Teeny-tiny-llama-160m", revision='main')
model = AutoModelForCausalLM.from_pretrained("nicholasKluge/Teeny-tiny-llama-160m", revision='main')
# Pass the model to your device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.eval()
model.to(device)
# Tokenize the inputs and pass them to the device
inputs = tokenizer("Astronomia é a ciência", return_tensors="pt").to(device)
# Generate some text
completions = model.generate(**inputs, num_return_sequences=2, max_new_tokens=100)
# Print the generated text
for i, completion in enumerate(completions):
print(f'🤖 {tokenizer.decode(completion)}')
Limitations
Hallucinations: This model can produce content that can be mistaken for truth but is, in fact, misleading or entirely false, i.e., hallucination.
Biases and Toxicity: This model inherits the social and historical stereotypes from the data used to train it. Given these biases, the model can produce toxic content, i.e., harmful, offensive, or detrimental to individuals, groups, or communities.
Unreliable Code: The model may produce incorrect code snippets and statements. These code generations should not be treated as suggestions or accurate solutions.
Language Limitations: The model is primarily designed to understand standard Portuguese (BR). Other languages might challenge its comprehension, leading to potential misinterpretations or errors in response.
Repetition and Verbosity: The model may get stuck on repetition loops (especially if the repetition penalty during generations is set to a meager value) or produce verbose responses unrelated to the prompt it was given.
Evaluations
Steps | Evaluation Loss | Perplexity | Total Energy Consumption | Emissions |
---|---|---|---|---|
100.000 | 3.19 | 24.52 | 3.75 kWh | 1.28 CO2eq |
200.000 | 3.02 | 20.58 | 7.51 kWh | 2.56 CO2eq |
300.000 | 2.83 | 16.98 | 11.25 kWh | 3.84 CO2eq |
400.000 | 2.79 | 16.41 | 14.52 kWh | 5.11 CO2eq |
Benchmarks
Models | Average | ARC | Hellaswag | MMLU | TruthfulQA |
---|---|---|---|---|---|
TeenyTinyLlama-160m | 31.16 | 26.15 | 29.29 | 28.11 | 41.12 |
Pythia-160m* | 31.16 | 24.06 | 31.39 | 24.86 | 44.34 |
OPT-125m* | 30.80 | 22.87 | 31.47 | 26.02 | 42.87 |
Gpt2-portuguese-small | 30.22 | 22.48 | 29.62 | 27.36 | 41.44 |
Gpt2-small8 | 29.97 | 21.48 | 31.60 | 25.79 | 40.65 |
Xglm-564M* | 31.20 | 24.57 | 34.64 | 25.18 | 40.43 |
Bloom-560m* | 32.13 | 24.74 | 37.15 | 24.22 | 42.44 |
- Evaluations on benchmarks were performed using the Language Model Evaluation Harness (by EleutherAI). Thanks to Laiviet for translating some of the tasks in the LM-Evaluation-Harness. The results of models marked with an "*" were retirved from the Open LLM Leaderboard.
Fine-Tuning Comparisons
Models | IMDB | FaQuAD-NLI | HateBr | Assin2 | AgNews |
---|---|---|---|---|---|
Teeny Tiny Llama 160m | 91.14 | 90.00 | 90.71 | 85.78 | 94.05 |
Bert-base-portuguese-cased | 92.22 | 93.07 | 91.28 | 87.45 | 94.19 |
Bert-large-portuguese-cased | 93.58 | 92.26 | 91.57 | 88.97 | 94.11 |
Gpt2-small-portuguese | 91.60 | 86.46 | 87.42 | 86.11 | 94.07 |
Cite as 🤗
@misc{nicholas22llama,
doi = {10.5281/zenodo.6989727},
url = {https://huggingface.co/nicholasKluge/TeenyTinyLlama-160m},
author = {Nicholas Kluge Corrêa},
title = {TeenyTinyLlama},
year = {2023},
publisher = {HuggingFace},
journal = {HuggingFace repository},
}
Funding
This repository was built as part of the RAIES (Rede de Inteligência Artificial Ética e Segura) initiative, a project supported by FAPERGS - (Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul), Brazil.
License
TeenyTinyLlama-160m is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.