Model Card for Model ID

Model Details

Model Description

Periquito-3B is a large language model (LLM) trained by Wandgibaut. It is built upon the OpenLlama-3B architecture and specifically fine-tuned using Portuguese Wikipedia (pt-br) data. This specialization makes it particularly adept at understanding and generating text in Brazilian Portuguese.

Developed by: Wandemberg Gibaut
Model type: Llama
Language(s) (NLP): Portuguese
License: Apache License 2.0
Finetuned from model [optional]: openlm-research/open_llama_3b

Loading the Weights with Hugging Face Transformers

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM
model_path = 'wandgibaut/periquito-3B'
tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto',
)
prompt = 'Q: Qual o maior animal terrestre?\nA:'
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generation_output = model.generate(
    input_ids=input_ids, max_new_tokens=32
)
print(tokenizer.decode(generation_output[0]))

For more advanced usage, please follow the transformers LLaMA documentation.

Evaluating with LM-Eval-Harness

The model can be evaluated with lm-eval-harness. However, we used a custom version, that has some translated tasks and the ENEM suit. This can be found in wandgibaut/lm-evaluation-harness-PTBR.

Dataset and Training

We finetunned the model on Wikipedia-pt dataset with LoRA, in Google's TPU-v3 in the Google's TPU Research program.

Evaluation

We evaluated OpenLLaMA on a wide range of tasks using lm-evaluation-harness. The LLaMA results are generated by running the original LLaMA model on the same evaluation metrics. We note that our results for the LLaMA model differ slightly from the original LLaMA paper, which we believe is a result of different evaluation protocols. Similar differences have been reported in this issue of lm-evaluation-harness. Additionally, we present the results of GPT-J, a 6B parameter model trained on the Pile dataset by EleutherAI.

hf-causal (pretrained=wandgibaut/periquito-3B), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
agnews_pt	0	acc	0.6184	±	0.0056
boolq_pt	1	acc	0.6333	±	0.0084
faquad	1	exact	7.9365
		f1	45.6971
		HasAns_exact	7.9365
		HasAns_f1	45.6971
		NoAns_exact	0.0000
		NoAns_f1	0.0000
		best_exact	7.9365
		best_f1	45.6971
imdb_pt	0	acc	0.6338	±	0.0068
sst2_pt	1	acc	0.6823	±	0.0158
toldbr	0	acc	0.4629	±	0.0109
		f1_macro	0.3164

hf-causal (pretrained=wandgibaut/periquito-3B,dtype=float), limit: None, provide_description: False, num_fewshot: 3, batch_size: None

Task	Version	Metric	Value		Stderr
agnews_pt	0	acc	0.6242	±	0.0056
boolq_pt	1	acc	0.6477	±	0.0084
faquad	1	exact	34.9206
		f1	70.3968
		HasAns_exact	34.9206
		HasAns_f1	70.3968
		NoAns_exact	0.0000
		NoAns_f1	0.0000
		best_exact	34.9206
		best_f1	70.3968
imdb_pt	0	acc	0.8408	±	0.0052
sst2_pt	1	acc	0.7775	±	0.0141
toldbr	0	acc	0.5143	±	0.0109
		f1_macro	0.5127

hf-causal (pretrained=wandgibaut/periquito-3B), limit: None, provide_description: False, num_fewshot: 0, batch_size: None

Task	Version	Metric	Value		Stderr
enem	0	acc	0.1976	±	0.0132
		2009	0.2022	±	0.0428
		2016	0.1809	±	0.0399
		2015	0.1348	±	0.0364
		2016_2_	0.2366	±	0.0443
		2017	0.2022	±	0.0428
		2013	0.1647	±	0.0405
		2012	0.2174	±	0.0432
		2011	0.2292	±	0.0431
		2010	0.2157	±	0.0409
		2014	0.1839	±	0.0418
enem_2022	0	acc	0.2373	±	0.0393
		2022	0.2373	±	0.0393
		human-sciences	0.2703	±	0.0740
		mathematics	0.1818	±	0.0842
		natural-sciences	0.1538	±	0.0722
		languages	0.3030	±	0.0812
enem_CoT	0	acc	0.1812	±	0.0127
		2009	0.1348	±	0.0364
		2016	0.1596	±	0.0380
		2015	0.1124	±	0.0337
		2016_2_	0.1290	±	0.0350
		2017	0.2247	±	0.0445
		2013	0.1765	±	0.0416
		2012	0.2391	±	0.0447
		2011	0.1979	±	0.0409
		2010	0.2451	±	0.0428
		2014	0.1839	±	0.0418
enem_CoT_2022	0	acc	0.2119	±	0.0378
		2022	0.2119	±	0.0378
		human-sciences	0.2703	±	0.0740
		mathematics	0.1818	±	0.0842
		natural-sciences	0.2308	±	0.0843
		languages	0.1515	±	0.0634

hf-causal (pretrained=wandgibaut/periquito-3B,dtype=float), limit: None, provide_description: False, num_fewshot: 1, batch_size: None

Task	Version	Metric	Value		Stderr
enem	0	acc	0.1790	±	0.0127
		2009	0.1573	±	0.0388
		2016	0.2021	±	0.0416
		2015	0.1573	±	0.0388
		2016_2_	0.1935	±	0.0412
		2017	0.2247	±	0.0445
		2013	0.1412	±	0.0380
		2012	0.1739	±	0.0397
		2011	0.1979	±	0.0409
		2010	0.1961	±	0.0395
		2014	0.1379	±	0.0372
enem_2022	0	acc	0.1864	±	0.0360
		2022	0.1864	±	0.0360
		human-sciences	0.2432	±	0.0715
		mathematics	0.1364	±	0.0749
		natural-sciences	0.1154	±	0.0639
		languages	0.2121	±	0.0723
enem_CoT	0	acc	0.2009	±	0.0132
		2009	0.2135	±	0.0437
		2016	0.2340	±	0.0439
		2015	0.1348	±	0.0364
		2016_2_	0.2258	±	0.0436
		2017	0.2360	±	0.0453
		2013	0.1529	±	0.0393
		2012	0.1957	±	0.0416
		2011	0.2500	±	0.0444
		2010	0.1667	±	0.0371
		2014	0.1954	±	0.0428
enem_CoT_2022	0	acc	0.2542	±	0.0403
		2022	0.2542	±	0.0403
		human-sciences	0.2703	±	0.0740
		mathematics	0.2273	±	0.0914
		natural-sciences	0.3846	±	0.0973
		languages	0.1515	±	0.0634

Use Cases:

The model is suitable for text generation, language understanding, and various natural language processing tasks in Brazilian Portuguese.

Limitations:

Like many language models, Periquito-3B might exhibit biases present in its training data. Additionally, its performance is primarily optimized for Portuguese, potentially limiting its effectiveness with other languages.

Ethical Considerations:

Users are encouraged to use the model ethically, particularly by avoiding the generation of harmful or biased content.

Acknowledgment

We thank the Google TPU Research Cloud program for providing part of the computation resources.

Citation [optional]

If you found periquito-3B useful in your research or applications, please cite using the following BibTeX:

BibTeX:

@software{wandgibautperiquito3B,
  author = {Gibaut, Wandemberg},
  title = {Periquito-3B},
  month = Sep,
  year = 2023,
  url = {https://huggingface.co/wandgibaut/periquito-3B}
}

Open Portuguese LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Average	33.04
ENEM Challenge (No Images)	17.98
BLUEX (No Images)	21.14
OAB Exams	22.69
Assin2 RTE	43.01
Assin2 STS	8.92
FaQuAD NLI	43.97
HateBR Binary	50.46
PT Hate Speech Binary	41.19
tweetSentBR	47.96

wandgibaut
/

periquito-3B