Algea-VE: A Tiny Multimodal Language Model with Only 0.8B Parameters
Algea-ve is trained on the LAION-CC-SBU dataset using algea-550M-base as the base model and fine-tuned on llava_v1_5_mix665k. It uses CLIP ViT-L/14-336 as the visual encoder. The model is very small, requiring only 32GB of VRAM for fine-tuning and 3GB for inference.
Due to insufficient training of the base model, the current model has some issues with hallucinations and repetition. To address this, I am training a new model that will maintain the same size but offer better performance.
This model is built based on the llavaphi project. To use the model, please click here.
- Downloads last month
- 16
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.