Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese
Abstract
In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2 (2024)
- MammothModa: Multi-Modal Large Language Model (2024)
- OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding (2024)
- A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks (2024)
- CROME: Cross-Modal Adapters for Efficient Multimodal LLM (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend