CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows Paper • 2107.00652 • Published Jul 1, 2021 • 2
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering Paper • 2403.09622 • Published Mar 14 • 16
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Paper • 2304.14178 • Published Apr 27, 2023 • 2
TextCraftor: Your Text Encoder Can be Image Quality Controller Paper • 2403.18978 • Published Mar 27 • 13
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model Paper • 2404.01331 • Published Mar 29 • 25
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models Paper • 2109.10282 • Published Sep 21, 2021 • 6
Text Role Classification in Scientific Charts Using Multimodal Transformers Paper • 2402.14579 • Published Feb 8 • 1