Norm (Inui)

upvoted 2 papers 2 days ago

OSV: One Step is Enough for High-Quality Image to Video Generation

Paper • 2409.11367 • Published 3 days ago • 11

Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models

Paper • 2409.10695 • Published 4 days ago • 1

upvoted a paper 4 days ago

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Paper • 2409.07239 • Published 9 days ago • 11

upvoted a paper 6 days ago

Tutorial on Diffusion Models for Imaging and Vision

Paper • 2403.18103 • Published Mar 26 • 2

upvoted a paper 15 days ago

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Paper • 2409.01704 • Published 17 days ago • 72

upvoted a paper 21 days ago

CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published 22 days ago • 55

upvoted a paper 23 days ago

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published 29 days ago • 109

upvoted a paper 26 days ago

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published 29 days ago • 33

upvoted 2 collections about 1 month ago

xLAM models

Collection

xLAM: A Family of Large Action Models to Empower AI Agent Systems • 9 items • Updated 11 days ago • 40

XGen-MM-1 models and datasets

Collection

A collection of all XGen-MM (Foundation LMM) models! • 13 items • Updated 9 days ago • 33

upvoted 3 papers about 1 month ago

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16 • 96

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Paper • 2408.08459 • Published Aug 15 • 44

I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm

Paper • 2408.08072 • Published Aug 15 • 31

upvoted an article about 1 month ago

Article

A failed experiment: Infini-Attention, and why we should keep trying?

Aug 14

• 41

upvoted a collection about 1 month ago

Minitron

Collection

A family of compressed models obtained via pruning and knowledge distillation • 7 items • Updated 3 days ago • 54

upvoted a paper about 1 month ago

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Paper • 2408.04840 • Published Aug 9 • 31

upvoted 2 articles about 1 month ago

Article

PaliGemma – Google's Cutting-Edge Open Vision Language Model

May 14

• 194

Article

From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate

Jun 13

• 41

upvoted a paper about 1 month ago

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3 • 74

upvoted 5 papers about 2 months ago

upvoted 2 papers 2 months ago

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Paper • 2407.08303 • Published Jul 11 • 17

PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10 • 64

upvoted 9 papers 3 months ago

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Paper • 2407.01392 • Published Jul 1 • 39

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Paper • 2406.10900 • Published Jun 16 • 11

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Paper • 2406.19389 • Published Jun 27 • 52

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Paper • 2406.14491 • Published Jun 20 • 85

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Paper • 2406.16860 • Published Jun 24 • 55

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Paper • 2406.12275 • Published Jun 18 • 29

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13 • 50

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Paper • 2406.07476 • Published Jun 11 • 32

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6 • 71

upvoted 5 papers 4 months ago

Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27 • 30

An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 84

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Paper • 2405.15738 • Published May 24 • 43

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16 • 125

KAN: Kolmogorov-Arnold Networks

Paper • 2404.19756 • Published Apr 30 • 108

upvoted 14 papers 5 months ago

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Paper • 2404.19752 • Published Apr 30 • 22

Better & Faster Large Language Models via Multi-token Prediction

Paper • 2404.19737 • Published Apr 30 • 73

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Paper • 2404.16994 • Published Apr 25 • 35

Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding

Paper • 2404.16710 • Published Apr 25 • 57

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

Paper • 2404.14619 • Published Apr 22 • 124

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Paper • 2404.14396 • Published Apr 22 • 18

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published Apr 22 • 250

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

Paper • 2404.08801 • Published Apr 12 • 62

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Paper • 2404.07973 • Published Apr 11 • 30

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Paper • 2404.07143 • Published Apr 10 • 103

Rho-1: Not All Tokens Are What You Need

Paper • 2404.07965 • Published Apr 11 • 83

RULER: What's the Real Context Size of Your Long-Context Language Models?

Paper • 2404.06654 • Published Apr 9 • 33

BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10 • 15

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

Paper • 2404.04125 • Published Apr 4 • 27

upvoted 6 papers 6 months ago

Training LLMs over Neurally Compressed Text

Paper • 2404.03626 • Published Apr 4 • 21

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4 • 25

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Paper • 2404.02905 • Published Apr 3 • 63

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Paper • 2404.02258 • Published Apr 2 • 103

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

Paper • 2404.01367 • Published Apr 1 • 19

ViTAR: Vision Transformer with Any Resolution

Paper • 2403.18361 • Published Mar 27 • 51

Inui

AI & ML interests

Organizations

Norm's activity

A failed experiment: Infini-Attention, and why we should keep trying?

PaliGemma – Google's Cutting-Edge Open Vision Language Model

From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate