File size: 6,221 Bytes
80c6598 0297db8 3a858ea db7ffe4 80c6598 1d378fb 80c6598 d917af7 80c6598 6696a1b 80c6598 f983391 a85e0a4 f983391 63057ca f983391 36aa603 90cd162 80c6598 fc9ff17 80c6598 aed8933 817d582 703e627 11311ed e74da3d 3081d81 7923606 3081d81 e74da3d aed8933 817d582 3081d81 817d582 aed8933 817d582 aed8933 a1ac49a 817d582 e74da3d 587e267 dc6a88d 587e267 aed8933 dc6a88d 817d582 34ec5c6 817d582 34ec5c6 817d582 34ec5c6 817d582 34ec5c6 02714ae 817d582 34ec5c6 02714ae 817d582 34ec5c6 817d582 c37926b 14fb892 c26facf 587e267 c26facf 587e267 c26facf 587e267 c26facf 14fb892 1de78e8 a454b7c 587e267 a22168a db7ffe4 7c03122 a22168a fe21202 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
---
language:
- en
tags:
- information retrieval
- embedding model
- visual information retrieval
metrics:
- recall
pipeline_tag: feature-extraction
license: apache-2.0
---
# Memex: OCR-free Visual Document Embedding Model as Your Personal Librarian
The model only takes images as document-side inputs and produce vectors representing document pages. `minicpm-visual-embedding-v0` is trained with over 200k query-visual document pairs, including textual document, visual document, arxiv figures, plots, charts, industry documents, textbooks, ebooks, and openly-available PDFs, etc. The performance of `minicpm-visual-embedding-v0` is on a par with our ablation text embedding model on text-oriented documents, and an advantages on visually-intensive documents.
Our model is capable of:
- Help you read a long visually-intensive or text-oriented PDF document and find the pages that answer your question.
- Help you build a personal library and retireve book pages from a large collection of books.
- It works like human: read and comprehend with **vision** and remember **multimodal** information in hippocampus.
![Memex Archtechture](images/memex.png)
# News
- 2024-07-14: π€ We released **online huggingface demo**! Try our [online demo](https://huggingface.co/spaces/bokesyo/minicpm-visual-embeeding-v0-demo)!
- 2024-07-14: π We released a **locally deployable Gradio demo** of `miniCPM-visual-embedding-v0`, take a look at [pipeline_gradio.py](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0/blob/main/pipeline_gradio.py). You can run `pipeline_gradio.py` to build a demo on your PC.
- 2024-07-13: π» We released a **locally deployable command-line based demo** of `miniCPM-visual-embedding-v0` for users to retireve most relavant pages from a given PDF file (could be very long), take a look at [pipeline.py](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0/blob/main/pipeline.py).
- 2024-06-27: π We released our first visual embedding model checkpoint minicpm-visual-embedding-v0 on [huggingface](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0).
- 2024-05-08: π We [open-sourced](https://github.com/RhapsodyAILab/minicpm-visual-embedding-v0) our training code (full-parameter tuning with GradCache and DeepSpeed, supports large batch size across multiple GPUs with zero-stage1) and eval code.
# Deploy on your PC
**Please make sure you have at least 32GB memory on your PC.**
- Apple M1/M2/M3 with 32GB memory.
- x86 CPU with 32GB memory.
- x86 CPU with 32GB memory + Nvidia GPU with 16GB memory.
### Install dependencies
Use pip to install all dependencies:
```
Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.36.0
sentencepiece==0.1.99
numpy==1.26.0
```
### Download model weights and modeling file
Use one of the following methods:
- Download with git clone.
```bash
git lfs install
git clone https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0
```
- Download with huggingface-hub.
```bash
pip install huggingface-hub
huggingface-cli download --resume-download RhapsodyAI/minicpm-visual-embedding-v0 --local-dir minicpm-visual-embedding-v0 --local-dir-use-symlinks False
```
### Launch demo
Install `gradio` first.
```bash
pip install gradio
```
Adapt the code in `pipeline_gradio.py` according to your device.
- For M1/M2/M3 users, please make sure `model = model.to(device='mps', dtype=torch.float16)` then run `PYTORCH_ENABLE_MPS_FALLBACK=1 python pipeline_gradio.py`.
- For x86 CPU users, please remove `model = model.to(device)` then run `python pipeline_gradio.py`.
- For x86 CPU + Nvidia GPU users, please make sure `model = model.to('cuda')` then run `python pipeline_gradio.py`.
- If you encountered an error, please open an issue [here](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0/discussions), we will respond soon.
# For research purpose
To run the model for research purpose, please refer the following code:
```python
from transformers import AutoModel
from transformers import AutoTokenizer
from PIL import Image
import torch
device = 'cuda:0'
# Load model, be sure to substitute `model_path` by your model path
model_path = '/local/path/to/model'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
model.to(device)
# Load image to PIL.Image object
image_1 = Image.open('/local/path/to/images/memex.png').convert('RGB')
image_2 = Image.open('/local/path/to/images/us2020.png').convert('RGB')
image_3 = Image.open('/local/path/to/images/hard_negative.png').convert('RGB')
# User query
query_instruction = 'Represent this query for retrieving relavant document: '
query = 'Who was elected as president of United States in 2020?'
query_full = query_instruction + query
# Embed image documents
with torch.no_grad():
p_reps = model(text=['', '', ''], image=[image_1, image_2, image_3], tokenizer=tokenizer).reps
# Embed text queries
with torch.no_grad():
q_reps = model(text=[query_full], image=[None], tokenizer=tokenizer).reps # [B, s, d]
# Calculate similarities
scores = torch.matmul(q_reps, p_reps.T)
print(scores)
# tensor([[-0.0112, 0.3316, 0.2376]], device='cuda:0')
```
# Todos
- [x] Release huggingface space demo.
- [] Release the evaluation results.
- [] Release technical report.
# Limitations
- This checkpoint is an alpha version, and may not be strong in your tasks, for bad case, please create an issue to let us know, many thanks!
- The modeling script `modeling_minicpmv` on `huggingface` is not standard yet, the inference code could be further improved.
- The inference speed is low, because vision encoder uses `timm`, which does not yet support `flash-attn`.
- The model performs not well on Chinese and other non-English information retrieval tasks.
# Citation
If you find our work useful, please consider cite us:
```bibtex
@misc{RhapsodyEmbedding2024,
author = {RhapsodyAI},
title = {OCR-free Visual Document Embedding Model as Your Personal Librarian},
year = {2024},
howpublished = {\url{https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0}},
note = {Accessed: 2024-06-28}
}
``` |