File size: 6,221 Bytes
80c6598
 
 
0297db8
 
 
 
3a858ea
 
 
db7ffe4
80c6598
 
1d378fb
80c6598
d917af7
80c6598
6696a1b
 
 
 
 
 
 
 
80c6598
 
 
 
f983391
a85e0a4
f983391
63057ca
f983391
36aa603
90cd162
80c6598
fc9ff17
80c6598
aed8933
817d582
703e627
 
 
 
 
11311ed
e74da3d
 
 
3081d81
 
 
 
 
 
 
 
7923606
3081d81
 
e74da3d
 
 
 
aed8933
 
817d582
 
 
3081d81
817d582
 
aed8933
817d582
 
aed8933
a1ac49a
817d582
 
e74da3d
587e267
 
dc6a88d
 
 
 
 
587e267
 
 
 
 
 
 
 
aed8933
 
 
dc6a88d
817d582
 
 
 
 
 
 
 
34ec5c6
 
 
 
 
817d582
34ec5c6
 
 
 
817d582
34ec5c6
817d582
 
 
 
 
34ec5c6
02714ae
817d582
34ec5c6
 
02714ae
817d582
34ec5c6
 
817d582
c37926b
14fb892
 
c26facf
 
587e267
c26facf
587e267
c26facf
587e267
c26facf
14fb892
 
1de78e8
 
a454b7c
 
 
 
587e267
 
a22168a
 
 
 
db7ffe4
7c03122
a22168a
 
 
 
 
 
fe21202
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
language:
- en
tags:
- information retrieval
- embedding model
- visual information retrieval
metrics:
- recall
pipeline_tag: feature-extraction
license: apache-2.0
---

# Memex: OCR-free Visual Document Embedding Model as Your Personal Librarian

The model only takes images as document-side inputs and produce vectors representing document pages. `minicpm-visual-embedding-v0` is trained with over 200k query-visual document pairs, including textual document, visual document, arxiv figures, plots, charts, industry documents, textbooks, ebooks, and openly-available PDFs, etc. The performance of `minicpm-visual-embedding-v0` is on a par with our ablation text embedding model on text-oriented documents, and an advantages on visually-intensive documents.

Our model is capable of:

- Help you read a long visually-intensive or text-oriented PDF document and find the pages that answer your question.

- Help you build a personal library and retireve book pages from a large collection of books.

- It works like human: read and comprehend with **vision** and remember **multimodal** information in hippocampus.

![Memex Archtechture](images/memex.png)

# News

- 2024-07-14: πŸ€— We released **online huggingface demo**! Try our [online demo](https://huggingface.co/spaces/bokesyo/minicpm-visual-embeeding-v0-demo)!

- 2024-07-14: πŸ˜‹ We released a **locally deployable Gradio demo** of `miniCPM-visual-embedding-v0`, take a look at [pipeline_gradio.py](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0/blob/main/pipeline_gradio.py). You can run `pipeline_gradio.py` to build a demo on your PC.

- 2024-07-13: πŸ’» We released a **locally deployable command-line based demo** of `miniCPM-visual-embedding-v0` for users to retireve most relavant pages from a given PDF file (could be very long), take a look at [pipeline.py](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0/blob/main/pipeline.py).

- 2024-06-27: πŸš€ We released our first visual embedding model checkpoint minicpm-visual-embedding-v0 on [huggingface](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0).

- 2024-05-08: 🌍 We [open-sourced](https://github.com/RhapsodyAILab/minicpm-visual-embedding-v0) our training code (full-parameter tuning with GradCache and DeepSpeed, supports large batch size across multiple GPUs with zero-stage1) and eval code. 

# Deploy on your PC

**Please make sure you have at least 32GB memory on your PC.**

- Apple M1/M2/M3 with 32GB memory.
- x86 CPU with 32GB memory.
- x86 CPU with 32GB memory + Nvidia GPU with 16GB memory.

### Install dependencies

Use pip to install all dependencies:

```
Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.36.0
sentencepiece==0.1.99
numpy==1.26.0
```


### Download model weights and modeling file

Use one of the following methods:

- Download with git clone.

```bash
git lfs install
git clone https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0
```

- Download with huggingface-hub.

```bash
pip install huggingface-hub
huggingface-cli download --resume-download RhapsodyAI/minicpm-visual-embedding-v0 --local-dir minicpm-visual-embedding-v0 --local-dir-use-symlinks False
```

### Launch demo

Install `gradio` first.

```bash
pip install gradio
```

Adapt the code in `pipeline_gradio.py` according to your device.

- For M1/M2/M3 users, please make sure `model = model.to(device='mps', dtype=torch.float16)` then run `PYTORCH_ENABLE_MPS_FALLBACK=1 python pipeline_gradio.py`.
- For x86 CPU users, please remove `model = model.to(device)` then run `python pipeline_gradio.py`.
- For x86 CPU + Nvidia GPU users, please make sure `model = model.to('cuda')` then run `python pipeline_gradio.py`.
- If you encountered an error, please open an issue [here](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0/discussions), we will respond soon.


# For research purpose

To run the model for research purpose, please refer the following code:

```python
from transformers import AutoModel
from transformers import AutoTokenizer
from PIL import Image
import torch

device = 'cuda:0'

# Load model, be sure to substitute `model_path` by your model path 
model_path = '/local/path/to/model'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
model.to(device)

# Load image to PIL.Image object
image_1 = Image.open('/local/path/to/images/memex.png').convert('RGB')
image_2 = Image.open('/local/path/to/images/us2020.png').convert('RGB')
image_3 = Image.open('/local/path/to/images/hard_negative.png').convert('RGB')

# User query
query_instruction = 'Represent this query for retrieving relavant document: '
query = 'Who was elected as president of United States in 2020?'
query_full = query_instruction + query

# Embed image documents
with torch.no_grad():
    p_reps = model(text=['', '', ''], image=[image_1, image_2, image_3], tokenizer=tokenizer).reps

# Embed text queries
with torch.no_grad():
    q_reps = model(text=[query_full], image=[None], tokenizer=tokenizer).reps # [B, s, d]

# Calculate similarities
scores = torch.matmul(q_reps, p_reps.T)
print(scores)
# tensor([[-0.0112,  0.3316,  0.2376]], device='cuda:0')
```

# Todos

- [x] Release huggingface space demo.

- [] Release the evaluation results.

- [] Release technical report.

# Limitations

- This checkpoint is an alpha version, and may not be strong in your tasks, for bad case, please create an issue to let us know, many thanks!

- The modeling script `modeling_minicpmv` on `huggingface` is not standard yet, the inference code could be further improved.

- The inference speed is low, because vision encoder uses `timm`, which does not yet support `flash-attn`.

- The model performs not well on Chinese and other non-English information retrieval tasks.

# Citation

If you find our work useful, please consider cite us:

```bibtex
@misc{RhapsodyEmbedding2024,
  author = {RhapsodyAI},
  title = {OCR-free Visual Document Embedding Model as Your Personal Librarian},
  year = {2024},
  howpublished = {\url{https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0}},
  note = {Accessed: 2024-06-28}
}
```