File size: 7,276 Bytes

80c6598
 
 
0297db8
 
 
 
3a858ea
 
 
db7ffe4
80c6598
 
1d378fb
80c6598
d02a0b4
80c6598
6696a1b
 
 
 
 
 
c81611e
 
6696a1b
 
80c6598
 
 
 
f288393
 
c075f70
9c28b51
2b3ec0b
a85e0a4
c81611e
63057ca
c81611e
36aa603
d02a0b4
80c6598
c81611e
80c6598
aed8933
817d582
703e627
 
 
 
 
11311ed
e74da3d
 
 
3081d81
 
 
 
 
 
 
 
7923606
3081d81
 
e74da3d
 
 
 
aed8933
 
817d582
 
 
3081d81
817d582
 
aed8933
817d582
 
aed8933
a1ac49a
817d582
 
e74da3d
587e267
 
dc6a88d
 
 
 
 
587e267
 
 
 
 
 
 
 
aed8933
 
 
dc6a88d
817d582
 
 
 
 
 
 
 
34ec5c6
 
 
 
 
817d582
34ec5c6
 
 
 
817d582
34ec5c6
817d582
 
 
 
 
34ec5c6
02714ae
817d582
34ec5c6
 
02714ae
817d582
34ec5c6
 
817d582
c37926b
14fb892
 
c26facf
 
587e267
c26facf
621393a
c26facf
621393a
c26facf
14fb892
 
1de78e8
 
a454b7c
 
 
 
587e267
 
a22168a
 
 
 
db7ffe4
7c03122
f288393
d02a0b4
a22168a
 
 
 
d7301f2

---
language:
- en
tags:
- information retrieval
- embedding model
- visual information retrieval
metrics:
- recall
pipeline_tag: feature-extraction
license: apache-2.0
---

# Memex: OCR-free Visual Document Embedding Model as Your Personal Librarian

The model only takes images as document-side inputs and produce vectors representing document pages. Memex is trained with over 200k query-visual document pairs, including textual document, visual document, arxiv figures, plots, charts, industry documents, textbooks, ebooks, and openly-available PDFs, etc. Its performance is on a par with our ablation text embedding model on text-oriented documents, and an advantages on visually-intensive documents.

Our model is capable of:

- Help you read a long visually-intensive or text-oriented PDF document and find the pages that answer your question.

- Help you build a personal library and retireve book pages from a large collection of books.

- It has only 2.8B parameters, and has the potential to run on your PC.

- It works like human: read and comprehend with **vision** and remember **multimodal** information in hippocampus.

![Memex Archtechture](images/memex.png)

# News

- 2024-08-18: 👀 We released a **new [end-to-end Visual RAG huggingface demo](https://huggingface.co/spaces/bokesyo/MiniCPMV-RAG-PDFQA)**, which supports **both retrieval and generation**, which means, you can use our system to **answer your questions within a long PDF** now!

- 2024-08-17: 👊 We open-sourced [cleaned version of training codebase](https://github.com/RhapsodyAILab/MiniCPM-V-Embedding-v0-Train) for MiniCPM-Visual-Embedding, which supports **deepspeed zero stage 1,2** and **large batchsize** like `4096` for full-parameter training to turn VLMs into dense retrievers. We also developed methods to filter training datasets and generating queries using unlablled datasets. We supports **multi-nodes, multi-GPUs** high-efficiency **evaluation** on large retrieval datasets. With such efforts, we support up to `20B` VLM contrastive learning with `4096` batch size. We have tested that one can train a VLM dense retriever with only **1 GPU, but with batch size of `4096`**. 

- 2024-07-14: 🤗 We released **online huggingface demo**! Try our [online demo](https://huggingface.co/spaces/bokesyo/MiniCPM_Visual_Document_Retriever_Demo)!

- 2024-07-14: 😋 We released a **locally deployable Gradio demo** of `Memex`, take a look at [pipeline_gradio.py](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0/blob/main/pipeline_gradio.py). You can build a demo on your PC now!

- 2024-07-13: 💻 We released a **locally deployable command-line based demo** for users to retireve most relavant pages from a given PDF file (could be very long), take a look at [pipeline.py](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0/blob/main/pipeline.py).

- 2024-06-27: 🚀 We released our first visual embedding model checkpoint on [huggingface](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0).

- 2024-05-08: 🌍 We [open-sourced](https://github.com/RhapsodyAILab/minicpm-visual-embedding-v0) our training code (full-parameter tuning with GradCache and DeepSpeed zero-stage2, supports large batch size across multiple GPUs with zero-stage1) and eval code. 

# Deploy on your PC

**Please make sure you have at least 32GB memory on your PC.**

- Apple M1/M2/M3 with 32GB memory.
- x86 CPU with 32GB memory.
- x86 CPU with 32GB memory + Nvidia GPU with 16GB memory.

### Install dependencies

Use pip to install all dependencies:

```
Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.36.0
sentencepiece==0.1.99
numpy==1.26.0
```


### Download model weights and modeling file

Use one of the following methods:

- Download with git clone.

```bash
git lfs install
git clone https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0
```

- Download with huggingface-hub.

```bash
pip install huggingface-hub
huggingface-cli download --resume-download RhapsodyAI/minicpm-visual-embedding-v0 --local-dir minicpm-visual-embedding-v0 --local-dir-use-symlinks False
```

### Launch demo

Install `gradio` first.

```bash
pip install gradio
```

Adapt the code in `pipeline_gradio.py` according to your device.

- For M1/M2/M3 users, please make sure `model = model.to(device='mps', dtype=torch.float16)` then run `PYTORCH_ENABLE_MPS_FALLBACK=1 python pipeline_gradio.py`.
- For x86 CPU users, please remove `model = model.to(device)` then run `python pipeline_gradio.py`.
- For x86 CPU + Nvidia GPU users, please make sure `model = model.to('cuda')` then run `python pipeline_gradio.py`.
- If you encountered an error, please open an issue [here](https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0/discussions), we will respond soon.


# For research purpose

To run the model for research purpose, please refer the following code:

```python
from transformers import AutoModel
from transformers import AutoTokenizer
from PIL import Image
import torch

device = 'cuda:0'

# Load model, be sure to substitute `model_path` by your model path 
model_path = '/local/path/to/model'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
model.to(device)

# Load image to PIL.Image object
image_1 = Image.open('/local/path/to/images/memex.png').convert('RGB')
image_2 = Image.open('/local/path/to/images/us2020.png').convert('RGB')
image_3 = Image.open('/local/path/to/images/hard_negative.png').convert('RGB')

# User query
query_instruction = 'Represent this query for retrieving relavant document: '
query = 'Who was elected as president of United States in 2020?'
query_full = query_instruction + query

# Embed image documents
with torch.no_grad():
    p_reps = model(text=['', '', ''], image=[image_1, image_2, image_3], tokenizer=tokenizer).reps

# Embed text queries
with torch.no_grad():
    q_reps = model(text=[query_full], image=[None], tokenizer=tokenizer).reps # [B, s, d]

# Calculate similarities
scores = torch.matmul(q_reps, p_reps.T)
print(scores)
# tensor([[-0.0112,  0.3316,  0.2376]], device='cuda:0')
```

# Todos

- [x] Release huggingface space demo.

- [ ] Release the evaluation results.

- [ ] Release technical report.

# Limitations

- This checkpoint is an alpha version, and may not be strong in your tasks, for bad case, please create an issue to let us know, many thanks!

- The modeling script `modeling_minicpmv` on `huggingface` is not standard yet, the inference code could be further improved.

- The inference speed is low, because vision encoder uses `timm`, which does not yet support `flash-attn`.

- The model performs not well on Chinese and other non-English information retrieval tasks.

# Citation

If you find our work useful, please consider cite us:

```bibtex
@misc{RhapsodyEmbedding2024,
  author = {Rhapsody Group, OpenBMB},
  title = {Memex: OCR-free Visual Document Embedding Model as Your Personal Librarian},
  year = {2024},
  howpublished = {\url{https://huggingface.co/RhapsodyAI/minicpm-visual-embedding-v0}},
  note = {Accessed: 2024-06-28}
}
```

Thanks to MiniCPM-V-2.0 `arxiv.org/abs/2408.01800`, without which there won't be `minicpm-visual-embedding`.