microsoft/Phi-3.5-vision-instruct · `TypeError` When Processing Text and Image Batch with `processor`

Aug 21

Issue Summary

I'm encountering a TypeError when trying to process a batch of data using the processor in my code. The error occurs when I attempt to process both text and image data together in a batch.

Code Snippet

texts = []
images = []
for example in examples:
    prompt = "my prompt"
    placeholder = f"<|image_{1}|>\n"
    messages = [
        {"role": "user", "content": placeholder + prompt},
    ]

    text = processor.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False
    )
    texts.append(text)
    images.append(example["image"])

batch = processor(texts, images, return_tensors="pt", padding=True)

return _compile(pattern, flags).split(string, maxsplit)
TypeError: expected string or bytes-like object

Detailed Description

The error occurs when trying to create a batch using the processor by passing a list of texts and images. The apply_chat_template function seems to return a result that causes the processor to raise a TypeError related to expecting a string or bytes-like object. It appears the issue is triggered when handling the batch processing of both text and image data together.

Expected Behavior

The processor should correctly handle the provided text and image inputs and return the appropriate tensors without raising an error.

Environment

Python Version: 3.9
OS: ubuntu

tomasruiz

Aug 21

have the same issue. My use case is to have a single question for a single image, but I want it in batched mode to increase throughput.

haipingwu

Microsoft org Aug 21

hi, thanks for your interest. Batch-mode is not supported in the processor.

tomasruiz

Aug 22

I got batched inference with text + images to work 🦜

One has to tokenize each image+prompt pair individually into a tensor, and then stack & pad these tensors into a large tensor to feed the model.

Here is the code to get going now 🚀: https://gist.github.com/tomasruizt/21cfd764f8d89a7802bf32537af55bbe

I tested that each image does not leak into the other prompts by permuting the prompts about each image, and evaluating the answers qualitatively. If you find any errors, please let me know.

nguyenbh changed discussion status to closed Sep 25