GPT-Vision
Collection
variants of GPT-Vision
•
4 items
•
Updated
This model is a finetuned version of GPT-Vision-1 You can find out more about this model in this GITHUB
from transformers import AutoModelForCausalLM
from PIL import Image
model = AutoModelForCausalLM.from_pretrained("damerajee/GPTVision-1-ft", trust_remote_code=True)
image_path = "Your_image_path"
image = Image.open(image_path)
image = image.convert('RGB')
#the model generates better with these parameters
gen_kwargs = {
"do_sample": True,
"temperature": 0.8,
"top_p": 0.6,
"repetition_penalty": 1.6,
}
question = "your_Quesition"
answer = model.generate(image=image,question=question,max_new_tokens=80,**gen_kwargs)
print(answer)
Image | Question | Response |
---|---|---|
Why is this dessert appealing to someone with a sweet tooth? | The cake is a must-try and the person with it may be enjoying their dessert.It might also serve as an appetizer, offering options for those who prefer to try more than just one of these options. This could include food or drinks they are not having yet consumed before. A plate can provide them from each side without additional options: some people would rather eat this dessert | |
What factors contribute to the player's success in this particular shot? | 1. Positioning and positioning: The ball is placed in a position that allows for the player to hit it with both hands, making contact on either side of his body or face down at an angle that requires them not only to reach the tennis court but also needs their feet to touch him. This can lead into different directions. For example if they are playing close-up during practice |
Despite fine-tuning, the model struggles with generating coherent text for more complex or nuanced queries. While it performs well with generic questions like 'Describe the scenery in this image,' or 'what is this player doing in this image?'
it falls short when asked about deeper or more intricate aspects.
which is a bummer but I tried