Spaces:
Runtime error
A little intuition on how the underlying on how unclip_image_interpolation works?
I see that it relies on UnCLIPImageInterpolationPipeline (which is similar to UnCLIPImageVariationPipeline ), but is there any explanation about the 'intuition' behind this I can look into?
Thanks :)
@mikegarts sure ! Happy to clarify.
- We generate the CLIP embeddings of the two input images using the CLIPVisionModelWithProjection. Let them be
z_start
andz_end
. - We interpolate between
z_start
andz_end
using spherical interpolation. a.k.a. slerp(https://en.wikipedia.org/wiki/Slerp). Number of interpolated embeddings generated = Number of steps of the pipeline. Let's sayz_1, ... z_N
- We pass the embeddings
z_1, ... z_N
to the decoder which generates the images you see in the output.
References:
Dall - E 2 paper. See Figure 3. https://cdn.openai.com/papers/dall-e-2.pdf
See our discussion on https://github.com/huggingface/diffusers/issues/1869
https://github.com/huggingface/diffusers/blob/main/examples/community/README.md#unclip-image-interpolation-pipeline
Community Pipeline code: https://github.com/huggingface/diffusers/blob/main/examples/community/unclip_image_interpolation.py
It uses the image variations checkpoint because it has the image_encoder (CLIPVisionModelWithProjection) weights that I can use.
Do let me know if you have any doubts on this.
Put briefly, we are encoding the images into the latent space and walking N steps between them in the latent space. At each step we sample the latent space and generate an image to see what lies there.
@NagaSaiAbhinay Thanks a lot, after reading the discussion thread it all makes much more sense :)
Btw, recently I pushed the controlnet_img2img pipeline and played with it a little and it produces somewhat more 'contextual' transition between two prompts/images (in a similar way - by interpolating in the embedding space + using img2img and controlnet to constraint the next image).
For example:
source_prompt = "a beautiful tabby cat"
dest_prompt = "an astronaut on a horse"
can yield something like that: