Proposed change to CLIP text encoding for tag style
tl;dr create attention masks from the commas to isolate each tag, blend concepts only in the image / Stable Diffusion and not in CLIP.
I strongly believe this would help disentangle features / tags.
The default attention masking for text in CLIP is to not allow tokens to be impacted by following tokens, only previous tokens.
However, the prompt style of comma separated tags seems ill-suited for this since each tag is an isolated concept on its own, separated by commas. This results in different tags and commas influencing later tags, meaning the order of tags changes the result.
Skipping a few layers at the end may reduce how much the tokens / words are mixed together, but it doesn't change how they're mixed together where each token is affected by every previous token.
I propose a new attention masking method to replace the default, where individual groups of tokens separated by commas are masked to only allow them to modify their own group. Then Stable Diffusion can blend the concepts together in the image, rather than by CLIP in the encoded prompt.
Also to remove commas from the tokens that are actually used when encoding the prompt, keeping them in prompts only to form the attention masks.
It should restart the usual pattern of blocking out previous tokens on each comma-separated tag, not just allowing all tokens to modify each other regardless of order or arrangement, in order to preserve the interpretation of words that are split into multiple tokens.
Apologies if rambling, the idea has been in my head for a while.
Or instead of removing the commas, replace them with tokens and then remove them at the end. This way it isn't only the first tag that gets modified by the start of text token. It definitely needs to keep that first one though in the final output.
Or adjust the attention masks to ignore the start of text token so it doesn't modify anything but I'm not sure how well that would work.
The tokens padding the end can also carry on the values of the previous tokens, and they're not necessary for Stable Diffusion to generate a sensible image. They're good for the unconditional encoding though.