alea31415's picture
Update README.md
964a2f0
|
raw
history blame
5.82 kB
metadata
license: creativeml-openrail-m

license: creativeml-openrail-m

This is a low-quality bocchi-the-rock (ぼっち・ざ・ろっく!) character model. Similar to my yama-no-susume model, this model is capable of generating multi-character scenes beyond images of a single character. Of course, the result is still hit-or-miss, but I with some chance you can get the entire Kessoku Band right in one shot, and otherwise, you can always rely on inpainting. Here are two examples:

With inpainting 4265343062-1047638199

Without inpainting 4265343086-2648280139

Characters

The model knows 12 characters from bocchi the rock. The ressemblance with a character can be improved by a better description of their appearance (for example by adding long wavy hair to ShimizuEliza).

xy_grid-0028-24 xy_grid-0029-24 xy_grid-0030-24

Dataset description

The dataset contains around 27K images with the following composition

  • 7024 anime screenshots
  • 1630 fan arts
  • 18519 customized regularization images

The model is trained with a specific weighting scheme to balance between different concepts. For example, the above three categories have weights respectively 0.3, 0.25, and 0.45. Each category is itself split into many sub-categories in a hierarchical way. For more details on the data preparation process please refer to https://github.com/cyber-meow/anime_screenshot_pipeline

Training Details

Trainer

The model is trained using EveryDream1 as EveryDream seems to be the only trainer out there that supports sample weighting (through the use of multiply.txt). Note that for future training it makes sense to migrate to EveryDream2.

Hardware and cost

The model is trained on runpod using 3090 and cost me around 15 dollors.

Hyperparameter specification

The model is trained for 48000 steps, at batch size 4, lr 1e-6, resolution 512, and conditional dropping rate of 10%.

Note that as a consequence of the weighting scheme which translates into a number of different multiply for each image, the count of repeat and epoch has a quite different meaning here. For example, depending on the weighting, I have around 300K images (some images are used multiple times) in an epoch, and therefore I did not even finish an entire epoch with the 48000 steps at batch size 4.

Failures

  • For the first 24000 steps I use the trigger words Bfan1 and Bfan2 for the two fans of Bocchi. However, these two words are too similar and the model fails to different characters for these. Therefore I changed Bfan2 to Bofa2 at step 24000. This seemed to solve the problem.
  • Character blending is always an issue.
  • When prompting the four characters of Kessoku Band we often get side shots. I think this is because of some overfitting to a particular image.

More Example Generations

With inpainting 4265343068-2420755431 4265343066-3979275255 4265343022-3534836762

Without inpainting 4265343092-803155289 4265343053-918713189 4265343054-2839948768 4265343096-399054050 4265343100-3858388158 4265343016-2842516738 4265343084-3548261345 4265343083-1372779456

Some failure cases 4265343089-2940163958 4265343091-129639375 4265343048-2869643584 4265343039-1470057774