Ketengan-Diffusion
commited on
Commit
•
a234fec
1
Parent(s):
54bab71
Update README.md
Browse files
README.md
CHANGED
@@ -31,11 +31,17 @@ This is enhanced version of AnySomniumXL v3
|
|
31 |
* More increased concept and character accuracy
|
32 |
|
33 |
# Our Dataset Process Curation
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
Our dataset is scored using Pretrained CLIP+MLP Aesthetic Scoring model by https://github.com/christophschuhmann/improved-aesthetic-predictor, and We made adjusment into our script to detecting any text or watermark by utilizing OCR by pytesseract
|
35 |
|
36 |
This scoring method has scale between -1-100, we take the score threshold around 17 or 20 as minimum and 65-75 as maximum to pretain the 2D style of the dataset, Any images with text will returning -1 score. So any images with score below 17 or above 65 is deleted
|
37 |
|
38 |
-
The dataset curation proccess is using Nvidia T4 16GB Machine and takes about
|
39 |
|
40 |
# Captioning process
|
41 |
We using combination of proprietary Multimodal LLM and open source multimodal LLM such as LLaVa 1.5 as the captioning process which is resulting more complex result than using normal BLIP2. Any detail like the clothes, atmosphere, situation, scene, place, gender, skin, and others is generated by LLM.
|
|
|
31 |
* More increased concept and character accuracy
|
32 |
|
33 |
# Our Dataset Process Curation
|
34 |
+
<p align="center">
|
35 |
+
<img src="Curation.png" width=70% height=70%>
|
36 |
+
</p>
|
37 |
+
|
38 |
+
Image source: [Source1](https://danbooru.donmai.us/posts/3143351) [Source2](https://danbooru.donmai.us/posts/3272710) [Source3](https://danbooru.donmai.us/posts/3320417)
|
39 |
+
|
40 |
Our dataset is scored using Pretrained CLIP+MLP Aesthetic Scoring model by https://github.com/christophschuhmann/improved-aesthetic-predictor, and We made adjusment into our script to detecting any text or watermark by utilizing OCR by pytesseract
|
41 |
|
42 |
This scoring method has scale between -1-100, we take the score threshold around 17 or 20 as minimum and 65-75 as maximum to pretain the 2D style of the dataset, Any images with text will returning -1 score. So any images with score below 17 or above 65 is deleted
|
43 |
|
44 |
+
The dataset curation proccess is using Nvidia T4 16GB Machine and takes about 7 days for curating 1.000.000 images.
|
45 |
|
46 |
# Captioning process
|
47 |
We using combination of proprietary Multimodal LLM and open source multimodal LLM such as LLaVa 1.5 as the captioning process which is resulting more complex result than using normal BLIP2. Any detail like the clothes, atmosphere, situation, scene, place, gender, skin, and others is generated by LLM.
|