Update README.md
Browse files
README.md
CHANGED
@@ -8,43 +8,174 @@ tags:
|
|
8 |
license: mit
|
9 |
---
|
10 |
|
11 |
-
#
|
12 |
|
13 |
-
## Model description
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
-
Describe the model here (what it does, what it's used for, etc.)
|
16 |
|
17 |
## Intended uses & limitations
|
18 |
|
19 |
#### How to use
|
20 |
|
21 |
```python
|
22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
```
|
24 |
|
|
|
|
|
25 |
#### Limitations and bias
|
26 |
|
27 |
-
|
|
|
|
|
28 |
|
29 |
## Training data
|
30 |
|
31 |
-
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
## Training procedure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
## Eval results
|
39 |
|
40 |
-
|
|
|
|
|
|
|
|
|
41 |
|
42 |
-
|
43 |
|
44 |
-
### BibTeX entry and citation info
|
45 |
|
|
|
46 |
```bibtex
|
47 |
-
@
|
48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
}
|
50 |
-
```
|
|
|
8 |
license: mit
|
9 |
---
|
10 |
|
11 |
+
# CycleGAN for unpaired image-to-image translation.
|
12 |
|
13 |
+
## Model description
|
14 |
+
|
15 |
+
CycleGAN for unpaired image-to-image translation.
|
16 |
+
Given two image domains A and B, the following components are trained end2end to translate between such domains:
|
17 |
+
- A generator A to B, named G_AB conditioned on an image from A
|
18 |
+
- A generator B to A, named G_BA conditioned on an image from B
|
19 |
+
- A domain classifier D_A, associated with G_AB
|
20 |
+
- A domain classifier D_B, associated with G_BA
|
21 |
+
|
22 |
+
|
23 |
+
At inference time, G_AB or G_BA are relevant to translate images, respectively A to B or B to A.
|
24 |
+
In the general setting, this technique provides style transfer functionalities between the selected image domains A and B.
|
25 |
+
This allows to obtain a generated translation by G_AB, of an image from domain A that resembles the distribution of the images from domain B, and viceversa for the generator G_BA.
|
26 |
+
Under these framework, these aspects have been used to perform style transfer between synthetic data obtained from a simulated driving dataset, GTA5, and the real driving data from Cityscapes.
|
27 |
+
This is of paramount importance to develop autonomous driving perception deep learning models, as this allows to generate synthetic data with automatic annotations which resembles real world images, without requiring the intervention of a human annotator.
|
28 |
+
This is fundamental because a manual annotator has been shown to require 1.5 to 3.3 hours to create semantic and instance segmentation masks for a single images.
|
29 |
+
These have been provided in the original [cityscapes paper (Cordts et al 2016)](https://arxiv.org/abs/2104.13395) and the [adverse condition dataset (Sakaridis et al. 2021)](https://arxiv.org/abs/2104.13395) paper.
|
30 |
+
|
31 |
+
|
32 |
+
Hence the CycleGAN provides forward and backward translation between synthetic and real world data.
|
33 |
+
This has showed to allows high quality translation even in absence of paired sample-ground-truth data.
|
34 |
+
The idea behind such model is that as the synthetic data distribution gets closer to the real world one, deep models do not suffer from degraded performance due to the domain shift issue.
|
35 |
+
A broad literature is available on the minimization of the domain shift, under the research branch of domain adaptation and transfer learning, of which image translation models provide an alternative approach
|
36 |
|
|
|
37 |
|
38 |
## Intended uses & limitations
|
39 |
|
40 |
#### How to use
|
41 |
|
42 |
```python
|
43 |
+
import os
|
44 |
+
from PIL import Image
|
45 |
+
from torchvision import transforms as T
|
46 |
+
from torchvision.transforms import Compose, Resize, ToTensor, Normalize, RandomCrop, RandomHorizontalFlip
|
47 |
+
from torchvision.utils import make_grid
|
48 |
+
from torch.utils.data import DataLoader
|
49 |
+
from huggan.pytorch.cyclegan.modeling_cyclegan import GeneratorResNet
|
50 |
+
import torch.nn as nn
|
51 |
+
import torch
|
52 |
+
import gradio as gr
|
53 |
+
import glob
|
54 |
+
|
55 |
+
|
56 |
+
|
57 |
+
|
58 |
+
def pred_pipeline(img, transforms):
|
59 |
+
orig_shape = img.shape
|
60 |
+
input = transforms(img)
|
61 |
+
input = input.unsqueeze(0)
|
62 |
+
output = model(input)
|
63 |
+
|
64 |
+
out_img = make_grid(output,#.detach().cpu(),
|
65 |
+
nrow=1, normalize=True)
|
66 |
+
out_transform = Compose([
|
67 |
+
T.Resize(orig_shape[:2]),
|
68 |
+
T.ToPILImage()
|
69 |
+
])
|
70 |
+
return out_transform(out_img)
|
71 |
+
|
72 |
+
|
73 |
+
|
74 |
+
|
75 |
+
n_channels = 3
|
76 |
+
image_size = 512
|
77 |
+
input_shape = (image_size, image_size)
|
78 |
+
|
79 |
+
transform = Compose([
|
80 |
+
T.ToPILImage(),
|
81 |
+
T.Resize(input_shape),
|
82 |
+
ToTensor(),
|
83 |
+
Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
|
84 |
+
])
|
85 |
+
|
86 |
+
|
87 |
+
model = GeneratorResNet.from_pretrained('Chris1/sim2real', input_shape=(n_channels, image_size, image_size),
|
88 |
+
num_residual_blocks=9)
|
89 |
+
|
90 |
+
real_images = model(synthetic_images)
|
91 |
```
|
92 |
|
93 |
+
|
94 |
+
|
95 |
#### Limitations and bias
|
96 |
|
97 |
+
Due to the absence of paired data, some background parts of the synthetic images are seldom wrongly translated, e.g. sky is translated to vegetation.
|
98 |
+
Additional pretext tasks in parallel to the discriminative classifier of fake and real samples could improve the result.
|
99 |
+
One easy improvement is the use of an additional parallel branch that performs semantic segmentation on the synthetic data, in order to learn features which are common to sky and vegetation, thus disentangling their representations as separate classes.
|
100 |
|
101 |
## Training data
|
102 |
|
103 |
+
|
104 |
+
The CycleGAN model is trained on an unpaired dataset of samples from synthetic and real driving data, respectively from the GTA5 and Cityscapes datasets.
|
105 |
+
To this end, the synthetic-to-real dataset can be loaded by means of the function load_dataset in the huggingface library, as follows.
|
106 |
+
```python
|
107 |
+
from datasets import load_dataset
|
108 |
+
|
109 |
+
unpaired_dataset = load_dataset("Chris1/sim2real_gta5_to_cityscapes")
|
110 |
+
|
111 |
+
```
|
112 |
+
This dataset contains two columns, imageA and imageB representing respectively the GTA5 and Cityscapes data.
|
113 |
+
Due to the fact that the two columns have to be of the same length, GTA5 is subsampled in order to reach the same number of samples provided by the Cityscapes train split (2975)
|
114 |
+
|
115 |
|
116 |
## Training procedure
|
117 |
+
#### Preprocessing
|
118 |
+
The following transformations are applied to each input sample of synthetic and real data.
|
119 |
+
The input size is fixed to RGB images of height, width = 512, 512.
|
120 |
+
This choice has been made in order to limit the impact of upsampling the translated images to higher resolutions.
|
121 |
+
```python
|
122 |
+
n_channels = 3
|
123 |
+
image_size = 512
|
124 |
+
input_shape = (image_size, image_size)
|
125 |
+
|
126 |
+
transform = Compose([
|
127 |
+
T.ToPILImage(),
|
128 |
+
T.Resize(input_shape),
|
129 |
+
ToTensor(),
|
130 |
+
Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
|
131 |
+
])
|
132 |
+
```
|
133 |
|
134 |
+
#### Hardware
|
135 |
+
The configuration has been tested on single GPU setup on a RTX5000 and A5000, as well as multi-gpu single-rank distributed setups composed of 2 of the mentioned GPUs.
|
136 |
+
|
137 |
+
#### Hyperparameters
|
138 |
+
The following configuration has been kept fixed for all translation models:
|
139 |
+
- learning rate 0.0002
|
140 |
+
- number of epochs 200
|
141 |
+
- learning rate decay activation at epoch 100
|
142 |
+
- number of residual blocks of the cyclegan 9
|
143 |
+
- image size 512x512
|
144 |
+
- number of channels=3
|
145 |
+
- cycle loss weight 10.0
|
146 |
+
- identity loss weight 5.0
|
147 |
+
- optimizer ADAM with beta1 0.5 and beta2 0.999
|
148 |
+
- batch size 8
|
149 |
+
- NO mixed precision training
|
150 |
|
151 |
## Eval results
|
152 |
|
153 |
+
#### Generated Images
|
154 |
+
|
155 |
+
In the provided images, row0 and row2 represent the synthetic and real images from the respective datasets.
|
156 |
+
Row1 is the translation of the immediate above images in row0(synthetic) by means of the G_AB translation model, to the real world style.
|
157 |
+
Row3 is the translation of the immediate above images in row2(real) by means of the G_BA translation model, to the synthetic world style.
|
158 |
|
159 |
+
Visualization over the training iterations for [synthetic (GTA5) to real (Cityscapes) translation](https://wandb.ai/chris1nexus/experiments_cyclegan_s2r_hp_opt--10/reports/CycleGAN-sim2real-training-results--VmlldzoxODUyNTk4?accessToken=tow3v4vp02aurzodedrdht15ig1cx69v5mited4dm8bgnup0z192wri0xtftaeqj)
|
160 |
|
|
|
161 |
|
162 |
+
### References
|
163 |
```bibtex
|
164 |
+
@misc{https://doi.org/10.48550/arxiv.1703.10593,
|
165 |
+
doi = {10.48550/ARXIV.1703.10593},
|
166 |
+
|
167 |
+
url = {https://arxiv.org/abs/1703.10593},
|
168 |
+
|
169 |
+
author = {Zhu, Jun-Yan and Park, Taesung and Isola, Phillip and Efros, Alexei A.},
|
170 |
+
|
171 |
+
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
|
172 |
+
|
173 |
+
title = {Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks},
|
174 |
+
|
175 |
+
publisher = {arXiv},
|
176 |
+
|
177 |
+
year = {2017},
|
178 |
+
|
179 |
+
copyright = {arXiv.org perpetual, non-exclusive license}
|
180 |
}
|
181 |
+
```
|