I created a simple DCGAN
with 6 layers and trained it on CelebA dataset (a portion of it containing 30K images).
I noticed my network generated images are dimmed looking and as the network trains more, the bright colors fade into dim ones!
here are some example:
This is how CelebA images look like (real images used for training) :
and these are the generated ones ,the number shows the epoch number(they were trained for 30 epochs ultimately) :
What is the cause for this phenomenon?
I tried to do all the general tricks concerning GAN
s, such as rescaling the input image between -1 and 1, or not using BatchNorm
in the first layer of the Discriminator
, and for the last layer of the Generator
or
using LeakyReLU(0.2)
in the Discriminator
, and ReLU
for the Generator
. yet I have no idea why the images are this dim/dark!
Is this caused by simply less training images?
or is it caused by the networks deficiencies ? if so what is the source of such deficiencies?
Here are how these networks implemented :
def conv_batch(in_dim, out_dim, kernel_size, stride, padding, batch_norm=True):
layers = nn.ModuleList()
conv = nn.Conv2d(in_dim, out_dim, kernel_size, stride, padding, bias=False)
layers.append(conv)
if batch_norm:
layers.append(nn.BatchNorm2d(out_dim))
return nn.Sequential(*layers)
class Discriminator(nn.Module):
def __init__(self, conv_dim=32, act = nn.ReLU()):
super().__init__()
self.conv_dim = conv_dim
self.act = act
self.conv1 = conv_batch(3, conv_dim, 4, 2, 1, False)
self.conv2 = conv_batch(conv_dim, conv_dim*2, 4, 2, 1)
self.conv3 = conv_batch(conv_dim*2, conv_dim*4, 4, 2, 1)
self.conv4 = conv_batch(conv_dim*4, conv_dim*8, 4, 1, 1)
self.conv5 = conv_batch(conv_dim*8, conv_dim*10, 4, 2, 1)
self.conv6 = conv_batch(conv_dim*10, conv_dim*10, 3, 1, 1)
self.drp = nn.Dropout(0.5)
self.fc = nn.Linear(conv_dim*10*3*3, 1)
def forward(self, input):
batch = input.size(0)
output = self.act(self.conv1(input))
output = self.act(self.conv2(output))
output = self.act(self.conv3(output))
output = self.act(self.conv4(output))
output = self.act(self.conv5(output))
output = self.act(self.conv6(output))
output = output.view(batch, self.fc.in_features)
output = self.fc(output)
output = self.drp(output)
return output
def deconv_convtranspose(in_dim, out_dim, kernel_size, stride, padding, batchnorm=True):
layers = []
deconv = nn.ConvTranspose2d(in_dim, out_dim, kernel_size = kernel_size, stride=stride, padding=padding)
layers.append(deconv)
if batchnorm:
layers.append(nn.BatchNorm2d(out_dim))
return nn.Sequential(*layers)
class Generator(nn.Module):
def __init__(self, z_size=100, conv_dim=32):
super().__init__()
self.conv_dim = conv_dim
# make the 1d input into a 3d output of shape (conv_dim*4, 4, 4 )
self.fc = nn.Linear(z_size, conv_dim*4*4*4)#4x4
# conv and deconv layer work on 3d volumes, so we now only need to pass the number of fmaps and the
# input volume size (its h,w which is 4x4!)
self.drp = nn.Dropout(0.5)
self.deconv1 = deconv_convtranspose(conv_dim*4, conv_dim*3, kernel_size =4, stride=2, padding=1)
self.deconv2 = deconv_convtranspose(conv_dim*3, conv_dim*2, kernel_size =4, stride=2, padding=1)
self.deconv3 = deconv_convtranspose(conv_dim*2, conv_dim, kernel_size =4, stride=2, padding=1)
self.deconv4 = deconv_convtranspose(conv_dim, conv_dim, kernel_size =3, stride=2, padding=1)
self.deconv5 = deconv_convtranspose(conv_dim, 3, kernel_size =4, stride=1, padding=1, batchnorm=False)
def forward(self, input):
output = self.fc(input)
output = self.drp(output)
output = output.view(-1, self.conv_dim*4, 4, 4)
output = F.relu(self.deconv1(output))
output = F.relu(self.deconv2(output))
output = F.relu(self.deconv3(output))
output = F.relu(self.deconv4(output))
# we create the image using tanh!
output = F.tanh(self.deconv5(output))
return output
# testing nets
dd = Discriminator()
zd = np.random.rand(2,3,64,64)
zd = torch.from_numpy(zd).float()
# print(dd)
print(dd(zd).shape)
gg = Generator()
z = np.random.uniform(-1,1,size=(2,100))
z = torch.from_numpy(z).float()
print(gg(z).shape)
I think that the problem lies rather in the architecture itself and I would first consider the overall quality of generated images rather than their brightness or darkness. The generations clearly get better as you train for more epochs. I agree that the images get darker but even in the early epochs, the generated images are significantly darker than the ones in the training samples. (At least compared to ones that you posted.)
And now coming back to your architecture, 30k samples are actually enough to obtain very convincing results as achieved by state-of-the-art models in face generations. The generations do get better but they are still far away from being "very good".
I think the generator is definitely not strong enough and is the problematic part. (The fact that your generator loss skyrockets can also be a hint for this.) In the generator, all you do is just upsampling and upsampling. You should note that the transposed convolution is more like a heuristic and it does not provide much learnability. This is related to the nature of the problem. When you are doing convolutions, you have all the information and you are trying to learn to encode but in the decoder, you are trying to recover information that was previously lost :). So, in a way, it is harder to learn because the information taken as input is limited and lacking.
In fact, deterministic bilinear interpolation methods do perform similar or even better than transposed convolutions and these are purely based on scaling/extending with zero learnability. (https://arxiv.org/pdf/1707.05847.pdf)
To observe the transposed convolutions' limits, I suggest that you replace all the Transposedconv2d
with UpSampling2D
(https://www.tensorflow.org/api_docs/python/tf/keras/layers/UpSampling2D) and I claim that the results will not be much different. UpSampling2D is one of those deterministic methods that I mentioned.
To improve your generator, you can try to insert convolutional layers between upsampling layers. These layers would refine the features/images and correct some of the mistakes that occurred during the up-sampling. In addition to corrections, the next upsampling layer would take a more informative input. What I mean is to try a UNet like decoding that you can find in this link (https://arxiv.org/pdf/1505.04597.pdf). Of course, that would be a primary step to explore. There are many more GAN architectures that you can try and probably perform better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With