OG Green AI & ML Blog: GANs

Showing posts with label GANs. Show all posts

Sunday, July 2, 2017

Generating Pokemon with a Generative Adversarial Network GAN in Tensorflow 1.1

The past three weeks or so, I've had an obsession, generating Pokemon with a Generative Adversarial Network (GAN), specifically a DCGAN. I had seen the video at https://www.youtube.com/watch?v=rs3aI7bACGc&t=34 but the original spark that gave the itch was a post by https://github.com/joshnewnham on Udacity's Deep Learning Slack channel. After being able to replicate MNIST and going through Udacity's Deep Learning Foundations Nanodegree Course I thought what the heck this is going to be easy. I had been primarily doing deep learning on a 13-inch MBP 2014 and using AWS and more recently FloydHub to work with models that were too large, but I accidentally left my Instance on for AWS and ate through $70 of a $100 credit over a weekend, and while FloydHub was promising it was rather frustrating uploading large datasets, so I decided to finally buy a gaming computer with a NVIDIA GPU in order to work on more serious deep learning projects on my own machine. I ended up getting this: https://www.amazon.com/gp/product/B01J99IL9E/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1. I got it for right under $900 (it has already gone up by about $270 since I bought it) so I wouldn't have been able to build it for much less given that it came with Windows 10. I bought it because of the 1060 GTX 6GB card. I debated getting a 1070 GTX 8GB but this is a good "starter machine". I wouldn't really recommend getting a card with less memory, maybe a 1050 Ti with 4GB, if you were absolutely strapped for cash but if you have the money don't skimp. As for "the Cloud" vs buying your own machine debate, I would say that there's something about just being able to jump on a machine and instantly start coding that motivates me much more and I don't have to worry about leaving an instance on and getting a surprise.

Now, on to the quest of making Pokemon. I don't know what made me so cocky, but I erroneously assumed I would be able to basically convert the same model I created to generate MNIST digits and pump out images that could pass for Pokemon. The dataset I used came from https://veekun.com/dex/downloads, I used the "Global Link Artwork" dataset. It contains about 850 images of all generations of Pokemon? I can't really be certain, when Pokemon became big in the States, I was a teenager so I'm only vaguely familiar with them and I was about to download Pokemon Go, but the fad came and went so quickly that I never got around to it.

Example Real Image

These images were 300 x 300 x 4 (RGBA, transparent backgrounds), but they were much too large for my GPU (maybe I should have gotten that 1070) so I initially decided to resize them to 60x60x4 with scipy.misc.imresize(). The generator started with 225 random numbers (15 x 15), from here I used tf.layers.dense() with tf.layers.xavier_initializer() and tf.constant_initializer(0.1). The first two layers were 1024 dimensions and 15 x 15 x 512 dimensions. From here there were 3 layers of tf.layers.conv2d_tranpose() each layer used the aforementioned initializers and a stride of 2, a kernel of 3 and were 512, 384 and 4 channels respectively, taking the random numbers to a "generated image" of 60 x 60 x 4 dimensions. The discriminator was basically the opposite setup again using tf.layers.conv2d_tranpose() of 384, 512, and then the data is reshaped to 15x15x512 and flattened to one dimension to be converted to a sigmoid. I also did my method "moment stabilization" to try to keep the distributions similar. I ran this network for 500 epochs. These are my results, try not to laugh:

These are my Pokeblobs!!!!! It's not entirely bad, but definitely not anything to be proud about, now how do we make it better? Well, the saying goes that 80% of data science is data preparation. A benefit of Deep Learning is the ability to model without having to do much or even any feature engineering, but that doesn't mean all data is just ready to run through a network. To get better results, we have to get our hands dirty. First of all, 850 images are not that many, moreover, a few of the images are repeats. The first step was manually looking through the dataset and deleting the duplicate images. This probably wasn't absolutely necessary because there weren't that many, but I did it anyway. What I noticed while looking at the images manually was that the figures seemed to randomly be facing the left or the right, so I manually flipped the images to all face the same way. I could have written a function to flip an image if more values were on the left versus the right and lined them up that way, but it only took about 15 minutes to do it manually. Now came the important part, data augmentation to increase the number of images.

The first method of augmentation I utilized was rotating images with scipy.ndimage.rotate(). Big mistake, with this method the images are no longer square, but rhombuses and it just created a ton of issues (i.e. the DCGAN displaying lines from the rotations). A good method is as follows; if the images are transparent (RGBA), lift the images and put them on a larger white background (in this case 100x100) and convert them to RGB in the process to give room to rotate the image, then crop the image back to the size that will be fed into the model. An RGBA can be converted to have a background like so:

https://stackoverflow.com/questions/2563822/how-do-you-composite-an-image-onto-another-image-with-pil-in-python (slightly altered, from RGBA to RGB)

from PIL import Image
img = Image.open('/pathto/file', 'r')
img_w, img_h = img.size
background = Image.new('RGB', (100, 100), (255, 255, 255))
bg_w, bg_h = background.size
offset = ((bg_w - img_w) / 2, (bg_h - img_h) / 2)
background.paste(img, offset)
background.save('out.png')

From this point, I made altered copies with cv2.GaussianBlur(), and cv2.medianBlur() and then I rotated all of these images as follows (this example shows a 3-degree rotation):

M1 = cv2.getRotationMatrix2D((cols/2,rows/2),3,1)
dst1 = cv2.warpAffine(pokemon[i],M1,(cols,rows))
save_path1 = path + '3' + str(i) + '.png'
scipy.misc.imsave(save_path1, dst1[14:86, 4:76])

I rotated the images 3, 5, and 7 degrees and kept the unrotated images. After my deletions and data augmentation, I ended up with 17676 images. I started with 100 random numbers and changed the image dimensions to 72x72x3, most models I see online use 64x64x3 to easily build a deeper network up from 4x4 but I was determined not to just plug and play someone else's model. To obtain better results, I decided to code the generator from scratch, doing so allowed me to just use a linear transformation as the first layer and build a generator network that was solely conv2d_transpose layers. It's fairly easy, but there's one caveat; when you use a stride of two or more conv2d_tranpose is not invertible, as a result, you have to compensate for the change in height and width by adjusting the channels, see: https://github.com/tensorflow/tensorflow/issues/2118, when you use tf.layers, this is done for you. Just know that output = (input + stride - 1) // stride for "SAME" convolution, so you have to adjust the conv2d_transpose operation to match these outputs or the network will keep throwing errors. The backpropagation is basically convolution and goes as follows, from 72x72 to 36x36 to 18x18 to 9x9 to 5x5 (using the math above), but intuition would suggest that the conv2d_tranpose would go from 9x9 (resized from flat) to 9x9 to 18x18 to 36x36 to 72x72, so the "flat-resize" has to match the 5x5 output (not 9x9). This was the most frustrating part of this project so hopefully, this saves at least one person hours of pulling their hair out. Also, you cannot combine tf.layers.batch_normalization() with tf.nn.conv2d_transpose() so you must use tf.nn.moments() to get the mean and variance and feed them into tf.nn.batch_normalization(). After making these changes, I got markedly better images, but I realized that I needed to beef up the discriminator, and in doing so I got these results:

I'm pretty happy with these results. The model is able to pick up on the outline around the figures. It also seems to be creating arms, legs, and heads. There are a few more changes I want to try, such as a Wasserstein GAN (the discriminator becomes a "critic" and it is supposed to be more stable than a regular GAN). The model takes about 16-20 hours to train so I've been running it and then going to work and checking or tweaking it when I get home. I will probably break down and do the 4x4, 8x8, 16x16, 32x32, 64x64 generator just once to see if that improves the images, the only thing I'm really disappointed about is the green hue on many of the generated images. I know there's a reason why this setup is so popular but it doesn't hurt to experiment. I will update this post if I get improved results. The full code implementation can be found here: https://github.com/ogreen8084/pokemon_gan Overall, this was a great project. I implemented and shelved average pooling, created a DCGAN in Tensorflow with batch-normalization without using the higher level library tf.layers and figured out how to go straight to conv2d_transpose without needing fully-connected layers by using a linear transformation. I honestly would not have worked on this (and I would have spent way too much money) if I was using "the Cloud" so I'm very happy with my investment in a workstation with a NVIDIA GPU. If you're on the fence about it, I can definitely tell you, it's worth it!

Until next time, Oscar.

Update:
I did end up doing the 4x4... 64x64 setup and these are the results... not a noticeable difference in quality, but I think the Pokemon look cooler, so I uploaded it!

Monday, May 22, 2017

Using Moments Using Moments to Improve & Stabilize Generative Adversarial Network (GAN) Learning (Pt 2)

This post continues where we left off previously. We had promising results from using the first and second central moments (mean and variance) of real data to guide "fake data" from a generator network to hopefully mimic the data distribution of the real data and therefore have a better chance of producing good results. We will now switch to a convolutional neural network structure and test on the same MNIST dataset.

In this second test, we will use the same convolutional network structure throughout and instead alter the learning rate. A known problem with deep learning, in general, is the need to test hyperparameters in order to obtain good results. It would be reasonable to assume that by attempting to minimize the difference between the distribution of the real data and the generated data it would provide a greater margin of error. In other words, by stabilizing the distribution of the generated data it will be easier to train the network for a greater range of hyperparameters. We shall experiment and find out. The full implemenation for this project can be found at: https://github.com/ogreen8084/moment_stabilization_dconv

Dependencies:

Python 3.5.1

Tensorflow 1.0.1

Numpy

Matplotlib

Pickle
Pandas

Tests:

In all tests, we optimize with AdamOptimizer. We create our “fake MNIST dataset” from an initial input of 100 dimensions drawn from a uniform distribution with a minimum of -1 and a maximum of 1. The generator DeConvNet has a fully connected layer of 1,024 units and then a fully connected layer of 7*7*256 or 12,544 units. This layer is then reshaped and fed to a conv2d transpose layer of 32 filters and finally to a final conv2d_tranpose layer of one filter. Batch normalization is used throughout, padding is "SAME" and the stride is two for each conv2d_tranpose layer. We again use a batch size of 100, but we only train for 20 epochs as opposed to the feed-forward network when we trained for 100 epochs.

Test #1:

Test # 2:

Test #3

Results:

It's safe to say that the moment stabilization speeds up training in each test. In test #3, with the smallest learning rate, 0.0001, the non-moment stabilization model struggles to train at all, while the moment stabilization model is able to create digit-like figures. In test #1, there is also a clear advantage seen from moment stabilization. the model is able to generate digit like figures quicker and those figures remain crisper than the non-stabilized model throughout. In test #2, the performance of both models is closest, which could lend to the theory that if the model is already optimized then moment stabilization has less of an effect.

Conclusion:

We have further evidence that attempting to minimize the difference between the mean and variance of the generated dataset and real dataset can increase the performance of GANs and can potentially help models train successfully with less need for fine tuning of hyperparameters.

Benefits and Concerns:
1. It seems like the method works, but we need to attempt to understand mathematically "why" the method seems to work.
2. Can the method also allow us to obtain good results with a smaller, less complex network?

Next Steps:
1. Test on smaller DeConvNet with MNIST
2. Test on Celeba Dataset

Monday, May 8, 2017

Using Moments to Stabilize Generative Adversarial Network (GAN) Learning

Generative Adversarial Networks (GANs), created in 2014 by Ian Goodfellow, are an extremely promising method of producing fake data indistinguishable from the real thing by pitting two neural networks against one another. According to Open AI, GANs currently produce the sharpest generative images, compared to the other popular methods: Variational Autoencoders and Autoregressive models. This benefit however, comes at a cost, GANs are difficult to optimize, due to unstable training dynamics (Karpathy, June 16, 2016). GANs also have two neural networks, which must be synchronized well or the generative model will collapse around a successful instance (a generated instance that can fool the discriminator) as opposed to approximating the true distribution of the real dataset (Goodfellow, June 10, 2014). Finally, GANs can be very sensitive to the initial values of the weights and fail to train, batch normalization is recommended to help overcome this issue (Udacity, May 5, 2017). However, there may be another way. What if we adjusted the generative loss function to penalize the model if it doesn’t produce a similar distribution to the real data?

It makes sense as the goal of the generative model is to create a distribution that matches that of the real data. We shall see that by incentivizing the generator batches to match first two central moments (the mean and variance) of the real data batches we are able to successfully train deeper networks with or without batch normalization. We will use the popular MNIST dataset, available at: http://yann.lecun.com/exdb/mnist/ . The baseline GAN and the visualization code is from Udacity’s Deep Learning Foundations Nanodegree program: https://www.udacity.com/course/deep-learning-nanodegree-foundation--nd101. The full implemenation for this project can be found at: https://github.com/ogreen8084/moment_stabilization

Dependencies:

Python 3.5.1

Tensorflow 1.0.1

Numpy

Matplotlib

Pickle
Pandas

Tests:

In all tests, We optimize with AdamOptimizer and use the default learning rate. We create our “fake MNIST dataset” from an initial input of 100 dimensions drawn from a uniform distribution with a minimum of -1 and a maximum of 1. Throughout testing we will use a batch size of 100 and we train for 100 epochs. We do so using numpy as follows:

z_size = 100

batch_size = 100

batch_z = np.random.uniform(-1, 1, size=(batch_size, z_size))

Baseline Test: One Layer Generator without batch normalization.

https://github.com/udacity/deep-learning/blob/master/gan_mnist/Intro_to_GANs_Solution.ipynb

We use a generator with one hidden layer of 128 units (n_units). We use a leaky relu activation function, which is designed to fix the “dying relu problem” (more information can be found at: http://cs231n.github.io/neural-networks-1/).

Results from the baseline GAN, (from every 10 epochs, all other visualizations follow this format as well):

Discriminator Loss vs Generator Loss (Baseline)

We can see that the generative model learns to produce numeric-like figures. However, they don’t look like “real numbers”. We could probably get better results from a deeper network however, there’s a problem. As mentioned earlier, if you attempt to train the network without using a technique such as batch-normalization it won't train well.

Test 2: Two Hidden Layer Generator without batch normalization.

We use a generator with two layers, the first layer has 128 hidden units and the second layer has 384 hidden units. After 100 epochs, the model starts to produce figures that somewhat resemble numeric figures, but these are clearly worse results than the single hidden layer model. Perhaps better results could be attained by training past 100 epochs, but we are using the same epochs for every model to maintain consistency.

Results from Two Hidden Layer GAN (no batch normalization)

Discriminator Loss vs Generator Loss (Two Hidden Layers)

Test #3: Two Hidden Layer Generator with Batch Normalization

We now introduce batch normalization to attempt to get better results from the deeper network. The only difference between this test and the prior test is the introduction of batch normalization on each of the hidden layers. Clearly, batch normalization has a positive effect on the generator’s performance vs the base two layer generator, and the model seems to learn how to create numeric like figures, but they are nothing to write home about. We’ll try one more test (before utilizing the “real data moments of central measure” and extend the network to four hidden layers.

Results from Two Hidden Layer GAN (with Batch Normalization)

Discriminator Loss vs Generator Loss (Two Hidden Layers/Batch Normalization)

Test #4: Four Hidden Layer Generator with Batch Normalization

The four hidden layer generator with batch normalization does not perform well. It seems to be getting better at producing numeric like figures at 100 epochs, but again, we are using the same number of epochs to compare each model to maintain consistency.

Result from Four Hidden Layer GAN (with Batch Normalization)

Discriminator Loss vs Generator Loss (Four Hidden Layers/Batch Normalization)

A Solution? Stabilizing with Moments from the Real Data’s Distribution

Indeed GANs seem to be hard to train, Github user Soumith Chintala has compiled a group of “hacks” to help train them taken from NIPS 2016: https://github.com/soumith/ganhacks. However, there may be another way. If the goal of the generator is to reproduce the distribution of the real data, why not add something to the loss function to penalize the generator for collapsing or for not conforming to the real data’s distribution. With tensorflow’s moment function (tf.nn.moments) we can simply measure the difference between the generator’s mean and variance and the real data’s mean and variance on a batch-by-batch basis for each feature. We can do so as follows:

g_mean, g_var = tf.nn.moments(g_model, axes=[0])

d_mean, d_var = tf.nn.moments(input_real, axes=[0])

mean_diff = 0.1 * tf.reduce_sum(tf.abs(g_mean - d_mean))

std_diff = 0.1 * tf.reduce_sum(tf.abs(g_var - d_var))

We scale by 0.1 to keep the mean_diff and the std_diff comparable to the generator loss, we don’t want these measures to be so much larger than the generator loss that the model “ignores” the generator loss.

The generator loss goes from:

g_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_fake, labels=tf.ones_like(d_logits_fake)))

to:

g_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_fake, labels=tf.ones_like(d_logits_fake))) + (std_diff + mean_diff)

Test #5: Two Hidden Layer Generator with Moment Stabilization

As we can see, the generator now produces much better results. We can now clearly see that the generator is producing numeric like figures, much crisp and clear than with batch normalization.

Results from Two Hidden Layer GAN (with Moment Stabilization)

Discriminator Loss vs Generator Loss (Two Hidden Layers/Moment Stabilization)

Test #6: Four Hidden Layer Generator with Moment Stabilization

By stabilizing with the moments of central measure of the real data we are now able to successfully create deep generator networks.

Results from Four Hidden Layer GAN (with Moment Stabilization)

Discriminator Loss vs Generator Loss (Four Hidden Layers/Moment Stabilization)

Test #7: Four Hidden Layer Generator with Moment Stabilization, Dropout

& Batch Normalization

As a final test, we see that we can also reap the rewards of moment stabilization with other regularization methods. This model is trained with dropout in the first layer and batch normalization.

Results from Four Hidden Layer GAN (with Moment Stabilization, Dropout & Batch Normalization)

Discriminator Loss vs Generator Loss (Four Hidden Layers/Moment Stabilization/Dropout/Batch Normalization)

Benefits and Concerns (initial):

1. The model seems to learn faster and more effectively with "moment stabilization".

2. How will the method transfer to datasets that are clustered and that doesn't lend itself well to being described by a single mean and variance?

3. The method adds to model complexity since the mean and variance are calculated for each dimension of the dataset over each batch. This doesn't pose too much of an issue for datasets such as MNIST with only 784 features, but what about datasets with 10,000 or 1,000,000 features or more features?

Next Steps:

1. Try the method on another dataset

2. Try the method with convolution

OG Green AI & ML Blog