Now, on to the quest of making Pokemon. I don't know what made me so cocky, but I erroneously assumed I would be able to basically convert the same model I created to generate MNIST digits and pump out images that could pass for Pokemon. The dataset I used came from https://veekun.com/dex/downloads, I used the "Global Link Artwork" dataset. It contains about 850 images of all generations of Pokemon? I can't really be certain, when Pokemon became big in the States, I was a teenager so I'm only vaguely familiar with them and I was about to download Pokemon Go, but the fad came and went so quickly that I never got around to it.
Example Real Image |
These images were 300 x 300 x 4 (RGBA, transparent backgrounds), but they were much too large for my GPU (maybe I should have gotten that 1070) so I initially decided to resize them to 60x60x4 with scipy.misc.imresize(). The generator started with 225 random numbers (15 x 15), from here I used tf.layers.dense() with tf.layers.xavier_initializer() and tf.constant_initializer(0.1). The first two layers were 1024 dimensions and 15 x 15 x 512 dimensions. From here there were 3 layers of tf.layers.conv2d_tranpose() each layer used the aforementioned initializers and a stride of 2, a kernel of 3 and were 512, 384 and 4 channels respectively, taking the random numbers to a "generated image" of 60 x 60 x 4 dimensions. The discriminator was basically the opposite setup again using tf.layers.conv2d_tranpose() of 384, 512, and then the data is reshaped to 15x15x512 and flattened to one dimension to be converted to a sigmoid. I also did my method "moment stabilization" to try to keep the distributions similar. I ran this network for 500 epochs. These are my results, try not to laugh:
These are my Pokeblobs!!!!! It's not entirely bad, but definitely not anything to be proud about, now how do we make it better? Well, the saying goes that 80% of data science is data preparation. A benefit of Deep Learning is the ability to model without having to do much or even any feature engineering, but that doesn't mean all data is just ready to run through a network. To get better results, we have to get our hands dirty. First of all, 850 images are not that many, moreover, a few of the images are repeats. The first step was manually looking through the dataset and deleting the duplicate images. This probably wasn't absolutely necessary because there weren't that many, but I did it anyway. What I noticed while looking at the images manually was that the figures seemed to randomly be facing the left or the right, so I manually flipped the images to all face the same way. I could have written a function to flip an image if more values were on the left versus the right and lined them up that way, but it only took about 15 minutes to do it manually. Now came the important part, data augmentation to increase the number of images.
The first method of augmentation I utilized was rotating images with scipy.ndimage.rotate(). Big mistake, with this method the images are no longer square, but rhombuses and it just created a ton of issues (i.e. the DCGAN displaying lines from the rotations). A good method is as follows; if the images are transparent (RGBA), lift the images and put them on a larger white background (in this case 100x100) and convert them to RGB in the process to give room to rotate the image, then crop the image back to the size that will be fed into the model. An RGBA can be converted to have a background like so:
https://stackoverflow.com/questions/2563822/how-do-you-composite-an-image-onto-another-image-with-pil-in-python (slightly altered, from RGBA to RGB)
from PIL import Image
img = Image.open('/pathto/file', 'r')
img_w, img_h = img.size
background = Image.new('RGB', (100, 100), (255, 255, 255))
bg_w, bg_h = background.size
offset = ((bg_w - img_w) / 2, (bg_h - img_h) / 2)
background.paste(img, offset)
background.save('out.png')
From this point, I made altered copies with cv2.GaussianBlur(), and cv2.medianBlur() and then I rotated all of these images as follows (this example shows a 3-degree rotation):M1 = cv2.getRotationMatrix2D((cols/2,rows/2),3,1) dst1 = cv2.warpAffine(pokemon[i],M1,(cols,rows)) save_path1 = path + '3' + str(i) + '.png' scipy.misc.imsave(save_path1, dst1[14:86, 4:76])
I rotated the images 3, 5, and 7 degrees and kept the unrotated images. After my deletions and data augmentation, I ended up with 17676 images. I started with 100 random numbers and changed the image dimensions to 72x72x3, most models I see online use 64x64x3 to easily build a deeper network up from 4x4 but I was determined not to just plug and play someone else's model. To obtain better results, I decided to code the generator from scratch, doing so allowed me to just use a linear transformation as the first layer and build a generator network that was solely conv2d_transpose layers. It's fairly easy, but there's one caveat; when you use a stride of two or more conv2d_tranpose is not invertible, as a result, you have to compensate for the change in height and width by adjusting the channels, see: https://github.com/tensorflow/tensorflow/issues/2118, when you use tf.layers, this is done for you. Just know that output = (input + stride - 1) // stride for "SAME" convolution, so you have to adjust the conv2d_transpose operation to match these outputs or the network will keep throwing errors. The backpropagation is basically convolution and goes as follows, from 72x72 to 36x36 to 18x18 to 9x9 to 5x5 (using the math above), but intuition would suggest that the conv2d_tranpose would go from 9x9 (resized from flat) to 9x9 to 18x18 to 36x36 to 72x72, so the "flat-resize" has to match the 5x5 output (not 9x9). This was the most frustrating part of this project so hopefully, this saves at least one person hours of pulling their hair out. Also, you cannot combine tf.layers.batch_normalization() with tf.nn.conv2d_transpose() so you must use tf.nn.moments() to get the mean and variance and feed them into tf.nn.batch_normalization(). After making these changes, I got markedly better images, but I realized that I needed to beef up the discriminator, and in doing so I got these results:
I'm pretty happy with these results. The model is able to pick up on the outline around the figures. It also seems to be creating arms, legs, and heads. There are a few more changes I want to try, such as a Wasserstein GAN (the discriminator becomes a "critic" and it is supposed to be more stable than a regular GAN). The model takes about 16-20 hours to train so I've been running it and then going to work and checking or tweaking it when I get home. I will probably break down and do the 4x4, 8x8, 16x16, 32x32, 64x64 generator just once to see if that improves the images, the only thing I'm really disappointed about is the green hue on many of the generated images. I know there's a reason why this setup is so popular but it doesn't hurt to experiment. I will update this post if I get improved results. The full code implementation can be found here: https://github.com/ogreen8084/pokemon_gan Overall, this was a great project. I implemented and shelved average pooling, created a DCGAN in Tensorflow with batch-normalization without using the higher level library tf.layers and figured out how to go straight to conv2d_transpose without needing fully-connected layers by using a linear transformation. I honestly would not have worked on this (and I would have spent way too much money) if I was using "the Cloud" so I'm very happy with my investment in a workstation with a NVIDIA GPU. If you're on the fence about it, I can definitely tell you, it's worth it!
Until next time, Oscar.
Update:
I did end up doing the 4x4... 64x64 setup and these are the results... not a noticeable difference in quality, but I think the Pokemon look cooler, so I uploaded it!