Wednesday, June 6, 2018

Super-Resolution SRCNN Tutorial in TensorFlow Part 1

SUPER-RESOLUTION SRCNN

TensorFlow Tutorial: Part 1

This is the first entry into a four-part series that will give a tutorial on the different ways that you can utilize deep convolutional neural networks to upscale images, i.e. Super-Resolution. We use TensorFlow version 1.4 throughout this series.

I hate small images!

I do a lot of graphic design stuff in my free time and I'm always pulling images off the net and using them in my work. There's no worse feeling than finding the perfect image only to have its resolution be too small to be of any use. Have you ever tried to use Photoshop, GIMP, or another image editor to attempt to resize an image and make it larger? If so you know firsthand the disappointment that comes with trying to upscale an image. Whether it be the Bicubic interpolation, Spline interpolation, or Lanczos resampling algorithm, no matter how fancy the upscaling method sounds, the image still comes out blurry and filled with artifacts, noise, and/or serrated edges. Fed up one day, I decided that I would scour the internet until l found a solution. No matter how long it took I was determined to find a better way to upscale images. Well, it only took 0.36 seconds. The first Google search result was for a website called Let's Enhance, a free online image and upscale and enhancement solution. The results were amazing, and they do it by using deep convolutional neural networks (ConvNets).

SR-CNNs

Well, why didn't I think of that, just feed a bunch of downscaled images into a neural network, use the upscaled images as the target and viola, Super-Resolution. I had recently created a database of shoe images for another project I'm working on, so I would just take those images, downscale them and run them through a Deconvolutional Network (DeconvNet) and have crisp images again, it should be a piece of cake. I got ok results, but clearly, I was missing something. Although my images were less blurry since I was attempting to upsample a smaller image by using deconvolution my newly upscaled images were covered with checkerboard artifacts. Rather than try to figure it out myself I decided to do another search online for some articles on established methods of Super-Resolution. In the end, I found three successful ways to successfully upscale images with ConvNets and used one of the methods as inspiration to come up with a fourth method. This first part of the series covers what is probably the most popular method of Super-Resolution, or at least the one that usually comes up first when you do a Google search.

The first good article I found on my Super-Resolution quest was Learning a Deep Convolutional Network for Image Super-Resolution by Chao Dong, et al. Anyone interested in this topic should read this paper end to end. I won't get into it much here, but it should be a very easy read for anyone who has a bit of exposure to neural networks. They actually take a rather shallow ConvNet of only three layers and a training set of only 91 images to produce stellar results. I didn't use this framework exactly, most significantly they use the YCrCb color model and I just use RGB, but I borrowed heavily from it. Since I use TensorFlow and they used MATLAB to implement their network I found a reproduction of their work that used TensorFlow at https://github.com/tegg89/SRCNN-Tensorflow and used it as inspiration for this code.

IMAGE PREP

Before I could create the network, however, I needed a dataset of images. Luckily, I'm working on another project with shoe images, so I already had a personal database of a few thousand shoes at my disposal to use in this project. It may sound like a lot, but in reality, it only took a few hours to download. If you wanted to do the something similar you could do so fairly quickly or just download a pre-made image dataset from the net. The next step is to make a set of low and high-resolution images with the same dimensions. This method of super-resolution takes the downscaled image and upscales it back to the original size before running it through the network. Downscaling the images and making their dimensions uniform is fairly simple using PIL. Take an image, any dimensions, put it on a square white background whose length and width is the same as the largest side of the image and then downscale all the images to the same size. I have a GTX 1060 so I decided a size of 128x128 for my network. Depending on your setup, you could use bigger or smaller images. The code is as follows:

We use the same function to prepare both the high and low-resolution images. To accomplish the downsampling, we scale the low-resolution images down by a factor of 2 (making them 64x64) and then upscale them back to 128x128 using Bicubic interpolation. Upscaling with Bicubic interpolation or with another algorithm, such as KNN before feeding the images through the DeConvNet eliminates the checkerboard artifact issue. You could also keep the image small and just perform the re-size in TensorFlow with tf.image.resize_images. I utilized this method with a GAN I'm working on and got good results, but won't use the image resize method in this series. I will cover it however when I discuss GANs again.

DECONVNET ARCHITECTURE

Our inputs to the network will be the high and low-resolution images, each 128x128x3. We set up a seven-layer DeconvNet with 32 filters in each layer. I was really hoping to use PReLUs but since they are optimized along with the model, I didn't have the processing power. I ended up choosing Leaky ReLUs for the activation function for all layers with the exception of the last one. The Leaky ReLU helps prevent the "dying ReLU" problem. A really nice, succinct explanation on dying ReLUs can be found here, A Practical Guide to ReLU. If a ReLU has a negative slope, its output is zero, the more negative slopes you have the less effective your network will be as this part of the network has basically been turned off.

Since the image size stays the same, the model utilizes a stride of 1, "same" padding, and no pooling. Since the stride is 1, technically a "transposed convolution" is not really happening, it's just a convolution since we're mapping the image back to an image that's the same size. However, coding it like this allows us to more easily experiment with feeding the original downsampled 64x64 image into the network and actually performing a transposed convolution or resize to upsample the image, so it makes the code more modular. The final activation function is Tanh, which has a range of -1 to 1, to keep the targets in line with this output, we transform our targets so that they have the same range before feeding them into the network. Our loss function is about as simple as possible, its just the mean-squared error. We just compare our upsampled image pixel by pixel against the ground truth, original image, some articles refer to this as Pixel Loss. What follows is a sample of my results on some test images after about 10 hours of training...

RESULTS

I was very excited seeing these results. Some details were lost during the downsampling and can never be recovered, we'll discuss why as this series progresses, but overall the images look pretty good, much more crisp than the Bicubic Interpolation. Full results and full implementation of this project can be found at https://github.com/ogreen8084/srcnns.

SRCNN PROS AND CONS

Pros: No checkerboard artifacts, relatively small network, less computationally intensive, easy to implement.

Cons: Loss Function? Is MSE the best way to compare images?

Sunday, July 2, 2017

Generating Pokemon with a Generative Adversarial Network GAN in Tensorflow 1.1

The past three weeks or so, I've had an obsession, generating Pokemon with a Generative Adversarial Network (GAN), specifically a DCGAN. I had seen the video at https://www.youtube.com/watch?v=rs3aI7bACGc&t=34 but the original spark that gave the itch was a post by https://github.com/joshnewnham on Udacity's Deep Learning Slack channel. After being able to replicate MNIST and going through Udacity's Deep Learning Foundations Nanodegree Course I thought what the heck this is going to be easy. I had been primarily doing deep learning on a 13-inch MBP 2014 and using AWS and more recently FloydHub to work with models that were too large, but I accidentally left my Instance on for AWS and ate through $70 of a $100 credit over a weekend, and while FloydHub was promising it was rather frustrating uploading large datasets, so I decided to finally buy a gaming computer with a NVIDIA GPU in order to work on more serious deep learning projects on my own machine. I ended up getting this: https://www.amazon.com/gp/product/B01J99IL9E/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1. I got it for right under $900 (it has already gone up by about $270 since I bought it) so I wouldn't have been able to build it for much less given that it came with Windows 10. I bought it because of the 1060 GTX 6GB card. I debated getting a 1070 GTX 8GB but this is a good "starter machine". I wouldn't really recommend getting a card with less memory, maybe a 1050 Ti with 4GB, if you were absolutely strapped for cash but if you have the money don't skimp. As for "the Cloud" vs buying your own machine debate, I would say that there's something about just being able to jump on a machine and instantly start coding that motivates me much more and I don't have to worry about leaving an instance on and getting a surprise.

Now, on to the quest of making Pokemon. I don't know what made me so cocky, but I erroneously assumed I would be able to basically convert the same model I created to generate MNIST digits and pump out images that could pass for Pokemon. The dataset I used came from https://veekun.com/dex/downloads, I used the "Global Link Artwork" dataset. It contains about 850 images of all generations of Pokemon? I can't really be certain, when Pokemon became big in the States, I was a teenager so I'm only vaguely familiar with them and I was about to download Pokemon Go, but the fad came and went so quickly that I never got around to it.

Example Real Image

These images were 300 x 300 x 4 (RGBA, transparent backgrounds), but they were much too large for my GPU (maybe I should have gotten that 1070) so I initially decided to resize them to 60x60x4 with scipy.misc.imresize(). The generator started with 225 random numbers (15 x 15), from here I used tf.layers.dense() with tf.layers.xavier_initializer() and tf.constant_initializer(0.1). The first two layers were 1024 dimensions and 15 x 15 x 512 dimensions. From here there were 3 layers of tf.layers.conv2d_tranpose() each layer used the aforementioned initializers and a stride of 2, a kernel of 3 and were 512, 384 and 4 channels respectively, taking the random numbers to a "generated image" of 60 x 60 x 4 dimensions. The discriminator was basically the opposite setup again using tf.layers.conv2d_tranpose() of 384, 512, and then the data is reshaped to 15x15x512 and flattened to one dimension to be converted to a sigmoid. I also did my method "moment stabilization" to try to keep the distributions similar. I ran this network for 500 epochs. These are my results, try not to laugh:

These are my Pokeblobs!!!!! It's not entirely bad, but definitely not anything to be proud about, now how do we make it better? Well, the saying goes that 80% of data science is data preparation. A benefit of Deep Learning is the ability to model without having to do much or even any feature engineering, but that doesn't mean all data is just ready to run through a network. To get better results, we have to get our hands dirty. First of all, 850 images are not that many, moreover, a few of the images are repeats. The first step was manually looking through the dataset and deleting the duplicate images. This probably wasn't absolutely necessary because there weren't that many, but I did it anyway. What I noticed while looking at the images manually was that the figures seemed to randomly be facing the left or the right, so I manually flipped the images to all face the same way. I could have written a function to flip an image if more values were on the left versus the right and lined them up that way, but it only took about 15 minutes to do it manually. Now came the important part, data augmentation to increase the number of images.

The first method of augmentation I utilized was rotating images with scipy.ndimage.rotate(). Big mistake, with this method the images are no longer square, but rhombuses and it just created a ton of issues (i.e. the DCGAN displaying lines from the rotations). A good method is as follows; if the images are transparent (RGBA), lift the images and put them on a larger white background (in this case 100x100) and convert them to RGB in the process to give room to rotate the image, then crop the image back to the size that will be fed into the model. An RGBA can be converted to have a background like so:

https://stackoverflow.com/questions/2563822/how-do-you-composite-an-image-onto-another-image-with-pil-in-python (slightly altered, from RGBA to RGB)

from PIL import Image
img = Image.open('/pathto/file', 'r')
img_w, img_h = img.size
background = Image.new('RGB', (100, 100), (255, 255, 255))
bg_w, bg_h = background.size
offset = ((bg_w - img_w) / 2, (bg_h - img_h) / 2)
background.paste(img, offset)
background.save('out.png')

From this point, I made altered copies with cv2.GaussianBlur(), and cv2.medianBlur() and then I rotated all of these images as follows (this example shows a 3-degree rotation):

M1 = cv2.getRotationMatrix2D((cols/2,rows/2),3,1)
dst1 = cv2.warpAffine(pokemon[i],M1,(cols,rows))
save_path1 = path + '3' + str(i) + '.png'
scipy.misc.imsave(save_path1, dst1[14:86, 4:76])

I rotated the images 3, 5, and 7 degrees and kept the unrotated images. After my deletions and data augmentation, I ended up with 17676 images. I started with 100 random numbers and changed the image dimensions to 72x72x3, most models I see online use 64x64x3 to easily build a deeper network up from 4x4 but I was determined not to just plug and play someone else's model. To obtain better results, I decided to code the generator from scratch, doing so allowed me to just use a linear transformation as the first layer and build a generator network that was solely conv2d_transpose layers. It's fairly easy, but there's one caveat; when you use a stride of two or more conv2d_tranpose is not invertible, as a result, you have to compensate for the change in height and width by adjusting the channels, see: https://github.com/tensorflow/tensorflow/issues/2118, when you use tf.layers, this is done for you. Just know that output = (input + stride - 1) // stride for "SAME" convolution, so you have to adjust the conv2d_transpose operation to match these outputs or the network will keep throwing errors. The backpropagation is basically convolution and goes as follows, from 72x72 to 36x36 to 18x18 to 9x9 to 5x5 (using the math above), but intuition would suggest that the conv2d_tranpose would go from 9x9 (resized from flat) to 9x9 to 18x18 to 36x36 to 72x72, so the "flat-resize" has to match the 5x5 output (not 9x9). This was the most frustrating part of this project so hopefully, this saves at least one person hours of pulling their hair out. Also, you cannot combine tf.layers.batch_normalization() with tf.nn.conv2d_transpose() so you must use tf.nn.moments() to get the mean and variance and feed them into tf.nn.batch_normalization(). After making these changes, I got markedly better images, but I realized that I needed to beef up the discriminator, and in doing so I got these results:

I'm pretty happy with these results. The model is able to pick up on the outline around the figures. It also seems to be creating arms, legs, and heads. There are a few more changes I want to try, such as a Wasserstein GAN (the discriminator becomes a "critic" and it is supposed to be more stable than a regular GAN). The model takes about 16-20 hours to train so I've been running it and then going to work and checking or tweaking it when I get home. I will probably break down and do the 4x4, 8x8, 16x16, 32x32, 64x64 generator just once to see if that improves the images, the only thing I'm really disappointed about is the green hue on many of the generated images. I know there's a reason why this setup is so popular but it doesn't hurt to experiment. I will update this post if I get improved results. The full code implementation can be found here: https://github.com/ogreen8084/pokemon_gan Overall, this was a great project. I implemented and shelved average pooling, created a DCGAN in Tensorflow with batch-normalization without using the higher level library tf.layers and figured out how to go straight to conv2d_transpose without needing fully-connected layers by using a linear transformation. I honestly would not have worked on this (and I would have spent way too much money) if I was using "the Cloud" so I'm very happy with my investment in a workstation with a NVIDIA GPU. If you're on the fence about it, I can definitely tell you, it's worth it!

Until next time, Oscar.

Update:
I did end up doing the 4x4... 64x64 setup and these are the results... not a noticeable difference in quality, but I think the Pokemon look cooler, so I uploaded it!

Sunday, June 4, 2017

Reinforcement Learning Q-Learning vs SARSA explanation, by example and code

I’ve been studying reinforcement learning over the past several weeks. David Silver has an excellent course on YouTube that introduces many of the major topics of the field. The course can be found at: https://www.youtube.com/channel/UCP7jMXSY2xbc3KCAE0MHQ-A/videos. There are several different algorithms for learning Markov Decision Processes (MDPs) however, many of the algorithms are extremely similar. It can be very hard to get a grasp on the differences between these methods. We shall differentiate between two methods that are extremely similar: Q-Learning and SARSA. An easy way to sum up their differences is to think of Q-Learning as an optimist, always looking at the world with rose colored glasses. It forms a policy based off of the best possible actions, regardless of if these actions take place. SARSA however is a more metered approach that forms a policy based off of the actual actions taken. The policy is basically a set of rules that govern how an agent should behave in an environment. Q-Learning and SARSA are both methods to obtain the “optimal policy” or set of rules that maximize the future value received from an environment.

Q-Learning

Q-Learning is considered an “off-policy” algorithm; it doesn’t have to update from an action actually experienced by the agent. Q-Learning adjusts its action-value function by updating it with the value it would receive from making the optimal action in the next state. This value is used to update the action-value function of the current state/action and the update occurs regardless of if the optimal action is actually taken in the next state.

Q[s][a] = Q[s][a] + alpha*(r+ GAMMA * MAX(Q[s’][a’]) - Q[s][a])

SARSA

SARSA on the other hand, is an “on-policy” algorithm. With an on-policy algorithm, the action-value function of a state is updated with the value of the action-value function of the actual action taken in the next state.

Q[s][a] = Q[s][a] + alpha*(r+ GAMMA * Q[s’][a’] - Q[s][a])

The Key Difference

Q-Learning updates with the value of the max (optimal) next action, SARSA updates with the value of the actual next action.

A simple example

The next state (s’) is a fork in the road. We have two moves, left or right. The next state is not deterministic, 75% of the time you will end up on the right, and 25% of the time you will end up on the left. (See Figure 1)

p(right) = 0.75 r(right) = 1, therefore: Q[s’][right] = 1

p(left) = 0.25, r(left) = -1, therefore: Q[s’][left] = -1

Q-Learning will update Q[s][a] with MAX(Q[s’][a’]) = Q[s’][right] = 1.

SARSA will update Q[s][a] with Q[s’][a’] = Q[s’][right] = 1, 75% of the time and with Q[s’][a’] = Qs’][left] = -1 the other 25% of the time.

Code Example (Python)

Dependencies: Python 3.5.1, Numpy

Full implementation: https://github.com/ogreen8084/reinforcement/tree/master

GridWorld

We set up a 4x4 GridWorld environment, but only 14 states are reachable.

self.invalid_states = {(2, 1): ' X ', (1, 2): ' X '}

There are two reward states:

self.rewards = {(3,3): 1, (2,3): -1 }

Possible actions are Up, Down, Left and Right. If the agent attempts a move that does not result in reaching another state it will “bounce” against the wall and remain in the same state. There is a step cost of -0.05 for each step. To ensure exploration of all states, the game will employ an exploration/exploitation model that will decay over time, as the policy becomes more set, the need for exploration decreases.

To make the two algorithms directly comparable, we make an assumption that the two methods will make the same action at every step. It is possible that the "max action" and thus the actions could diverge because the two methods could update differently at the beginning. However, since each action is a random action we are free to make this assumption.

The difference in code?

For both algorithms, we return the action and max value from the next state. Q-Learning will use this max value to update its action-value function. It is considered “off-policy” because this update does not depend on the action that actually occurs in the next state.

The Q-Learning update is as follows:

max_action, max_st_val = best_action(Q_learn[new_state])

Q_learn[state][action] = Q_learn[state][action] + alpha*(reward + GAMMA * max_st_val - Q_learn[state][action])

SARSA on the other hand, is updated with the action-value function of the action taken in the next state according to the policy, thus it is “on-policy”. This action is randomly chosen from possible actions of the next state.

The SARSA update is as follows:

next_action = random_action(max_action, 0.5/t)

next_st_val = Q_sarsa[new_state][next_action]

Q_sarsa[state][action] = Q_sarsa[state][action] + alpha*(reward + GAMMA * next_st_val - Q_sarsa[state][action])

So that’s the difference between the two algorithms in code, when using Q-Learning the update is done with the maximum valued action at the next state and with SARSA the update is dependent upon the action that is actually taken at the next state. As a result, Q-Learning updates are larger than SARSA updates.

Conclusion

Does this mean that Q-Learning is superior to SARSA, not so fast. According to Poole, SARSA is useful when you want to optimize the value of an agent that is exploring. In some situations exploration can be dangerous, the example Poole uses is a robot agent going near the top of the stairs, SARSA for example may discover this danger and adopt a policy that keeps the robot away, Q-Learning would not do so (Poole, 2010). So when choosing between the two it may be better to utilize SARSA when you need to take risk into account. If you have an optimistic view or risk is not a concern it may be better to utilize Q-Learning.

Poole, D. (2010) Artificial Intelligence Foundations of Computational Agents: On-Policy Learning. Retrieved From: http://artint.info/html/ArtInt_268.html

OG Green AI & ML Blog

Wednesday, June 6, 2018