Sunday, June 4, 2017

Reinforcement Learning Q-Learning vs SARSA explanation, by example and code



I’ve been studying reinforcement learning over the past several weeks. David Silver has an excellent course on YouTube that introduces many of the major topics of the field. The course can be found at: https://www.youtube.com/channel/UCP7jMXSY2xbc3KCAE0MHQ-A/videos. There are several different algorithms for learning Markov Decision Processes (MDPs)  however, many of the algorithms are extremely similar. It can be very hard to get a grasp on the differences between these methods. We shall differentiate between two methods that are extremely similar: Q-Learning and SARSA. An easy way to sum up their differences is to think of Q-Learning as an optimist, always looking at the world with rose colored glasses. It forms a policy based off of the best possible actions, regardless of if these actions take place. SARSA however is a more metered approach that forms a policy based off of the actual actions taken. The policy is basically a set of rules that govern how an agent should behave in an environment. Q-Learning and SARSA are both methods to obtain the “optimal policy” or set of rules that maximize the future value received from an environment. 

Q-Learning
Q-Learning is considered  an “off-policy” algorithm; it doesn’t have to update from an action actually experienced by the agent. Q-Learning adjusts its action-value function by updating it with the value it would receive from making the optimal action in the next state. This value is used to update the action-value function of the current state/action and the update occurs regardless of if the optimal action is actually taken in the next state. 
Q[s][a] = Q[s][a] + alpha*(r+ GAMMA * MAX(Q[s’][a’]) - Q[s][a])


SARSA
SARSA on the other hand, is an “on-policy” algorithm. With an on-policy algorithm, the action-value function of a state is updated with the value of the action-value function of the actual action taken in the next state. 
Q[s][a] = Q[s][a] + alpha*(r+ GAMMA * Q[s’][a’] - Q[s][a])


The Key Difference
Q-Learning updates with the value of the max (optimal) next action, SARSA updates with the value of the actual next action.


A simple example
The next state (s’)  is a fork in the road. We have two moves, left or right. The next state is not deterministic, 75% of the time you will end up on the right, and 25% of the time you will end up on the left. (See Figure 1)
















p(right) = 0.75   r(right) = 1, therefore:  Q[s’][right] = 1
p(left) = 0.25, r(left) = -1, therefore:  Q[s’][left] = -1

Q-Learning will update Q[s][a] with MAX(Q[s’][a’]) = Q[s’][right] = 1. 
SARSA will update Q[s][a] with Q[s’][a’] = Q[s’][right] = 1, 75% of the time and with Q[s’][a’] = Qs’][left] = -1 the other 25% of the time.

Code Example (Python)
Dependencies: Python 3.5.1, Numpy 

GridWorld
We set up a 4x4 GridWorld environment, but only 14 states are reachable. 
self.invalid_states = {(2, 1): ' X  ', (1, 2): '  X '}

There are two reward states:
self.rewards = {(3,3): 1, (2,3): -1 }

Possible actions are Up, Down, Left and Right. If the agent attempts a move that does not result in reaching another state it will “bounce” against the wall and remain in the same state. There is a step cost of -0.05 for each step. To ensure exploration of all states, the game will employ an exploration/exploitation model that will decay over time, as the policy becomes more set, the need for exploration decreases. 
To make the two algorithms directly comparable, we make an assumption that the two methods will make the same action at every step. It is possible that the "max action" and thus the actions could diverge because the two methods could update differently at the beginning. However, since each action is a random action we are free to make this assumption. 

The difference in code?
For both algorithms, we return the action and max value from the next state. Q-Learning will use this max value to update its action-value function. It is considered “off-policy” because this update does not depend on the action that actually occurs in the next state. 
The Q-Learning update is as follows:
max_action, max_st_val = best_action(Q_learn[new_state])
Q_learn[state][action] = Q_learn[state][action] + alpha*(reward + GAMMA * max_st_val - Q_learn[state][action])

SARSA on the other hand, is updated with the action-value function of the action taken in the next state according to the policy, thus it is “on-policy”. This action is randomly chosen from possible actions of the next state. 
The SARSA update is as follows:
next_action = random_action(max_action, 0.5/t)
next_st_val = Q_sarsa[new_state][next_action]
Q_sarsa[state][action] = Q_sarsa[state][action] + alpha*(reward + GAMMA * next_st_val - Q_sarsa[state][action])

So that’s the difference between the two algorithms in code, when using Q-Learning the update is done with the maximum valued action at the next state and with SARSA the update is dependent upon the action that is actually taken at the next state. As a result, Q-Learning updates are larger than SARSA updates.


















Conclusion
Does this mean that Q-Learning is superior to SARSA, not so fast. According to Poole, SARSA is useful when you want to optimize the value of an agent that is exploring. In some situations exploration can be dangerous, the example Poole uses is a robot agent going near the top of the stairs, SARSA for example may discover this danger and adopt a policy that keeps the robot away, Q-Learning would not do so  (Poole, 2010). So when choosing between the two it may be better to utilize SARSA when you need to take risk into account. If you have an optimistic view or risk is not a concern it may be better to utilize Q-Learning. 


Poole, D. (2010) Artificial Intelligence Foundations of Computational Agents: On-Policy Learning. Retrieved From: http://artint.info/html/ArtInt_268.html

Super-Resolution SRCNN Tutorial in TensorFlow Part 1

SUPER-RESOLUTION SRCNN  TensorFlow Tutorial: Part 1 This is the first entry into a four-part series that will give a tutorial on th...