Difference between revisions of "Deep Q-Learning"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
 
(5 intermediate revisions by the same user not shown)
Line 30: Line 30:
 
<ol>
 
<ol>
 
<li>Be in state <code>s</code></li>
 
<li>Be in state <code>s</code></li>
<li>With probability <code>p</code>, choose a random action <code>a</code> from the action space (explore). Otherwise, choose action <code>a = argmax_a(Q(s,a))</code> (exploit).</li>
+
<li>With probability <code>p</code>, choose a random action <code>a</code> from the action space (explore). Otherwise, choose action <code>a = argmax<sub>a</sub>(Q(s,a))</code> (exploit).</li>
 
<li>Take action <code>a</code> and witness the reward <code>r</code> and the new state you are in <code>s'</code>.</li>
 
<li>Take action <code>a</code> and witness the reward <code>r</code> and the new state you are in <code>s'</code>.</li>
 
<li>This unit of training has yielded an experience tuple - <code><s, a, s', r></code>. Using this experience tuple, perform value iteration with Bellman's equation.  
 
<li>This unit of training has yielded an experience tuple - <code><s, a, s', r></code>. Using this experience tuple, perform value iteration with Bellman's equation.  
Line 58: Line 58:
 
<ol>
 
<ol>
 
<li>Be in state <code>s</code></li>
 
<li>Be in state <code>s</code></li>
<li>With probability <code>p</code>, choose a random action <code>a</code> from the action space (explore). Otherwise, choose action <code>a = argmax_a(Q(s,a))</code> (exploit).</li>
+
<li>With probability <code>p</code>, choose a random action <code>a</code> from the action space (explore). Otherwise, choose action <code>a = argmax<sub>a</sub>(Q(s,a))</code> (exploit).</li>
 
<li>Take action <code>a</code> and witness the reward <code>r</code> and the new state you are in <code>s'</code>.</li>
 
<li>Take action <code>a</code> and witness the reward <code>r</code> and the new state you are in <code>s'</code>.</li>
 
<li style="color:navy">This unit of training has yielded an experience tuple - <code><s, a, s', r></code>. Add this experience tuple to the bucket. If your bucket overflows its size, remove the oldest entry.</li>
 
<li style="color:navy">This unit of training has yielded an experience tuple - <code><s, a, s', r></code>. Add this experience tuple to the bucket. If your bucket overflows its size, remove the oldest entry.</li>
<li style="color:navy">Experience replay step. If you have more than <code>BATCH_SIZE</code> (usually 32) entries in the bucket, sample <code>BATCH_SIZE</code> experience tuples from the bucket without replacement. For each sampled tuple <code><s, a, s', r></code>, generate a new training instance with x-value <code>s</code>. The y-value of each instance is where things get a little complicated, so pay attention. Let the variable <code>P = Q(s)</code>, all current Q-value predictions for all actions. If the <code>a</code> from our experience tuple is the i<sup>th</sup> action in the neural net, then we updated <b>only the Q-value corresponding to <code>a</code></b> with <code>P[i] = r + gamma*max(Q(s'))</code>. The y-value of each sample is set to its respective <code>P</code>, and the model is updated with a training session on just these samples.</li>
+
<li style="color:navy">Experience replay step. If you have more than <code>BATCH_SIZE</code> (usually 32) entries in the bucket, sample <code>BATCH_SIZE</code> experience tuples from the bucket without replacement. For each sampled tuple <code><s, a, s', r></code>, generate a new training instance with x-value <code>s</code>. The y-value of each instance is where things get a little complicated, so pay attention. Let the variable <code>P = Q(s)</code>, all current Q-value predictions for all actions. If the <code>a</code> from our experience tuple is the i<sup>th</sup> action in the neural net, then we update <b>only the Q-value corresponding to <code>a</code></b> with <code>P[i] = r + gamma*max(Q(s'))</code>. The y-value of each sample is set to its respective <code>P</code>, and the model is updated with a training session on just these samples.</li>
 
</ol>
 
</ol>
 
Loop the above training steps until your epoch completes. Keep doing more epochs (reset to initial state) until the reward achieved by your agent converges and no longer improves. Note that the neural network weights persist between epochs.
 
Loop the above training steps until your epoch completes. Keep doing more epochs (reset to initial state) until the reward achieved by your agent converges and no longer improves. Note that the neural network weights persist between epochs.
 +
 +
==Advanced Deep Q-Learning==
 +
DQNs tend not to be stable. Because the loss is calculated from the same network that weights are applied to, you tend to see oscillations in policies. Think of this like a cat chasing its own tail. There are a few tricks you can use to mitigate these problems.
 +
<ol>
 +
<li><b>Target network.</b> Instead of calculating targets with <code>P[i] = r + gamma*max(Q(s'))</code>, we have a separate target network <code>Q~</code> that we will use in <code>P[i] = r + gamma*max(Q~(s'))</code>. <code>Q~</code> has a structure identical to <code>Q</code>. Every 1000 or so training iterations, we update <code>Q~</code>'s weight to be the same as <code>Q</code>'s weights. In this way, there will be reduced correlation between <code>Q~</code> and <code>Q</code>, mitigating the aforementioned cat-chasing-tail scenario.
 +
</li>
 +
<li><b>Double Q-learning.</b> One problem in the DQN algorithm is that the agent tends to overestimate the Q-function value, due to the max in the formula used to set targets: <code>P[i] = r + gamma*max(Q(s'))</code>. The solution is to use two Q-functions, Q<sub>1</sub> and Q<sub>2</sub>, which are independently learned. One function is then used to determine the maximizing action and second to estimate its value.  For our purposes, the target network <code>Q~</code> is relatively independent of <code>Q</code>, so we can use <code>Q</code> and <code>Q~</code> as our networks for double Q-learning. So our update now becomes <code>P[i] = r + gamma*Q~(s',argmax<sub>a</sub>(Q(s',a))</code>.
 +
</li>
 +
</ol>
  
 
==Notes==
 
==Notes==
There are many different weight initializations to experiment with - glorot, he ...
+
Hyperparameters:
There are many different network structure to experiment with.
+
<ul>
Research into bucket size
+
<li>Network architecture</li>
Prioritized bucket
+
<li>Weight initilizations (glorot, he...)</li>
 +
<li>Network architecture</li>
 +
<li>Optimizer</li>
 +
<li>Learning rate</li>
 +
<li>Bucket size</li>
 +
<li>Sample training batch size</li>
 +
<li>Gamma</li>
 +
<li>Explore starting probability</li>
 +
<li>Explore ending probability</li>
 +
<li>Explore probability decay rate</li>
 +
<li>Target network update interval</li>
 +
</ul>
  
 
==Resources==
 
==Resources==

Latest revision as of 18:56, 12 February 2018

Motivation

What is deep Q-learning, and why do we want to use it?

Brief History

Deep Q-learning was introduced in 2015 by Google's DeepMind in a Nature article called Human-level control through deep reinforcement learning. DeepMind used Q-learning backed by a deep neural network to train an agent to play Atari games from raw pixel data - often outperforming humans. Previous attempts to combine reinforcement learning with neural networks had largely failed due to unstable learning. To address these instabilities, DeepMind introduced a mechanism by which the algorithm stores all of the agent's experiences and then randomly samples and replays these experiences to provide diverse and decorrelated training data.

Advantages

Continuous state space. Instead of building a Q-table, we use a neural network as part of deep Q-learning to approximate a Q-function. There is no need to discretize our data into arbitrary, independent buckets. Unlike the rows of a table, a function is continuous - so we predict Q-values for states that we have never seen before.

Can handle high-dimensional data. Deep networks can use convolutional layers and other trickery to extract features from high-dimensional data. Table-based Q-learning would fail miserably at this task, as the curse of dimensionality would leave us with gigantic state space to explore.

Stock prediction

How might we use deep Q-learning to predict stocks? Our actions would be BUY, HOLD, or SELL, and our state space might just be one feature, RETURNS, a continuous value. Deep Q-learning is a much more natural fit to the trading problem than the Q-table implementation we did in class, where we had to discretize our technical indicator values. In theory, technical indicators derived from price are superfluous if we provide our network with raw price data - deep learning should extract these features, and perhaps better ones, on its own.

Intro

To understand deep Q-learning, it is imperative you first have an understanding of normal, table-based Q-learning. And even if you do, read over this recipe because it will be used as reference later.

Table-based Q-learning

A Q-value is the expected future reward of taking an action in a certain state. Q-learning is model free, and so we do not try to learn the transition or immediate reward functions; we are just trying to construct a model of future reward. In this case, that model is a Q-table, which is a simple matrix.

Initialization step (happens only once)

  1. Define a table Q with a row for each state and a column for each action. We will be able to index into Q with Q(s,a) to get the expected utility of being in state s and taking action a. Alternatively, you can get an entire row containing Q-values for each action with Q(s) = [q1, q2, ..., qn]. Initialize this table with small, random values.
  2. Define exploration probability 0 <= p <= 1

Training step

  1. Be in state s
  2. With probability p, choose a random action a from the action space (explore). Otherwise, choose action a = argmaxa(Q(s,a)) (exploit).
  3. Take action a and witness the reward r and the new state you are in s'.
  4. This unit of training has yielded an experience tuple - <s, a, s', r>. Using this experience tuple, perform value iteration with Bellman's equation.
    Legend: gamma = discount rate, alpha = learning rate.
    Q(s,a) = (1 - alpha)Q(s,a) + (alpha)(r + gamma*max(Q(s')))
  5. Multiply p by some discount factor (perhaps .999).

Loop the above training steps until your epoch completes. Keep doing more epochs (reset to initial state) until the reward achieved by your agent converges and no longer improves. Note that the Q-table persists between epochs.

Deep Q-Learning Algorithm

Now that we have established the vanilla, table-based way of doing things, here is the shiny new deep Q-learning algorithm. Differences from the table-based algorithm are highlighted in blue. The main differences are the presence of a neural network backing the Q-function, and the experience replay bucket. Experience replay works very much like dyna-q learning. We store experience tuples and later randomly sample from our bucket and train off those examples. This enables us to train off of I.I.D. experience tuples rather than just training off of sequential data, which is highly correlated.
Initialization step (happens only once)

  1. Define a neural network Q with an input node for each feature and an output node for each action. The hidden layers are defined however you like. Ultimately, providing this network with current state s = [feature1, feature2, ..., featuren] as input should generate the Q-values of each action [q1, q2, ..., qn] as output - in other words, Q(s) = [q1, q2, ..., qn]. Just as before, Q(s,a) is the Q-value of taking action a in state s. Initialize this table with small, random values. Picking the right activation functions is critical to success - the output should be a linear activation function (outputs need to be continuous and unbounded), and the hidden layer activations should perhaps be ReLu. The loss function is also an important hyperparameter, try mean-squared error to start.
  2. Define exploration probability 0 <= p <= 1
  3. Define a deque of some size. This will henceforth be referred to as the bucket. The bucket will hold our historical data for experience replay.

Training step

  1. Be in state s
  2. With probability p, choose a random action a from the action space (explore). Otherwise, choose action a = argmaxa(Q(s,a)) (exploit).
  3. Take action a and witness the reward r and the new state you are in s'.
  4. This unit of training has yielded an experience tuple - <s, a, s', r>. Add this experience tuple to the bucket. If your bucket overflows its size, remove the oldest entry.
  5. Experience replay step. If you have more than BATCH_SIZE (usually 32) entries in the bucket, sample BATCH_SIZE experience tuples from the bucket without replacement. For each sampled tuple <s, a, s', r>, generate a new training instance with x-value s. The y-value of each instance is where things get a little complicated, so pay attention. Let the variable P = Q(s), all current Q-value predictions for all actions. If the a from our experience tuple is the ith action in the neural net, then we update only the Q-value corresponding to a with P[i] = r + gamma*max(Q(s')). The y-value of each sample is set to its respective P, and the model is updated with a training session on just these samples.

Loop the above training steps until your epoch completes. Keep doing more epochs (reset to initial state) until the reward achieved by your agent converges and no longer improves. Note that the neural network weights persist between epochs.

Advanced Deep Q-Learning

DQNs tend not to be stable. Because the loss is calculated from the same network that weights are applied to, you tend to see oscillations in policies. Think of this like a cat chasing its own tail. There are a few tricks you can use to mitigate these problems.

  1. Target network. Instead of calculating targets with P[i] = r + gamma*max(Q(s')), we have a separate target network Q~ that we will use in P[i] = r + gamma*max(Q~(s')). Q~ has a structure identical to Q. Every 1000 or so training iterations, we update Q~'s weight to be the same as Q's weights. In this way, there will be reduced correlation between Q~ and Q, mitigating the aforementioned cat-chasing-tail scenario.
  2. Double Q-learning. One problem in the DQN algorithm is that the agent tends to overestimate the Q-function value, due to the max in the formula used to set targets: P[i] = r + gamma*max(Q(s')). The solution is to use two Q-functions, Q1 and Q2, which are independently learned. One function is then used to determine the maximizing action and second to estimate its value. For our purposes, the target network Q~ is relatively independent of Q, so we can use Q and Q~ as our networks for double Q-learning. So our update now becomes P[i] = r + gamma*Q~(s',argmaxa(Q(s',a)).

Notes

Hyperparameters:

  • Network architecture
  • Weight initilizations (glorot, he...)
  • Network architecture
  • Optimizer
  • Learning rate
  • Bucket size
  • Sample training batch size
  • Gamma
  • Explore starting probability
  • Explore ending probability
  • Explore probability decay rate
  • Target network update interval

Resources

Cartpole DQN implementation https://gist.github.com/tsu-nera/edd306ddeefebe4afb1efceefbc3f953
Another explanation + algorithm for deep Q-learning http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/
A full tutorial on how to build a DQN from scratch with keras + gym: https://jaromiru.com/2016/09/27/lets-make-a-dqn-theory/