Deep Q-Learning - Revision history

Petosa at 23:56, 12 February 2018

2018-02-12T23:56:07Z

Petosa: /* Deep Q-Learning Algorithm */

2018-02-12T23:55:33Z

Deep Q-Learning Algorithm

Petosa: /* Notes */

2018-02-12T23:54:52Z

Notes

Petosa: /* Advanced Deep Q-Learning */

2018-02-12T23:49:08Z

Advanced Deep Q-Learning

Petosa: /* Advanced Deep Q-Learning */

2018-02-12T23:46:59Z

Advanced Deep Q-Learning

Petosa at 23:30, 12 February 2018

2018-02-12T23:30:25Z

Petosa: /* Motivation */

2018-02-12T23:23:32Z

Motivation

Petosa: /* Advanced Deep Q-Learning= */

2018-02-12T23:23:23Z

Advanced Deep Q-Learning=

Petosa: /* Motivation */

2018-02-12T23:23:02Z

Motivation

Petosa: /* Table-based Q-learning */

2018-02-12T16:31:16Z

Table-based Q-learning

@@ Line 30: / Line 30: @@
 <ol>
 <li>Be in state <code>s</code></li>
-<li>With probability <code>p</code>, choose a random action <code>a</code> from the action space (explore). Otherwise, choose action <code>a = argmax_a(Q(s,a))</code> (exploit).</li>
+<li>With probability <code>p</code>, choose a random action <code>a</code> from the action space (explore). Otherwise, choose action <code>a = argmax<sub>a</sub>(Q(s,a))</code> (exploit).</li>
 <li>Take action <code>a</code> and witness the reward <code>r</code> and the new state you are in <code>s'</code>.</li>
 <li>This unit of training has yielded an experience tuple - <code><s, a, s', r></code>. Using this experience tuple, perform value iteration with Bellman's equation.
@@ Line 58: / Line 58: @@
 <ol>
 <li>Be in state <code>s</code></li>
-<li>With probability <code>p</code>, choose a random action <code>a</code> from the action space (explore). Otherwise, choose action <code>a = argmax_a(Q(s,a))</code> (exploit).</li>
+<li>With probability <code>p</code>, choose a random action <code>a</code> from the action space (explore). Otherwise, choose action <code>a = argmax<sub>a</sub>(Q(s,a))</code> (exploit).</li>
 <li>Take action <code>a</code> and witness the reward <code>r</code> and the new state you are in <code>s'</code>.</li>
 <li style="color:navy">This unit of training has yielded an experience tuple - <code><s, a, s', r></code>. Add this experience tuple to the bucket. If your bucket overflows its size, remove the oldest entry.</li>

@@ Line 61: / Line 61: @@
 <li>Take action <code>a</code> and witness the reward <code>r</code> and the new state you are in <code>s'</code>.</li>
 <li style="color:navy">This unit of training has yielded an experience tuple - <code><s, a, s', r></code>. Add this experience tuple to the bucket. If your bucket overflows its size, remove the oldest entry.</li>
-<li style="color:navy">Experience replay step. If you have more than <code>BATCH_SIZE</code> (usually 32) entries in the bucket, sample <code>BATCH_SIZE</code> experience tuples from the bucket without replacement. For each sampled tuple <code><s, a, s', r></code>, generate a new training instance with x-value <code>s</code>. The y-value of each instance is where things get a little complicated, so pay attention. Let the variable <code>P = Q(s)</code>, all current Q-value predictions for all actions. If the <code>a</code> from our experience tuple is the i<sup>th</sup> action in the neural net, then we updated <b>only the Q-value corresponding to <code>a</code></b> with <code>P[i] = r + gamma*max(Q(s'))</code>. The y-value of each sample is set to its respective <code>P</code>, and the model is updated with a training session on just these samples.</li>
+<li style="color:navy">Experience replay step. If you have more than <code>BATCH_SIZE</code> (usually 32) entries in the bucket, sample <code>BATCH_SIZE</code> experience tuples from the bucket without replacement. For each sampled tuple <code><s, a, s', r></code>, generate a new training instance with x-value <code>s</code>. The y-value of each instance is where things get a little complicated, so pay attention. Let the variable <code>P = Q(s)</code>, all current Q-value predictions for all actions. If the <code>a</code> from our experience tuple is the i<sup>th</sup> action in the neural net, then we update <b>only the Q-value corresponding to <code>a</code></b> with <code>P[i] = r + gamma*max(Q(s'))</code>. The y-value of each sample is set to its respective <code>P</code>, and the model is updated with a training session on just these samples.</li>
 </ol>
 Loop the above training steps until your epoch completes. Keep doing more epochs (reset to initial state) until the reward achieved by your agent converges and no longer improves. Note that the neural network weights persist between epochs.

@@ Line 75: / Line 75: @@
 ==Notes==
-There are many different weight initializations to experiment with - glorot, he ...
+Hyperparameters:
-There are many different network structure to experiment with.
+<ul>
-Research into bucket size
+<li>Network architecture</li>
-Prioritized bucket
+<li>Weight initilizations (glorot, he...)</li>
 ==Resources==

@@ Line 70: / Line 70: @@
 <li><b>Target network.</b> Instead of calculating targets with <code>P[i] = r + gamma*max(Q(s'))</code>, we have a separate target network <code>Q~</code> that we will use in <code>P[i] = r + gamma*max(Q~(s'))</code>. <code>Q~</code> has a structure identical to <code>Q</code>. Every 1000 or so training iterations, we update <code>Q~</code>'s weight to be the same as <code>Q</code>'s weights. In this way, there will be reduced correlation between <code>Q~</code> and <code>Q</code>, mitigating the aforementioned cat-chasing-tail scenario.
 </li>
-<li>One problem in the DQN algorithm is that the agent tends to overestimate the Q-function value, due to the max in the formula used to set targets: <code>P[i] = r + gamma*max(Q(s'))</code>. The solution is to use two Q-functions, Q<sub>1</sub> and Q<sub>2</sub>, which are independently learned. One function is then used to determine the maximizing action and second to estimate its value.  For our purposes, the target network <code>Q~</code> is relatively independent of <code>Q</code>, so we can use <code>Q</code> and <code>Q~</code> as our networks for double Q-learning. So our update now becomes <code>P[i] = r + gamma*Q~(s',argmax<sub>a</sub>(Q(s',a))</code>.
+<li><b>Double Q-learning.</b> One problem in the DQN algorithm is that the agent tends to overestimate the Q-function value, due to the max in the formula used to set targets: <code>P[i] = r + gamma*max(Q(s'))</code>. The solution is to use two Q-functions, Q<sub>1</sub> and Q<sub>2</sub>, which are independently learned. One function is then used to determine the maximizing action and second to estimate its value.  For our purposes, the target network <code>Q~</code> is relatively independent of <code>Q</code>, so we can use <code>Q</code> and <code>Q~</code> as our networks for double Q-learning. So our update now becomes <code>P[i] = r + gamma*Q~(s',argmax<sub>a</sub>(Q(s',a))</code>.
 </li>
 </ol>

@@ Line 68: / Line 68: @@
 DQNs tend not to be stable. Because the loss is calculated from the same network that weights are applied to, you tend to see oscillations in policies. Think of this like a cat chasing its own tail. There are a few tricks you can use to mitigate these problems.
 <ol>
-<li><b>Target network.</b> Instead of calculating targets with <code>P[i] = r + gamma*max(Q(s'))</code>, we have a separate target network <code>Q~</code> that we will use in <code>P[i] = r + gamma*max(Q~(s'))</code>. <code>Q~</code> has a structure identical tot <code>Q</code>. Every 1000 or so training iterations, we update <code>Q~</code>'s weight to be the same as <code>Q</code>'s weights. In this way, there will be reduced qcorrelation between <code>Q~</code> and <code>Q</code>, mitigating the aforementioned cat-chasing-tail sceneario.
+<li><b>Target network.</b> Instead of calculating targets with <code>P[i] = r + gamma*max(Q(s'))</code>, we have a separate target network <code>Q~</code> that we will use in <code>P[i] = r + gamma*max(Q~(s'))</code>. <code>Q~</code> has a structure identical to <code>Q</code>. Every 1000 or so training iterations, we update <code>Q~</code>'s weight to be the same as <code>Q</code>'s weights. In this way, there will be reduced correlation between <code>Q~</code> and <code>Q</code>, mitigating the aforementioned cat-chasing-tail scenario.
 </ol>

@@ Line 5: / Line 5: @@
 === Brief History ===
 Deep Q-learning was introduced in 2015 by Google's DeepMind in a Nature article called <i>Human-level control through deep reinforcement learning.</i> DeepMind used Q-learning backed by a deep neural network to train an agent to play Atari games from raw pixel data - often outperforming humans. Previous attempts to combine reinforcement learning with neural networks had largely failed due to unstable learning. To address these instabilities, DeepMind introduced a mechanism by which the algorithm stores all of the agent's experiences and then randomly samples and replays these experiences to provide diverse and decorrelated training data.
 === Advantages ===

← Older revision		Revision as of 23:23, 12 February 2018
Line 6:		Line 6:
	Deep Q-learning was introduced in 2015 by Google's DeepMind in a Nature article called <i>Human-level control through deep reinforcement learning.</i> DeepMind used Q-learning backed by a deep neural network to train an agent to play Atari games from raw pixel data - often outperforming humans. Previous attempts to combine reinforcement learning with neural networks had largely failed due to unstable learning. To address these instabilities, DeepMind introduced a mechanism by which the algorithm stores all of the agent's experiences and then randomly samples and replays these experiences to provide diverse and decorrelated training data.		Deep Q-learning was introduced in 2015 by Google's DeepMind in a Nature article called <i>Human-level control through deep reinforcement learning.</i> DeepMind used Q-learning backed by a deep neural network to train an agent to play Atari games from raw pixel data - often outperforming humans. Previous attempts to combine reinforcement learning with neural networks had largely failed due to unstable learning. To address these instabilities, DeepMind introduced a mechanism by which the algorithm stores all of the agent's experiences and then randomly samples and replays these experiences to provide diverse and decorrelated training data.

−	~~==Advanced Deep Q-Learning===~~
−	DQNs tend not to be stable. Because the loss is calculated from the same network that weights are applied to, you tend to see oscillations in policies. Think of this like a cat chasing its own tail. There are a few tricks you can use to mitigate these problems.

−	~~<ol>~~
−	~~<li><b>Target network.</b> Instea</li>~~
−	~~</ol>~~

	=== Advantages ===		=== Advantages ===