Difference between revisions of "Deep Q-Learning"
(Created page with "==Intro== To understand deep Q-learning, it is imperative you first have an understanding of normal, table-based Q-learning. <ul> <li>Define a table <code>Q</code> with a row...") |
(→Intro) |
||
Line 1: | Line 1: | ||
+ | |||
+ | ==Motivation== | ||
+ | What is deep Q-learning, and why do we want to use it? | ||
+ | |||
+ | === Brief History === | ||
+ | Deep Q-learning was introduced in 2015 by Google's DeepMind in a Nature article called <i>Human-level control through deep reinforcement learning.</i> DeepMind used Q-learning backed by a deep neural network to train an agent to play Atari games from raw pixel data - often outperforming humans. Previous attempts to combine reinforcement learning with neural networks had largely failed due to unstable learning. To address these instabilities, DeepMind introduced a mechanism by which the algorithm stores all of the agent's experiences and then randomly samples and replays these experiences to provide diverse and decorrelated training data. | ||
+ | |||
+ | === Advantages === | ||
+ | <b>Continuous state space.</b> Instead of building a Q-table, deep Q-learning approximates a Q-function. Unlike the rows of a table, a function is continuous, so even if we predict for a value we have never seen before in the state space, the Q-function will provide an informed estimate, while the Q-table would fail. There is no need to discretize our data into arbitrary, independent buckets. | ||
+ | |||
+ | <b>Can handle high-dimensional data.</b> Deep networks can use convolutional layers and other trickery to extract features from high-dimensional data. Table-based Q-learning would fail miserably at this task, as the curse of dimensionality leaves us with gigantic state space to explore. | ||
+ | |||
+ | === Stock prediction === | ||
+ | How might we use deep Q-learning to predict stocks? Our actions would be BUY, HOLD, or SELL, and our state space might just be one feature, RETURNS, a continuous value. Deep Q-learning is much more natural fit to this problem than the Q-table implementation we did in class, where we had to discretize our technical indicator values. In theory, technical indicators derived from price are superfluous if we provide our network with raw price data - deep learning should extract these features, and perhaps better ones, on its own. | ||
+ | |||
==Intro== | ==Intro== | ||
− | To understand deep Q-learning, it is imperative you first have an understanding of normal, table-based Q-learning. | + | To understand deep Q-learning, it is imperative you first have an understanding of normal, table-based Q-learning. And even if you do, read over this recipe so that you will understand what changes when I reference it later: |
− | < | + | ===Table-based Q-learning=== |
− | <li>Define a table <code>Q</code> with a row for each state and a column for each action. We can index into <code>Q</code> with <code>Q(s,a)</code>. The value of each index will be the expected utility of being in state <code>s</code> and taking action <code>a</code> </li> | + | <b>Initialization step</b> |
+ | <ol> | ||
+ | <li>Define a table <code>Q</code> with a row for each state and a column for each action. We can index into <code>Q</code> with <code>Q(s,a)</code>. The value of each index will be the expected utility of being in state <code>s</code> and taking action <code>a</code>. Initialize this table with small, random values. </li> | ||
<li>Define exploration probability <code>0 < p <= 1</code></li> | <li>Define exploration probability <code>0 < p <= 1</code></li> | ||
− | + | </ol> | |
− | |||
− | </ | + | <b>Training step</b> |
− | <code> | + | <ol> |
− | the | + | <li>Be in state <code>s</code></li> |
− | </ | + | <li>With probability <code>p</code>, take action <code>a = argmax_a(Q(s,a))</code>. Otherwise, take a random action from the action space.</li> |
+ | </ol> |
Revision as of 21:18, 10 February 2018
Contents
Motivation
What is deep Q-learning, and why do we want to use it?
Brief History
Deep Q-learning was introduced in 2015 by Google's DeepMind in a Nature article called Human-level control through deep reinforcement learning. DeepMind used Q-learning backed by a deep neural network to train an agent to play Atari games from raw pixel data - often outperforming humans. Previous attempts to combine reinforcement learning with neural networks had largely failed due to unstable learning. To address these instabilities, DeepMind introduced a mechanism by which the algorithm stores all of the agent's experiences and then randomly samples and replays these experiences to provide diverse and decorrelated training data.
Advantages
Continuous state space. Instead of building a Q-table, deep Q-learning approximates a Q-function. Unlike the rows of a table, a function is continuous, so even if we predict for a value we have never seen before in the state space, the Q-function will provide an informed estimate, while the Q-table would fail. There is no need to discretize our data into arbitrary, independent buckets.
Can handle high-dimensional data. Deep networks can use convolutional layers and other trickery to extract features from high-dimensional data. Table-based Q-learning would fail miserably at this task, as the curse of dimensionality leaves us with gigantic state space to explore.
Stock prediction
How might we use deep Q-learning to predict stocks? Our actions would be BUY, HOLD, or SELL, and our state space might just be one feature, RETURNS, a continuous value. Deep Q-learning is much more natural fit to this problem than the Q-table implementation we did in class, where we had to discretize our technical indicator values. In theory, technical indicators derived from price are superfluous if we provide our network with raw price data - deep learning should extract these features, and perhaps better ones, on its own.
Intro
To understand deep Q-learning, it is imperative you first have an understanding of normal, table-based Q-learning. And even if you do, read over this recipe so that you will understand what changes when I reference it later:
Table-based Q-learning
Initialization step
- Define a table
Q
with a row for each state and a column for each action. We can index intoQ
withQ(s,a)
. The value of each index will be the expected utility of being in states
and taking actiona
. Initialize this table with small, random values. - Define exploration probability
0 < p <= 1
Training step
- Be in state
s
- With probability
p
, take actiona = argmax_a(Q(s,a))
. Otherwise, take a random action from the action space.