Deep Q-Learning

From Quantitative Analysis Software Courses
Jump to navigation Jump to search

Motivation

What is deep Q-learning, and why do we want to use it?

Brief History

Deep Q-learning was introduced in 2015 by Google's DeepMind in a Nature article called Human-level control through deep reinforcement learning. DeepMind used Q-learning backed by a deep neural network to train an agent to play Atari games from raw pixel data - often outperforming humans. Previous attempts to combine reinforcement learning with neural networks had largely failed due to unstable learning. To address these instabilities, DeepMind introduced a mechanism by which the algorithm stores all of the agent's experiences and then randomly samples and replays these experiences to provide diverse and decorrelated training data.

Advantages

Continuous state space. Instead of building a Q-table, deep Q-learning approximates a Q-function. Unlike the rows of a table, a function is continuous, so even if we predict for a value we have never seen before in the state space, the Q-function will provide an informed estimate, while the Q-table would have no information. There is no need to discretize our data into arbitrary, independent buckets.

Can handle high-dimensional data. Deep networks can use convolutional layers and other trickery to extract features from high-dimensional data. Table-based Q-learning would fail miserably at this task, as the curse of dimensionality leaves us with gigantic state space to explore.

Stock prediction

How might we use deep Q-learning to predict stocks? Our actions would be BUY, HOLD, or SELL, and our state space might just be one feature, RETURNS, a continuous value. Deep Q-learning is a much more natural fit to the trading problem than the Q-table implementation we did in class, where we had to discretize our technical indicator values. In theory, technical indicators derived from price are superfluous if we provide our network with raw price data - deep learning should extract these features, and perhaps better ones, on its own.

Intro

To understand deep Q-learning, it is imperative you first have an understanding of normal, table-based Q-learning. And even if you do, read over this recipe so that you will understand what changes when I reference it later:

Table-based Q-learning

Q-values are just the expected future reward of taking an action in a certain state. Q-learning in model free, and so we do not try to learn the transition or immediate reward functions, we just try to construct a model of future reward.
Initialization step

  1. Define a table Q with a row for each state and a column for each action. We can index into Q with Q(s,a). The value at each index will be the expected utility of being in state s and taking action a. Initialize this table with small, random values.
  2. Define exploration probability 0 < p <= 1

Training step

  1. Be in state s
  2. With probability p, take action a = argmax_a(Q(s,a)) (exploit). Otherwise, take a random action from the action space (explore).
  3. Take action a and witness the reward r and the new state you are in s'.
  4. This unit of training has yielding an experience tuple - <s, a, s', r>. Using this experience tuple, perform value iteration with Bellman's equation.
    Legend: gamma = discount rate, alpha = learning rate.
    Q(s,a) = (1 - alpha)Q(s,a) + (alpha)(r + gamma*argmax_a(Q(s',a)))
  5. Multiply p by some discount factor (perhaps .999).

Repeat the above steps until the reward achieved by the agent converges and no longer improves.

Deep Q-Learning Algorithm

Now that we have established the vanilla, table-based way of doing things, here is the shiny new deep Q-learning algorithm. Differences from the table-based algorithm are highlighted in blue.
Initialization step

  1. Define a neural network Q with an input node for each feature and an output node for each action. The hidden layers are defined however you like. Ultimately, providing this network with current state s = [feature1, feature2, ..., featuren] as input should generate the Q-values of each action [q1, q2, ..., qn] as output - in other words, Q(s) = [q1, q2, ..., qn]. Just as before, Q(s,a) is the Q-value of taking action a in state s. Initialize this table with small, random values. The loss function is also an important hyperparameter, try mean-squared error to start.
  2. Define exploration probability 0 < p <= 1
  3. Define a deque of some size. This will henceforth be referred to as the bucket. The bucket will hold our historical data for experience replay.

Training step

  1. Be in state s
  2. With probability p, take action a = argmax_a(Q(s,a)) (exploit). Otherwise, take a random action from the action space (explore).
  3. Take action a and witness the reward r and the new state you are in s'.
  4. This unit of training has yielding an experience tuple - <s, a, s', r>. Add this experience tuple to the bucket.
  5. Experience replay step. If you have more than BATCH_SIZE entries entries in the bucket, sample BATCH_SIZE tuples from the bucket without replacement. For each sample tuple <s, a, s', r>, generate a new training instance with x-value s. The y-value is where things get a little complicated, so pay attention. Let P = Q(s). If a is the ith action in the neural net, then we updated only the entry corresponding to a with P=(P[i] = r + gamma*argmax_a(Q(s',a))). The y-value of each sample is set to the array P, and the model is updated with an additional training epoch on just these samples.

Repeat the above steps until the reward achieved by the agent converges and no longer improves.

Notes

There are many different weight initializations to experiment with - glorot, he ... There are many different network structure to experiment with. Research into bucket size Prioritized bucket