Difference between revisions of "Summer 2016 Project 5"

Latest revision as of 20:56, 19 July 2016

Overview

In this project you will implement the Q-Learning and Dyna-Q solutions to the reinforcement learning problem. You will apply them to two problems: 1) Navigation, and 2) Trading. The reason for working with the navigation problem first is that, as you will see, navigation is an easy problem to work with and understand. In the last part of the assignment you will apply Q-Learning to stock trading.

Note that your Q-Learning code really shouldn't care which problem it is solving. The difference is that you need to wrap the learner in different code that frames the problem for the learner as necessary.

For the navigation problem we have created testqlearner.py that automates testing of your Q-Learner in the navigation problem. We also provide teststrategylearner.py to test your strategy learner. In order to apply Q-learning to trading you will have to implement an API that calls Q-learning internally.

Overall, your tasks for this project include:

Code a Q-Learner
Code the Dyna-Q feature of Q-Learning
Test/debug the Q-Learner in navigation problems
Build a strategy learner based on your Q-Learner
Test/debug the strategy learner on specific symbol/time period problems

Scoring for the project will be allocated as follows:

Navigation test cases: 80% (note that we will check those with dyna = 0)
Dyna implemented: 5% (we will check this with one navigation test case by comparing performance with and without dyna turned on)
Trading strategy test cases: 20%

For this assignment we will test only your code (there is no report component). Note that the scoring is structured so that you can earn a B (80%) if you implement only Q-Learning, but if you implement everything, the total possible score is 105%. That means you can earn up to 5% extra credit on this project ( == 1% extra credit on the final course grade).

Template and Data

Download mc3_p3.zip, unzip inside ml4t/
Implement the QLearner class in mc3_p3/QLearner.py.
Implement the StrategyLearner class in mc3_p3/StrategyLearner.py
To test your Q-learner, run python testqlearner.py from the mc3_p3/ directory.
To test your strategy learner, run python teststrategylearner.py from the mc3_p3/ directory.
Note that example problems are provided in the mc3_p3/testworlds directory

Part 1: Implement QLearner

Your QLearner class should be implemented in the file QLearner.py. It should implement EXACTLY the API defined below. DO NOT import any modules besides those allowed below. Your class should implement the following methods:

QLearner(...): Constructor, see argument details below.
query(s_prime, r): Update Q-table with <s, a, s_prime, r> and return new action for state s_prime, update rar.
querysetstate(s): Set state to s, return action for state s, but don't update Q-table or rar.

Here's an example of the API in use:

import QLearner as ql

learner = ql.QLearner(num_states = 100, \ 
    num_actions = 4, \
    alpha = 0.2, \
    gamma = 0.9, \
    rar = 0.98, \
    radr = 0.999, \
    dyna = 0, \
    verbose = False)

s = 99 # our initial state

a = learner.querysetstate(s) # action for state s

s_prime = 5 # the new state we end up in after taking action a in state s

r = 0 # reward for taking action a in state s

next_action = learner.query(s_prime, r)

The constructor QLearner() should reserve space for keeping track of Q[s, a] for the number of states and actions. It should initialize Q[] with uniform random values between -1.0 and 1.0. Details on the input arguments to the constructor:

num_states integer, the number of states to consider
num_actions integer, the number of actions available.
alpha float, the learning rate used in the update rule. Should range between 0.0 and 1.0 with 0.2 as a typical value.
gamma float, the discount rate used in the update rule. Should range between 0.0 and 1.0 with 0.9 as a typical value.
rar float, random action rate: the probability of selecting a random action at each step. Should range between 0.0 (no random actions) to 1.0 (always random action) with 0.5 as a typical value.
radr float, random action decay rate, after each update, rar = rar * radr. Ranges between 0.0 (immediate decay to 0) and 1.0 (no decay). Typically 0.99.
dyna integer, conduct this number of dyna updates for each regular update. When Dyna is used, 200 is a typical value.
verbose boolean, if True, your class is allowed to print debugging statements, if False, all printing is prohibited.

query(s_prime, r) is the core method of the Q-Learner. It should keep track of the last state s and the last action a, then use the new information s_prime and r to update the Q table. The learning instance, or experience tuple is <s, a, s_prime, r>. query() should return an integer, which is the next action to take. Note that it should choose a random action with probability rar, and that it should update rar according to the decay rate radr at each step. Details on the arguments:

s_prime integer, the the new state.
r float, a real valued immediate reward.

querysetstate(s) A special version of the query method that sets the state to s, and returns an integer action according to the same rules as query() (including choosing a random action sometimes), but it does not execute an update to the Q-table. It also does not update rar. There are two main uses for this method: 1) To set the initial state, and 2) when using a learned policy, but not updating it.

Part 2: Navigation Problem Test Cases

We will test your Q-Learner with a navigation problem as follows. Note that your Q-Learner does not need to be coded specially for this task. In fact the code doesn't need to know anything about it. The code necessary to test your learner with this navigation task is implemented in testqlearner.py for you. The navigation task takes place in a 10 x 10 grid world. The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:

0: blank space.
1: an obstacle.
2: the starting location for the robot.
3: the goal location.

An example navigation problem (CSV file) is shown below. Following python conventions, [0,0] is upper left, or northwest corner, [9,9] lower right or southeast corner. Rows are north/south, columns are east/west.

0,0,0,0,3,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,1,1,1,1,1,0,0,0
0,0,1,0,0,0,1,0,0,0
0,0,1,0,0,0,1,0,0,0
0,0,1,0,0,0,1,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,2,0,0,0,0,0

In this example the robot starts at the bottom center, and must navigate to the top center. Note that a wall of obstacles blocks its path. We map this problem to a reinforcement learning problem as follows:

State: The state is the location of the robot, it is computed (discretized) as: column location * 10 + row location.
Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
R: The reward is -1.0 unless the action leads to the goal, in which case the reward is +1.0.
T: The transition matrix can be inferred from the CSV map and the actions.

Note that R and T are not known by or available to the learner. The testing code testqlearner.py will test your code as follows (pseudo code):

Instantiate the learner with the constructor QLearner()
s = initial_location
a = querysetstate(s)
s_prime = new location according to action a
r = -1.0
while not converged:
    a = query(s_prime, r) 
    s_prime = new location according to action a
    if s_prime == goal:
        r = +1
        s_prime = start location
    else
        r = -1

A few things to note about this code: The learner always receives a reward of -1.0 until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.

Here are example solutions:

mc3_p3_examples

mc3_p3_dyna_examples

Part 3: Implement Dyna

Add additional components to your QLearner class so that multiple "hallucinated" experience tuples are used to update the Q-table for each "real" experience. The addition of this component should speed convergence in terms of the number of calls to query().

We will test your code on world03.csv with 50 iterations and with dyna = 200. Our expectation is that with Dyna, the solution should be much better than without.

Part 4: Implement Strategy Learner

For this part of the project you should develop a learner that can learn a trading policy using your Q-Learner. Utilize the template provided in StrategyLearner.py Overall the structure of your strategy learner should be arranged like this:

For the policy learning part:

Select several technical features, and compute their values for the training data
Discretize the values of the features
Instantiate a Q-learner
For each day in the training data:
- Compute the current state (including holding)
- Compute the reward for the last action
- Query the learner with the current state and reward to get an action
- Implement the action the learner returned (BUY, SELL, NOTHING), and update portfolio value
Repeat the above loop multiple times until cumulative return stops improving.

A rule to keep in mind: As in past projects, you can only be long or short 100 shares, so if your learner returns two BUYs in a row, don't double down, same thing with SELLs.

For the policy testing part:

For each day in the testing data:
- Compute the current state
- Query the learner with the current state to get an action
- Implement the action the learner returned (BUY, SELL, NOTHING), and update portfolio value
Return the resulting trades in a data frame (details below).

Your StrategyLearner should implement the following API:

import StrategyLearner as sl
learner = sl.StrategyLearner(verbose = False) # constructor
learner.addEvidence(symbol = "IBM", sd=dt.datetime(2008,1,1), ed=dt.datetime(2009,1,1), sv = 10000) # training step
df_trades = learner.testPolicy(symbol = "IBM", sd=dt.datetime(2009,1,1), ed=dt.datetime(2010,1,1), sv = 10000) # testing step

The input parameters are:

verbose: if False do not generate any output
symbol: the stock symbol to train on
sd: A datetime object that represents the start date
ed: A datetime object that represents the end date
sv: Start value of the portfolio

The output result is:

df_trades: A data frame whose values represent trades for each day. Legal values are +100.0 indicating a BUY of 100 shares, -100.0 indicating a SELL of 100 shares, and 0.0 indicating NOTHING [ values of +200 and -200 for trades are also legal so long as net holdings are constrained to -100, 0, and 100].

Contents of Report

There is no report component of this assignment. However, if you would like to impress us with your Machine Learning prowess, you are invited to submit a succinct report.

Hints & Resources

This paper by Kaelbling, Littman and Moore, is a good resource for RL in general: http://www.jair.org/media/301/live-301-1562-jair.pdf See Section 4.2 for details on Q-Learning.

There is also a chapter in the Mitchell book on Q-Learning.

For implementing Dyna, you may find the following resources useful:

What to turn in

Turn your project in via t-square. All of your code must be contained within QLearner.py and StrategyLearner.py.

Your QLearner as QLearner.py
Your StrategyLearner as StrategyLearner.py
Your report (if any) as report.pdf
Do not submit any other files.

Rubric

Only your QLearner class will be tested.

For basic Q-Learning (dyna = 0) we will test your learner against 10 test worlds with 500 iterations. Each test should complete in less than 2 seconds. For the test to be successful, your learner should find a path to the goal <= 1.5 x the number of steps our reference solution finds. We will check this by taking the min of all the 500 runs. Each test case is worth 8 points. We will initialize your learner with the following parameter values:

    learner = ql.QLearner(num_states=100,\
        num_actions = 4, \
        alpha = 0.2, \
        gamma = 0.9, \
        rar = 0.98, \
        radr = 0.999, \
        dyna = 0, \
        verbose=False) #initialize the learner

For Dyna-Q, we will set dyna = 200. We will test your learner against world03.csv with 50 iterations. The test should complete in less than 10 seconds. For the test to be successful, your learner should find a path to the goal <= 1.5 x the number of steps our reference solution finds. We will check this by taking the min of all 50 runs. The test case is worth 5 points. We will initialize your learner with the following parameter values:

    learner = ql.QLearner(num_states=100,\
        num_actions = 4, \
        alpha = 0.2, \
        gamma = 0.9, \
        rar = 0.5, \
        radr = 0.99, \
        dyna = 200, \
        verbose=False) #initialize the learner

We will test StrategyLearner in the following situations:
- Training: Dec 31 2007 to Dec 31 2009
- Testing: Dec 31 2009 to Dec 31 2011
- Symbols: ML4T-220, IBM
- Starting value: $10,000
- Benchmark: Buy 100 shares on the first trading day, Sell 100 shares on the last day.
We expect the following outcomes in testing:
- For ML4T-220, the trained policy should significantly outperform the benchmark in sample (7 points)
- For ML4T-220, the trained policy should significantly outperform the benchmark out of sample (7 points)
- For IBM, the trained policy should significantly outperform the benchmark in sample (7 points)

Training and testing for each situation should run in less than 30 seconds. We reserve the right to use different time periods if necessary to reduce auto grading time.

Required, Allowed, & Prohibited

Required:

Your project must be coded in Python 2.7.x.
Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.

Allowed:

You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
Your code may use standard Python libraries.
You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
Code provided by the instructor, or allowed by the instructor to be shared.
Use util.py (only) for reading data.

Prohibited:

Any libraries not listed in the "allowed" section above.
Any code you did not write yourself
Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
Print statements outside "verbose" checks (they significantly slow down auto grading).
Any method for reading data besides util.py

Difference between revisions of "Summer 2016 Project 5"

Latest revision as of 20:56, 19 July 2016

Contents

Overview

Template and Data

Part 1: Implement QLearner

Part 2: Navigation Problem Test Cases

Part 3: Implement Dyna

Part 4: Implement Strategy Learner

Contents of Report

Hints & Resources

What to turn in

Rubric

Required, Allowed, & Prohibited

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

QuantSoftware Research Group

Spring 2020

Site

Tools

@@ Line 1: / Line 1: @@
-==FAQs==
+==Overview==
+In this project you will implement the Q-Learning and Dyna-Q solutions to the reinforcement learning problem.  You will apply them to two problems: 1) Navigation, and 2) Trading.  The reason for working with the navigation problem first is that, as you will see, navigation is an easy problem to work with and understand.  In the last part of the assignment you will apply Q-Learning to stock trading.
+Note that your Q-Learning code really shouldn't care which problem it is solving.  The difference is that you need to wrap the learner in different code that frames the problem for the learner as necessary.
+For the navigation problem we have created testqlearner.py that automates testing of your Q-Learner in the navigation problem.  We also provide teststrategylearner.py to test your strategy learner.  In order to apply Q-learning to trading you will have to implement an API that calls Q-learning internally.
+Overall, your tasks for this project include:
+* Code a Q-Learner
+* Code the Dyna-Q feature of Q-Learning
+* Test/debug the Q-Learner in navigation problems
+* Build a strategy learner based on your Q-Learner
+* Test/debug the strategy learner on specific symbol/time period problems
+Scoring for the project will be allocated as follows:
+* Navigation test cases: 80% (note that we will check those with dyna = 0)
+* Dyna implemented: 5% (we will check this with one navigation test case by comparing performance with and without dyna turned on)
+* Trading strategy test cases: 20%
+For this assignment we will test only your code (there is no report component).  Note that the scoring is structured so that you can earn a B (80%) if you implement only Q-Learning, but if you implement everything, the total possible score is 105%.  That means you can earn up to 5% extra credit on this project ( == 1% extra credit on the final course grade).
+==Template and Data==
+* Download <tt>'''[[Media:mc3_p3.zip|mc3_p3.zip]]'''</tt>, unzip inside <tt>ml4t/</tt>
+* Implement the <tt>QLearner</tt> class in <tt>mc3_p3/QLearner.py</tt>.
+* Implement the <tt>StrategyLearner</tt> class in <tt>mc3_p3/StrategyLearner.py</tt>
+* To test your Q-learner, run <tt>'''python testqlearner.py'''</tt> from the <tt>mc3_p3/</tt> directory.
+* To test your strategy learner, run <tt>'''python teststrategylearner.py'''</tt> from the <tt>mc3_p3/</tt> directory.
+* Note that example problems are provided in the <tt>mc3_p3/testworlds</tt> directory
+==Part 1: Implement QLearner==
-* Q: In a previous project there was a constraint of holding a single position until exit. Does that apply to this project?  Yes, hold one position until exit.
+Your QLearner class should be implemented in the file <tt>QLearner.py</tt>.  It should implement EXACTLY the API defined below.  DO NOT import any modules besides those allowed below.  Your class should implement the following methods:
-* Q: Is that 5 calendar days, or 5 trading days (i.e., days when SPY was traded)? A: Always use trading days.
+* QLearner(...): Constructor, see argument details below.
+* query(s_prime, r): Update Q-table with <s, a, s_prime, r> and return new action for state s_prime, update rar.
+* querysetstate(s): Set state to s, return action for state s, but don't update Q-table or rar.
-* Q: Are there constraints for Python modules allowed for this project? Can we experiment with modules for optimization or technical analysis and cite or are we expected to write everything from scratch for this project as well?  A: You can use scikit modules as long as you cite them..  You've already written the learners you need though.
+Here's an example of the API in use:
-* Q: Can we change our policy to work better for IBM vs the sine data? A: No, you must use the same indicators, policy, etc. for both.  I suggest you optimize first for IBM, then go back to the sine data because almost anything should work with the sine data.
+<pre>
+import QLearner as ql
-* Q: I want to read some other values from the data besides just adjusted close, how can I do that? A: Please modify an old version of util.py to do that, include that new util.py with your submission.
+learner &#61; ql.QLearner(num_states &#61; 100, \
+    num_actions &#61; 4, \
+    alpha &#61; 0.2, \
+    gamma &#61; 0.9, \
+    rar &#61; 0.98, \
+    radr &#61; 0.999, \
+    dyna &#61; 0, \
+    verbose &#61; False)
-* Q: Are we required to trade in only 100 share blocks? (and have no more than 100 shares long or short at a time as in some of the previous assignments)  A: Yes.  This will enable comparison between results more easily.
+s &#61; 99 # our initial state
-* Q: Are we limited to leverage of 2.0 on the portfolio?  A: There is no limit on leverage.
-* Q: Are we only allowed one position at a time?  A: You can be in one of three states: -100 shares, +100 shares, 0 shares.
-* Q: Are we supposed to build one policy that we use on both SINE and IBM? A: Yes, all parameters for your learner and policy should be the same.  The difference is the DATA.
+a &#61; learner.querysetstate(s) # action for state s
-==Overview==
+s_prime &#61; 5 # the new state we end up in after taking action a in state s
-In this project you will transform your regression learner into a stock trading strategy.  You should train a learner to predict the change in price of a stock over the next five trading days (one week). You will use data from Dec 31 2007 to 2009 to train your prediction model, then you will test it from Dec 31 2009 to 2011.
+r &#61; 0 # reward for taking action a in state s
-Now, just predicting the change in price isn't enough, you need to also code a policy that uses the forecaster you built to buy or sell shares.  Your policy should buy when it thinks the price will go up, and short when it thinks the price will go down.  You can then feed those buy and sell orders into your market simulator to backtest the strategy.  For ease of comparison between strategies, please observe these rules:
+next_action &#61; learner.query(s_prime, r)
+</pre>
-* Starting cash is $10,000.
-* Allowable positions are: 100 shares long, 100 shares short, 0 shares.
-* There is no limit on leverage.
-Finding features, a learner, and a policy that all work together to provide a reliably winning strategy with live stock data is HARD!  It is possible, and people have done it, but we can't reasonably expect you to be successful at it in this short class.  Accordingly, we want you to work with some easy data first, namely we will provide you with sinusoidal historical price data.  Once you've got something that works with that, you can try your learner on real stock data.
+<b>The constructor QLearner()</b> should reserve space for keeping track of Q[s, a] for the number of states and actions.  It should initialize Q[] with uniform random values between -1.0 and 1.0.  Details on the input arguments to the constructor:
-==Detailed steps==
+* <tt>num_states</tt> integer, the number of states to consider
+* <tt>num_actions</tt>  integer, the number of actions available.
+* <tt>alpha</tt> float, the learning rate used in the update rule. Should range between 0.0 and 1.0 with 0.2 as a typical value.
+* <tt>gamma</tt> float, the discount rate used in the update rule.  Should range between 0.0 and 1.0 with 0.9 as a typical value.
+* <tt>rar</tt> float, random action rate: the probability of selecting a random action at each step. Should range between 0.0 (no random actions) to 1.0 (always random action) with 0.5 as a typical value.
+* <tt>radr</tt> float, random action decay rate, after each update, rar &#61; rar * radr. Ranges between 0.0 (immediate decay to 0) and 1.0 (no decay).  Typically 0.99.
+* <tt>dyna</tt> integer, conduct this number of dyna updates for each regular update.  When Dyna is used, 200 is a typical value.
+* <tt>verbose</tt> boolean, if True, your class is allowed to print debugging statements, if False, all printing is prohibited.
-Overall, you should follow these steps:
-* Train a regression learner (KNN or LinReg, or other of your choice with or without bagging) on data from Dec 31 2007 to Dec 31 2009.  This is your in sample training data.
+<b>query(s_prime, r)</b> is the core method of the Q-Learner.  It should keep track of the last state s and the last action a, then use the new information s_prime and r to update the Q table.  The learning instance, or experience tuple is <s, a, s_prime, r>.  query() should return an integer, which is the next action to take.  Note that it should choose a random action with probability rar, and that it should update rar according to the decay rate radr at each step.  Details on the arguments:
-** For your X values: Identify and implement at least 3 technical features that you believe may be predictive of future return. You should implement them so they output values typically ranging from -1.0 to 1.0.  This will help avoid the situation where one feature overwhelms the results. See a few formulae below.
-** For your Y values: Use future 5 day return (not future price).  You're trying to predict a relative change that you can use to invest with.
-* Create a plot that illustrates your training Y values in one color, current price in another color and your model's PREDICTED Y in a third color. To help with the visualization, you should adjust your training Y and predicted Y so that they are at the same scale as the current price. With this chart we should be able to see how well your learner performs and that your Y values are shifted back 5 days.  You may find it convenient to zoom in on a particular time period so this is evident.
-* Create a trading policy based on what your learner predicts for future return.  As an example you might choose to buy when the forecaster predicts the price will go up more than 1%, then hold for 5 days.
-* Create a plot that illustrates entry and exits as vertical lines on a price chart for the in sample period Dec 31 2007 to Dec 31 2009. Show long entries as green lines, short entries as red lines and exits as black lines. You may find it convenient to zoom in on a particular time period so this is evident.
-* Now use your code to generate orders and run those orders through your market simulator.  Create a chart of this backtest.  It should do VERY well for the in sample period Dec 31 2007 to Dec 31 2009.
-* Freeze your model based on the Dec 31 2007 to Dec 31 2009 training data.  Now test it out of sample over the period Dec 31 2009 to Dec 31 2011.  Create a plot that illustrates entry & exits, generate trades, run through your simulator, chart the backtest.
-Perform the above steps first using the data ML4T-240.csv.  Once you've validated success (it should work well), repeat using IBM data over the same dates.  Remember Dec 31 2007 to Dec 31 2009 is training, Dec 31 2009 to Dec 31 2011 is testing.  You should have one set of charts for each symbol.
+* <tt>s_prime</tt> integer, the the new state.
+* <tt>r</tt> float, a real valued immediate reward.
-==Summary of Plots To Create==
+<b>querysetstate(s)</b> A special version of the query method that sets the state to s, and returns an integer action according to the same rules as query() (including choosing a random action sometimes), but it does not execute an update to the Q-table.  It also does not update rar. There are two main uses for this method: 1) To set the initial state, and 2) when using a learned policy, but not updating it.
-# Sine data in-sample Training Y/Price/Predicted Y: Create a plot that illustrates your training Y values in one color, current price in another color and your model's PREDICTED Y in a third color. To help with the visualization, you should adjust your training Y and predicted Y so that it is at the same scale as the current price.
+==Part 2: Navigation Problem Test Cases==
-# Sine data in-sample Entries/Exits: Create a plot that illustrates entry and exits as vertical lines on a price chart for the in sample period. Show long entries as green lines, short entries as red lines and exits as black lines. You may find it convenient to zoom in on a particular time period so this is evident.
-# Sine data in-sample backtest
-# Sine data out-of-sample Entries/Exits: Freeze your model based on the in-sample data. Now test it for the the out-of-sample period. Plot the entry & exits, generate trades,
-# Sine data out-of-sample backtest.
-# IBM data in-sample Entries/Exits: Create a plot that illustrates entry and exits as vertical lines on a price chart for the in sample period 2008-2009. Show long entries as green lines, short entries as red lines and exits as black lines. You may find it convenient to zoom in on a particular time period so this is evident.
-# IBM data in-sample backtest
-# IBM data out-of-sample Entries/Exits
-# IBM data out-of-sample backtest
-==Template and Data==
-You should create a directory for your code in ml4t/p4.  You will have access to the data in the ML4T/Data directory but you should use ONLY the code in util.py to read it.  In particular files named ML4T-240.csv, and IBM.csv.
+We will test your Q-Learner with a navigation problem as follows.  Note that your Q-Learner does not need to be coded specially for this task.  In fact the code doesn't need to know anything about it.  The code necessary to test your learner with this navigation task is implemented in testqlearner.py for you.  The navigation task takes place in a 10 x 10 grid world.  The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:
-==Choosing Technical Features -- Your X Values==
+* 0: blank space.
+* 1: an obstacle.
+* 2: the starting location for the robot.
+* 3: the goal location.
-Here's a suggestion of how to normalize Bollinger Bands so that feature so that it will typically provide values between -1.0 and 1.0:
+An example navigation problem (CSV file) is shown below.  Following python conventions, [0,0] is upper left, or northwest corner, [9,9] lower right or southeast corner.  Rows are north/south, columns are east/west.
 <PRE>
-bb_value[t] = (price[t] - SMA[t])/(2 * stdev[t])
+,0,0,0,3,0,0,0,0,0
+,0,0,0,0,0,0,0,0,0
+,0,0,0,0,0,0,0,0,0
+,0,1,1,1,1,1,0,0,0
+,0,1,0,0,0,1,0,0,0
+,0,1,0,0,0,1,0,0,0
+,0,1,0,0,0,1,0,0,0
+,0,0,0,0,0,0,0,0,0
+,0,0,0,0,0,0,0,0,0
+,0,0,0,2,0,0,0,0,0
 </PRE>
-Two other good features worth considering are momentum and volatility.
+In this example the robot starts at the bottom center, and must navigate to the top center.  Note that a wall of obstacles blocks its path.  We map this problem to a reinforcement learning problem as follows:
+* State: The state is the location of the robot, it is computed (discretized) as: column location * 10 + row location.
+* Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
+* R: The reward is -1.0 unless the action leads to the goal, in which case the reward is +1.0.
+* T: The transition matrix can be inferred from the CSV map and the actions.
+Note that R and T are not known by or available to the learner.  The testing code <tt>testqlearner.py</tt> will test your code as follows (pseudo code):
+<pre>
+Instantiate the learner with the constructor QLearner()
+s = initial_location
+a = querysetstate(s)
+s_prime = new location according to action a
+r = -1.0
+while not converged:
+    a = query(s_prime, r)
+    s_prime = new location according to action a
+    if s_prime == goal:
+        r = +1
+        s_prime = start location
+    else
+        r = -1
+</pre>
+A few things to note about this code: The learner always receives a reward of -1.0 until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.
+Here are example solutions:
+[[mc3_p3_examples]]
+[[mc3_p3_dyna_examples]]
+==Part 3: Implement Dyna==
+Add additional components to your QLearner class so that multiple "hallucinated" experience tuples are used to update the Q-table for each "real" experience.  The addition of this component should speed convergence in terms of the number of calls to query().
+We will test your code on <tt>world03.csv</tt> with 50 iterations and with dyna = 200.  Our expectation is that with Dyna, the solution should be much better than without.
+==Part 4: Implement Strategy Learner==
+For this part of the project you should develop a learner that can learn a trading policy using your Q-Learner.  Utilize the template provided in <tt>StrategyLearner.py</tt> Overall the structure of your strategy learner should be arranged like this:
+For the policy learning part:
+* Select several technical features, and compute their values for the training data
+* Discretize the values of the features
+* Instantiate a Q-learner
+* For each day in the training data:
+** Compute the current state (including holding)
+** Compute the reward for the last action
+** Query the learner with the current state and reward to get an action
+** Implement the action the learner returned (BUY, SELL, NOTHING), and update portfolio value
+* Repeat the above loop multiple times until cumulative return stops improving.
+A rule to keep in mind: As in past projects, you can only be long or short 100 shares, so if your learner returns two BUYs in a row, don't double down, same thing with SELLs.
+For the policy testing part:
+* For each day in the testing data:
+** Compute the current state
+** Query the learner with the current state to get an action
+** Implement the action the learner returned (BUY, SELL, NOTHING), and update portfolio value
+* Return the resulting trades in a data frame (details below).
+Your StrategyLearner should implement the following API:
 <PRE>
-momentum[t] = (price[t]/price[t-N]) - 1
+import StrategyLearner as sl
+learner = sl.StrategyLearner(verbose = False) # constructor
+learner.addEvidence(symbol = "IBM", sd=dt.datetime(2008,1,1), ed=dt.datetime(2009,1,1), sv = 10000) # training step
+df_trades = learner.testPolicy(symbol = "IBM", sd=dt.datetime(2009,1,1), ed=dt.datetime(2010,1,1), sv = 10000) # testing step
 </PRE>
-Volatility is just the stdev of daily returns.
+The input parameters are:
+* verbose: if False do not generate any output
+* symbol: the stock symbol to train on
+* sd: A datetime object that represents the start date
+* ed: A datetime object that represents the end date
+* sv: Start value of the portfolio
+The output result is:
+* df_trades: A data frame whose values represent trades for each day.  Legal values are +100.0 indicating a BUY of 100 shares, -100.0 indicating a SELL of 100 shares, and 0.0 indicating NOTHING [ values of +200 and -200 for trades are also legal so long as net holdings are constrained to -100, 0, and 100].
+==Contents of Report==
-==Choosing Y==
-Your code should predict 5 day change in price.  You need to build a new Y that reflects the 5 day change and aligns with the current date.  Here's pseudo code for the calculation of Y
+There is no report component of this assignment.  However, if you would like to impress us with your Machine Learning prowess, you are invited to submit a succinct report.
- Y[t] = (price[t+5]/price[t]) - 1.0
+==Hints & Resources==
-If you select Y in this manner and use it for training, your learner will predict 5 day returns.
-==Contents of Report==
+This paper by Kaelbling, Littman and Moore, is a good resource for RL in general: http://www.jair.org/media/301/live-301-1562-jair.pdf  See Section 4.2 for details on Q-Learning.
-* Your report should be no more than 2500 words.  Your report should contain no more than 12 charts.  Penalties will apply if you violate these constraints.
+There is also a chapter in the Mitchell book on Q-Learning.
-* Include the charts listed in the overview section above.
-* Describe each of the indicators you have selected in enough detail that someone else could reproduce them in code.
-* Describe your trading policy clearly.
-* If you used any external code or ideas be sure to cite them in your code and in the report.
-* Discussion of results.  Did it work well?  Why?  What would you do differently?
-==Expectations==
+For implementing Dyna, you may find the following resources useful:
-* In-sample sine and in-sample IBM backtests should both perform very well -- better than the manual policy you created for the last assignment.
+* https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node96.html
-* Out-of-sample sine backtest should perform nearly identically as the in-sample test.
+* http://www-anw.cs.umass.edu/~barto/courses/cs687/Chapter%209.pdf
-* Out-of-sample IBM backtest should... (you should be able to complete this sentence).
 ==What to turn in==
-Turn your project in via t-square.
-* Your report as <tt>report.pdf</tt>
+Turn your project in via t-square.   All of your code must be contained within QLearner.py and StrategyLearner.py.
-* All of your code, as necessary to run as <tt>.py</tt> files.
-* Document how to run your code in <tt>readme.txt</tt>.
+* Your QLearner as <tt>QLearner.py</tt>
-* No zip files please.
+* Your StrategyLearner as <tt>StrategyLearner.py</tt>
+* Your report (if any) as <tt>report.pdf</tt>
+* Do not submit any other files.
+==Rubric==
+Only your QLearner class will be tested.
+* For basic Q-Learning (dyna = 0) we will test your learner against 10 test worlds with 500 iterations.  Each test should complete in less than 2 seconds.  For the test to be successful, your learner should find a path to the goal <= 1.5 x the number of steps our reference solution finds.  We will check this by taking the min of all the 500 runs. Each test case is worth 8 points. We will initialize your learner with the following parameter values:
-==Extra credit up to 3%==
+<Pre>
+    learner = ql.QLearner(num_states=100,\
+        num_actions = 4, \
+        alpha = 0.2, \
+        gamma = 0.9, \
+        rar = 0.98, \
+        radr = 0.999, \
+        dyna = 0, \
+        verbose=False) #initialize the learner
+</PRE>
-Choose one or more of the following:
+* For Dyna-Q, we will set dyna = 200.  We will test your learner against <tt>world03.csv</tt> with 50 iterations.  The test should complete in less than 10 seconds. For the test to be successful, your learner should find a path to the goal <= 1.5 x the number of steps our reference solution finds.  We will check this by taking the min of all 50 runs. The test case is worth 5 points.  We will initialize your learner with the following parameter values:
-* Compare the performance of KNN and LinReg in this task.  The instructor anticipates that LinReg might work well.  If that turns out to be the case, how can that be?  This is a non-linear task isn't it?
+<Pre>
-* Extend your code to create a "rolling" model that updates each day rolling forward.
+    learner = ql.QLearner(num_states=100,\
-* Extend your code to simultaneously forecast all the members of the S&P 500.  Generate trades accordingly, and backtest the result.
+        num_actions = 4, \
+        alpha = 0.2, \
+        gamma = 0.9, \
+        rar = 0.5, \
+        radr = 0.99, \
+        dyna = 200, \
+        verbose=False) #initialize the learner
+</PRE>
-Submit to the extra credit assignment on t-square.  One single PDF file only, max 1000 words.
+* We will test StrategyLearner in the following situations:
+** Training: Dec 31 2007 to Dec 31 2009
+** Testing: Dec 31 2009 to Dec 31 2011
+** Symbols: ML4T-220, IBM
+** Starting value: $10,000
+** Benchmark: Buy 100 shares on the first trading day, Sell 100 shares on the last day.
+* We expect the following outcomes in testing:
+** For ML4T-220, the trained policy should significantly outperform the benchmark in sample (7 points)
+** For ML4T-220, the trained policy should significantly outperform the benchmark out of sample (7 points)
+** For IBM, the trained policy should significantly outperform the benchmark in sample (7 points)
-==Rubric==
+Training and testing for each situation should run in less than 30 seconds.  We reserve the right to use different time periods if necessary to reduce auto grading time.
-* Are all 9 plots present and correct? -5 points for each missing plot.
+==Required, Allowed, & Prohibited==
-** Note: Correct in the sense that they properly display the information requested.  The result may not be the desired one.
-* Are comparative backtest results correct? (ML4T-240 in sample & out of sample, IBM in sample & out of sample) -10 points for each incorrect result.
-* Indicators used: Are descriptions of factors used sufficiently clear that others could reproduce them? Up to -10 points for lack of clarity.
-* Trading strategy: Is description sufficiently clear that others could reproduce it? Up to -10 points for lack of clarity.
-* Is discussion of results concise, complete, correct? Up to -5 points for each of concise, complete, correct.
-==Required, Allowed & Prohibited==
 Required:
 * Your project must be coded in Python 2.7.x.
 * Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
-* Use only util.py to read data.  If you want to read items other than adjusted close, modify util.py to do it, and submit your new version with your code.
 Allowed:
@@ Line 140: / Line 278: @@
 * Your code may use standard Python libraries.
 * You may use the NumPy, SciPy, matplotlib and Pandas libraries.  Be sure you are using the correct versions.
-* You may use scikit learn libraries (note that you don't need them because you just wrote your own!).
-* You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
 * Code provided by the instructor, or allowed by the instructor to be shared.
-* A herring.
+* Use util.py (only) for reading data.
 Prohibited:
-* Any other method of reading data besides util.py
 * Any libraries not listed in the "allowed" section above.
-* Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
+* Any code you did not write yourself
+* Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
+* Print statements outside "verbose" checks (they significantly slow down auto grading).
+* Any method for reading data besides util.py