Difference between revisions of "MC3-Project-2"

Latest revision as of 15:37, 3 July 2017

Updates / FAQs

2017-03-29 Changed rubric to evaluate the median of all runs rather than the "best" of all runs.
2016-11-8 Changed rubric to evaluate the median of all runs rather than the "best" of all runs.

Overview

In this project you will implement the Q-Learning and Dyna-Q solutions to the reinforcement learning problem. You will apply them to a navigation problem in this project. In a later project you will apply them to trading. The reason for working with the navigation problem first is that, as you will see, navigation is an easy problem to work with and understand. Note that your Q-Learning code really shouldn't care which problem it is solving. The difference is that you need to wrap the learner in different code that frames the problem for the learner as necessary.

For the navigation problem we have created testqlearner.py that automates testing of your Q-Learner in the navigation problem.

Overall, your tasks for this project include:

Code a Q-Learner
Code the Dyna-Q feature of Q-Learning
Test/debug the Q-Learner in navigation problems

For this assignment we will test only your code (there is no report component).

Template and Data

Update your local mc3p2_qlearning_robot directory using github.
Implement the QLearner class in mc3p2_qlearning_robot/QLearner.py.
To debug your Q-learner, run python testqlearner.py from the mc3p2_qlearning_robot/ directory. The grading script for this project is grade_robot_qlearning.py.
Note that example navigation problems are provided in the mc3p2_qlearning_robot/testworlds directory.

Part 1: Implement Q-Learner (95%)

Your QLearner class should be implemented in the file QLearner.py. It should implement EXACTLY the API defined below. DO NOT import any modules besides those allowed below. Your class should implement the following methods:

The constructor QLearner() should reserve space for keeping track of Q[s, a] for the number of states and actions. It should initialize Q[] with all zeros. Details on the input arguments to the constructor:

num_states integer, the number of states to consider
num_actions integer, the number of actions available.
alpha float, the learning rate used in the update rule. Should range between 0.0 and 1.0 with 0.2 as a typical value.
gamma float, the discount rate used in the update rule. Should range between 0.0 and 1.0 with 0.9 as a typical value.
rar float, random action rate: the probability of selecting a random action at each step. Should range between 0.0 (no random actions) to 1.0 (always random action) with 0.5 as a typical value.
radr float, random action decay rate, after each update, rar = rar * radr. Ranges between 0.0 (immediate decay to 0) and 1.0 (no decay). Typically 0.99.
dyna integer, conduct this number of dyna updates for each regular update. When Dyna is used, 200 is a typical value.
verbose boolean, if True, your class is allowed to print debugging statements, if False, all printing is prohibited.

query(s_prime, r) is the core method of the Q-Learner. It should keep track of the last state s and the last action a, then use the new information s_prime and r to update the Q table. The learning instance, or experience tuple is <s, a, s_prime, r>. query() should return an integer, which is the next action to take. Note that it should choose a random action with probability rar, and that it should update rar according to the decay rate radr at each step. Details on the arguments:

s_prime integer, the the new state.
r float, a real valued immediate reward.

querysetstate(s) A special version of the query method that sets the state to s, and returns an integer action according to the same rules as query() (including choosing a random action sometimes), but it does not execute an update to the Q-table. It also does not update rar. There are two main uses for this method: 1) To set the initial state, and 2) when using a learned policy, but not updating it.

Here's an example of the API in use:

import QLearner as ql

learner = ql.QLearner(num_states = 100, \ 
    num_actions = 4, \
    alpha = 0.2, \
    gamma = 0.9, \
    rar = 0.98, \
    radr = 0.999, \
    dyna = 0, \
    verbose = False)

s = 99 # our initial state

a = learner.querysetstate(s) # action for state s

s_prime = 5 # the new state we end up in after taking action a in state s

r = 0 # reward for taking action a in state s

next_action = learner.query(s_prime, r)

Part 2: Navigation Problem Test Cases

We will test your Q-Learner with a navigation problem as follows. Note that your Q-Learner does not need to be coded specially for this task. In fact the code doesn't need to know anything about it. The code necessary to test your learner with this navigation task is implemented in testqlearner.py for you. The navigation task takes place in a 10 x 10 grid world. The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:

0: blank space.
1: an obstacle.
2: the starting location for the robot.
3: the goal location.
5: quicksand.

An example navigation problem (world01.csv) is shown below. Following python conventions, [0,0] is upper left, or northwest corner, [9,9] lower right or southeast corner. Rows are north/south, columns are east/west.

3,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,1,1,1,1,1,0,0,0
0,5,1,0,0,0,1,0,0,0
0,5,1,0,0,0,1,0,0,0
0,0,1,0,0,0,1,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,2,0,0,0,0,0

In this example the robot starts at the bottom center, and must navigate to the top left. Note that a wall of obstacles blocks its path, and there is some quicksand along the left side. The objective is for the robot to learn how to navigate from the starting location to the goal with the highest total reward. We define the reward for each step as:

-1 if the robot moves to an empty or blank space, or attempts to move into a wall
-100 if the robot moves to a quicksand space
1 if the robot moves to the goal space

Overall, we will assess the performance of a policy as the median reward it incurs to travel from the start to the goal (higher reward is better). We assess a learner in terms of the reward it converges to over a given number of training epochs (trips from start to goal). Important note: the problem includes random actions. So, for example, if your learner responds with a "move north" action, there is some probability that the robot will actually move in a different direction. For this reason, the "wise" learner develops policies that keep the robot well away from quicksand. We map this problem to a reinforcement learning problem as follows:

State: The state is the location of the robot, it is computed (discretized) as: column location * 10 + row location.
Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
R: The reward is as described above.
T: The transition matrix can be inferred from the CSV map and the actions.

Note that R and T are not known by or available to the learner. The code in testqlearner.py will test your code as follows (pseudo code):

Instantiate the learner with the constructor QLearner()
s = initial_location
a = querysetstate(s)
s_prime = new location according to action a
r = -1.0
while not converged:
    a = query(s_prime, r) 
    s_prime = new location according to action a
    if s_prime == goal:
        r = +1
        s_prime = start location
    else if s_prime == quicksand:
        r = -100
    else:
        r = -1

A few things to note about this code: The learner always receives a reward of -1.0 (or -100.0) until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.

Here are example solutions. Note that these examples were created before we added "quicksand" to the project. We will be updating the examples to reflect this change. In the mean time, you may find these useful:

mc3_p2_examples

mc3_p2_dyna_examples

Part 3: Implement Dyna (5%)

Add additional components to your QLearner class so that multiple "hallucinated" experience tuples are used to update the Q-table for each "real" experience. The addition of this component should speed convergence in terms of the number of calls to query(), not necessarily running time.

Note that it is not important that you implement Dyna exactly as described in the lecture. The key requirement is that your code should somehow hallucinate additional experiences. The precise method you use for discovering those experiences is flexible. We will test your code on several test worlds with 50 epochs and with dyna = 200. Our expectation is that with Dyna, the solution should be much better after 50 epochs than without.

Part 4: Implement author() Method (0%)

You should implement a method called author() that returns your Georgia Tech user ID as a string. This is the ID you use to log into t-square. It is not your 9 digit student number. Here is an example of how you might implement author() within a learner object:

class QLearner(object):
    def author(self):
        return 'tb34' # replace tb34 with your Georgia Tech username.

And here's an example of how it could be called from a testing program:

    # create a learner and train it
    learner = ql.QLearner() # create a QLearner
    print learner.author()

Check the template code for examples. We are adding those to the repo now, but it might not be there if you check right away. Implementing this method correctly does not provide any points, but there will be a penalty for not implementing it.

Contents of Report

There is no report component of this assignment. However, if you would like to impress us with your Machine Learning prowess, you are invited to submit a succinct report.

Hints & resources

This paper by Kaelbling, Littman and Moore, is a good resource for RL in general: http://www.jair.org/media/301/live-301-1562-jair.pdf See Section 4.2 for details on Q-Learning.

There is also a chapter in the Mitchell book on Q-Learning.

For implementing Dyna, you may find the following resources useful:

What to turn in

Turn your project in via t-square. All of your code must be contained within QLearner.py .

Your QLearner as QLearner.py
Do not submit any other files.

Rubric

Only your QLearner class will be tested.

For basic Q-Learning (dyna = 0) we will test your learner against 10 test worlds with 500 epochs in each world. One "epoch" means your robot reaches the goal one time, or after 100000 steps, whichever comes first. Your QLearner retains its state (Q-table), and then we allow it to navigate to the goal again, over and over, 500 times. Each test (500 epochs) should complete in less than 2 seconds. NOTE: an epoch where the robot fails to reach the goal will likely take much longer (in running time), than one that does reach the goal, and is a common reason for failing to complete test cases within the time limit.
Benchmark: As a benchmark to compare your solution to, we will run our reference solution in the same world, with 500 epochs. We will take the median reward of our reference across all of those 500 epochs.
Your score: For each world we will take the median cost your solution finds across all 500 epochs.
For a test to be successful, your learner should find a total reward >= 1.5 x the benchmark. Note that since reward for a single epoch is negative, your solution can be up to 50% worse than the reference solution and still pass.
There are 10 test cases, each test case is worth 9.5 points.
Here is how we will initialize your QLearner for these test cases:

    learner = ql.QLearner(num_states=100,\
        num_actions = 4, \
        alpha = 0.2, \
        gamma = 0.9, \
        rar = 0.98, \
        radr = 0.999, \
        dyna = 0, \
        verbose=False) #initialize the learner

For Dyna-Q, we will set dyna = 200. We will test your learner against world01.csv and world02.csv with 50 epochs. Scoring is similar to the non-dyna case: Each test should complete in less than 10 seconds. For the test to be successful, your learner should find solution with total reward to the goal >= 1.5 x the median reward our reference solution across all 50 epochs. Note that since reward for a single epoch is negative, your solution can be up to 50% worse than the reference solution and still pass. We will check this by taking the median of all 50 runs. Each test case is worth 2.5 points. We will initialize your learner with the following parameter values for these test cases:

    learner = ql.QLearner(num_states=100,\
        num_actions = 4, \
        alpha = 0.2, \
        gamma = 0.9, \
        rar = 0.5, \
        radr = 0.99, \
        dyna = 200, \
        verbose=False) #initialize the learner

Is the author() method correctly implemented (-20% if not)

Required, Allowed & Prohibited

Required:

Your project must be coded in Python 2.7.x.
Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu).

Allowed:

You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
Your code may use standard Python libraries.
You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
Code provided by the instructor, or allowed by the instructor to be shared.

Prohibited:

Any libraries not listed in the "allowed" section above.
Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
Print statements outside "verbose" checks (they significantly slow down auto grading).
Any method for reading data besides util.py

Difference between revisions of "MC3-Project-2"

Latest revision as of 15:37, 3 July 2017

Contents

Updates / FAQs

Overview

Template and Data

Part 1: Implement Q-Learner (95%)

Part 2: Navigation Problem Test Cases

Part 3: Implement Dyna (5%)

Part 4: Implement author() Method (0%)

Contents of Report

Hints & resources

What to turn in

Rubric

Required, Allowed & Prohibited

Legacy

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

QuantSoftware Research Group

Spring 2020

Site

Tools

@@ Line 1: / Line 1: @@
 ==Updates / FAQs==
-* Q: In a previous project there was a constraint of holding a single position until exit. Does that apply to this project?  Yes, hold one position til exit.
+* '''2017-03-29''' Changed rubric to evaluate the '''median''' of all runs rather than the "best" of all runs.
+* '''2016-11-8''' Changed rubric to evaluate the '''median''' of all runs rather than the "best" of all runs.
-* Q: Is that 5 calendar days, or 5 trading days (i.e., days when SPY was traded)? A: Always use trading days.
-* Q: Are there constraints for Python modules allowed for this project? Can we experiment with modules for optimization or technical analysis and cite or are we expected to write everything from scratch for this project as well?  A: You can use scikit modules as long as you cite them..  You've already written the learners you need though.
-* Q: Can we change our policy to work better for IBM vs the sine data? A: No, you must use the same indicators, policy, etc. for both.  I suggest you optimize first for IBM, then go back to the sine data because almost anything should work with the sine data.
-'''2016-4-12'''
-* Q: I want to read some other values from the data besides just adjusted close, how can I do that? A: Please modify an old version of util.py to do that, include that new util.py with your submission.
-* Q: Are we required to trade in only 100 share blocks? (and have no more than 100 shares long or short at a time as in some of the previous assignments)  A: Yes.  This will enable comparison between results more easily.
-* Q: Are we limited to leverage of 2.0 on the portfolio?  A: There is no limit on leverage.
-* Q: Are we only allowed one position at a time?  A: You can be in one of three states: -100 shares, +100 shares, 0 shares.
-* Q: Are we supposed to build one policy that we use on both SINE and IBM? A: Yes, all parameters for your learner and policy should be the same.  The difference is the DATA.
 ==Overview==
-In this project you will develop trading strategies using Technical Analysis, and test them using your market simulator. You will then utilize your Random Tree learner to train and test a learning trading algorithm.
+In this project you will implement the Q-Learning and Dyna-Q solutions to the reinforcement learning problem.  You will apply them to a navigation problem in this project.  In a later project you will apply them to trading.  The reason for working with the navigation problem first is that, as you will see, navigation is an easy problem to work with and understand.  Note that your Q-Learning code really shouldn't care which problem it is solving.  The difference is that you need to wrap the learner in different code that frames the problem for the learner as necessary.
-* Part 1: Develop and describe a set of at least 3 technical indicators.  At least one of these indicators must be substantially different from the indicators whose code was presented in class.
+For the navigation problem we have created testqlearner.py that automates testing of your Q-Learner in the navigation problem.
-* Part 2: Devise and test a rule-based trading strategy using your indicators from Part 1.  Test its performance in sample using your market simulator.
-* Part 3: Use you decision tree learner to create a classifier that decides when to trade.  Test its performance in sample using your market simulator.
-* Part 4: Comparative analysis.
-In this project we shift from an auto graded format to a report format. For this project your grade will be based on the PDF report you submit, not your code. However, you will also submit your code that will be checked visually to ensure it appropriately matches the report you submit.
+Overall, your tasks for this project include:
-==Data Details, Dates and Rules==
+* Code a Q-Learner
+* Code the Dyna-Q feature of Q-Learning
+* Test/debug the Q-Learner in navigation problems
-Use the following parameters for Part 2, 3 and 4:
+For this assignment we will test only your code (there is no report component).
-* Use only the data provided for this course.  You are not allowed to import external data.
+==Template and Data==
-* Trade only the symbol IBM (however, you may, if you like, use data from other symbols to inform your strategy).
-* The in sample/training period is January 1, 2006 to December 31 2009.
-* The out of sample/testing period is January 1, 2010 to December 31 2010.
-* Starting cash is $100,000.
-* Allowable positions are: 500 shares long, 500 shares short, 0 shares.
-* There is no limit on leverage.
-==Part 1: Technical Indicators (20%)==
+* Update your local mc3p2_qlearning_robot directory using github.
+* Implement the <tt>QLearner</tt> class in <tt>mc3p2_qlearning_robot/QLearner.py</tt>.
+* To debug your Q-learner, run <tt>'''python testqlearner.py'''</tt> from the <tt>mc3p2_qlearning_robot/</tt> directory. The grading script for this project is <tt>grade_robot_qlearning.py</tt>.
+* Note that example navigation problems are provided in the <tt>mc3p2_qlearning_robot/testworlds</tt> directory.
-Develop and describe at least 3 and at most 5 technical indicators.  You may find our lecture on time series processing to be helpful.  For each indicator you should create a single chart that shows the price history of the stock during the in-sample period and the value of the indicator. Note that your chart should help to convey an understanding of how the indicator works, it is not strictly necessary that the chart show the literal value of the indicator.
+==Part 1: Implement Q-Learner (95%)==
-Your report description of each indicator should enable someone to reproduce it just by reading the description. We want a written description here, not code, however, it is OK to augment your written description with a pseudocode figure.
+Your QLearner class should be implemented in the file <tt>QLearner.py</tt>.  It should implement EXACTLY the API defined below.  DO NOT import any modules besides those allowed below.  Your class should implement the following methods:
-At least one of the indicators you use should be completely different from the ones presented in our lectures.
+<b>The constructor QLearner()</b> should reserve space for keeping track of Q[s, a] for the number of states and actions.  It should initialize Q[] with all zeros.  Details on the input arguments to the constructor:
-Deliverables:
+* <tt>num_states</tt> integer, the number of states to consider
-* Descriptive text (1 to 3 pages with figures).
+* <tt>num_actions</tt>  integer, the number of actions available.
-* 3 to 5 charts (one for each indicator)
+* <tt>alpha</tt> float, the learning rate used in the update rule. Should range between 0.0 and 1.0 with 0.2 as a typical value.
-* Code: indicators.py
+* <tt>gamma</tt> float, the discount rate used in the update rule.  Should range between 0.0 and 1.0 with 0.9 as a typical value.
+* <tt>rar</tt> float, random action rate: the probability of selecting a random action at each step. Should range between 0.0 (no random actions) to 1.0 (always random action) with 0.5 as a typical value.
+* <tt>radr</tt> float, random action decay rate, after each update, rar = rar * radr. Ranges between 0.0 (immediate decay to 0) and 1.0 (no decay).  Typically 0.99.
+* <tt>dyna</tt> integer, conduct this number of dyna updates for each regular update.  When Dyna is used, 200 is a typical value.
+* <tt>verbose</tt> boolean, if True, your class is allowed to print debugging statements, if False, all printing is prohibited.
-==Part 2: Manual Rule-Based Trader (30%)==
+<b>query(s_prime, r)</b> is the core method of the Q-Learner.  It should keep track of the last state s and the last action a, then use the new information s_prime and r to update the Q table.  The learning instance, or experience tuple is <s, a, s_prime, r>.  query() should return an integer, which is the next action to take.  Note that it should choose a random action with probability rar, and that it should update rar according to the decay rate radr at each step.  Details on the arguments:
-Devise a set of rules using the indicators you created in Part 1 above.  Your rules should be designed to trigger a "long" or "short" entry for a 10 trading day hold.  In other words, once an entry is initiated, you must remain in the position for 10 trading days.  In your report you must describe your trading rules so that another person could implement them based only on your description. We want a written description here, not code, however, it is OK to augment your written description with a pseudocode figure.
+* <tt>s_prime</tt> integer, the the new state.
+* <tt>r</tt> float, a real valued immediate reward.
-You should tweak your rules as best you can to get the best performance possible from during the in sample period (do not peak at out of sample performance).
+<b>querysetstate(s)</b> A special version of the query method that sets the state to s, and returns an integer action according to the same rules as query() (including choosing a random action sometimes), but it does not execute an update to the Q-table.  It also does not update rar. There are two main uses for this method: 1) To set the initial state, and 2) when using a learned policy, but not updating it.
-Use your rule-based strategy to generate an orders file over the in sample period, then run that file through your market simulator to create a chart that includes the following components over the in sample period:
+Here's an example of the API in use:
-* Price of IBM (normalized to 1.0 at the start): Black line
+<PRE>
-* Value of the rule-based portfolio (normalized to 1.0 at the start): Blue line
+import QLearner as ql
-* Vertical green lines indicating LONG entry points.
-* Vertical red lines indicating SHORT entry points.
-* Vertical black lines indicating exits (long or short).
-Note that each red or green vertical line should be followed by a black line before another entry occurs.  We will check for that.  We expect that your rule-based strategy should outperform the stock IBM over the in sample period.
+learner = ql.QLearner(num_states = 100, \
+    num_actions = 4, \
+    alpha = 0.2, \
+    gamma = 0.9, \
+    rar = 0.98, \
+    radr = 0.999, \
+    dyna = 0, \
+    verbose = False)
-Deliverables:
+s = 99 # our initial state
-* Descriptive text (1 or 2 pages with chart) that provides a compelling justification for rule-based system developed.
-* Text must describe rule based system in sufficient detail that another person could implement it.
-* 1 chart.
-* Code: rule_based.py (generates an orders file)
-==Part 3: ML Trader (30%)==
+a = learner.querysetstate(s) # action for state s
-Convert your decision tree '''regression''' learner into a '''classification''' learner.  The classifications should be:
+s_prime = 5 # the new state we end up in after taking action a in state s
-* +1: BUY
+r = 0 # reward for taking action a in state s
-* 0: DO NOTHING
-* -1: SELL
-The X data for each sample (day) are simply the values of your indicators for the stock -- you should have 3 to 5 of them.  The Y data (or classifications) will be based on 10 day return.  You should classify the example as a +1 or "BUY" if the 10 day return exceeds a certain value, let's call it YBUY for the moment.  You should classify the example as a -1 or "SELL" if the 10 day return is below a certain value we'll call YSELL.   In all other cases the sample should be classified as a 0 or "DO NOTHING."
+next_action = learner.query(s_prime, r)
+</PRE>
-Note that your X values are calculated each day from the current day's (and earlier) data, but the Y value is calculated using data from the future.  You may tweak various parameters of your learner to maximize return (more on that below).  Train and test your learning strategy over the in sample period.  Whenever a BUY or SELL is encountered, you must enter the corresponding position and hold it for 10 days.  That means, for instance, that if you encounter a BUY on day 1, then a SELL on day 2, you must keep the stock still until the 10 days expire, even though you received this conflicting information.  The reason for this is that we're trying to provide a way to directly compare the manual strategy versus the ML strategy.
+==Part 2: Navigation Problem Test Cases==
-Use your ML-based strategy to generate an orders file over the in sample period, then run that file through your market simulator to create a chart that includes the following components over the in sample period:
+We will test your Q-Learner with a navigation problem as follows.  Note that your Q-Learner does not need to be coded specially for this task.  In fact the code doesn't need to know anything about it.  The code necessary to test your learner with this navigation task is implemented in testqlearner.py for you.  The navigation task takes place in a 10 x 10 grid world.  The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:
-* Price of IBM (normalized to 1.0 at the start): Black line.
+* 0: blank space.
-* Value of the rule-based portfolio (normalized to 1.0 at the start): Blue line.
+* 1: an obstacle.
-* Value of the ML-based portfolio (normalized to 1.0 at the start): Green line.
+* 2: the starting location for the robot.
-* Vertical green lines indicating LONG entry points.
+* 3: the goal location.
-* Vertical red lines indicating SHORT entry points.
+* 5: quicksand.
-* Vertical black lines indicating exits (long or short).
-Note that each red or green vertical line should be followed by a black line before another entry occurs.  We will check for that.  We expect that the ML-based strategy will outperform the manual strategy, however it is possible that it does not.  If it is the case that your manual strategy does better, you should try to explain why in your report.
+An example navigation problem (world01.csv) is shown below.  Following python conventions, [0,0] is upper left, or northwest corner, [9,9] lower right or southeast corner.  Rows are north/south, columns are east/west.
-You should tweak the parameters of your learner to maximize performance during the in sample period.  Here is a partial list of things you can tweak:
+<PRE>
-* Adjust YSELL and YBUY.
+,0,0,0,0,0,0,0,0,0
-* Adjust leaf_size.
+,0,0,0,0,0,0,0,0,0
-* Utilize bagging and adjust the number of bags.
+,0,0,0,0,0,0,0,0,0
+,0,1,1,1,1,1,0,0,0
-Deliverables:
+,5,1,0,0,0,1,0,0,0
-* Descriptive text (1 or 2 pages with chart) that describes your ML approach.
+,5,1,0,0,0,1,0,0,0
-* Text must describe ML based system in sufficient detail that another person could implement it.
+,0,1,0,0,0,1,0,0,0
-* 1 chart
+,0,0,0,0,0,0,0,0,0
-* Code: ML_based.py (generates an orders file)
+,0,0,0,0,0,0,0,0,0
-* Additional code files as necessary to support ML_based.py (e.g. RTLearner.py and so on).
+,0,0,0,2,0,0,0,0,0
+</PRE>
-==Part 4: Comparative Analysis (20%)==
+In this example the robot starts at the bottom center, and must navigate to the top left.  Note that a wall of obstacles blocks its path, and there is some quicksand along the left side.  The objective is for the robot to learn how to navigate from the starting location to the goal with the highest total reward.  We define the reward for each step as:
+* -1 if the robot moves to an empty or blank space, or attempts to move into a wall
+* -100 if the robot moves to a quicksand space
+* 1 if the robot moves to the goal space
-Evaluate the performance of both of your strategies in the out of sample period.   Note that you '''should not''' train your learner on this data.  You should use the classification learned using the training data only.  Create a chart that shows, out of sample:
+Overall, we will assess the performance of a policy as the median reward it incurs to travel from the start to the goal (higher reward is better).  We assess a learner in terms of the reward it converges to over a given number of training epochs (trips from start to goal).  '''Important note:''' the problem includes random actions.  So, for example, if your learner responds with a "move north" action, there is some probability that the robot will actually move in a different direction.  For this reason, the "wise" learner develops policies that keep the robot well away from quicksand.  We map this problem to a reinforcement learning problem as follows:
-* Performance of the stock: Black line
+* State: The state is the location of the robot, it is computed (discretized) as: column location * 10 + row location.
-* Performance of manual strategy: Blue line
+* Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
-* Performance of the ML strategy: Green line
+* R: The reward is as described above.
-* All three should be normalized to 1.0 at the start.
+* T: The transition matrix can be inferred from the CSV map and the actions.
-Create a table that summarizes the performance of the stock, the manual strategy and the ML strategy for both in sample and out of sample periods.  Utilize your experience in this class to determine which factors are best to use for comparing these strategies.  If performance out of sample is worse than in sample, do your best to explain why.  Also if the manual and ML strategies perform substantially differently, explain why.  Is one method or the other more or less susceptible to the same underlying flaw?  Why or why not?
+Note that R and T are not known by or available to the learner.  The code in <tt>testqlearner.py</tt> will test your code as follows (pseudo code):
-Deliverables:
+<pre>
-* Descriptive text (1 or 2 pages including figures)
+Instantiate the learner with the constructor QLearner()
-* 1 chart
+s = initial_location
+a = querysetstate(s)
+s_prime = new location according to action a
+r = -1.0
+while not converged:
+    a = query(s_prime, r)
+    s_prime = new location according to action a
+    if s_prime == goal:
+        r = +1
+        s_prime = start location
+    else if s_prime == quicksand:
+        r = -100
+    else:
+        r = -1
+</pre>
-==Hints==
+A few things to note about this code: The learner always receives a reward of -1.0 (or -100.0) until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.
-Overall, I recommend the following steps in the creation of your strategies:
+Here are example solutions.  Note that these examples were created before we added "quicksand" to the project.  We will be updating the examples to reflect this change.  In the mean time, you may find these useful:
-* Indicator design hints:
+[[mc3_p2_examples]]
-** For your X values: Identify and implement at least 3 technical features that you believe may be predictive of future return. You should implement them so they output values typically ranging from -1.0 to 1.0 (a common way to do this is to apply the zscore https://en.wikipedia.org/wiki/Standard_score to the raw results) .  This will help avoid the situation where one feature overwhelms the results. See a few formulae below.
-* Rule based design:
-** Use a cascade of if statements conditioned on the indicators to identify whether a BUY condition is met.
-** Use a cascade of if statements conditioned on the indicators to identify whether a SELL condition is met.
-** The conditions for BUY and SELL should be mutually exclusive.
-** If neither BUY or SELL is triggered, the result should be DO NOTHING.
-** For debugging purposes, you may find it helpful to plot the value of the rule-based output (-1, 0, 1) versus the stock price.
-* Train a classification learner on in sample training data:
-** For your Y values: Use future 10 day return (not future price).  Then classify that return as BUY, SELL or DO NOTHING.  You're trying to predict a relative change that you can use to invest with.
-** For debugging purposes, you may find it helpful to plot the value of the training classification data (-1, 0, 1) versus the stock price in one color.
-** For debugging purposes, you may find it helpful to plot the value of the training classification output (-1, 0, 1) versus the stock price in another color.  Ideally, these two lines should be very similar.
-==Summary of Plots To Create==
+[[mc3_p2_dyna_examples]]
-# Sine data in-sample Training Y/Price/Predicted Y: Create a plot that illustrates your training Y values in one color, current price in another color and your model's PREDICTED Y in a third color. To help with the visualization, you should adjust your training Y and predicted Y so that it is at the same scale as the current price.
+==Part 3: Implement Dyna (5%)==
-# Sine data in-sample Entries/Exits: Create a plot that illustrates entry and exits as vertical lines on a price chart for the in sample period. Show long entries as green lines, short entries as red lines and exits as black lines. You may find it convenient to zoom in on a particular time period so this is evident.
-# Sine data in-sample backtest
-# Sine data out-of-sample Entries/Exits: Freeze your model based on the in-sample data. Now test it for the the out-of-sample period. Plot the entry & exits, generate trades,
-# Sine data out-of-sample backtest.
-# IBM data in-sample Entries/Exits: Create a plot that illustrates entry and exits as vertical lines on a price chart for the in sample period 2008-2009. Show long entries as green lines, short entries as red lines and exits as black lines. You may find it convenient to zoom in on a particular time period so this is evident.
-# IBM data in-sample backtest
-# IBM data out-of-sample Entries/Exits
-# IBM data out-of-sample backtest
-==Template and Data==
+Add additional components to your QLearner class so that multiple "hallucinated" experience tuples are used to update the Q-table for each "real" experience.  The addition of this component should speed convergence in terms of the number of calls to query(), <b>not necessarily running time</b>.
-You should create a directory for your code in ml4t/mc3-p2.  You will have access to the data in the ML4T/Data directory but you should use ONLY the code in util.py to read it.  In particular files named ML4T-220.csv, and IBM.csv.
+Note that it is not important that you implement Dyna exactly as described in the lecture.  The key requirement is that your code should somehow hallucinate additional experiences.  The precise method you use for discovering those experiences is flexible.  We will test your code on several test worlds with 50 epochs and with dyna = 200.  Our expectation is that with Dyna, the solution should be much better after 50 epochs than without.
-==Choosing Technical Features -- Your X Values==
+==Part 4: Implement author() Method (0%)==
-You should have already successfully coded the Bollinger Band feature.  Here's a suggestion of how to normalize that feature so that it will typically provide values between -1.0 and 1.0:
+You  should implement a method called <tt>author()</tt> that returns your Georgia Tech user ID as a string. This is the ID you use to log into t-square.  It is not your 9 digit student number.  Here is an example of how you might implement author() within a learner object:
-<PRE>
+<pre>
-bb_value[t] = (price[t] - SMA[t])/(2 * stdev[t])
+class QLearner(object):
-</PRE>
+    def author(self):
+        return 'tb34' # replace tb34 with your Georgia Tech username.
+</pre>
-Two other good features worth considering are momentum and volatility.
+And here's an example of how it could be called from a testing program:
-<PRE>
+<pre>
-momentum[t] = (price[t]/price[t-N]) - 1
+    # create a learner and train it
-</PRE>
+    learner = ql.QLearner() # create a QLearner
+    print learner.author()
+</pre>
-Volatility is just the stdev of daily returns.
+Check the template code for examples. We are adding those to the repo now, but it might not be there if you check right away.  Implementing this method correctly does not provide any points, but there will be a penalty for not implementing it.
-==Choosing Y==
+==Contents of Report==
-Your code should predict 5 day change in price.  You need to build a new Y that reflects the 5 day change and aligns with the current date.  Here's pseudo code for the calculation of Y
+There is no report component of this assignment.  However, if you would like to impress us with your Machine Learning prowess, you are invited to submit a succinct report.
- Y[t] = (price[t+5]/price[t]) - 1.0
+==Hints & resources==
-If you select Y in this manner and use it for training, your learner will predict 5 day returns.
+This paper by Kaelbling, Littman and Moore, is a good resource for RL in general: http://www.jair.org/media/301/live-301-1562-jair.pdf  See Section 4.2 for details on Q-Learning.
-==Contents of Report==
+There is also a chapter in the Mitchell book on Q-Learning.
-* Your report should be no more than 2500 words.  Your report should contain no more than 12 charts.  Penalties will apply if you violate these constraints.
+For implementing Dyna, you may find the following resources useful:
-* Include the charts listed in the overview section above.
-* Describe each of the indicators you have selected in enough detail that someone else could reproduce them in code.
-* Describe your trading policy clearly.
-* If you used any external code or ideas be sure to cite them in your code and in the report.
-* Discussion of results.  Did it work well?  Why?  What would you do differently?
-==Expectations==
+* https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node96.html
+* http://www-anw.cs.umass.edu/~barto/courses/cs687/Chapter%209.pdf
-* In-sample sine and in-sample IBM backtests should both perform very well -- better than the manual policy you created for the last assignment.
+==What to turn in==
-* Out-of-sample sine backtest should perform nearly identically as the in-sample test.
-* Out-of-sample IBM backtest should... (you should be able to complete this sentence).
-==What to turn in==
+Turn your project in via t-square.   All of your code must be contained within QLearner.py .
-Turn your project in via t-square.
+* Your QLearner as <tt>QLearner.py</tt>
+* Do not submit any other files.
-* Your report as <tt>report.pdf</tt>
+==Rubric==
-* All of your code, as necessary to run as <tt>.py</tt> files.
-* Document how to run your code in <tt>readme.txt</tt>.
-* No zip files please.
-==Extra credit up to 3%==
+Only your QLearner class will be tested.
-Choose one or more of the following:
+* For basic Q-Learning (dyna = 0) we will test your learner against 10 test worlds with 500 epochs in each world.  One "epoch" means your robot reaches the goal one time, or after 100000 steps, whichever comes first.  Your QLearner retains its state (Q-table), and then we allow it to navigate to the goal again, over and over, 500 times.  Each test (500 epochs) should complete in less than 2 seconds.  <b>NOTE</b>: an epoch where the robot fails to reach the goal will likely take <b>much</b> longer (in running time), than one that does reach the goal, and is a common reason for failing to complete test cases within the time limit.
+* Benchmark: As a benchmark to compare your solution to, we will run our reference solution in the same world, with 500 epochs.  We will take the median reward of our reference across all of those 500 epochs.
+* Your score: For each world we will take the median cost your solution finds across all 500 epochs.
+* For a test to be successful, your learner should find a total reward >= 1.5 x the benchmark. Note that since reward for a single epoch is negative, your solution can be up to 50% worse than the reference solution and still pass.
+* There are 10 test cases, each test case is worth 9.5 points.
+* Here is how we will initialize your QLearner for these test cases:
-* Compare the performance of KNN and LinReg in this task.  The instructor anticipates that LinReg might work well.  If that turns out to be the case, how can that be?  This is a non-linear task isn't it?
+<Pre>
-* Extend your code to create a "rolling" model that updates each day rolling forward.
+    learner = ql.QLearner(num_states=100,\
-* Extend your code to simultaneously forecast all the members of the S&P 500.  Generate trades accordingly, and backtest the result.
+        num_actions = 4, \
+        alpha = 0.2, \
+        gamma = 0.9, \
+        rar = 0.98, \
+        radr = 0.999, \
+        dyna = 0, \
+        verbose=False) #initialize the learner
+</PRE>
-Submit to the extra credit assignment on t-square.  One single PDF file only, max 1000 words.
+* For Dyna-Q, we will set dyna = 200.  We will test your learner against <tt>world01.csv</tt> and <tt>world02.csv</tt> with 50 epochs.  Scoring is similar to the non-dyna case: Each test should complete in less than 10 seconds. For the test to be successful, your learner should find solution with total reward to the goal >= 1.5 x the  median reward our reference solution across all 50 epochs.  Note that since reward for a single epoch is negative, your solution can be up to 50% worse than the reference solution and still pass.  We will check this by taking the median of all 50 runs. Each test case is worth 2.5 points.  We will initialize your learner with the following parameter values for these test cases:
-==Rubric==
+<Pre>
+    learner = ql.QLearner(num_states=100,\
+        num_actions = 4, \
+        alpha = 0.2, \
+        gamma = 0.9, \
+        rar = 0.5, \
+        radr = 0.99, \
+        dyna = 200, \
+        verbose=False) #initialize the learner
+</PRE>
-* Are all 9 plots present and correct? -5 points for each missing plot.
+* Is the author() method correctly implemented (-20% if not)
-** Note: Correct in the sense that they properly display the information requested.  The result may not be the desired one.
-* Are comparative backtest results correct? (ML4T-220 in sample & out of sample, IBM in sample & out of sample) -10 points for each incorrect result.
-* Indicators used: Are descriptions of factors used sufficiently clear that others could reproduce them? Up to -10 points for lack of clarity.
-* Trading strategy: Is description sufficiently clear that others could reproduce it? Up to -10 points for lack of clarity.
-* Is discussion of results concise, complete, correct? Up to -5 points for each of concise, complete, correct.
 ==Required, Allowed & Prohibited==
@@ Line 233: / Line 226: @@
 Required:
 * Your project must be coded in Python 2.7.x.
-* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
+* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu).
-* Use only util.py to read data.  If you want to read items other than adjusted close, modify util.py to do it, and submit your new version with your code.
 Allowed:
@@ Line 240: / Line 232: @@
 * Your code may use standard Python libraries.
 * You may use the NumPy, SciPy, matplotlib and Pandas libraries.  Be sure you are using the correct versions.
-* You may use scikit learn libraries (note that you don't need them because you just wrote your own!).
 * You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
 * Code provided by the instructor, or allowed by the instructor to be shared.
-* A herring.
 Prohibited:
-* Any other method of reading data besides util.py
 * Any libraries not listed in the "allowed" section above.
 * Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
+* Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
+* Print statements outside "verbose" checks (they significantly slow down auto grading).
+* Any method for reading data besides util.py
 ==Legacy==
-[[MC3-Project-2-Legacy]]
+*[[MC3-Project-2-Legacy-trader]]
+*[[MC3-Project-2-Legacy]]
+*[[MC3-Project-4]]