Difference between revisions of "MC3-Project-4"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
Line 1: Line 1:
 
==DRAFT==
 
==DRAFT==
 +
 +
This is a draft version of the project.  This section will be removed once the project is finalized.
  
 
==Updates / FAQs==
 
==Updates / FAQs==

Revision as of 22:50, 28 November 2016

DRAFT

This is a draft version of the project. This section will be removed once the project is finalized.

Updates / FAQs

Overview

In this project you will apply the Q-Learner you developed earlier to the trading problem. It is not required, but we recommend that you reuse the indicators that you developed in the previous project for this one. The indicators define most of the "state" for your learner, while the actions are BUY, DO NOTHING, SELL.

Overall, your tasks for this project include:

  • Build a strategy learner based on your Q-Learner and previously developed indicators.
  • Test/debug the strategy learner on specific symbol/time period problems

Scoring for the project will be based on trading strategy test cases. For this assignment we will test only your code (there is no report component).

Template and Data

Wait for an email from Brian to confirm that the repo is current before updating your local copy.

  • Update your local mc3_p4 directory using github.
  • Place your existing Q-Learner in the file mc3_p4/QLearner.py.
  • Implement the StrategyLearner class in mc3_p3/StrategyLearner.py
  • To test your strategy learner, run python teststrategylearner.py from the mc3_p4/ directory.

Part 2: Navigation Problem Test Cases

We will test your Q-Learner with a navigation problem as follows. Note that your Q-Learner does not need to be coded specially for this task. In fact the code doesn't need to know anything about it. The code necessary to test your learner with this navigation task is implemented in testqlearner.py for you. The navigation task takes place in a 10 x 10 grid world. The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:

  • 0: blank space.
  • 1: an obstacle.
  • 2: the starting location for the robot.
  • 3: the goal location.

An example navigation problem (CSV file) is shown below. Following python conventions, [0,0] is upper left, or northwest corner, [9,9] lower right or southeast corner. Rows are north/south, columns are east/west.

0,0,0,0,3,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,1,1,1,1,1,0,0,0
0,0,1,0,0,0,1,0,0,0
0,0,1,0,0,0,1,0,0,0
0,0,1,0,0,0,1,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,2,0,0,0,0,0

In this example the robot starts at the bottom center, and must navigate to the top center. Note that a wall of obstacles blocks its path. We map this problem to a reinforcement learning problem as follows:

  • State: The state is the location of the robot, it is computed (discretized) as: column location * 10 + row location.
  • Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
  • R: The reward is -1.0 unless the action leads to the goal, in which case the reward is +1.0.
  • T: The transition matrix can be inferred from the CSV map and the actions.

Note that R and T are not known by or available to the learner. The testing code testqlearner.py will test your code as follows (pseudo code):

Instantiate the learner with the constructor QLearner()
s = initial_location
a = querysetstate(s)
s_prime = new location according to action a
r = -1.0
while not converged:
    a = query(s_prime, r) 
    s_prime = new location according to action a
    if s_prime == goal:
        r = +1
        s_prime = start location
    else
        r = -1

A few things to note about this code: The learner always receives a reward of -1.0 until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.

Here are example solutions:

mc3_p3_examples

mc3_p3_dyna_examples

Part 3: Implement Dyna

Add additional components to your QLearner class so that multiple "hallucinated" experience tuples are used to update the Q-table for each "real" experience. The addition of this component should speed convergence in terms of the number of calls to query().

We will test your code on world03.csv with 50 iterations and with dyna = 200. Our expectation is that with Dyna, the solution should be much better than without.

Part 4: Implement Strategy Learner

For this part of the project you should develop a learner that can learn a trading policy using your Q-Learner. Utilize the template provided in StrategyLearner.py Overall the structure of your strategy learner should be arranged like this:

For the policy learning part:

  • Select several technical features, and compute their values for the training data
  • Discretize the values of the features
  • Instantiate a Q-learner
  • For each day in the training data:
    • Compute the current state (including holding)
    • Compute the reward for the last action
    • Query the learner with the current state and reward to get an action
    • Implement the action the learner returned (BUY, SELL, NOTHING), and update portfolio value
  • Repeat the above loop multiple times until cumulative return stops improving.

A rule to keep in mind: As in past projects, you can only be long or short 100 shares, so if your learner returns two BUYs in a row, don't double down, same thing with SELLs.

For the policy testing part:

  • For each day in the testing data:
    • Compute the current state
    • Query the learner with the current state to get an action
    • Implement the action the learner returned (BUY, SELL, NOTHING), and update portfolio value
  • Return the resulting trades in a data frame (details below).

Your StrategyLearner should implement the following API:

import StrategyLearner as sl
learner = sl.StrategyLearner(verbose = False) # constructor
learner.addEvidence(symbol = "IBM", sd=dt.datetime(2008,1,1), ed=dt.datetime(2009,1,1), sv = 10000) # training step
df_trades = learner.testPolicy(symbol = "IBM", sd=dt.datetime(2009,1,1), ed=dt.datetime(2010,1,1), sv = 10000) # testing step

The input parameters are:

  • verbose: if False do not generate any output
  • symbol: the stock symbol to train on
  • sd: A datetime object that represents the start date
  • ed: A datetime object that represents the end date
  • sv: Start value of the portfolio

The output result is:

  • df_trades: A data frame whose values represent trades for each day. Legal values are +100.0 indicating a BUY of 100 shares, -100.0 indicating a SELL of 100 shares, and 0.0 indicating NOTHING [update, values of +200 and -200 for trades are also legal so long as net holdings are constrained to -100, 0, and 100].

Contents of Report

There is no report component of this assignment. However, if you would like to impress us with your Machine Learning prowess, you are invited to submit a succinct report.

Hints & resources

This paper by Kaelbling, Littman and Moore, is a good resource for RL in general: http://www.jair.org/media/301/live-301-1562-jair.pdf See Section 4.2 for details on Q-Learning.

There is also a chapter in the Mitchell book on Q-Learning.

For implementing Dyna, you may find the following resources useful:

What to turn in

Turn your project in via t-square. All of your code must be contained within QLearner.py and StrategyLearner.py.

  • Your QLearner as QLearner.py
  • Your StrategyLearner as StrategyLearner.py
  • Your report (if any) as report.pdf
  • Do not submit any other files.

Rubric

Only your QLearner class will be tested.

  • For basic Q-Learning (dyna = 0) we will test your learner against 10 test worlds with 500 iterations. Each test should complete in less than 2 seconds. For the test to be successful, your learner should find a path to the goal <= 1.5 x the number of steps our reference solution finds. We will check this by taking the min of all the 500 runs. Each test case is worth 8 points. We will initialize your learner with the following parameter values:
    learner = ql.QLearner(num_states=100,\
        num_actions = 4, \
        alpha = 0.2, \
        gamma = 0.9, \
        rar = 0.98, \
        radr = 0.999, \
        dyna = 0, \
        verbose=False) #initialize the learner
  • For Dyna-Q, we will set dyna = 200. We will test your learner against world03.csv with 50 iterations. The test should complete in less than 10 seconds. For the test to be successful, your learner should find a path to the goal <= 1.5 x the number of steps our reference solution finds. We will check this by taking the min of all 50 runs. The test case is worth 5 points. We will initialize your learner with the following parameter values:
    learner = ql.QLearner(num_states=100,\
        num_actions = 4, \
        alpha = 0.2, \
        gamma = 0.9, \
        rar = 0.5, \
        radr = 0.99, \
        dyna = 200, \
        verbose=False) #initialize the learner
  • We will test StrategyLearner in the following situations:
    • Training: Dec 31 2007 to Dec 31 2009
    • Testing: Dec 31 2009 to Dec 31 2011
    • Symbols: ML4T-220, IBM
    • Starting value: $10,000
    • Benchmark: Buy 100 shares on the first trading day, Sell 100 shares on the last day.
  • We expect the following outcomes in testing:
    • For ML4T-220, the trained policy should significantly outperform the benchmark in sample (7 points)
    • For ML4T-220, the trained policy should significantly outperform the benchmark out of sample (7 points)
    • For IBM, the trained policy should significantly outperform the benchmark in sample (7 points)

Training and testing for each situation should run in less than 30 seconds. We reserve the right to use different time periods if necessary to reduce auto grading time.

Required, Allowed & Prohibited

Required:

  • Your project must be coded in Python 2.7.x.
  • Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.

Allowed:

  • You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
  • Your code may use standard Python libraries.
  • You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
  • You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
  • Code provided by the instructor, or allowed by the instructor to be shared.
  • Use util.py (only) for reading data.

Prohibited:

  • Any libraries not listed in the "allowed" section above.
  • Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
  • Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
  • Print statements outside "verbose" checks (they significantly slow down auto grading).
  • Any method for reading data besides util.py

Legacy