Difference between revisions of "Summer 2016 Project 5"
Line 94: | Line 94: | ||
==Part 3: Implement Dyna== | ==Part 3: Implement Dyna== | ||
+ | |||
+ | Add additional components to your QLearner class so that multiple "hallucinated" experience tuples are used to update the Q-table for each "real" experience. The addition of this component should speed convergence in terms of the number of calls to query(). | ||
+ | |||
+ | We will test your code on <tt>world03.csv</tt> with 50 iterations and with dyna = 200. Our expectation is that with Dyna, the solution should be much better than without. | ||
==Part 4: Implement Strategy Learner== | ==Part 4: Implement Strategy Learner== |
Revision as of 19:35, 19 July 2016
Contents
Overview
In this project you will implement the Q-Learning and Dyna-Q solutions to the reinforcement learning problem. You will apply them to two problems: 1) Navigation, and 2) Trading. The reason for working with the navigation problem first is that, as you will see, navigation is an easy problem to work with and understand. In the last part of the assignment you will apply Q-Learning to stock trading.
Note that your Q-Learning code really shouldn't care which problem it is solving. The difference is that you need to wrap the learner in different code that frames the problem for the learner as necessary.
For the navigation problem we have created testqlearner.py that automates testing of your Q-Learner in the navigation problem. We also provide teststrategylearner.py to test your strategy learner. In order to apply Q-learning to trading you will have to implement an API that calls Q-learning internally.
Overall, your tasks for this project include:
- Code a Q-Learner
- Code the Dyna-Q feature of Q-Learning
- Test/debug the Q-Learner in navigation problems
- Build a strategy learner based on your Q-Learner
- Test/debug the strategy learner on specific symbol/time period problems
Scoring for the project will be allocated as follows:
- Navigation test cases: 80% (note that we will check those with dyna = 0)
- Dyna implemented: 5% (we will check this with one navigation test case by comparing performance with and without dyna turned on)
- Trading strategy test cases: 20%
For this assignment we will test only your code (there is no report component). Note that the scoring is structured so that you can earn a B (80%) if you implement only Q-Learning, but if you implement everything, the total possible score is 105%. That means you can earn up to 5% extra credit on this project ( == 1% extra credit on the final course grade).
Template and Data
- Download mc3_p3.zip, unzip inside ml4t/
- Implement the QLearner class in mc3_p3/QLearner.py.
- Implement the StrategyLearner class in mc3_p3/StrategyLearner.py
- To test your Q-learner, run python testqlearner.py from the mc3_p3/ directory.
- To test your strategy learner, run python teststrategylearner.py from the mc3_p3/ directory.
- Note that example problems are provided in the mc3_p3/testworlds directory
Part 1: Implement QLearner
We will test your Q-Learner with a navigation problem as follows. Note that your Q-Learner does not need to be coded specially for this task. In fact the code doesn't need to know anything about it. The code necessary to test your learner with this navigation task is implemented in testqlearner.py for you. The navigation task takes place in a 10 x 10 grid world. The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:
- 0: blank space.
- 1: an obstacle.
- 2: the starting location for the robot.
- 3: the goal location.
An example navigation problem (CSV file) is shown below. Following python conventions, [0,0] is upper left, or northwest corner, [9,9] lower right or southeast corner. Rows are north/south, columns are east/west.
0,0,0,0,3,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0 0,0,1,1,1,1,1,0,0,0 0,0,1,0,0,0,1,0,0,0 0,0,1,0,0,0,1,0,0,0 0,0,1,0,0,0,1,0,0,0 0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0 0,0,0,0,2,0,0,0,0,0
In this example the robot starts at the bottom center, and must navigate to the top center. Note that a wall of obstacles blocks its path. We map this problem to a reinforcement learning problem as follows:
- State: The state is the location of the robot, it is computed (discretized) as: column location * 10 + row location.
- Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
- R: The reward is -1.0 unless the action leads to the goal, in which case the reward is +1.0.
- T: The transition matrix can be inferred from the CSV map and the actions.
Note that R and T are not known by or available to the learner. The testing code testqlearner.py will test your code as follows (pseudo code):
Instantiate the learner with the constructor QLearner() s = initial_location a = querysetstate(s) s_prime = new location according to action a r = -1.0 while not converged: a = query(s_prime, r) s_prime = new location according to action a if s_prime == goal: r = +1 s_prime = start location else r = -1
A few things to note about this code: The learner always receives a reward of -1.0 until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.
Here are example solutions:
Part 3: Implement Dyna
Add additional components to your QLearner class so that multiple "hallucinated" experience tuples are used to update the Q-table for each "real" experience. The addition of this component should speed convergence in terms of the number of calls to query().
We will test your code on world03.csv with 50 iterations and with dyna = 200. Our expectation is that with Dyna, the solution should be much better than without.