Difference between revisions of "MC3-Project-2"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
(Created page with "==Updates / FAQs== Q: Can I use an ML library or do I have to write the code myself? A: You must write the KNN and bagging code yourself. For the LinRegLearner you are allo...")
 
 
(171 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Updates / FAQs==
 
==Updates / FAQs==
  
Q: Can I use an ML library or do I have to write the code myself?  A: You must write the KNN and bagging code yourself. For the LinRegLearner you are allowed to make use of NumPy or SciPy libraries but you must "wrap" the library code to implement the APIs defined below.  Do not uses other libraries or your code will fail the auto grading test cases.
+
* '''2017-03-29''' Changed rubric to evaluate the '''median''' of all runs rather than the "best" of all runs.
 +
* '''2016-11-8''' Changed rubric to evaluate the '''median''' of all runs rather than the "best" of all runs.
  
2015-11-07
+
==Overview==
 +
 
 +
In this project you will implement the Q-Learning and Dyna-Q solutions to the reinforcement learning problem.  You will apply them to a navigation problem in this project.  In a later project you will apply them to trading.  The reason for working with the navigation problem first is that, as you will see, navigation is an easy problem to work with and understand.  Note that your Q-Learning code really shouldn't care which problem it is solving.  The difference is that you need to wrap the learner in different code that frames the problem for the learner as necessary.
 +
 
 +
For the navigation problem we have created testqlearner.py that automates testing of your Q-Learner in the navigation problem.
 +
 
 +
Overall, your tasks for this project include:
 +
 
 +
* Code a Q-Learner
 +
* Code the Dyna-Q feature of Q-Learning
 +
* Test/debug the Q-Learner in navigation problems
 +
 
 +
For this assignment we will test only your code (there is no report component).
 +
 
 +
==Template and Data==
 +
 
 +
* Update your local mc3p2_qlearning_robot directory using github.
 +
* Implement the <tt>QLearner</tt> class in <tt>mc3p2_qlearning_robot/QLearner.py</tt>.
 +
* To debug your Q-learner, run <tt>'''python testqlearner.py'''</tt> from the <tt>mc3p2_qlearning_robot/</tt> directory. The grading script for this project is <tt>grade_robot_qlearning.py</tt>.
 +
* Note that example navigation problems are provided in the <tt>mc3p2_qlearning_robot/testworlds</tt> directory.
  
Draft version posted.
+
==Part 1: Implement Q-Learner (95%)==
  
2015-11-10
+
Your QLearner class should be implemented in the file <tt>QLearner.py</tt>.  It should implement EXACTLY the API defined below.  DO NOT import any modules besides those allowed below.  Your class should implement the following methods:
  
* Q: Which libraries am I allowed to use?  Which library calls are prohibited?
+
<b>The constructor QLearner()</b> should reserve space for keeping track of Q[s, a] for the number of states and actions.  It should initialize Q[] with all zeros.  Details on the input arguments to the constructor:
  
* A: The general idea is that the use of classes that create and maintain their own data structures are prohibited. So for instance, use of <tt>scipy.spatial.KDTree</tt> is not allowed because it builds a tree and keeps that data structure around for reference laterThe intent for this project is that YOU should be building and maintaining the data structures necessary. You can, however, use most methods that return immediate results and do not retain data structures
+
* <tt>num_states</tt> integer, the number of states to consider
** Examples of things that are allowed: sqrt(), sort(), argsort() -- note that these methods return an immediate value and do not retain data structures for later use.
+
* <tt>num_actions</tt> integer, the number of actions available.
** Examples of things that are prohibited: any scikit add on library, scipy.spatial.KDTree, importing things from libraries other than numpy or scipy.
+
* <tt>alpha</tt> float, the learning rate used in the update rule. Should range between 0.0 and 1.0 with 0.2 as a typical value.
 +
* <tt>gamma</tt> float, the discount rate used in the update ruleShould range between 0.0 and 1.0 with 0.9 as a typical value.
 +
* <tt>rar</tt> float, random action rate: the probability of selecting a random action at each step. Should range between 0.0 (no random actions) to 1.0 (always random action) with 0.5 as a typical value.
 +
* <tt>radr</tt> float, random action decay rate, after each update, rar = rar * radr. Ranges between 0.0 (immediate decay to 0) and 1.0 (no decay).  Typically 0.99.
 +
* <tt>dyna</tt> integer, conduct this number of dyna updates for each regular update.  When Dyna is used, 200 is a typical value.
 +
* <tt>verbose</tt> boolean, if True, your class is allowed to print debugging statements, if False, all printing is prohibited.
  
2015-11-12
+
<b>query(s_prime, r)</b> is the core method of the Q-Learner.  It should keep track of the last state s and the last action a, then use the new information s_prime and r to update the Q table.  The learning instance, or experience tuple is <s, a, s_prime, r>.  query() should return an integer, which is the next action to take.  Note that it should choose a random action with probability rar, and that it should update rar according to the decay rate radr at each step.  Details on the arguments:
  
Clarification regarding dataset generation: Your strategy for defeating KNNLearner and LinRegLearner should not depend on they way you select training data versus testing data. The relationship of one learner performing better than another should persist regardless of which 60% of the data is selected for training and which 40% is selected for testing.
+
* <tt>s_prime</tt> integer, the the new state.
 +
* <tt>r</tt> float, a real valued immediate reward.
  
==Overview==
+
<b>querysetstate(s)</b> A special version of the query method that sets the state to s, and returns an integer action according to the same rules as query() (including choosing a random action sometimes), but it does not execute an update to the Q-tableIt also does not update rar. There are two main uses for this method: 1) To set the initial state, and 2) when using a learned policy, but not updating it.
You are to implement and evaluate three learning algorithms as Python classes: A KNN learner, a Linear Regression learner (provided) and a Bootstrap Aggregating learnerThe classes should be named KNNLearner, LinRegLearner, and BagLearner respectively. We are considering this a <b>regression</b> problem (not classification).  So the goal is to return a continuous numerical result (not a discrete result).
 
  
In this project we are training & testing with static spatial data.  In the next project we will make the transition to time series data.
+
Here's an example of the API in use:
  
You must write your own code for KNN and bagging. You are NOT allowed to use other peoples' code to implement KNN or bagging.
+
<PRE>
 +
import QLearner as ql
  
The project has two main components: The code for your learners, which will be auto graded, and your report, <tt>report.pdf</tt> that should include the components listed below.
+
learner = ql.QLearner(num_states = 100, \
 +
    num_actions = 4, \
 +
    alpha = 0.2, \
 +
    gamma = 0.9, \
 +
    rar = 0.98, \
 +
    radr = 0.999, \
 +
    dyna = 0, \
 +
    verbose = False)
  
==Template and Data==
+
s = 99 # our initial state
  
Instructions:
+
a = learner.querysetstate(s) # action for state s
* Download <tt>'''[[Media:mc3_p1.zip|mc3_p1.zip]]'''</tt>, unzip inside <tt>ml4t/</tt>
 
  
You will find these files in the mc3_p1 directory
+
s_prime = 5 # the new state we end up in after taking action a in state s
  
* <tt>Data/</tt>: Contains data for you to test your learning code on.
+
r = 0 # reward for taking action a in state s
* <tt>LinRegLearner.py</tt>: An implementation of the LinRegLearner class.  You can use it as a template for implementing your learner classes.
 
* <tt>__init__.py</tt>: Tells Python that you can import classes while in this directory.
 
* <tt>testlearner.py</tt>: Helper code to test a learner class.
 
  
In the Data/ directory there are three files:
+
next_action = learner.query(s_prime, r)
* 3_groups.csv
+
</PRE>
* ripple_.csv
 
* simple.csv
 
  
We will mainly be working with ripple and 3_groups. Each data file contains 3 columns: X1, X2, and Y.  In most cases you should use the <b>first 60% of the data for training</b>, and the <b>remaining 40% for testing</b>.
+
==Part 2: Navigation Problem Test Cases==
  
==Part 1: Implement KNNLearner (30%)==
+
We will test your Q-Learner with a navigation problem as follows.  Note that your Q-Learner does not need to be coded specially for this task.  In fact the code doesn't need to know anything about it.  The code necessary to test your learner with this navigation task is implemented in testqlearner.py for you.  The navigation task takes place in a 10 x 10 grid world.  The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:
  
Your KNNLearner class should be implemented in the file <tt>KNNLearner.py</tt>. It should implement EXACTLY the API defined below. DO NOT import any modules besides those from numpy, scipy, or the basic Python libraries. You should implement the following functions/methods:
+
* 0: blank space.
 +
* 1: an obstacle.
 +
* 2: the starting location for the robot.
 +
* 3: the goal location.
 +
* 5: quicksand.
  
import KNNLearner as knn
+
An example navigation problem (world01.csv) is shown below. Following python conventions, [0,0] is upper left, or northwest corner, [9,9] lower right or southeast corner. Rows are north/south, columns are east/west.
learner = knn.KNNLearner(k = 3) # constructor
 
  learner.addEvidence(Xtrain, Ytrain) # training step
 
Y = learner.query(Xtest) # query
 
  
Where "k" is the number of nearest neighbors to find. Xtrain and Xtest should be ndarrays (numpy objects) where each row represents an X1, X2, X3... XN set of feature values.  The columns are the features and the rows are the individual example instances.  Y and Ytrain are single dimension ndarrays that indicate the value we are attempting to predict with X.
+
<PRE>
 +
3,0,0,0,0,0,0,0,0,0
 +
0,0,0,0,0,0,0,0,0,0
 +
0,0,0,0,0,0,0,0,0,0
 +
0,0,1,1,1,1,1,0,0,0
 +
0,5,1,0,0,0,1,0,0,0
 +
0,5,1,0,0,0,1,0,0,0
 +
0,0,1,0,0,0,1,0,0,0
 +
0,0,0,0,0,0,0,0,0,0
 +
0,0,0,0,0,0,0,0,0,0
 +
0,0,0,0,2,0,0,0,0,0
 +
</PRE>
  
Use Euclidean distanceTake the mean of the closest k points' Y values to make your predictionIf there are multiple equidistant points on the boundary of being selected or not selected, you may use whatever method you like to choose among them.
+
In this example the robot starts at the bottom center, and must navigate to the top left.  Note that a wall of obstacles blocks its path, and there is some quicksand along the left sideThe objective is for the robot to learn how to navigate from the starting location to the goal with the highest total rewardWe define the reward for each step as:
 +
* -1 if the robot moves to an empty or blank space, or attempts to move into a wall
 +
* -100 if the robot moves to a quicksand space
 +
* 1 if the robot moves to the goal space
  
==Part 2: Implement BagLearner (20%)==
+
Overall, we will assess the performance of a policy as the median reward it incurs to travel from the start to the goal (higher reward is better).  We assess a learner in terms of the reward it converges to over a given number of training epochs (trips from start to goal).  '''Important note:''' the problem includes random actions.  So, for example, if your learner responds with a "move north" action, there is some probability that the robot will actually move in a different direction.  For this reason, the "wise" learner develops policies that keep the robot well away from quicksand.  We map this problem to a reinforcement learning problem as follows:
  
Implement Bootstrap Aggregating as a Python class named BagLearner.  Your BagLearner class should be implemented in the file <tt>BagLearner.py</tt>.  It should implement EXACTLY the API defined below. DO NOT import any modules besides those from numpy, scipy, or the basic Python libraries.  You should implement the following functions/methods:
+
* State: The state is the location of the robot, it is computed (discretized) as: column location * 10 + row location.
+
* Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
import BagLearner as bl
+
* R: The reward is as described above.
learner = bl.BagLearner(learner = knn.KNNLearner, kwargs = {"k":3}, bags = 20, boost = False)
+
* T: The transition matrix can be inferred from the CSV map and the actions.
learner.addEvidence(Xtrain, Ytrain)
 
Y = learner.query(Xtest)
 
  
Where learner is the learning class to use with bagging.  kwargs are keyword arguments to be passed on to the learner's constructor and they vary according to the learner (see hints below)"bags" is the number of learners you should train using Bootstrap Aggregation. If boost is true, then you should implement boosting. 
+
Note that R and T are not known by or available to the learner.  The code in <tt>testqlearner.py</tt> will test your code as follows (pseudo code):
  
Notes:  See hints section below for example code you might use to instantiate your learners. Boosting is an extra credit topic and not required.  There's a citation below in the Resources section that outlines a method of implementing bagging. If the training set contains n data items, each bag should contain n items as well.  Note that because you should sample with replacement, some of the data items will be repeated.
+
<pre>
 +
Instantiate the learner with the constructor QLearner()
 +
s = initial_location
 +
a = querysetstate(s)
 +
s_prime = new location according to action a
 +
r = -1.0
 +
while not converged:
 +
    a = query(s_prime, r)
 +
    s_prime = new location according to action a
 +
    if s_prime == goal:
 +
        r = +1
 +
        s_prime = start location
 +
    else if s_prime == quicksand:
 +
        r = -100
 +
    else:
 +
        r = -1
 +
</pre>
  
==Part 3: Experiments and report (50%)==
+
A few things to note about this code: The learner always receives a reward of -1.0 (or -100.0) until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.
  
Create a report that addresses the following issues/questionsThe report should be submitted as <tt>report.pdf</tt> in PDF format.  Do not submit word docs or latex files.  Include data as tables or charts to support each your answersI expect that this report will be 4 to 10 pages.
+
Here are example solutionsNote that these examples were created before we added "quicksand" to the projectWe will be updating the examples to reflect this change. In the mean time, you may find these useful:
  
* Create your own dataset generating code (call it <tt>best4linreg.py</tt>) that creates data that performs significantly better with LinRegLearner than KNNLearner.  Explain your data generating algorithm, and explain why LinRegLearner performs better.  Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
+
[[mc3_p2_examples]]
* Create your own dataset generating code (call it <tt>best4KNN.py</tt>) that creates data that performs significantly better with KNNLearner than LinRegLearner.  Explain your data generating algorithm, and explain why KNNLearner performs better.  Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
 
* Consider the dataset <tt>ripple</tt> with KNN.  For which values of K does overfitting occur? (Don't use bagging).
 
* Now use bagging in conjunction with KNN with the <tt>ripple</tt> dataset.  How does performance vary as you increase the number of bags?  Does overfitting occur with respect to the number of bags?
 
* Can bagging reduce or eliminate overfitting with respect to K for the <tt>ripple</tt> dataset?
 
  
==Hints & resources==
+
[[mc3_p2_dyna_examples]]
 +
 
 +
==Part 3: Implement Dyna (5%)==
 +
 
 +
Add additional components to your QLearner class so that multiple "hallucinated" experience tuples are used to update the Q-table for each "real" experience.  The addition of this component should speed convergence in terms of the number of calls to query(), <b>not necessarily running time</b>.
  
Some external resources that might be useful for this project:
+
Note that it is not important that you implement Dyna exactly as described in the lecture.  The key requirement is that your code should somehow hallucinate additional experiences.  The precise method you use for discovering those experiences is flexible.  We will test your code on several test worlds with 50 epochs and with dyna = 200.  Our expectation is that with Dyna, the solution should be much better after 50 epochs than without.
  
* You may be interested to take a look at Andew Moore's slides on [http://www.autonlab.org/tutorials/mbl.html instance based learning].
+
==Part 4: Implement author() Method (0%)==
* A definition of [http://mathworld.wolfram.com/StatisticalCorrelation.html correlation] which we'll use to assess the quality of the learning.
 
* [https://en.wikipedia.org/wiki/Bootstrap_aggregating Bootstrap Aggregating]
 
* [https://en.wikipedia.org/wiki/AdaBoost AdaBoost]
 
* [http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html numpy corrcoef]
 
* [http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html numpy argsort]
 
* [http://en.wikipedia.org/wiki/Root_mean_square RMS error]
 
  
You can use code like the below to instantiate several learners with the parameters listed in kwargs:
+
You should implement a method called <tt>author()</tt> that returns your Georgia Tech user ID as a string. This is the ID you use to log into t-square.  It is not your 9 digit student number.  Here is an example of how you might implement author() within a learner object:
  
 
<pre>
 
<pre>
learners = []
+
class QLearner(object):
kwargs = {"k":10}
+
    def author(self):
for i in range(0,bags):
+
        return 'tb34' # replace tb34 with your Georgia Tech username.
    learners.append(learner(**kwargs))
 
 
</pre>
 
</pre>
 +
 +
And here's an example of how it could be called from a testing program:
 +
 +
<pre>
 +
    # create a learner and train it
 +
    learner = ql.QLearner() # create a QLearner
 +
    print learner.author()
 +
</pre>
 +
 +
Check the template code for examples. We are adding those to the repo now, but it might not be there if you check right away.  Implementing this method correctly does not provide any points, but there will be a penalty for not implementing it.
 +
 +
==Contents of Report==
 +
 +
There is no report component of this assignment.  However, if you would like to impress us with your Machine Learning prowess, you are invited to submit a succinct report.
 +
 +
==Hints & resources==
 +
 +
This paper by Kaelbling, Littman and Moore, is a good resource for RL in general: http://www.jair.org/media/301/live-301-1562-jair.pdf  See Section 4.2 for details on Q-Learning.
 +
 +
There is also a chapter in the Mitchell book on Q-Learning.
 +
 +
For implementing Dyna, you may find the following resources useful:
 +
 +
* https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node96.html
 +
* http://www-anw.cs.umass.edu/~barto/courses/cs687/Chapter%209.pdf
  
 
==What to turn in==
 
==What to turn in==
Be sure to follow these instructions diligently!
 
  
Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):
+
Turn your project in via t-square.  All of your code must be contained within QLearner.py .
 +
 
 +
* Your QLearner as <tt>QLearner.py</tt>
 +
* Do not submit any other files.
 +
 
 +
==Rubric==
 +
 
 +
Only your QLearner class will be tested. 
 +
 
 +
* For basic Q-Learning (dyna = 0) we will test your learner against 10 test worlds with 500 epochs in each world.  One "epoch" means your robot reaches the goal one time, or after 100000 steps, whichever comes first.  Your QLearner retains its state (Q-table), and then we allow it to navigate to the goal again, over and over, 500 times.  Each test (500 epochs) should complete in less than 2 seconds.  <b>NOTE</b>: an epoch where the robot fails to reach the goal will likely take <b>much</b> longer (in running time), than one that does reach the goal, and is a common reason for failing to complete test cases within the time limit.
 +
* Benchmark: As a benchmark to compare your solution to, we will run our reference solution in the same world, with 500 epochs.  We will take the median reward of our reference across all of those 500 epochs.
 +
* Your score: For each world we will take the median cost your solution finds across all 500 epochs.
 +
* For a test to be successful, your learner should find a total reward >= 1.5 x the benchmark. Note that since reward for a single epoch is negative, your solution can be up to 50% worse than the reference solution and still pass.
 +
* There are 10 test cases, each test case is worth 9.5 points.
 +
* Here is how we will initialize your QLearner for these test cases:
  
* Your code as <tt>KNNLearner.py, BagLearner.py</tt>, <tt>best4linreg.py</tt>, <tt>best4KNN.py</tt>
+
<Pre>
* Your report as <tt>report.pdf</tt>
+
    learner = ql.QLearner(num_states=100,\
 +
        num_actions = 4, \
 +
        alpha = 0.2, \
 +
        gamma = 0.9, \
 +
        rar = 0.98, \
 +
        radr = 0.999, \
 +
        dyna = 0, \
 +
        verbose=False) #initialize the learner
 +
</PRE>
  
DO NOT submit extra credit work as part of this submissionSubmit it separately to the "Extra credit" assignment on t-square.
+
* For Dyna-Q, we will set dyna = 200.  We will test your learner against <tt>world01.csv</tt> and <tt>world02.csv</tt> with 50 epochsScoring is similar to the non-dyna case: Each test should complete in less than 10 seconds. For the test to be successful, your learner should find solution with total reward to the goal >= 1.5 x the  median reward our reference solution across all 50 epochs.  Note that since reward for a single epoch is negative, your solution can be up to 50% worse than the reference solution and still pass.  We will check this by taking the median of all 50 runs. Each test case is worth 2.5 points.  We will initialize your learner with the following parameter values for these test cases:
  
Unlimited resubmissions are allowed up to the deadline for the project.
+
<Pre>
 +
    learner = ql.QLearner(num_states=100,\
 +
        num_actions = 4, \
 +
        alpha = 0.2, \
 +
        gamma = 0.9, \
 +
        rar = 0.5, \
 +
        radr = 0.99, \
 +
        dyna = 200, \
 +
        verbose=False) #initialize the learner
 +
</PRE>
  
==Extra credit up to 3%==
+
* Is the author() method correctly implemented (-20% if not)
  
Implement boosting as part of BagLearner.  How does boosting affect performance for <tt>ripple</tt> and <tt>3_groups</tt> data?
+
==Required, Allowed & Prohibited==
  
Does overfitting occur for either of these datasets as the number of bags with boosting increases?
+
Required:
 +
* Your project must be coded in Python 2.7.x.
 +
* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu).
  
Create your own dataset for which overfitting occurs as the number of bags with boosting increases.
+
Allowed:
 +
* You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
 +
* Your code may use standard Python libraries.
 +
* You may use the NumPy, SciPy, matplotlib and Pandas libraries.  Be sure you are using the correct versions.
 +
* You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
 +
* Code provided by the instructor, or allowed by the instructor to be shared.
  
Submit your report <tt>report.pdf</tt> that focuses just on your extra credit work to the "extra credit" assignment on t-square.
+
Prohibited:
 +
* Any libraries not listed in the "allowed" section above.
 +
* Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
 +
* Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
 +
* Print statements outside "verbose" checks (they significantly slow down auto grading).
 +
* Any method for reading data besides util.py
  
==Rubric==
+
==Legacy==
  
* KNNLearner, auto grade 10 test cases (including ripple.csv and 3_groups.csv), 3 points each: 30 points
+
*[[MC3-Project-2-Legacy-trader]]
* BagLearner, auto grade 10 test cases (including ripple.csv and 3_groups.csv), 2 points each: 20 points
+
*[[MC3-Project-2-Legacy]]
* best4linreg.py (15 points)
+
*[[MC3-Project-4]]
** Code submitted (OK if not Python): -5 if absent
 
** Description complete -- Sufficient that someone else could implement it: -5 if not
 
** Description compelling -- The reasoning that linreg should do better is understandable and makes sense. Graph of the data helps but is not required if the description is otherwise compelling: -5 if not
 
** Train and test data drawn from same distribution: -5 if not
 
** Performance demonstrates that linreg does better: -10 if not
 
* best4KNN.py (15 points)
 
** Code submitted (OK if not Python): -5 if absent
 
** Description complete -- Sufficient that someone else could implement it: -5 if not
 
** Description compelling -- The reasoning that linreg should do better is understandable and makes sense. Graph of the data helps but is not required if the description is otherwise compelling: -5 if not
 
** Train and test data drawn from same distribution: -5 if not
 
** Performance demonstrates that linreg does better: -10 if not
 
* Overfitting (10 points)
 
** Is the region of overfitting correctly identified?: 5 points
 
** Is conclusion supported with data (table or chart): 5 points
 
* Bagging (10 points)
 
** Correct conclusion regarding overfitting as bags increase, supported with tables or charts: 5 points
 
** Correct conclusion regarding overfitting as K increases, supported with tables or charts: 5 points
 

Latest revision as of 14:37, 3 July 2017

Updates / FAQs

  • 2017-03-29 Changed rubric to evaluate the median of all runs rather than the "best" of all runs.
  • 2016-11-8 Changed rubric to evaluate the median of all runs rather than the "best" of all runs.

Overview

In this project you will implement the Q-Learning and Dyna-Q solutions to the reinforcement learning problem. You will apply them to a navigation problem in this project. In a later project you will apply them to trading. The reason for working with the navigation problem first is that, as you will see, navigation is an easy problem to work with and understand. Note that your Q-Learning code really shouldn't care which problem it is solving. The difference is that you need to wrap the learner in different code that frames the problem for the learner as necessary.

For the navigation problem we have created testqlearner.py that automates testing of your Q-Learner in the navigation problem.

Overall, your tasks for this project include:

  • Code a Q-Learner
  • Code the Dyna-Q feature of Q-Learning
  • Test/debug the Q-Learner in navigation problems

For this assignment we will test only your code (there is no report component).

Template and Data

  • Update your local mc3p2_qlearning_robot directory using github.
  • Implement the QLearner class in mc3p2_qlearning_robot/QLearner.py.
  • To debug your Q-learner, run python testqlearner.py from the mc3p2_qlearning_robot/ directory. The grading script for this project is grade_robot_qlearning.py.
  • Note that example navigation problems are provided in the mc3p2_qlearning_robot/testworlds directory.

Part 1: Implement Q-Learner (95%)

Your QLearner class should be implemented in the file QLearner.py. It should implement EXACTLY the API defined below. DO NOT import any modules besides those allowed below. Your class should implement the following methods:

The constructor QLearner() should reserve space for keeping track of Q[s, a] for the number of states and actions. It should initialize Q[] with all zeros. Details on the input arguments to the constructor:

  • num_states integer, the number of states to consider
  • num_actions integer, the number of actions available.
  • alpha float, the learning rate used in the update rule. Should range between 0.0 and 1.0 with 0.2 as a typical value.
  • gamma float, the discount rate used in the update rule. Should range between 0.0 and 1.0 with 0.9 as a typical value.
  • rar float, random action rate: the probability of selecting a random action at each step. Should range between 0.0 (no random actions) to 1.0 (always random action) with 0.5 as a typical value.
  • radr float, random action decay rate, after each update, rar = rar * radr. Ranges between 0.0 (immediate decay to 0) and 1.0 (no decay). Typically 0.99.
  • dyna integer, conduct this number of dyna updates for each regular update. When Dyna is used, 200 is a typical value.
  • verbose boolean, if True, your class is allowed to print debugging statements, if False, all printing is prohibited.

query(s_prime, r) is the core method of the Q-Learner. It should keep track of the last state s and the last action a, then use the new information s_prime and r to update the Q table. The learning instance, or experience tuple is <s, a, s_prime, r>. query() should return an integer, which is the next action to take. Note that it should choose a random action with probability rar, and that it should update rar according to the decay rate radr at each step. Details on the arguments:

  • s_prime integer, the the new state.
  • r float, a real valued immediate reward.

querysetstate(s) A special version of the query method that sets the state to s, and returns an integer action according to the same rules as query() (including choosing a random action sometimes), but it does not execute an update to the Q-table. It also does not update rar. There are two main uses for this method: 1) To set the initial state, and 2) when using a learned policy, but not updating it.

Here's an example of the API in use:

import QLearner as ql

learner = ql.QLearner(num_states = 100, \ 
    num_actions = 4, \
    alpha = 0.2, \
    gamma = 0.9, \
    rar = 0.98, \
    radr = 0.999, \
    dyna = 0, \
    verbose = False)

s = 99 # our initial state

a = learner.querysetstate(s) # action for state s

s_prime = 5 # the new state we end up in after taking action a in state s

r = 0 # reward for taking action a in state s

next_action = learner.query(s_prime, r)

Part 2: Navigation Problem Test Cases

We will test your Q-Learner with a navigation problem as follows. Note that your Q-Learner does not need to be coded specially for this task. In fact the code doesn't need to know anything about it. The code necessary to test your learner with this navigation task is implemented in testqlearner.py for you. The navigation task takes place in a 10 x 10 grid world. The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:

  • 0: blank space.
  • 1: an obstacle.
  • 2: the starting location for the robot.
  • 3: the goal location.
  • 5: quicksand.

An example navigation problem (world01.csv) is shown below. Following python conventions, [0,0] is upper left, or northwest corner, [9,9] lower right or southeast corner. Rows are north/south, columns are east/west.

3,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,1,1,1,1,1,0,0,0
0,5,1,0,0,0,1,0,0,0
0,5,1,0,0,0,1,0,0,0
0,0,1,0,0,0,1,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,2,0,0,0,0,0

In this example the robot starts at the bottom center, and must navigate to the top left. Note that a wall of obstacles blocks its path, and there is some quicksand along the left side. The objective is for the robot to learn how to navigate from the starting location to the goal with the highest total reward. We define the reward for each step as:

  • -1 if the robot moves to an empty or blank space, or attempts to move into a wall
  • -100 if the robot moves to a quicksand space
  • 1 if the robot moves to the goal space

Overall, we will assess the performance of a policy as the median reward it incurs to travel from the start to the goal (higher reward is better). We assess a learner in terms of the reward it converges to over a given number of training epochs (trips from start to goal). Important note: the problem includes random actions. So, for example, if your learner responds with a "move north" action, there is some probability that the robot will actually move in a different direction. For this reason, the "wise" learner develops policies that keep the robot well away from quicksand. We map this problem to a reinforcement learning problem as follows:

  • State: The state is the location of the robot, it is computed (discretized) as: column location * 10 + row location.
  • Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
  • R: The reward is as described above.
  • T: The transition matrix can be inferred from the CSV map and the actions.

Note that R and T are not known by or available to the learner. The code in testqlearner.py will test your code as follows (pseudo code):

Instantiate the learner with the constructor QLearner()
s = initial_location
a = querysetstate(s)
s_prime = new location according to action a
r = -1.0
while not converged:
    a = query(s_prime, r) 
    s_prime = new location according to action a
    if s_prime == goal:
        r = +1
        s_prime = start location
    else if s_prime == quicksand:
        r = -100
    else:
        r = -1

A few things to note about this code: The learner always receives a reward of -1.0 (or -100.0) until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.

Here are example solutions. Note that these examples were created before we added "quicksand" to the project. We will be updating the examples to reflect this change. In the mean time, you may find these useful:

mc3_p2_examples

mc3_p2_dyna_examples

Part 3: Implement Dyna (5%)

Add additional components to your QLearner class so that multiple "hallucinated" experience tuples are used to update the Q-table for each "real" experience. The addition of this component should speed convergence in terms of the number of calls to query(), not necessarily running time.

Note that it is not important that you implement Dyna exactly as described in the lecture. The key requirement is that your code should somehow hallucinate additional experiences. The precise method you use for discovering those experiences is flexible. We will test your code on several test worlds with 50 epochs and with dyna = 200. Our expectation is that with Dyna, the solution should be much better after 50 epochs than without.

Part 4: Implement author() Method (0%)

You should implement a method called author() that returns your Georgia Tech user ID as a string. This is the ID you use to log into t-square. It is not your 9 digit student number. Here is an example of how you might implement author() within a learner object:

class QLearner(object):
    def author(self):
        return 'tb34' # replace tb34 with your Georgia Tech username.

And here's an example of how it could be called from a testing program:

    # create a learner and train it
    learner = ql.QLearner() # create a QLearner
    print learner.author()

Check the template code for examples. We are adding those to the repo now, but it might not be there if you check right away. Implementing this method correctly does not provide any points, but there will be a penalty for not implementing it.

Contents of Report

There is no report component of this assignment. However, if you would like to impress us with your Machine Learning prowess, you are invited to submit a succinct report.

Hints & resources

This paper by Kaelbling, Littman and Moore, is a good resource for RL in general: http://www.jair.org/media/301/live-301-1562-jair.pdf See Section 4.2 for details on Q-Learning.

There is also a chapter in the Mitchell book on Q-Learning.

For implementing Dyna, you may find the following resources useful:

What to turn in

Turn your project in via t-square. All of your code must be contained within QLearner.py .

  • Your QLearner as QLearner.py
  • Do not submit any other files.

Rubric

Only your QLearner class will be tested.

  • For basic Q-Learning (dyna = 0) we will test your learner against 10 test worlds with 500 epochs in each world. One "epoch" means your robot reaches the goal one time, or after 100000 steps, whichever comes first. Your QLearner retains its state (Q-table), and then we allow it to navigate to the goal again, over and over, 500 times. Each test (500 epochs) should complete in less than 2 seconds. NOTE: an epoch where the robot fails to reach the goal will likely take much longer (in running time), than one that does reach the goal, and is a common reason for failing to complete test cases within the time limit.
  • Benchmark: As a benchmark to compare your solution to, we will run our reference solution in the same world, with 500 epochs. We will take the median reward of our reference across all of those 500 epochs.
  • Your score: For each world we will take the median cost your solution finds across all 500 epochs.
  • For a test to be successful, your learner should find a total reward >= 1.5 x the benchmark. Note that since reward for a single epoch is negative, your solution can be up to 50% worse than the reference solution and still pass.
  • There are 10 test cases, each test case is worth 9.5 points.
  • Here is how we will initialize your QLearner for these test cases:
    learner = ql.QLearner(num_states=100,\
        num_actions = 4, \
        alpha = 0.2, \
        gamma = 0.9, \
        rar = 0.98, \
        radr = 0.999, \
        dyna = 0, \
        verbose=False) #initialize the learner
  • For Dyna-Q, we will set dyna = 200. We will test your learner against world01.csv and world02.csv with 50 epochs. Scoring is similar to the non-dyna case: Each test should complete in less than 10 seconds. For the test to be successful, your learner should find solution with total reward to the goal >= 1.5 x the median reward our reference solution across all 50 epochs. Note that since reward for a single epoch is negative, your solution can be up to 50% worse than the reference solution and still pass. We will check this by taking the median of all 50 runs. Each test case is worth 2.5 points. We will initialize your learner with the following parameter values for these test cases:
    learner = ql.QLearner(num_states=100,\
        num_actions = 4, \
        alpha = 0.2, \
        gamma = 0.9, \
        rar = 0.5, \
        radr = 0.99, \
        dyna = 200, \
        verbose=False) #initialize the learner
  • Is the author() method correctly implemented (-20% if not)

Required, Allowed & Prohibited

Required:

  • Your project must be coded in Python 2.7.x.
  • Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu).

Allowed:

  • You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
  • Your code may use standard Python libraries.
  • You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
  • You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
  • Code provided by the instructor, or allowed by the instructor to be shared.

Prohibited:

  • Any libraries not listed in the "allowed" section above.
  • Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
  • Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
  • Print statements outside "verbose" checks (they significantly slow down auto grading).
  • Any method for reading data besides util.py

Legacy