Difference between revisions of "Holy Hand Grenade of Antioch"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
Line 47: Line 47:
 
Where ID is a set of characters that uniquely identify you.  If you are a Georgia Tech student for instance, this should be your login ID (e.g., mine is tb34).  "agentname" is your specific name for this agent.  We have separate names for agents submitted by the same person because in the future you might improve the one you wrote, or you might want to contribute a new one with a different name.  Copy the template code into that subdirectory and rename it "yourID_agentname.py" and be sure also that your agent's class name is "yourID_agentname"
 
Where ID is a set of characters that uniquely identify you.  If you are a Georgia Tech student for instance, this should be your login ID (e.g., mine is tb34).  "agentname" is your specific name for this agent.  We have separate names for agents submitted by the same person because in the future you might improve the one you wrote, or you might want to contribute a new one with a different name.  Copy the template code into that subdirectory and rename it "yourID_agentname.py" and be sure also that your agent's class name is "yourID_agentname"
  
You can assume that you will have read and write access to "your" subdirectory.  So you can store a learned policy there and perhaps update it between runs.  It is required that you follow the path name conventions used by the example agent provided, namely that the subdirectory location is relative.  If you do not follow that convention, your code will break and we will not grade it.
+
You can assume that your agent will have read and write access to "your" subdirectory.  So you can store a learned policy there and perhaps update it between runs.  It is required that you follow the path name conventions used by the example agent provided, namely that the subdirectory location is relative.  If you do not follow that convention, your code will break and we will not grade it.
  
 
==Lifecycle of your agent==
 
==Lifecycle of your agent==

Revision as of 16:55, 16 August 2019

Updates / FAQs

  • 2019-08-08 First draft

Overview

In this optional project you will implement a an agent that trades in a simulated environment that includes dozens of other trading agents. The success of your contributed code and your score on the project will depend on how profitable your agent's trading is. The following rules apply:

  • Your agent starts each morning with $100,000 in cash.
  • You will trade only one asset, JPM.
  • Trading begins at 9:30AM, the market closes at 4:00PM.
  • Your score depends on the value of your portfolio as of market close, including cash and stock positions.
  • Your agent should never initiate a trade that will cause your portfolio to exceed a leverage of 1.0.

About the ABIDES simulator and getting started

You will implement your trading agent to run within the Agent-Based Interactive Discrete Event Simulation (ABIDES). ABIDES was designed by Prof. Tucker Balch and David Byrd at Georgia Tech. David is the lead architect and developer of ABIDES. The development of ABIDES has been supported by the NSF and It is now available as open source at GitHub.

ABIDES is used in research at J.P. Morgan to develop and evaluate trading algorithms and models of market structure.

Get the ABIDES simulation distribution at GitHub here: https://github.com/abides-sim/abides

Note that unless you really know what you're doing, you should never issue a pull request to this repo. It will expose your code to others and you will be perceived as uncool. To run an example simulation in the default configuration of background agents use the following Unix command lines:

[please replace the below with correct command lines]

cd *blah*
python *blah*

You will find a subdirectory

ABIDES/TradingEcosystem/*blah*

The "TradingEcosystem" directory is where we collect agents that will contribute to our ecosystem of traders. We hope, perhaps to include yours there in the future. There is a subdirectory for each participating agent. Note that the subdirectory *blah* contains a basic example agent after which you can pattern the trading agent you design.

[I made up the above as a way to handle (eventually) many contributed traders. Let me know if you end up thinking it is a decent approach or if you have a better way to handle it.]

What you should do

Create your own directory:

ABIDES/TradingEcosystem/yourID_agentname

Where ID is a set of characters that uniquely identify you. If you are a Georgia Tech student for instance, this should be your login ID (e.g., mine is tb34). "agentname" is your specific name for this agent. We have separate names for agents submitted by the same person because in the future you might improve the one you wrote, or you might want to contribute a new one with a different name. Copy the template code into that subdirectory and rename it "yourID_agentname.py" and be sure also that your agent's class name is "yourID_agentname"

You can assume that your agent will have read and write access to "your" subdirectory. So you can store a learned policy there and perhaps update it between runs. It is required that you follow the path name conventions used by the example agent provided, namely that the subdirectory location is relative. If you do not follow that convention, your code will break and we will not grade it.

Lifecycle of your agent

Your

Part 2: Navigation Problem Test Cases

We will test your CRLearner with a navigation problem as follows. Note that your CRLearner does not need to be coded specially for this task. In fact the code doesn't need to know anything about it. The code necessary to test your learner with this navigation task is implemented in testcrlearner.py for you.

The navigation task takes place in a square grid world that measures 1.0 units by 1.0 units. The location of the robot is the "state" and it will be provided to you as a 1 by 2 ndarray where the first element represents the X location and the second element represents the Y location. The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:

  • 0: blank space.
  • 1: an obstacle.
  • 2: the starting location for the robot.
  • 3: the goal location.
  • 5: quicksand.

An example navigation problem (world01.csv) is shown below. Following python conventions, [0.0, 0.0] is upper left, or northwest corner, and [1.0, 1.0] is the lower right or southeast corner. Rows are north/south, columns are east/west.

3,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,1,1,1,1,1,0,0,0
0,5,1,0,0,0,1,0,0,0
0,5,1,0,0,0,1,0,0,0
0,0,1,0,0,0,1,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,2,0,0,0,0,0

In this example the robot will be started at the bottom center, and must navigate to the top left. Note that a wall of obstacles blocks its path, and there is some quicksand along the left side. The objective is for the robot to learn how to navigate from the starting location to the goal with the highest total reward. We define the reward for each step as:

  • -1 if the robot moves to an empty or blank space, or attempts to move into a wall
  • -100 if the robot moves to a quicksand space
  • 1 if the robot moves to the goal space

Overall, we will assess the performance of a policy as the average reward it incurs to travel from the start to the goal (higher reward is better). We assess a learner in terms of the reward it converges to over a given number of training iterations (trips from start to goal).

Important note: the problem includes random actions and sensor noise. So, for example, if your learner responds with a "move north" action, there is some probability that the robot will actually move in a different direction. For this reason, the "wise" learner develops policies that keep the robot well away from quicksand. We map this problem to a reinforcement learning problem as follows:

  • State: The state is the location of the robot, expressed as a 2 element vector.
  • Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
  • R: The reward is as described above.
  • T: The transition matrix can be inferred from the CSV map and the actions.

Note that R and T are not known by or available to the learner. The testing code testcrlearner.py will test your code as follows (pseudo code):

Instantiate the learner with the constructor QLearner()
s = initial_location
a = querysetstate(s)
s_prime = new location according to action a
r = -1.0
while not converged:
    a = query(s_prime, r) 
    s_prime = new location according to action a
    if s_prime == goal:
        r = +1
        s_prime = start location
    else if s_prime == quicksand:
        r = -100
    else:
        r = -1

A few things to note about this code: The learner always receives a reward of -1.0 (or -100.0) until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.

Part 3: Implement author() Method (0%)

You should implement a method called author() that returns your Georgia Tech user ID as a string. This is the ID you use to log into t-square. It is not your 9 digit student number. Here is an example of how you might implement author() within a learner object:

class CRLearner(object):
    def author(self):
        return 'tb34' # replace tb34 with your Georgia Tech username.

And here's an example of how it could be called from a testing program:

    # create a learner and train it
    learner = cr.CRLearner() # create a QLearner
    print learner.author()

Check the template code for examples. We are adding those to the repo now, but it might not be there if you check right away. Implementing this method correctly does not provide any points, but there will be a penalty for not implementing it.

Contents of Report

There is no report component of this assignment. However, if you would like to impress us with your Machine Learning prowess, you are invited to submit a succinct report.

Hints & resources

The main difference between this problem and the earlier one is that you must deal with continuous state. Deep Q-Learning is one approach to this problem. You are welcome also to consider other solutions if you like. Here are some links to Deep Q-Learning approaches:

What to turn in

Turn your project in via t-square. All the code necessary to run your learner must be submitted. We will call only your methods in CRLearner following the specification described above. You are allowed to access/use library code, but it must be submitted and run as .py files. If you do use code that was not written by you, you must include comments providing proper credit and citations.

  • Your CRLearner as CRLearner.py
  • Other python files as necessary to support your learner.

Rubric

Only your CRLearner class will be tested.

  • The code for the learner must reflect an effort to create a continuous state learner (not a repackaged discrete state learner like Q-Learning).
  • We will create a number of groups of test cases, where each group reflects essentially the same navigation problem but with progressively higher resolution. e.g., multiple square worlds of different sizes 5x5 world, 10x10 world, 100x100, 1000x1000, etc. Your learner will not know the dimensions of the world it is in.
  • We will test your learner against N (value of N to be determined later) test worlds with 500 iterations in each world. One "iteration" means your robot reaches the goal one time, or the simulation times out. Your CRLearner retains its state, and then we allow it to navigate to the goal again, over and over, 500 times.
  • Benchmark: We do not have a reference solution for this problem. We will instead use the best student's submission as the benchmark. We will select a number of test cases that the benchmark can solve, then use those as the cases we test other submissions against. We will take the median reward of the benchmark across all of those 500 iterations.
  • Your score: For each world we will take the median cost your solution finds across all 500 iterations.
  • For a test to be successful, your learner should find a total reward >= 1.5 x the benchmark.
  • There will be 10 test cases, each test case is worth 9.0 points.
  • Is the author() method correctly implemented (-100% if not)

Required, Allowed & Prohibited

Required:

  • Your project must be coded in Python 2.7.x.
  • Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu).
  • All code required to run the learner must be submitted. We will not debug your code.
  • All code in CRLearner.py must be written by you.

Allowed:

  • You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
  • Your code may use standard Python libraries.
  • You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
  • You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
  • Code provided by the instructor, or allowed by the instructor to be shared.
  • You may reuse code from the internet that you include as support files (it must be credited and cited).

Prohibited:

  • Any libraries not listed in the "allowed" section above.

Legacy