Holy Hand Grenade of Antioch

1 Updates / FAQs
2 Overview
3 About the ABIDES simulator
4 Getting started and template
5 Part 1: Implement CRLearner
6 Part 2: Navigation Problem Test Cases
7 Part 3: Implement author() Method (0%)
8 Contents of Report
9 Hints & resources
10 What to turn in
11 Rubric
12 Required, Allowed & Prohibited
13 Legacy

Updates / FAQs

2019-08-08 First draft

Overview

In this optional project you will implement a an agent that trades in a simulated environment that includes dozens of other trading agents. The success of your contributed code and your score on the project will depend on how profitable your agent's trading is. The following rules apply:

Your agent starts each morning with $100,000 in cash.
You will trade only one asset, JPM.
Trading begins at 9:30AM, the market closes at 4:00PM.
Your score depends on the value of your portfolio as of market close, including cash and stock positions.
Your agent should never initiate a trade that will cause your portfolio to exceed a leverage of 1.0.

About the ABIDES simulator

You will implement your trading agent to run within the Agent-Based Interactive Discrete Event Simulation (ABIDES). ABIDES was inspired by an earlier project named Stockyard led by Prof. Tucker Balch at Georgia Tech. In 2016 David Byrd and Prof. Balch decided to start fresh and build a new simulator from the ground up. David is the lead architect and developer of ABIDES. The development of ABIDES has been supported by the NSF and It is now available as open source at GitHub.

ABIDES is also used in research at J.P. Morgan to develop and evaluate trading algorithms and models of market structure.

Please see our arXiv paper for preliminary documentation:

https://arxiv.org/abs/1904.12066

Please see the wiki for tutorials and example configurations:

https://github.com/abides-sim/abides/wiki

Getting started and template

Get the ABIDES simulation distribution at GitHub here: https://github.com/abides-sim/abides

Update your local mc3_p5 directory using github.
Implement the CRLearner class in mc3_p5/CRLearner.py.
To test your CRLearner, run python testcrlearner.py from the mc3_p5/ directory.
Note that example navigation problems are provided in the mc3_p5/testworlds directory.

The worlds beginning with "vr" as in "vr_01_005.csv" are intended for use as test cases for this project.

Part 1: Implement CRLearner

Your CRLearner class should be implemented in the file CRLearner.py. It should implement EXACTLY the API defined below. DO NOT import any modules besides those allowed below. Your class should implement the following methods:

Details on the input arguments to the constructor:

num_dimensions: integer, the number of continuous dimensions in the state
num_actions: integer, the number of actions available.
verbose: binary, True if printing stuff is allowed.

query(s_prime, r) is the core method of the CRLearner. It should keep track of the last state s and the last action a, then use the new information s_prime and r to update its internal model or policy. The learning instance, or experience tuple is <s, a, s_prime, r>. query() should return an integer, which is the next action to take. Details on the arguments:

s_prime: a one-dimensional ndarray containing num_dimensions elements. Each element corresponds to one dimension of the state.
r: float, a real valued immediate reward.

querysetstate(s) A special version of the query method that sets the state to s, and returns an integer action according to the same rules as query()

Here's an example of the API in use:

import CRLearner as cr
import numpy as np

learner = cr.CRLearner(num_dimensions = 2, \
    num_actions = 4, verbose = False)

s = np.asarray((0.4, 0.45)) # our initial state

a = learner.querysetstate(s) # action for state s

s_prime = np.asarray((0.42, 0.45)) # the new state we end up in after taking action a in state s

r = 0.0 # reward for taking action a in state s

next_action = learner.query(s_prime, r)

Part 2: Navigation Problem Test Cases

We will test your CRLearner with a navigation problem as follows. Note that your CRLearner does not need to be coded specially for this task. In fact the code doesn't need to know anything about it. The code necessary to test your learner with this navigation task is implemented in testcrlearner.py for you.

The navigation task takes place in a square grid world that measures 1.0 units by 1.0 units. The location of the robot is the "state" and it will be provided to you as a 1 by 2 ndarray where the first element represents the X location and the second element represents the Y location. The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:

0: blank space.
1: an obstacle.
2: the starting location for the robot.
3: the goal location.
5: quicksand.

An example navigation problem (world01.csv) is shown below. Following python conventions, [0.0, 0.0] is upper left, or northwest corner, and [1.0, 1.0] is the lower right or southeast corner. Rows are north/south, columns are east/west.

3,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,1,1,1,1,1,0,0,0
0,5,1,0,0,0,1,0,0,0
0,5,1,0,0,0,1,0,0,0
0,0,1,0,0,0,1,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0
0,0,0,0,2,0,0,0,0,0

In this example the robot will be started at the bottom center, and must navigate to the top left. Note that a wall of obstacles blocks its path, and there is some quicksand along the left side. The objective is for the robot to learn how to navigate from the starting location to the goal with the highest total reward. We define the reward for each step as:

-1 if the robot moves to an empty or blank space, or attempts to move into a wall
-100 if the robot moves to a quicksand space
1 if the robot moves to the goal space

Overall, we will assess the performance of a policy as the average reward it incurs to travel from the start to the goal (higher reward is better). We assess a learner in terms of the reward it converges to over a given number of training iterations (trips from start to goal).

Important note: the problem includes random actions and sensor noise. So, for example, if your learner responds with a "move north" action, there is some probability that the robot will actually move in a different direction. For this reason, the "wise" learner develops policies that keep the robot well away from quicksand. We map this problem to a reinforcement learning problem as follows:

State: The state is the location of the robot, expressed as a 2 element vector.
Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
R: The reward is as described above.
T: The transition matrix can be inferred from the CSV map and the actions.

Note that R and T are not known by or available to the learner. The testing code testcrlearner.py will test your code as follows (pseudo code):

Instantiate the learner with the constructor QLearner()
s = initial_location
a = querysetstate(s)
s_prime = new location according to action a
r = -1.0
while not converged:
    a = query(s_prime, r) 
    s_prime = new location according to action a
    if s_prime == goal:
        r = +1
        s_prime = start location
    else if s_prime == quicksand:
        r = -100
    else:
        r = -1

A few things to note about this code: The learner always receives a reward of -1.0 (or -100.0) until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.

Part 3: Implement author() Method (0%)

You should implement a method called author() that returns your Georgia Tech user ID as a string. This is the ID you use to log into t-square. It is not your 9 digit student number. Here is an example of how you might implement author() within a learner object:

class CRLearner(object):
    def author(self):
        return 'tb34' # replace tb34 with your Georgia Tech username.

And here's an example of how it could be called from a testing program:

    # create a learner and train it
    learner = cr.CRLearner() # create a QLearner
    print learner.author()

Check the template code for examples. We are adding those to the repo now, but it might not be there if you check right away. Implementing this method correctly does not provide any points, but there will be a penalty for not implementing it.

Contents of Report

There is no report component of this assignment. However, if you would like to impress us with your Machine Learning prowess, you are invited to submit a succinct report.

Hints & resources

The main difference between this problem and the earlier one is that you must deal with continuous state. Deep Q-Learning is one approach to this problem. You are welcome also to consider other solutions if you like. Here are some links to Deep Q-Learning approaches:

This blog is a good starting point: http://karpathy.github.io/2016/05/31/rl/
An overview of Deep RL: https://arxiv.org/abs/1701.07274
An article in Nature: https://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

What to turn in

Turn your project in via t-square. All the code necessary to run your learner must be submitted. We will call only your methods in CRLearner following the specification described above. You are allowed to access/use library code, but it must be submitted and run as .py files. If you do use code that was not written by you, you must include comments providing proper credit and citations.

Your CRLearner as CRLearner.py
Other python files as necessary to support your learner.

Rubric

Only your CRLearner class will be tested.

The code for the learner must reflect an effort to create a continuous state learner (not a repackaged discrete state learner like Q-Learning).
We will create a number of groups of test cases, where each group reflects essentially the same navigation problem but with progressively higher resolution. e.g., multiple square worlds of different sizes 5x5 world, 10x10 world, 100x100, 1000x1000, etc. Your learner will not know the dimensions of the world it is in.
We will test your learner against N (value of N to be determined later) test worlds with 500 iterations in each world. One "iteration" means your robot reaches the goal one time, or the simulation times out. Your CRLearner retains its state, and then we allow it to navigate to the goal again, over and over, 500 times.
Benchmark: We do not have a reference solution for this problem. We will instead use the best student's submission as the benchmark. We will select a number of test cases that the benchmark can solve, then use those as the cases we test other submissions against. We will take the median reward of the benchmark across all of those 500 iterations.
Your score: For each world we will take the median cost your solution finds across all 500 iterations.
For a test to be successful, your learner should find a total reward >= 1.5 x the benchmark.
There will be 10 test cases, each test case is worth 9.0 points.
Is the author() method correctly implemented (-100% if not)

Required, Allowed & Prohibited

Required:

Your project must be coded in Python 2.7.x.
Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu).
All code required to run the learner must be submitted. We will not debug your code.
All code in CRLearner.py must be written by you.

Allowed:

You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
Your code may use standard Python libraries.
You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
Code provided by the instructor, or allowed by the instructor to be shared.
You may reuse code from the internet that you include as support files (it must be credited and cited).

Prohibited:

Any libraries not listed in the "allowed" section above.

Holy Hand Grenade of Antioch

Contents

Updates / FAQs

Overview

About the ABIDES simulator

Getting started and template

Part 1: Implement CRLearner

Part 2: Navigation Problem Test Cases

Part 3: Implement author() Method (0%)

Contents of Report

Hints & resources

What to turn in

Rubric

Required, Allowed & Prohibited

Legacy

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

QuantSoftware Research Group

Spring 2020

Site

Tools