Difference between revisions of "MC3-Project-3"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
 
(134 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==Draft==
+
==DRAFT==
  
This is a draft version of the assignment.  This "draft" statement will be removed once the assignment definition is final.
+
This assignment is under revision.  This notice will be removed once it is final.
  
 
==Updates / FAQs==
 
==Updates / FAQs==
  
==Overview==
+
*'''2017-04-02'''
 +
** Clarified instructions regarding "best possible" to use your own market simulator with adjusted closing prices.
  
In this project you will implement and assess Q-LearningBecause of the limited time available for this project, we're going to have you first test your Q-Learning implementation to solve a navigation problemApplying Q-Learning to stock trading is offered as an extra credit assignmentNote that your Q-Learning code really shouldn't care which problem it is solving, but in order to apply it to trading, you will have to re-work the testqlearner.py code.
+
*'''2017-03-16'''
 +
** Switch from IBM to AAPL.  Position sizes changedIn sample and out of sample dates changed.
 +
** Added requirement for "best possible strategy".   
 +
** Added requirement that indicators be standardized.
 +
** Changed from 10 day to 21 day holdingChart requirements relaxed to just require a vertical line upon entry (no black vertical line on exit).
 +
** Added requirement for data visualization.
  
==Template and Data==
+
* Q: In a previous project there was a constraint of holding a single position until exit. Does that apply to this project?  Yes, hold one position til exit.
  
* Download <tt>'''[[Media:mc3_p3.zip|mc3_p3.zip]]'''</tt>, unzip inside <tt>ml4t/</tt>
+
* Q: Is that 21 calendar days, or 21 trading days (i.e., days when SPY was traded)? A: Always use trading days.
* Implement the <tt>QLearner</tt> class in <tt>mc3_p3/QLearner.py</tt>.
 
* To test your learner, run <tt>'''python testqlearner.py'''</tt> from the <tt>mc3_p3/</tt> directory.
 
* Note that example problems are provided in the <tt>mc3_p3/testworlds</tt> directory
 
  
==Part 1: Implement QLearner==
+
* Q: Are there constraints for Python modules allowed for this project? Can we experiment with modules for optimization or technical analysis and cite or are we expected to write everything from scratch for this project as well?  A: The constraints are the same as for the first learning project. You've already written the learners you need.
  
Your QLearner class should be implemented in the file <tt>QLearner.py</tt>. It should implement EXACTLY the API defined below. DO NOT import any modules besides those allowed below.  Your class should implement the following methods:
+
* Q: I want to read some other values from the data besides just adjusted close, how can I do that? A: Please modify an old version of util.py to do that, include that new util.py with your submission.
  
* QLearner(...): Constructor, see argument details below.
+
* Q: Are we required to trade in only 200 share blocks? (and have no more than 200 shares long or short at a time as in some of the previous assignments) A: (update).  You can trade up to 400 shares at a time as long as you maintain the requirement of 200, 0 or -200 shares.  This will enable comparison between results more easily.
* query(s_prime, r): Update Q-table with <s, a, s_prime, r> and return new action for state s.
+
* querysetstate(s): Set state to s, return action for state s, but don't update Q-table.
+
* Q: Are we limited to leverage of 2.0 on the portfolio?  A: There is no limit on leverage.
 +
 +
* Q: Are we only allowed one position at a time?  A: You can be in one of three states: -200 shares, +200 shares, 0 shares.
  
Here's an example of the API in use:
+
==Overview==
  
<PRE>
+
In this project you will develop trading strategies using Technical Analysis, and test them using your market simulator. You will then utilize your Random Tree learner to train and test a learning trading algorithm.
import QLearner as ql
 
  
learner = ql.QLearner(num_states = 100, \
+
In this project we shift from an auto graded format to a report format. For this project your grade will be based on the PDF report you submit, not your code. However, you will also submit your code that will be checked visually to ensure it appropriately matches the report you submit.
    num_actions = 4, \
 
    alpha = 0.2, \
 
    gamma = 0.9, \
 
    rar = 0.5, \
 
    radr = 0.99, \
 
    dyna = 0)
 
  
s = 99 # our initial state
+
==Data Details, Dates and Rules==
  
a = learner.querysetstate(s) # action for state s
+
Use the following parameters for Part 2, 3 and 4:
  
s_prime = 5 # the new state we end up in after taking action a in state s
+
* Use only the data provided for this course.  You are not allowed to import external data.
 +
* Trade only the symbol AAPL (however, you may, if you like, use data from other symbols to inform your strategy).
 +
* The in sample/training period is January 1, 2008 to December 31 2009.
 +
* The out of sample/testing period is January 1, 2010 to December 31 2011.
 +
* Starting cash is $100,000.
 +
* Allowable positions are: 200 shares long, 200 shares short, 0 shares.
 +
* Benchmark: The performance of a portfolio starting with $100,000 cash, investing in 200 shares of AAPL and holding that position
 +
* There is no limit on leverage.
  
r = 0 # reward for taking action a in state s
+
==Part 1: Technical Indicators (20%)==
  
next_action = learner.query(s_prime, r)
+
Develop and describe at least 3 and at most 5 technical indicators. You may find our lecture on time series processing to be helpful.  For each indicator you should create a single chart that shows the price history of the stock during the in-sample period, "helper data" and the value of the indicator itself.  As an example, if you were using price/SMA as an indicator you would want to create a chart with 3 lines: Price, SMA, Price/SMA.  In order to facilitate visualization of the indicator you can normalize the data to 1.0 at the start of the date range (i.e. divide price[t] by price[0]).
</PRE>
 
  
<b>The constructor QLearner()</b> should reserve space for keeping track of Q[s, a] for the number of states and actions.  It should initialize Q[] with uniform random values between -1.0 and 1.0. Details on the input arguments to the constructor:
+
You should "standardize" or "normalize" your indicators so that they have zero mean and standard deviation 1.0  One way to do this is the standard score transformation as described here: https://en.wikipedia.org/wiki/Standard_score .  This transformation will help ensure that all of your indicators are considered with equal importance by your learner.
  
* <tt>num_states</tt> integer, the number of states to consider
+
Your report description of each indicator should enable someone to reproduce it just by reading the description. We want a written description here, not code, however, it is OK to augment your written description with a pseudocode figure.
* <tt>num_actions</tt>  integer, the number of actions available.  
 
* <tt>alpha</tt> float, the learning rate used in the update rule. Should range between 0.0 and 1.0 with 0.2 as a typical value.
 
* <tt>gamma</tt> float, the discount rate used in the update rule.  Should range between 0.0 and 1.0 with 0.9 as a typical value.
 
* <tt>rar</tt> float, random action rate: the probability of selecting a random action at each step. Should range between 0.0 (no random actions) to 1.0 (always random action) with 0.5 as a typical value.
 
* <tt>radr</tt> float, random action decay rate, after each update, rar = rar * radr. Ranges between 0.0 (immediate decay to 0) and 1.0 (no decay).  Typically 0.99.
 
* <tt>dyna</tt> integer, conduct this number of dyna updates for each regular update.  When Dyna is used, 200 is a typical value.
 
  
<b>query(s_prime, r)</b> is the core method of the Q-Learner.  It should keep track of the last state s and the last action a, then use the new information s_prime and r to update the Q table. The learning instance, or experience tuple is <s, a, s_prime, r>.  query() should return an integer, which is the next action to take.  Note that it should choose a random action with probability rar, and that it should update rar according to the decay rate radr at each step.
+
At least one of the indicators you use should be completely different from the ones presented in our lectures. (i.e. something other than SMA, Bollinger Bands, RSI)
  
<b>querysetstate(s)</b> A special version of the query method that finds an action according to the same rules as query(), but it does not execute an update to the Q-table.  This method is typically only used once, to set the initial state.
+
Deliverables:
 +
* Descriptive text (2 to 3 pages with figures).
 +
* 3 to 5 charts (one for each indicator)
 +
* Code: indicators.py
  
==The Navigation Problem==
+
==Part 2: Best Possible Strategy (5%)==
  
We will test your Q-Learner with a navigation problem as followsNote that your Q-Learner does not need to be coded specially for this taskIn fact the code doesn't need to know anything about it.  The code necessary to test your learner with this navigation task is implemented in testqlearner.py for you.  The navigation task takes place in a 10 x 10 grid world.  The particular environment is expressed in a CSV file of integers, where the value in each position is interpreted as follows:
+
Assume that you can see the future, but that you are constrained by the portfolio size and order limits as specified aboveCreate a set of trades that represents the best a strategy could possibly do during the in sample period. The holding time requirements described in the next sections do not apply to this exerciseThe reason we're having you do this is so that you will have an idea of an upper bound on performance.   
  
* 0: blank space.
+
The intent is for you to use adjusted close prices with the market simulator that you wrote earlier in the course.
* 1: an obstacle.
 
* 2: the starting location for the robot.
 
* 3: the goal location.
 
  
An example navigation problem (CSV file) is shown below:
+
Provide a chart that reports:
  
<PRE>
+
* Benchmark (see definition above) normalized to 1.0 at the start: Black line
0,0,0,0,3,0,0,0,0,0
+
* Value of the best possible portfolio (normalized to 1.0 at the start): Blue line
0,0,0,0,0,0,0,0,0,0
 
0,0,0,0,0,0,0,0,0,0
 
0,0,1,1,1,1,1,0,0,0
 
0,0,1,0,0,0,1,0,0,0
 
0,0,1,0,0,0,1,0,0,0
 
0,0,1,0,0,0,1,0,0,0
 
0,0,0,0,0,0,0,0,0,0
 
0,0,0,0,0,0,0,0,0,0
 
0,0,0,0,2,0,0,0,0,0
 
</PRE>
 
  
The robot starts at the bottom center, and must navigate to the top center.  Note that a wall of obstacles blocks its path.  We map this problem to a reinforcement learning problem as follows:
+
You should also report in text:
  
* State: The state is the location of the robot, it is computed (discretized) as: horizontal location * 10 + vertical location.
+
* Cumulative return of the benchmark and portfolio
* Actions: There are 4 possible actions, 0: move north, 1: move east, 2: move south, 3: move west.
+
* Stdev of daily returns of benchmark and portfolio
* R: The reward is -1.0 unless the action leads to the goal, in which case the reward is +1.0.
+
* Mean of daily returns of benchmark and portfolio
* T: The transition matrix can be inferred from the CSV map and the actions.
 
  
Note that R and T are not known by or available to the learner.  The testing code <tt>testqlearner.py</tt> will test your code as follows (pseudo code):
+
==Part 3: Manual Rule-Based Trader (20%)==
  
<pre>
+
Devise a set of rules using the indicators you created in Part 1 above.  Your rules should be designed to trigger a "long" or "short" entry for a 21 trading day hold.  In other words, once an entry is initiated, you must remain in the position for 21 trading days.  In your report you must describe your trading rules so that another person could implement them based only on your description. We want a written description here, not code, however, it is OK to augment your written description with a pseudocode figure.
Instantiate the learner with the constructor QLearner()
 
s = initial_location
 
a = querysetstate(s)
 
s_prime = new location according to action a
 
r = -1.0
 
while not converged:
 
    a = query(s_prime, r)
 
    s_prime = new location according to action a
 
    if s_prime == goal:
 
        r = +1
 
        s_prime = start location
 
    else
 
        r = -1
 
</pre>
 
  
A few things to note about this code: The learner always receives a reward of -1.0 until it reaches the goal, when it receives a reward of +1.0. As soon as the robot reaches the goal, it is immediately returned to the starting location.
+
You should tweak your rules as best you can to get the best performance possible during the in sample period (do not peek at out of sample performance).  Use your rule-based strategy to generate an orders file over the in sample period, then run that file through your market simulator to create a chart that includes the following components over the in sample period:
  
==Contents of Report==
+
* Benchmark (see definition above) normalized to 1.0 at the start: Black line
 +
* Value of the rule-based portfolio (normalized to 1.0 at the start): Blue line
 +
* Vertical green lines indicating LONG entry points.
 +
* Vertical red lines indicating SHORT entry points.
  
==Hints & resources==
+
Note that each red or green vertical line should be at least 21 days from the preceding line.  We will check for that.  We expect that your rule-based strategy should outperform the benchmark over the in sample period. 
  
==What to turn in==
+
Deliverables:
 +
* Descriptive text (1 or 2 pages with chart) that provides a compelling justification for the rule-based system developed.
 +
* Text must describe rule based system in sufficient detail that another person could implement it.
 +
* 1 chart.
 +
* Code: rule_based.py (generates an orders file)
  
Turn your project in via t-square. 
+
==Part 4: ML Trader (30%)==
  
* Your report as <tt>report.pdf</tt>
+
Convert your decision tree '''regression''' learner into a '''classification''' learner. The classifications should be:
* Your code as <tt>code.py</tt>
 
  
==Extra credit up to 3%==
+
* +1: LONG
 +
* 0: DO NOTHING
 +
* -1: SHORT
  
==Rubric==
+
The X data for each sample (day) are simply the values of your indicators for the stock -- you should have 3 to 5 of them.  The Y data (or classifications) will be based on 21 day return.  You should classify the example as a +1 or "LONG" if the 21 day return exceeds a certain value, let's call it YBUY for the moment.  You should classify the example as a -1 or "SHORT" if the 21 day return is below a certain value we'll call YSELL.  In all other cases the sample should be classified as a 0 or "DO NOTHING."  Note that it is very important that you train your learner with these classification values (not the 21 day returns).  We will check for this.
  
==Required, Allowed & Prohibited==
+
Note that your X values are calculated each day from the current day's (and earlier) data, but the Y value (classification) is calculated using data from the future.  You may tweak various parameters of your learner to maximize return (more on that below).  Train and test your learning strategy over the in sample period.  Whenever a LONG or SHORT is encountered, you must enter the corresponding position and hold it for 21 days.  That means, for instance, that if you encounter a LONG on day 1, then a SHORT on day 2, you must keep the stock still until the 21 days expire, even though you received this conflicting information.  The reason for this is that we're trying to provide a way to directly compare the manual strategy versus the ML strategy.
  
Required:
+
'''Important note:''' You must set the leaf_size parameter of your decision tree learner to 5 or larger. This requirement is intended to avoid a degenerate overfit solution to this problem.
* Your project must be coded in Python 2.7.x.
 
* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
 
* Your code must run in less than 30 seconds on one of the university-provided computers.
 
  
Allowed:
+
Use your ML-based strategy to generate an orders file over the in sample period, then run that file through your market simulator to create a chart that includes the following components over the in sample period:
* You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
 
* Your code may use standard Python libraries.
 
* You may use the NumPy, SciPy and Pandas libraries.  Be sure you are using the correct versions.
 
* You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
 
* Code provided by the instructor, or allowed by the instructor to be shared.
 
  
Prohibited:
+
* Benchmark (see definition above) normalized to 1.0 at the start: Black line
* Any libraries not listed in the "allowed" section above.
+
* Value of the rule-based portfolio (normalized to 1.0 at the start): Blue line.
* Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
+
* Value of the ML-based portfolio (normalized to 1.0 at the start): Green line.
* Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
+
* Vertical green lines indicating LONG entry points.
* Print statements (they significantly slow down auto grading).
+
* Vertical red lines indicating SHORT entry points.
  
 +
We expect that the ML-based strategy will outperform the manual strategy, however it is possible that it does not.  If it is the case that your manual strategy does better, you should try to explain why in your report.
  
 +
You should tweak the parameters of your learner to maximize performance during the in sample period.  Here is a partial list of things you can tweak:
 +
* Adjust YSELL and YBUY.
 +
* Adjust leaf_size.
 +
* Utilize bagging and adjust the number of bags.
  
==Updates / FAQs==
+
Deliverables:
 +
* Descriptive text (1 or 2 pages with chart) that describes your ML approach.
 +
* Text must describe ML based system in sufficient detail that another person could implement it.
 +
* 1 chart
 +
* Code: ML_based.py (generates an orders file)
 +
* Additional code files as necessary to support ML_based.py (e.g. RTLearner.py and so on).
  
Q: Can I use an ML library or do I have to write the code myself?  A: You must write the KNN and bagging code yourself.  For the LinRegLearner you are allowed to make use of NumPy or SciPy libraries but you must "wrap" the library code to implement the APIs defined below.  Do not uses other libraries or your code will fail the auto grading test cases.
+
==Part 5: Visualization of data (15%)==
  
2015-11-07
+
Choose two of your indicators, call them X1 and X2.  Create 3 scatter plots where each point in each plot is located according to the indicator values on that day at X1, X2.  Color each dot according to the following scheme:
  
Draft version posted.
+
* Green if the factors on that day satisfy "LONG" conditions.
 +
* Red if the factors satisfy "SHORT" conditions.
 +
* Black if neither "LONG" or "SHORT" are satisfied.
  
2015-11-10
+
The scale for the scatter plot should be set to +-1.5 in both dimensions.  This will help us check that you have standardized your indicators.
  
* Q: Which libraries am I allowed to use?  Which library calls are prohibited?
+
The 3 plots should be based on the in sample period (about 500 points):
  
* A: The general idea is that the use of classes that create and maintain their own data structures are prohibited. So for instance, use of <tt>scipy.spatial.KDTree</tt> is not allowed because it builds a tree and keeps that data structure around for reference later.  The intent for this project is that YOU should be building and maintaining the data structures necessary.  You can, however, use most methods that return immediate results and do not retain data structures
+
# Your rule-based strategy.
** Examples of things that are allowed: sqrt(), sort(), argsort() -- note that these methods return an immediate value and do not retain data structures for later use.
+
# The training data for your ML strategy.
** Examples of things that are prohibited: any scikit add on library, scipy.spatial.KDTree, importing things from libraries other than pandas, numpy or scipy.
+
# Response of your learner when queried with the same data (after training).
  
2015-11-12
+
==Part 6: Comparative Analysis (10%)==
  
Clarification regarding dataset generation: Your strategy for defeating KNNLearner and LinRegLearner should not depend on they way you select training data versus testing data.  The relationship of one learner performing better than another should persist regardless of which 60% of the data is selected for training and which 40% is selected for testing.
+
Evaluate the performance of both of your strategies in the out of sample period.  Note that you '''should not''' train or tweak your learner on this data.  You should use the classification learned using the training data only. Create a chart that shows, out of sample:
  
==Overview==
+
* Benchmark (see definition above) normalized to 1.0 at the start: Black line
You are to implement and evaluate three learning algorithms as Python classes: A KNN learner, a Linear Regression learner (provided) and a Bootstrap Aggregating learner. The classes should be named KNNLearner, LinRegLearner, and BagLearner respectively.  We are considering this a <b>regression</b> problem (not classification). So the goal is to return a continuous numerical result (not a discrete result).
+
* Performance of manual strategy: Blue line
 +
* Performance of the ML strategy: Green line
 +
* All three should be normalized to 1.0 at the start.
  
In this project we are training & testing with static spatial dataIn the next project we will make the transition to time series data.
+
Create a table that summarizes the performance of the stock, the manual strategy and the ML strategy for both in sample and out of sample periods.  Utilize your experience in this class to determine which factors are best to use for comparing these strategies.  If performance out of sample is worse than in sample, do your best to explain whyAlso if the manual and ML strategies perform substantially differently, explain why.  Is one method or the other more or less susceptible to the same underlying flaw?  Why or why not?
  
You must write your own code for KNN and bagging. You are NOT allowed to use other peoples' code to implement KNN or bagging.
+
Deliverables:
 +
* Descriptive text (1 or 2 pages including figures)
 +
* 1 chart
  
The project has two main components: The code for your learners, which will be auto graded, and your report, <tt>report.pdf</tt> that should include the components listed below.
+
==Hints==
  
==Template and Data==
+
'''Overall, I recommend the following steps in the creation of your strategies:'''
  
Instructions:
+
* Indicator design hints:
* Download <tt>'''[[Media:mc3_p1.zip|mc3_p1.zip]]'''</tt>, unzip inside <tt>ml4t/</tt>
+
** For your X values: Identify and implement at least 3 technical features that you believe may be predictive of future return.
 +
* Rule based design:
 +
** Use a cascade of if statements conditioned on the indicators to identify whether a LONG condition is met.
 +
** Use a cascade of if statements conditioned on the indicators to identify whether a SHORT condition is met.
 +
** The conditions for LONG and SHORT should be mutually exclusive.
 +
** If neither LONG or SHORT is triggered, the result should be DO NOTHING.
 +
** For debugging purposes, you may find it helpful to plot the value of the rule-based output (-1, 0, 1) versus the stock price.
 +
* Train a classification learner on in sample training data:
 +
** For your Y values: Use future 21 day return (not future price). Then classify that return as LONG, SHORT or DO NOTHING. You're trying to predict a relative change that you can use to invest with.
 +
** For debugging purposes, you may find it helpful to plot the value of the training classification data (-1, 0, 1) versus the stock price in one color.
 +
** For debugging purposes, you may find it helpful to plot the value of the training classification output (-1, 0, 1) versus the stock price in another color.  Ideally, these two lines should be very similar.
  
You will find these files in the mc3_p1 directory
+
'''Choosing Technical Features -- Your X Values'''
  
* <tt>Data/</tt>: Contains data for you to test your learning code on.
+
You should have already successfully coded the Bollinger Band feature:
* <tt>LinRegLearner.py</tt>: An implementation of the LinRegLearner class.  You can use it as a template for implementing your learner classes.
 
* <tt>__init__.py</tt>: Tells Python that you can import classes while in this directory.
 
* <tt>testlearner.py</tt>: Helper code to test a learner class.
 
  
In the Data/ directory there are three files:
+
<PRE>
* 3_groups.csv
+
bb_value[t] = (price[t] - SMA[t])/(stdev[t])
* ripple_.csv
+
</PRE>
* simple.csv
 
  
We will mainly be working with ripple and 3_groups. Each data file contains 3 columns: X1, X2, and Y.  In most cases you should use the <b>first 60% of the data for training</b>, and the <b>remaining 40% for testing</b>.
+
Two other good features worth considering are momentum and volatility.
  
==Part 1: Implement KNNLearner (30%)==
+
<PRE>
 +
momentum[t] = (price[t]/price[t-N]) - 1
 +
</PRE>
  
Your KNNLearner class should be implemented in the file <tt>KNNLearner.py</tt>.  It should implement EXACTLY the API defined below.  DO NOT import any modules besides those from numpy, scipy, or the basic Python libraries.  You should implement the following functions/methods:
+
Volatility is just the stdev of daily returns.
  
import KNNLearner as knn
+
You still need to standardize the resulting values.
learner = knn.KNNLearner(k = 3) # constructor
 
learner.addEvidence(Xtrain, Ytrain) # training step
 
Y = learner.query(Xtest) # query
 
  
Where "k" is the number of nearest neighbors to find. Xtrain and Xtest should be ndarrays (numpy objects) where each row represents an X1, X2, X3... XN set of feature values.  The columns are the features and the rows are the individual example instances.  Y and Ytrain are single dimension ndarrays that indicate the value we are attempting to predict with X.
+
'''Choosing Y'''
  
Use Euclidean distanceTake the mean of the closest k points' Y values to make your predictionIf there are multiple equidistant points on the boundary of being selected or not selected, you may use whatever method you like to choose among them.
+
Your code should classify based on 21 day change in priceYou need to build a new Y that reflects the 21 day change and aligns with the current dateHere's pseudo code for the calculation of Y
  
==Part 2: Implement BagLearner (20%)==
+
ret = (price[t+21]/price[t]) - 1.0
 +
if ret > YBUY:
 +
    Y[t] = +1 # LONG
 +
else if ret < YSELL:
 +
    Y[t] = -1 # SHORT
 +
else:
 +
    Y[t] = 0
  
Implement Bootstrap Aggregating as a Python class named BagLearner.  Your BagLearner class should be implemented in the file <tt>BagLearner.py</tt>.  It should implement EXACTLY the API defined below.  DO NOT import any modules besides those from numpy, scipy, or the basic Python libraries.  You should implement the following functions/methods:
+
If you select Y in this manner and use it for training, your learner will classify 21 day returns.
 
import BagLearner as bl
 
learner = bl.BagLearner(learner = knn.KNNLearner, kwargs = {"k":3}, bags = 20, boost = False)
 
learner.addEvidence(Xtrain, Ytrain)
 
Y = learner.query(Xtest)
 
  
Where learner is the learning class to use with bagging.  kwargs are keyword arguments to be passed on to the learner's constructor and they vary according to the learner (see hints below).  "bags" is the number of learners you should train using Bootstrap Aggregation.  If boost is true, then you should implement boosting. 
+
==Template and Data==
  
Notes:  See hints section below for example code you might use to instantiate your learnersBoosting is an extra credit topic and not required.  There's a citation below in the Resources section that outlines a method of implementing bagging. If the training set contains n data items, each bag should contain n items as wellNote that because you should sample with replacement, some of the data items will be repeated.
+
There is no github template for this projectYou should create a directory for your code in ml4t/mc3-p3 and make a copy of util.py thereYou should also copy into that directory your learner code and your market simulator code. You will have access to the data in the ML4T/Data directory but you should use ONLY the code in util.py to read it.
  
==Part 3: Experiments and report (50%)==
+
==Contents of Report==
 
 
Create a report that addresses the following issues/questions.  The report should be submitted as <tt>report.pdf</tt> in PDF format.  Do not submit word docs or latex files.  Include data as tables or charts to support each your answers.  I expect that this report will be 4 to 10 pages.
 
 
 
* Create your own dataset generating code (call it <tt>best4linreg.py</tt>) that creates data that performs significantly better with LinRegLearner than KNNLearner.  Explain your data generating algorithm, and explain why LinRegLearner performs better.  Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
 
* Create your own dataset generating code (call it <tt>best4KNN.py</tt>) that creates data that performs significantly better with KNNLearner than LinRegLearner.  Explain your data generating algorithm, and explain why KNNLearner performs better.  Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
 
* Consider the dataset <tt>ripple</tt> with KNN.  For which values of K does overfitting occur? (Don't use bagging).
 
* Now use bagging in conjunction with KNN with the <tt>ripple</tt> dataset.  How does performance vary as you increase the number of bags?  Does overfitting occur with respect to the number of bags?
 
* Can bagging reduce or eliminate overfitting with respect to K for the <tt>ripple</tt> dataset?
 
 
 
==Hints & resources==
 
  
Some external resources that might be useful for this project:
+
* Your report should be no more than 3000 words.  Your report should contain no more than 14 charts.  Penalties will apply if you violate these constraints.
 +
* Include charts and text as identified in the sections above.
  
* You may be interested to take a look at Andew Moore's slides on [http://www.autonlab.org/tutorials/mbl.html instance based learning].
+
==Expectations==
* A definition of [http://mathworld.wolfram.com/StatisticalCorrelation.html correlation] which we'll use to assess the quality of the learning.
 
* [https://en.wikipedia.org/wiki/Bootstrap_aggregating Bootstrap Aggregating]
 
* [https://en.wikipedia.org/wiki/AdaBoost AdaBoost]
 
* [http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html numpy corrcoef]
 
* [http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html numpy argsort]
 
* [http://en.wikipedia.org/wiki/Root_mean_square RMS error]
 
  
You can use code like the below to instantiate several learners with the parameters listed in kwargs:
+
* In-sample AAPL backtests should perform very well -- The ML version should do better than the manual version.
 
+
* Out-of-sample AAPL backtests should... (you should be able to complete this sentence).
<pre>
 
learners = []
 
kwargs = {"k":10}
 
for i in range(0,bags):
 
    learners.append(learner(**kwargs))
 
</pre>
 
  
 
==What to turn in==
 
==What to turn in==
Be sure to follow these instructions diligently!
 
  
Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):
+
Turn your project in via t-square. 
  
* Your code as <tt>KNNLearner.py, BagLearner.py</tt>, <tt>best4linreg.py</tt>, <tt>best4KNN.py</tt>
 
 
* Your report as <tt>report.pdf</tt>
 
* Your report as <tt>report.pdf</tt>
 +
* All of your code, as necessary to run as <tt>.py</tt> files.
 +
* Document how to run your code in <tt>readme.txt</tt>.
 +
* No zip files please.
  
DO NOT submit extra credit work as part of this submission.  Submit it separately to the "Extra credit" assignment on t-square.
+
==Rubric==
  
Unlimited resubmissions are allowed up to the deadline for the project.
+
Start with 100%, deductions as follows:
  
==Extra credit up to 3%==
+
Indicators (up to 20% potential deductions):
 +
* Is each indicator described in sufficient detail that someone else could reproduce it? (-5% for each if not)
 +
* Is there a chart for each indicator that properly illustrates its operation? (-5% for each if not)
 +
* Is at least one indicator different from those provided by the instructor's code (i.e., another indicator that is not SMA, Bollinger Bands or RSI) (-10% if not)
 +
* Does the submitted code <tt>indicators.py</tt> properly reflect the indicators provided in the report (-20% if not)
  
Implement boosting as part of BagLearner.  How does boosting affect performance for <tt>ripple</tt> and <tt>3_groups</tt> data?
+
Best possible (up to 5% potential deductions):
 +
* Is the chart correct (dates and equity curve) (-5% for if not)
 +
* Is the reported performance correct within 5% (-1% for each item if not)
  
Does overfitting occur for either of these datasets as the number of bags with boosting increases?
+
Manual rule-based trader (up to 20% deductions):
 +
* Is the trading strategy described with clarity and in sufficient detail that someone else could reproduce it? (-10%)
 +
* Does the provided chart include:
 +
** Historic value of benchmark normalized to 1.0 with black line (-5% if not)
 +
** Historic value of portfolio normalized to 1.0 with blue line (-10% if not)
 +
** Are the appropriate date ranges covered? (-5% if not)
 +
** Are vertical lines included to indicate entries (-10% if not)
 +
* Does the submitted code <tt>rule_based.py</tt> properly reflect the strategy provided in the report? (-20% if not)
 +
* Does the manual trading system provide higher cumulative return than the benchmark over the in-sample time period? (-5% if not)
  
Create your own dataset for which overfitting occurs as the number of bags with boosting increases.
+
ML-based trader (up to 30% deductions):
 +
* Is the ML strategy described with clarity and in sufficient detail that someone else could reproduce it? (-10%)
 +
* Are modifications/tweaks to the basic decision tree learner fully described (-10%)
 +
* Does the methodology utilize a classification-based learner? (-30%)
 +
* Does the provided chart include:
 +
** Historic value of benchmark normalized to 1.0 with black line (-5% if not)
 +
** Historic value of rule-based portfolio normalized to 1.0 with blue line (-5% if not)
 +
** Historic value of ML-based portfolio normalized to 1.0 with green line (-10% if not)
 +
** Are the appropriate date ranges covered? (-5% if not)
 +
** Are vertical lines included to indicate entry (-10% if not)
 +
* Does the submitted code <tt>ML_based.py</tt> properly reflect the strategy provided in the report? (-30% if not)
 +
* Does the ML trading system provide 1.5x higher cumulative return or than the benchmark over the in-sample time period? (-5% if not)
  
Submit your report <tt>report.pdf</tt> that focuses just on your extra credit work to the "extra credit" assignment on t-square.
+
Data visualization (up to 15% deductions):
 +
* Is the X data reported in all three charts the same? (-5% if not)
 +
* Is the X data standardized? (-5% if not)
 +
* Is the Y data in the train and query plots similar (-5% if not)
  
==Rubric==
+
Comparative analysis (up to 10% deductions):
 
+
* Is the appropriate chart provided (-5% for each missing element, up to a maximum of -10%)
* KNNLearner, auto grade 10 test cases (including ripple.csv and 3_groups.csv), 3 points each: 30 points
+
* Is there a table that reports in-sample and out-of-sample data for the baseline (just the stock), rule-based, and ML-based strategies? (-5% for each missing element)
* BagLearner, auto grade 10 test cases (including ripple.csv and 3_groups.csv), 2 points each: 20 points
+
* Are differences between the in-sample and out-of-sample performances appropriately explained (-5%)
* best4linreg.py (15 points)
 
** Code submitted (OK if not Python): -5 if absent
 
** Description complete -- Sufficient that someone else could implement it: -5 if not
 
** Description compelling -- The reasoning that linreg should do better is understandable and makes sense. Graph of the data helps but is not required if the description is otherwise compelling: -5 if not
 
** Train and test data drawn from same distribution: -5 if not
 
** Performance demonstrates that linreg does better: -10 if not
 
* best4KNN.py (15 points)
 
** Code submitted (OK if not Python): -5 if absent
 
** Description complete -- Sufficient that someone else could implement it: -5 if not
 
** Description compelling -- The reasoning that linreg should do better is understandable and makes sense. Graph of the data helps but is not required if the description is otherwise compelling: -5 if not
 
** Train and test data drawn from same distribution: -5 if not
 
** Performance demonstrates that linreg does better: -10 if not
 
* Overfitting (10 points)
 
** Is the region of overfitting correctly identified?: 5 points
 
** Is conclusion supported with data (table or chart): 5 points
 
* Bagging (10 points)
 
** Correct conclusion regarding overfitting as bags increase, supported with tables or charts: 5 points
 
** Correct conclusion regarding overfitting as K increases, supported with tables or charts: 5 points
 
  
 
==Required, Allowed & Prohibited==
 
==Required, Allowed & Prohibited==
Line 304: Line 289:
 
Required:
 
Required:
 
* Your project must be coded in Python 2.7.x.
 
* Your project must be coded in Python 2.7.x.
* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
+
* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu).
* Your code must run in less than 5 seconds on one of the university-provided computers.
+
* Use only util.py to read data.  If you want to read items other than adjusted close, modify util.py to do it, and submit your new version with your code.
  
 
Allowed:
 
Allowed:
 
* You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
 
* You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
 
* Your code may use standard Python libraries.
 
* Your code may use standard Python libraries.
* You may use the NumPy, SciPy and Pandas libraries.  Be sure you are using the correct versions.
+
* You may use the NumPy, SciPy, matplotlib and Pandas libraries.  Be sure you are using the correct versions.
 
* You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
 
* You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
 
* Code provided by the instructor, or allowed by the instructor to be shared.
 
* Code provided by the instructor, or allowed by the instructor to be shared.
* Cheese.
+
* A herring.
  
 
Prohibited:
 
Prohibited:
 +
* Any other method of reading data besides util.py
 
* Any libraries not listed in the "allowed" section above.
 
* Any libraries not listed in the "allowed" section above.
 
* Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
 
* Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
* Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
+
 
 +
==Legacy==
 +
 
 +
*[[MC3-Project-2-Legacy-trader]]
 +
*[[MC3-Project-2-Legacy]]
 +
*[[MC3-Project-3-Legacy-Q]]
 +
*[[MC3-Project-3-Legacy]]

Latest revision as of 18:04, 19 May 2017

DRAFT

This assignment is under revision. This notice will be removed once it is final.

Updates / FAQs

  • 2017-04-02
    • Clarified instructions regarding "best possible" to use your own market simulator with adjusted closing prices.
  • 2017-03-16
    • Switch from IBM to AAPL. Position sizes changed. In sample and out of sample dates changed.
    • Added requirement for "best possible strategy".
    • Added requirement that indicators be standardized.
    • Changed from 10 day to 21 day holding. Chart requirements relaxed to just require a vertical line upon entry (no black vertical line on exit).
    • Added requirement for data visualization.
  • Q: In a previous project there was a constraint of holding a single position until exit. Does that apply to this project? Yes, hold one position til exit.
  • Q: Is that 21 calendar days, or 21 trading days (i.e., days when SPY was traded)? A: Always use trading days.
  • Q: Are there constraints for Python modules allowed for this project? Can we experiment with modules for optimization or technical analysis and cite or are we expected to write everything from scratch for this project as well? A: The constraints are the same as for the first learning project. You've already written the learners you need.
  • Q: I want to read some other values from the data besides just adjusted close, how can I do that? A: Please modify an old version of util.py to do that, include that new util.py with your submission.
  • Q: Are we required to trade in only 200 share blocks? (and have no more than 200 shares long or short at a time as in some of the previous assignments) A: (update). You can trade up to 400 shares at a time as long as you maintain the requirement of 200, 0 or -200 shares. This will enable comparison between results more easily.
  • Q: Are we limited to leverage of 2.0 on the portfolio? A: There is no limit on leverage.
  • Q: Are we only allowed one position at a time? A: You can be in one of three states: -200 shares, +200 shares, 0 shares.

Overview

In this project you will develop trading strategies using Technical Analysis, and test them using your market simulator. You will then utilize your Random Tree learner to train and test a learning trading algorithm.

In this project we shift from an auto graded format to a report format. For this project your grade will be based on the PDF report you submit, not your code. However, you will also submit your code that will be checked visually to ensure it appropriately matches the report you submit.

Data Details, Dates and Rules

Use the following parameters for Part 2, 3 and 4:

  • Use only the data provided for this course. You are not allowed to import external data.
  • Trade only the symbol AAPL (however, you may, if you like, use data from other symbols to inform your strategy).
  • The in sample/training period is January 1, 2008 to December 31 2009.
  • The out of sample/testing period is January 1, 2010 to December 31 2011.
  • Starting cash is $100,000.
  • Allowable positions are: 200 shares long, 200 shares short, 0 shares.
  • Benchmark: The performance of a portfolio starting with $100,000 cash, investing in 200 shares of AAPL and holding that position
  • There is no limit on leverage.

Part 1: Technical Indicators (20%)

Develop and describe at least 3 and at most 5 technical indicators. You may find our lecture on time series processing to be helpful. For each indicator you should create a single chart that shows the price history of the stock during the in-sample period, "helper data" and the value of the indicator itself. As an example, if you were using price/SMA as an indicator you would want to create a chart with 3 lines: Price, SMA, Price/SMA. In order to facilitate visualization of the indicator you can normalize the data to 1.0 at the start of the date range (i.e. divide price[t] by price[0]).

You should "standardize" or "normalize" your indicators so that they have zero mean and standard deviation 1.0 One way to do this is the standard score transformation as described here: https://en.wikipedia.org/wiki/Standard_score . This transformation will help ensure that all of your indicators are considered with equal importance by your learner.

Your report description of each indicator should enable someone to reproduce it just by reading the description. We want a written description here, not code, however, it is OK to augment your written description with a pseudocode figure.

At least one of the indicators you use should be completely different from the ones presented in our lectures. (i.e. something other than SMA, Bollinger Bands, RSI)

Deliverables:

  • Descriptive text (2 to 3 pages with figures).
  • 3 to 5 charts (one for each indicator)
  • Code: indicators.py

Part 2: Best Possible Strategy (5%)

Assume that you can see the future, but that you are constrained by the portfolio size and order limits as specified above. Create a set of trades that represents the best a strategy could possibly do during the in sample period. The holding time requirements described in the next sections do not apply to this exercise. The reason we're having you do this is so that you will have an idea of an upper bound on performance.

The intent is for you to use adjusted close prices with the market simulator that you wrote earlier in the course.

Provide a chart that reports:

  • Benchmark (see definition above) normalized to 1.0 at the start: Black line
  • Value of the best possible portfolio (normalized to 1.0 at the start): Blue line

You should also report in text:

  • Cumulative return of the benchmark and portfolio
  • Stdev of daily returns of benchmark and portfolio
  • Mean of daily returns of benchmark and portfolio

Part 3: Manual Rule-Based Trader (20%)

Devise a set of rules using the indicators you created in Part 1 above. Your rules should be designed to trigger a "long" or "short" entry for a 21 trading day hold. In other words, once an entry is initiated, you must remain in the position for 21 trading days. In your report you must describe your trading rules so that another person could implement them based only on your description. We want a written description here, not code, however, it is OK to augment your written description with a pseudocode figure.

You should tweak your rules as best you can to get the best performance possible during the in sample period (do not peek at out of sample performance). Use your rule-based strategy to generate an orders file over the in sample period, then run that file through your market simulator to create a chart that includes the following components over the in sample period:

  • Benchmark (see definition above) normalized to 1.0 at the start: Black line
  • Value of the rule-based portfolio (normalized to 1.0 at the start): Blue line
  • Vertical green lines indicating LONG entry points.
  • Vertical red lines indicating SHORT entry points.

Note that each red or green vertical line should be at least 21 days from the preceding line. We will check for that. We expect that your rule-based strategy should outperform the benchmark over the in sample period.

Deliverables:

  • Descriptive text (1 or 2 pages with chart) that provides a compelling justification for the rule-based system developed.
  • Text must describe rule based system in sufficient detail that another person could implement it.
  • 1 chart.
  • Code: rule_based.py (generates an orders file)

Part 4: ML Trader (30%)

Convert your decision tree regression learner into a classification learner. The classifications should be:

  • +1: LONG
  • 0: DO NOTHING
  • -1: SHORT

The X data for each sample (day) are simply the values of your indicators for the stock -- you should have 3 to 5 of them. The Y data (or classifications) will be based on 21 day return. You should classify the example as a +1 or "LONG" if the 21 day return exceeds a certain value, let's call it YBUY for the moment. You should classify the example as a -1 or "SHORT" if the 21 day return is below a certain value we'll call YSELL. In all other cases the sample should be classified as a 0 or "DO NOTHING." Note that it is very important that you train your learner with these classification values (not the 21 day returns). We will check for this.

Note that your X values are calculated each day from the current day's (and earlier) data, but the Y value (classification) is calculated using data from the future. You may tweak various parameters of your learner to maximize return (more on that below). Train and test your learning strategy over the in sample period. Whenever a LONG or SHORT is encountered, you must enter the corresponding position and hold it for 21 days. That means, for instance, that if you encounter a LONG on day 1, then a SHORT on day 2, you must keep the stock still until the 21 days expire, even though you received this conflicting information. The reason for this is that we're trying to provide a way to directly compare the manual strategy versus the ML strategy.

Important note: You must set the leaf_size parameter of your decision tree learner to 5 or larger. This requirement is intended to avoid a degenerate overfit solution to this problem.

Use your ML-based strategy to generate an orders file over the in sample period, then run that file through your market simulator to create a chart that includes the following components over the in sample period:

  • Benchmark (see definition above) normalized to 1.0 at the start: Black line
  • Value of the rule-based portfolio (normalized to 1.0 at the start): Blue line.
  • Value of the ML-based portfolio (normalized to 1.0 at the start): Green line.
  • Vertical green lines indicating LONG entry points.
  • Vertical red lines indicating SHORT entry points.

We expect that the ML-based strategy will outperform the manual strategy, however it is possible that it does not. If it is the case that your manual strategy does better, you should try to explain why in your report.

You should tweak the parameters of your learner to maximize performance during the in sample period. Here is a partial list of things you can tweak:

  • Adjust YSELL and YBUY.
  • Adjust leaf_size.
  • Utilize bagging and adjust the number of bags.

Deliverables:

  • Descriptive text (1 or 2 pages with chart) that describes your ML approach.
  • Text must describe ML based system in sufficient detail that another person could implement it.
  • 1 chart
  • Code: ML_based.py (generates an orders file)
  • Additional code files as necessary to support ML_based.py (e.g. RTLearner.py and so on).

Part 5: Visualization of data (15%)

Choose two of your indicators, call them X1 and X2. Create 3 scatter plots where each point in each plot is located according to the indicator values on that day at X1, X2. Color each dot according to the following scheme:

  • Green if the factors on that day satisfy "LONG" conditions.
  • Red if the factors satisfy "SHORT" conditions.
  • Black if neither "LONG" or "SHORT" are satisfied.

The scale for the scatter plot should be set to +-1.5 in both dimensions. This will help us check that you have standardized your indicators.

The 3 plots should be based on the in sample period (about 500 points):

  1. Your rule-based strategy.
  2. The training data for your ML strategy.
  3. Response of your learner when queried with the same data (after training).

Part 6: Comparative Analysis (10%)

Evaluate the performance of both of your strategies in the out of sample period. Note that you should not train or tweak your learner on this data. You should use the classification learned using the training data only. Create a chart that shows, out of sample:

  • Benchmark (see definition above) normalized to 1.0 at the start: Black line
  • Performance of manual strategy: Blue line
  • Performance of the ML strategy: Green line
  • All three should be normalized to 1.0 at the start.

Create a table that summarizes the performance of the stock, the manual strategy and the ML strategy for both in sample and out of sample periods. Utilize your experience in this class to determine which factors are best to use for comparing these strategies. If performance out of sample is worse than in sample, do your best to explain why. Also if the manual and ML strategies perform substantially differently, explain why. Is one method or the other more or less susceptible to the same underlying flaw? Why or why not?

Deliverables:

  • Descriptive text (1 or 2 pages including figures)
  • 1 chart

Hints

Overall, I recommend the following steps in the creation of your strategies:

  • Indicator design hints:
    • For your X values: Identify and implement at least 3 technical features that you believe may be predictive of future return.
  • Rule based design:
    • Use a cascade of if statements conditioned on the indicators to identify whether a LONG condition is met.
    • Use a cascade of if statements conditioned on the indicators to identify whether a SHORT condition is met.
    • The conditions for LONG and SHORT should be mutually exclusive.
    • If neither LONG or SHORT is triggered, the result should be DO NOTHING.
    • For debugging purposes, you may find it helpful to plot the value of the rule-based output (-1, 0, 1) versus the stock price.
  • Train a classification learner on in sample training data:
    • For your Y values: Use future 21 day return (not future price). Then classify that return as LONG, SHORT or DO NOTHING. You're trying to predict a relative change that you can use to invest with.
    • For debugging purposes, you may find it helpful to plot the value of the training classification data (-1, 0, 1) versus the stock price in one color.
    • For debugging purposes, you may find it helpful to plot the value of the training classification output (-1, 0, 1) versus the stock price in another color. Ideally, these two lines should be very similar.

Choosing Technical Features -- Your X Values

You should have already successfully coded the Bollinger Band feature:

bb_value[t] = (price[t] - SMA[t])/(stdev[t])

Two other good features worth considering are momentum and volatility.

momentum[t] = (price[t]/price[t-N]) - 1

Volatility is just the stdev of daily returns.

You still need to standardize the resulting values.

Choosing Y

Your code should classify based on 21 day change in price. You need to build a new Y that reflects the 21 day change and aligns with the current date. Here's pseudo code for the calculation of Y

ret = (price[t+21]/price[t]) - 1.0
if ret > YBUY:
    Y[t] = +1 # LONG
else if ret < YSELL:
    Y[t] = -1 # SHORT
else:
    Y[t] = 0

If you select Y in this manner and use it for training, your learner will classify 21 day returns.

Template and Data

There is no github template for this project. You should create a directory for your code in ml4t/mc3-p3 and make a copy of util.py there. You should also copy into that directory your learner code and your market simulator code. You will have access to the data in the ML4T/Data directory but you should use ONLY the code in util.py to read it.

Contents of Report

  • Your report should be no more than 3000 words. Your report should contain no more than 14 charts. Penalties will apply if you violate these constraints.
  • Include charts and text as identified in the sections above.

Expectations

  • In-sample AAPL backtests should perform very well -- The ML version should do better than the manual version.
  • Out-of-sample AAPL backtests should... (you should be able to complete this sentence).

What to turn in

Turn your project in via t-square.

  • Your report as report.pdf
  • All of your code, as necessary to run as .py files.
  • Document how to run your code in readme.txt.
  • No zip files please.

Rubric

Start with 100%, deductions as follows:

Indicators (up to 20% potential deductions):

  • Is each indicator described in sufficient detail that someone else could reproduce it? (-5% for each if not)
  • Is there a chart for each indicator that properly illustrates its operation? (-5% for each if not)
  • Is at least one indicator different from those provided by the instructor's code (i.e., another indicator that is not SMA, Bollinger Bands or RSI) (-10% if not)
  • Does the submitted code indicators.py properly reflect the indicators provided in the report (-20% if not)

Best possible (up to 5% potential deductions):

  • Is the chart correct (dates and equity curve) (-5% for if not)
  • Is the reported performance correct within 5% (-1% for each item if not)

Manual rule-based trader (up to 20% deductions):

  • Is the trading strategy described with clarity and in sufficient detail that someone else could reproduce it? (-10%)
  • Does the provided chart include:
    • Historic value of benchmark normalized to 1.0 with black line (-5% if not)
    • Historic value of portfolio normalized to 1.0 with blue line (-10% if not)
    • Are the appropriate date ranges covered? (-5% if not)
    • Are vertical lines included to indicate entries (-10% if not)
  • Does the submitted code rule_based.py properly reflect the strategy provided in the report? (-20% if not)
  • Does the manual trading system provide higher cumulative return than the benchmark over the in-sample time period? (-5% if not)

ML-based trader (up to 30% deductions):

  • Is the ML strategy described with clarity and in sufficient detail that someone else could reproduce it? (-10%)
  • Are modifications/tweaks to the basic decision tree learner fully described (-10%)
  • Does the methodology utilize a classification-based learner? (-30%)
  • Does the provided chart include:
    • Historic value of benchmark normalized to 1.0 with black line (-5% if not)
    • Historic value of rule-based portfolio normalized to 1.0 with blue line (-5% if not)
    • Historic value of ML-based portfolio normalized to 1.0 with green line (-10% if not)
    • Are the appropriate date ranges covered? (-5% if not)
    • Are vertical lines included to indicate entry (-10% if not)
  • Does the submitted code ML_based.py properly reflect the strategy provided in the report? (-30% if not)
  • Does the ML trading system provide 1.5x higher cumulative return or than the benchmark over the in-sample time period? (-5% if not)

Data visualization (up to 15% deductions):

  • Is the X data reported in all three charts the same? (-5% if not)
  • Is the X data standardized? (-5% if not)
  • Is the Y data in the train and query plots similar (-5% if not)

Comparative analysis (up to 10% deductions):

  • Is the appropriate chart provided (-5% for each missing element, up to a maximum of -10%)
  • Is there a table that reports in-sample and out-of-sample data for the baseline (just the stock), rule-based, and ML-based strategies? (-5% for each missing element)
  • Are differences between the in-sample and out-of-sample performances appropriately explained (-5%)

Required, Allowed & Prohibited

Required:

  • Your project must be coded in Python 2.7.x.
  • Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu).
  • Use only util.py to read data. If you want to read items other than adjusted close, modify util.py to do it, and submit your new version with your code.

Allowed:

  • You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
  • Your code may use standard Python libraries.
  • You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
  • You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
  • Code provided by the instructor, or allowed by the instructor to be shared.
  • A herring.

Prohibited:

  • Any other method of reading data besides util.py
  • Any libraries not listed in the "allowed" section above.
  • Any code you did not write yourself (except for the 5 line rule in the "allowed" section).

Legacy