MC3-Project-2
Contents
- 1 Updates / FAQs
- 2 Overview
- 3 Data Details, Dates and Rules
- 4 Part 1: Technical Indicators (20%)
- 5 Part 2: Manual Rule-Based Trader (30%)
- 6 Part 3: ML Trader (30%)
- 7 Part 4: Comparative Analysis (20%)
- 8 Legacy
- 9 Detailed steps
- 10 Summary of Plots To Create
- 11 Template and Data
- 12 Choosing Technical Features -- Your X Values
- 13 Choosing Y
- 14 Contents of Report
- 15 Expectations
- 16 What to turn in
- 17 Extra credit up to 3%
- 18 Rubric
- 19 Required, Allowed & Prohibited
- 20 Legacy
Updates / FAQs
- Q: In a previous project there was a constraint of holding a single position until exit. Does that apply to this project? Yes, hold one position til exit.
- Q: Is that 5 calendar days, or 5 trading days (i.e., days when SPY was traded)? A: Always use trading days.
- Q: Are there constraints for Python modules allowed for this project? Can we experiment with modules for optimization or technical analysis and cite or are we expected to write everything from scratch for this project as well? A: You can use scikit modules as long as you cite them.. You've already written the learners you need though.
- Q: Can we change our policy to work better for IBM vs the sine data? A: No, you must use the same indicators, policy, etc. for both. I suggest you optimize first for IBM, then go back to the sine data because almost anything should work with the sine data.
2016-4-12
- Q: I want to read some other values from the data besides just adjusted close, how can I do that? A: Please modify an old version of util.py to do that, include that new util.py with your submission.
- Q: Are we required to trade in only 100 share blocks? (and have no more than 100 shares long or short at a time as in some of the previous assignments) A: Yes. This will enable comparison between results more easily.
- Q: Are we limited to leverage of 2.0 on the portfolio? A: There is no limit on leverage.
- Q: Are we only allowed one position at a time? A: You can be in one of three states: -100 shares, +100 shares, 0 shares.
- Q: Are we supposed to build one policy that we use on both SINE and IBM? A: Yes, all parameters for your learner and policy should be the same. The difference is the DATA.
Overview
In this project you will develop trading strategies using Technical Analysis, and test them using your market simulator. You will then utilize your Random Tree learner to train and test a learning trading algorithm.
- Part 1: Develop and describe a set of at least 3 technical indicators. At least one of these indicators must be substantially different from the indicators whose code was presented in class.
- Part 2: Devise and test a rule-based trading strategy using your indicators from Part 1. Test its performance in sample using your market simulator.
- Part 3: Use you decision tree learner to create a classifier that decides when to trade. Test its performance in sample using your market simulator.
- Part 4: Comparative analysis.
In this project we shift from an auto graded format to a report format. For this project your grade will be based on the PDF report you submit, not your code. However, you will also submit your code that will be checked visually to ensure it appropriately matches the report you submit.
Data Details, Dates and Rules
Use the following parameters for Part 2, 3 and 4:
- Use only the data provided for this course. You are not allowed to import external data.
- Trade only the symbol IBM (however, you may, if you like, use data from other symbols to inform your strategy).
- The in sample/training period is January 1, 2006 to December 31 2009.
- The out of sample/testing period is January 1, 2010 to December 31 2010.
- Starting cash is $100,000.
- Allowable positions are: 500 shares long, 500 shares short, 0 shares.
- There is no limit on leverage.
Part 1: Technical Indicators (20%)
Develop and describe at least 3 and at most 5 technical indicators. You may find our lecture on time series processing to be helpful. For each indicator you should create a single chart that shows the price history of the stock during the in-sample period and the value of the indicator. Note that your chart should help to convey an understanding of how the indicator works, it is not strictly necessary that the chart show the literal value of the indicator.
Your report description of each indicator should enable someone to reproduce it just by reading the description. We want a written description here, not code, however, it is OK to augment your written description with a pseudocode figure.
At least one of the indicators you use should be completely different from the ones presented in our lectures.
Deliverables:
- Descriptive text (1 to 3 pages with figures).
- 3 to 5 charts (one for each indicator)
- Code: indicators.py
Part 2: Manual Rule-Based Trader (30%)
Devise a set of rules using the indicators you created in Part 1 above. Your rules should be designed to trigger a "long" or "short" entry for a 10 trading day hold. In other words, once an entry is initiated, you must remain in the position for 10 trading days. In your report you must describe your trading rules so that another person could implement them based only on your description. We want a written description here, not code, however, it is OK to augment your written description with a pseudocode figure.
You should tweak your rules as best you can to get the best performance possible from during the in sample period (do not peak at out of sample performance).
Use your rule-based strategy to generate an orders file over the in sample period, then run that file through your market simulator to create a chart that includes the following components over the in sample period:
- Price of IBM (normalized to 1.0 at the start): Black line
- Value of the rule-based portfolio (normalized to 1.0 at the start): Blue line
- Vertical green lines indicating LONG entry points.
- Vertical red lines indicating SHORT entry points.
- Vertical black lines indicating exits (long or short).
Note that each red or green vertical line should be followed by a black line before another entry occurs. We will check for that. We expect that your rule-based strategy should outperform the stock IBM over the in sample period.
Deliverables:
- Descriptive text (1 or 2 pages) that provides a compelling justification for rule-based system developed.
- Text must describe rule based system in sufficient detail that another person could implement it.
- 1 chart.
- Code: rule_based.py (generates an orders file)
Part 3: ML Trader (30%)
Convert your decision tree regression learner into a classification learner. The classifications should be:
- +1: BUY
- 0: DO NOTHING
- -1: SELL
The X data for each sample (day) are simply the values of your indicators for the stock -- you should have 3 to 5 of them. The Y data (or classifications) will be based on 10 day return. You should classify the example as a +1 or "BUY" if the 10 day return exceeds a certain value, let's call it YBUY for the moment. You should classify the example as a -1 or "SELL" if the 10 day return is below a certain value we'll call YSELL. In all other cases the sample should be classified as a 0 or "DO NOTHING."
Note that your X values are calculated each day from the current day's (and earlier) data, but the Y value is calculated using data from the future. You may tweak various parameters of your learner to maximize return (more on that below). Train and test your learning strategy over the in sample period. Whenever a BUY or SELL is encountered, you must enter the corresponding position and hold it for 10 days. That means, for instance, that if you encounter a BUY on day 1, then a SELL on day 2, you must keep the stock still until the 10 days expire, even though you received this conflicting information. The reason for this is that we're trying to provide a way to directly compare the manual strategy versus the ML strategy.
Use your ML-based strategy to generate an orders file over the in sample period, then run that file through your market simulator to create a chart that includes the following components over the in sample period:
- Price of IBM (normalized to 1.0 at the start): Black line.
- Value of the rule-based portfolio (normalized to 1.0 at the start): Blue line.
- Value of the ML-based portfolio (normalized to 1.0 at the start): Green line.
- Vertical green lines indicating LONG entry points.
- Vertical red lines indicating SHORT entry points.
- Vertical black lines indicating exits (long or short).
Note that each red or green vertical line should be followed by a black line before another entry occurs. We will check for that. We expect that the ML-based strategy will outperform the manual strategy, however it is possible that it does not. If it is the case that your manual strategy does better, you should try to explain why in your report.
You should tweak the parameters of your learner to maximize performance during the in sample period. Here is a partial list of things you can tweak:
- Adjust YSELL and YBUY.
- Adjust leaf_size.
- Utilize bagging and adjust the number of bags.
Deliverables:
- Descriptive text (1 or 2 pages) that describes your ML approach.
- Text must describe ML based system in sufficient detail that another person could implement it.
- 1 chart
- Code: ML_based.py (generates an orders file)
- Additional code files as necessary to support ML_based.py (e.g. RTLearner.py and so on).
Part 4: Comparative Analysis (20%)
Evaluate the performance of both of your strategies in the out of sample period. Create a chart that shows, out of sample:
- Performance of the stock: Black line
- Performance of manual strategy: Blue line
- Performance of the ML strategy: Green line
- All three should be normalized to 1.0 at the start.
Create a table that summarizes the performance of the stock, the manual strategy and the ML strategy for both in sample and out of sample periods. Utilize your experience in this class to determine which factors are best to use for comparing these strategies. If performance out of sample is worse than in sample, do your best to explain why. Also if the manual and ML strategies perform substantially differently, explain why. Is one method or the other more or less susceptible to the same underlying flaw? Why or why not?
Legacy
You should train a learner to predict the change in price of a stock over the next five trading days (one week). You will use data from Dec 31 2007 to 2009 to train your prediction model, then you will test it from Dec 31 2009 to 2011.
Now, just predicting the change in price isn't enough, you need to also code a policy that uses the forecaster you built to buy or sell shares. Your policy should buy when it thinks the price will go up, and short when it thinks the price will go down. You can then feed those buy and sell orders into your market simulator to backtest the strategy. For ease of comparison between strategies, please observe these rules:
- Starting cash is $10,000.
- Allowable positions are: 100 shares long, 100 shares short, 0 shares.
- There is no limit on leverage.
Finding features, a learner, and a policy that all work together to provide a reliably winning strategy with live stock data is HARD! It is possible, and people have done it, but we can't reasonably expect you to be successful at it in this short class. Accordingly, we want you to work with some easy data first, namely we will provide you with sinusoidal historical price data. Once you've got something that works with that, you can try your learner on real stock data.
Detailed steps
Overall, you should follow these steps:
- Train a regression learner (KNN or LinReg, or other of your choice with or without bagging) on data from Dec 31 2007 to Dec 31 2009. This is your in sample training data.
- For your X values: Identify and implement at least 3 technical features that you believe may be predictive of future return. You should implement them so they output values typically ranging from -1.0 to 1.0. This will help avoid the situation where one feature overwhelms the results. See a few formulae below.
- For your Y values: Use future 5 day return (not future price). You're trying to predict a relative change that you can use to invest with.
- Create a plot that illustrates your training Y values in one color, current price in another color and your model's PREDICTED Y in a third color. To help with the visualization, you should adjust your training Y and predicted Y so that they are at the same scale as the current price. With this chart we should be able to see how well your learner performs and that your Y values are shifted back 5 days. You may find it convenient to zoom in on a particular time period so this is evident.
- Create a trading policy based on what your learner predicts for future return. As an example you might choose to buy when the forecaster predicts the price will go up more than 1%, then hold for 5 days.
- Create a plot that illustrates entry and exits as vertical lines on a price chart for the in sample period Dec 31 2007 to Dec 31 2009. Show long entries as green lines, short entries as red lines and exits as black lines. You may find it convenient to zoom in on a particular time period so this is evident.
- Now use your code to generate orders and run those orders through your market simulator. Create a chart of this backtest. It should do VERY well for the in sample period Dec 31 2007 to Dec 31 2009.
- Freeze your model based on the Dec 31 2007 to Dec 31 2009 training data. Now test it out of sample over the period Dec 31 2009 to Dec 31 2011. Create a plot that illustrates entry & exits, generate trades, run through your simulator, chart the backtest.
Perform the above steps first using the data ML4T-220.csv. Once you've validated success (it should work well), repeat using IBM data over the same dates. Remember Dec 31 2007 to Dec 31 2009 is training, Dec 31 2009 to Dec 31 2011 is testing. You should have one set of charts for each symbol.
Summary of Plots To Create
- Sine data in-sample Training Y/Price/Predicted Y: Create a plot that illustrates your training Y values in one color, current price in another color and your model's PREDICTED Y in a third color. To help with the visualization, you should adjust your training Y and predicted Y so that it is at the same scale as the current price.
- Sine data in-sample Entries/Exits: Create a plot that illustrates entry and exits as vertical lines on a price chart for the in sample period. Show long entries as green lines, short entries as red lines and exits as black lines. You may find it convenient to zoom in on a particular time period so this is evident.
- Sine data in-sample backtest
- Sine data out-of-sample Entries/Exits: Freeze your model based on the in-sample data. Now test it for the the out-of-sample period. Plot the entry & exits, generate trades,
- Sine data out-of-sample backtest.
- IBM data in-sample Entries/Exits: Create a plot that illustrates entry and exits as vertical lines on a price chart for the in sample period 2008-2009. Show long entries as green lines, short entries as red lines and exits as black lines. You may find it convenient to zoom in on a particular time period so this is evident.
- IBM data in-sample backtest
- IBM data out-of-sample Entries/Exits
- IBM data out-of-sample backtest
Template and Data
You should create a directory for your code in ml4t/mc3-p2. You will have access to the data in the ML4T/Data directory but you should use ONLY the code in util.py to read it. In particular files named ML4T-220.csv, and IBM.csv.
Choosing Technical Features -- Your X Values
You should have already successfully coded the Bollinger Band feature. Here's a suggestion of how to normalize that feature so that it will typically provide values between -1.0 and 1.0:
bb_value[t] = (price[t] - SMA[t])/(2 * stdev[t])
Two other good features worth considering are momentum and volatility.
momentum[t] = (price[t]/price[t-N]) - 1
Volatility is just the stdev of daily returns.
Choosing Y
Your code should predict 5 day change in price. You need to build a new Y that reflects the 5 day change and aligns with the current date. Here's pseudo code for the calculation of Y
Y[t] = (price[t+5]/price[t]) - 1.0
If you select Y in this manner and use it for training, your learner will predict 5 day returns.
Contents of Report
- Your report should be no more than 2500 words. Your report should contain no more than 12 charts. Penalties will apply if you violate these constraints.
- Include the charts listed in the overview section above.
- Describe each of the indicators you have selected in enough detail that someone else could reproduce them in code.
- Describe your trading policy clearly.
- If you used any external code or ideas be sure to cite them in your code and in the report.
- Discussion of results. Did it work well? Why? What would you do differently?
Expectations
- In-sample sine and in-sample IBM backtests should both perform very well -- better than the manual policy you created for the last assignment.
- Out-of-sample sine backtest should perform nearly identically as the in-sample test.
- Out-of-sample IBM backtest should... (you should be able to complete this sentence).
What to turn in
Turn your project in via t-square.
- Your report as report.pdf
- All of your code, as necessary to run as .py files.
- Document how to run your code in readme.txt.
- No zip files please.
Extra credit up to 3%
Choose one or more of the following:
- Compare the performance of KNN and LinReg in this task. The instructor anticipates that LinReg might work well. If that turns out to be the case, how can that be? This is a non-linear task isn't it?
- Extend your code to create a "rolling" model that updates each day rolling forward.
- Extend your code to simultaneously forecast all the members of the S&P 500. Generate trades accordingly, and backtest the result.
Submit to the extra credit assignment on t-square. One single PDF file only, max 1000 words.
Rubric
- Are all 9 plots present and correct? -5 points for each missing plot.
- Note: Correct in the sense that they properly display the information requested. The result may not be the desired one.
- Are comparative backtest results correct? (ML4T-220 in sample & out of sample, IBM in sample & out of sample) -10 points for each incorrect result.
- Indicators used: Are descriptions of factors used sufficiently clear that others could reproduce them? Up to -10 points for lack of clarity.
- Trading strategy: Is description sufficiently clear that others could reproduce it? Up to -10 points for lack of clarity.
- Is discussion of results concise, complete, correct? Up to -5 points for each of concise, complete, correct.
Required, Allowed & Prohibited
Required:
- Your project must be coded in Python 2.7.x.
- Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
- Use only util.py to read data. If you want to read items other than adjusted close, modify util.py to do it, and submit your new version with your code.
Allowed:
- You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
- Your code may use standard Python libraries.
- You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
- You may use scikit learn libraries (note that you don't need them because you just wrote your own!).
- You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
- Code provided by the instructor, or allowed by the instructor to be shared.
- A herring.
Prohibited:
- Any other method of reading data besides util.py
- Any libraries not listed in the "allowed" section above.
- Any code you did not write yourself (except for the 5 line rule in the "allowed" section).