Difference between revisions of "CS4646 assess learners"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
==Draft==
 
 
'''''This project is in draft mode.  When it is finalized, this notice will be removed.'''''
 
  
 
==Overview==
 
==Overview==
You are to implement and evaluate three learning algorithms as Python classes: A Decision Tree (as described in class) learner, a Random Tree learner, and a Bootstrap Aggregating learner.  The classes should be named DTLearner, RTLearner,  and BagLearner respectively.  You are given a Linear Regression learner named LinRegLearner.
+
You are to implement and evaluate three learning algorithms as Python classes: A Decision Tree learner, a Random Tree learner, and a Bootstrap Aggregating learner.  The classes should be named DTLearner, RTLearner,  and BagLearner respectively.  You are given a Linear Regression learner named LinRegLearner.
  
 
You must write your own code for Decision Tree learning, Random Tree learning and bagging. You are NOT allowed to use other people's code to implement these learners.  You may not import libraries that solve these problems.
 
You must write your own code for Decision Tree learning, Random Tree learning and bagging. You are NOT allowed to use other people's code to implement these learners.  You may not import libraries that solve these problems.
Line 65: Line 62:
 
==Implement BagLearner (20 points)==
 
==Implement BagLearner (20 points)==
  
Implement Bootstrap Aggregating as a Python class named BagLearner.  Your BagLearner class should be implemented in the file <tt>BagLearner.py</tt>.  It should support EXACTLY the API defined below.  This API is designed so that BagLearner can accept any learner (e.g., RTLearner, LinRegLearner, even another BagLearner) as input and use it to generate a learner ensemble.  Your BagLearner should support the following function/method prototypes:
+
Implement Bootstrap Aggregation as a Python class named BagLearner.  Your BagLearner class should be implemented in the file <tt>BagLearner.py</tt>.  It should support EXACTLY the API defined below.  This API is designed so that BagLearner can accept any learner (e.g., RTLearner, LinRegLearner, even another BagLearner) as input and use it to generate a learner ensemble.  Your BagLearner should support the following function/method prototypes:
 
   
 
   
 
  import BagLearner as bl
 
  import BagLearner as bl
Line 72: Line 69:
 
  Y = learner.query(Xtest)
 
  Y = learner.query(Xtest)
  
Where learner is the learning class to use with bagging. You should be able to support any learning class that obeys the API defined above for DTLearner and RTLearner. kwargs are keyword arguments to be passed on to the learner's constructor and they vary according to the learner (see example below). The  "bags" argument is the number of learners you should train using Bootstrap Aggregation.  If boost is true, then you should implement boosting (optional).  If verbose is True, your code can generate output.  Otherwise the code should should be silent.
+
Where learner is the learning class to use with bagging. You should be able to support any learning class that obeys the API defined above for DTLearner and RTLearner. kwargs are keyword arguments to be passed on to the learner's constructor and they vary according to the learner (see example below). The  "bags" argument is the number of learners you should train using Bootstrap Aggregation.  If verbose is True, your code can generate output.  Otherwise the code should should be silent.
 +
 
 +
We will not use the boosting parameter.  It will always be false.
  
 
As an example, if we wanted to make a random forest of 20 Decision Trees with leaf_size 1 we might call BagLearner as follows
 
As an example, if we wanted to make a random forest of 20 Decision Trees with leaf_size 1 we might call BagLearner as follows
Line 87: Line 86:
 
  learner.addEvidence(Xtrain, Ytrain)
 
  learner.addEvidence(Xtrain, Ytrain)
 
  Y = learner.query(Xtest)
 
  Y = learner.query(Xtest)
 
Boosting is an optional topic and not required.  There's a citation in the Resources section that outlines a method of implementing boosting.
 
  
 
If the training set contains n data items, each bag should contain n items as well.  Note that because you should sample with replacement, some of the data items will be repeated.
 
If the training set contains n data items, each bag should contain n items as well.  Note that because you should sample with replacement, some of the data items will be repeated.
  
This code should not generate statistics or charts. If you want create charts and statistics, you can modify <tt>testlearner.py</tt>.
+
This code should not generate statistics or charts. If you want create charts and statistics, you can modify <tt>testlearner.py</tt>, which you will not submit.
  
 
You can use code like the below to instantiate several learners with the parameters listed in kwargs:
 
You can use code like the below to instantiate several learners with the parameters listed in kwargs:
Line 102: Line 99:
 
</pre>
 
</pre>
  
==Implement InsaneLearner (up to 10 point penalty)==
+
Note that we will test your code with an ''arbitrarily'' named class (different every time), so you '''must not''' hardcode your BagLearner to work specifically/only with DTLearner and RTLearner!
 
 
Your BagLearner should be able to accept any learner object so long as the learner obeys the API defined above.  We will test this in two ways: 1) By calling your BagLearner with an arbitrarily named class and 2) By having you implement InsaneLearner as described below.  If your code dies in either case, you will lose 10 points.
 
 
 
Using your BagLearner class and the provided LinRegLearner class, implement InsaneLearner as follows: InsaneLearner should contain 20 BagLearner instances where each instance is composed of 20 LinRegLearner instances. We should be able to call your InsaneLearner using the following API:
 
 
 
import InsaneLearner as it
 
learner = it.InsaneLearner(verbose = False) # constructor
 
learner.addEvidence(Xtrain, Ytrain) # training step
 
Y = learner.query(Xtest) # query
 
 
 
The code for InsaneLearner should be 20 lines or less.  There is no credit for this, but a penalty if it is not implemented correctly.
 
 
 
==Implement author() Method (up to 10 point penalty)==
 
 
 
For all learners you submit (DT, RT, Bag, Insane) should implement a method called <tt>author()</tt> that returns your Georgia Tech user ID as a string.  It is not your 9 digit student number.  Here is an example of how you might implement author() within a learner object:
 
 
 
<pre>
 
class LinRegLearner(object):
 
 
 
    def __init__(self):
 
        pass # move along, these aren't the drones you're looking for
 
 
 
    def author(self):
 
        return 'tb34' # replace tb34 with your Georgia Tech username.
 
</pre>
 
 
 
And here's an example of how it could be called from a testing program:
 
 
 
<pre>
 
    # create a learner and train it
 
    learner = lrl.LinRegLearner() # create a LinRegLearner
 
    learner.addEvidence(trainX, trainY) # train it
 
    print learner.author()
 
</pre>
 
 
 
Check the template code for examples. We are adding those to the repo now, but it might not be there if you check right away.  Implementing this method correctly does not provide any points, but there will be a penalty for not implementing it.
 
  
 
==Experiments and report (50 points)==
 
==Experiments and report (50 points)==
Line 149: Line 110:
  
 
* Quantitatively compare "classic" decision trees (DTLearner) versus random trees (RTLearner).  In which ways is one method better than the other?
 
* Quantitatively compare "classic" decision trees (DTLearner) versus random trees (RTLearner).  In which ways is one method better than the other?
 
==Hints & resources==
 
 
"Official" course-based materials:
 
* [https://www.youtube.com/watch?v=OBWL4oLT7Uc How to use a decision tree if you have one (Balch Youtube video)]
 
* [https://www.youtube.com/watch?v=WVc3cjvDHhw How to build a decision tree & Random Trees (Balch Youtube video)]
 
* [[Media:How-to-learn-a-decision-tree.pdf]] Balch slides on decision trees
 
* [[Media:Decision-tree-example.xlsx]] Example tabular version of decision tree
 
 
Additional supporting materials:
 
* You may be interested to take a look at Andew Moore's slides on [http://www.autonlab.org/tutorials/mbl.html instance based learning].
 
* A definition of [http://mathworld.wolfram.com/StatisticalCorrelation.html correlation] which we'll use to assess the quality of the learning.
 
* [https://en.wikipedia.org/wiki/Bootstrap_aggregating Bootstrap Aggregating]
 
* [https://en.wikipedia.org/wiki/AdaBoost AdaBoost]
 
* [http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html numpy corrcoef]
 
* [http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html numpy argsort]
 
* [http://en.wikipedia.org/wiki/Root_mean_square RMS error]
 
 
==Extra Credit (0 points)==
 
 
Implement boosting as part of BagLearner.  How does boosting affect performance compared to not boosting?  Does overfitting occur as the number of bags with boosting increases?  Create your own dataset for which overfitting occurs as the number of bags with boosting increases.
 
 
* Submit your report regarding boosting as report-boosting.pdf
 
  
 
==What to turn in==
 
==What to turn in==
Line 178: Line 116:
 
Via T-Square, submit as attachment (no zip files; refer to schedule for deadline).   
 
Via T-Square, submit as attachment (no zip files; refer to schedule for deadline).   
  
* Your code as <tt>RTLearner.py</tt>, <TT>DTLearner.py</TT>, <tt>InsaneLearner.py</tt> and <tt>BagLearner.py</tt>.
+
* Your code as <tt>RTLearner.py</tt>, <TT>DTLearner.py</TT>, and <tt>BagLearner.py</tt>.
 
* Your report as <tt>report.pdf</tt>
 
* Your report as <tt>report.pdf</tt>
  
Line 206: Line 144:
 
*** 1) For out of sample data is correlation with 1 bag lower than correlation for 20 bags?
 
*** 1) For out of sample data is correlation with 1 bag lower than correlation for 20 bags?
 
*** 2) Does the test complete in less than 10 seconds (i.e. 100 seconds for all 10 tests)?
 
*** 2) Does the test complete in less than 10 seconds (i.e. 100 seconds for all 10 tests)?
* Is the author() method correctly implemented for DTLearner, InsaneLearner, BagLearner and RTLearner? (-10 points for each if not)
 
* Is InsaneLearner correctly implemented in 20 lines or less (-10 points if not)
 
 
* Does BagLearner work correctly with an arbitrarily named class (-10 points if not)
 
* Does BagLearner work correctly with an arbitrarily named class (-10 points if not)
  
Line 239: Line 175:
 
* Your project must be coded in Python 2.7.x.
 
* Your project must be coded in Python 2.7.x.
 
* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu).
 
* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu).
* Your code must run in less than 5 seconds on one of the university-provided computers.
+
* Your code must run within the given time limits on one of the university-provided computers.
 
* The code you submit should NOT include any data reading routines.  The provided testlearner.py code reads data for you.
 
* The code you submit should NOT include any data reading routines.  The provided testlearner.py code reads data for you.
 
* The code you submit should NOT generate any output: No prints, no charts, etc.
 
* The code you submit should NOT generate any output: No prints, no charts, etc.
  
 
Allowed:
 
Allowed:
* You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
+
* You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines.
 
* Your code may use standard Python libraries.
 
* Your code may use standard Python libraries.
 
* You may use the NumPy, SciPy, matplotlib and Pandas libraries.  Be sure you are using the correct versions.
 
* You may use the NumPy, SciPy, matplotlib and Pandas libraries.  Be sure you are using the correct versions.
* You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
 
 
* Code provided by the instructor, or allowed by the instructor to be shared.
 
* Code provided by the instructor, or allowed by the instructor to be shared.
* Cheese.
 
  
 
Prohibited:
 
Prohibited:
* Any other method of reading data besides testlearner.py
 
 
* Any libraries not listed in the "allowed" section above.
 
* Any libraries not listed in the "allowed" section above.
* Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
+
* Any code you did not write yourself
 
* Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
 
* Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
 
* Code that includes any data reading routines.  The provided testlearner.py code reads data for you.
 
* Code that includes any data reading routines.  The provided testlearner.py code reads data for you.
Line 262: Line 195:
  
 
The data used in this assignment was provided by [http://archive.ics.uci.edu/ml/datasets/ISTANBUL+STOCK+EXCHANGE UCI's ML Datasets].
 
The data used in this assignment was provided by [http://archive.ics.uci.edu/ml/datasets/ISTANBUL+STOCK+EXCHANGE UCI's ML Datasets].
 
==Legacy==
 
 
* [[MC3-Project-1-legacy]]
 
* [[MC3-Project-1-legacy]]
 

Latest revision as of 19:40, 30 January 2018

Overview

You are to implement and evaluate three learning algorithms as Python classes: A Decision Tree learner, a Random Tree learner, and a Bootstrap Aggregating learner. The classes should be named DTLearner, RTLearner, and BagLearner respectively. You are given a Linear Regression learner named LinRegLearner.

You must write your own code for Decision Tree learning, Random Tree learning and bagging. You are NOT allowed to use other people's code to implement these learners. You may not import libraries that solve these problems.

For this project, we are ignoring the time series aspect of the data and treating it as a regression problem.

The project has two main components: The code for your learners, which will be auto graded, and your report, report.pdf that should include the components listed below.

Your learners should be able to handle any number of dimensions in X from 2 to N.

Template and Data

Instructions:

  • You should see these files:
    • assess_learners the assignment directory
    • assess_learners/Data/: Contains data for you to test your learning code on.
    • assess_learners/LinRegLearner.py: An implementation of the LinRegLearner class. You can use it as a template for implementing your learner classes.
    • assess_learners/testlearner.py: Simple testing scaffold that you can use to test your learners. Useful for debugging.
    • assess_learners/grade_learners.py: The grading script.

In the assess_learners/Data/ directory you will find the istanbul.csv data set. There are other data sets you may use for testing, as well. Each data file contains N+1 columns: X1, X2, ... XN, and Y.

We will mainly be working with the istanbul data. This data includes the returns of multiple worldwide indexes for a number of days in history. The overall objective is to predict what the return for the MSCI Emerging Markets (EM) index will be on the basis of the other index returns. Y in this case is the last column to the right, and the X values are the remaining columns to the left (except the first column). Note again that we are not predicting the future here. We are predicting one index (that we presumably can't look up) based on other indices on the same day. Thus you can ignore the date column.

When the auto grader tests your code we will randomly select 60% of the data to train on and use the other 40% for testing. Make sure your code doesn't depend on particular random seeds to work properly.


Implement DTLearner (15 points)

Implement a Decision Tree learner class named DTLearner in the file DTLearner.py using the method discussed in class. (It is one variant of CART, also very similar to the ID3 algorithm by JR Quinlan, except using correlation to do regression learning.)

We select as the "best feature" the one with the highest absolute value of correlation with Y. We use the median value of that feature as the split point.

For this part of the project, your code should build a single tree only (not a forest). DO NOT import any modules besides those listed in the allowed section below. You should implement the following functions/methods. This also shows the common use case:

import DTLearner as dt
learner = dt.DTLearner(leaf_size = 1, verbose = False) # constructor
learner.addEvidence(Xtrain, Ytrain) # training step
Y = learner.query(Xtest) # query

When there are <= leaf_size samples in a node, it should become a leaf instead of splitting again. Xtrain and Xtest should be ndarrays (numpy objects) where each row represents an X1, X2, X3... XN set of feature values. The columns are the features and the rows are the individual example instances. Y and Ytrain are single dimension ndarrays that indicate the value we are attempting to predict with X.

If "verbose" is True, your code can print out information for debugging. If verbose = False your code should not generate ANY output. When we test your code, verbose will be False.

This code should not generate statistics or charts.

Implement RTLearner (15 points)

Implement a Random Tree learner class named RTLearner in the file RTLearner.py. This learner should behave exactly like your DTLearner, except that the choice of feature to split on should be made randomly. You should be able to accomplish this by removing a few lines from DTLearner (the ones that compute the correlation) and replacing the line that selects the feature with a call to a random number generator.

You should implement the following functions/methods:

import RTLearner as rt
learner = rt.RTLearner(leaf_size = 1, verbose = False) # constructor
learner.addEvidence(Xtrain, Ytrain) # training step
Y = learner.query(Xtest) # query

Implement BagLearner (20 points)

Implement Bootstrap Aggregation as a Python class named BagLearner. Your BagLearner class should be implemented in the file BagLearner.py. It should support EXACTLY the API defined below. This API is designed so that BagLearner can accept any learner (e.g., RTLearner, LinRegLearner, even another BagLearner) as input and use it to generate a learner ensemble. Your BagLearner should support the following function/method prototypes:

import BagLearner as bl
learner = bl.BagLearner(learner = al.ArbitraryLearner, kwargs = {"argument1":1, "argument2":2}, bags = 20, boost = False, verbose = False)
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

Where learner is the learning class to use with bagging. You should be able to support any learning class that obeys the API defined above for DTLearner and RTLearner. kwargs are keyword arguments to be passed on to the learner's constructor and they vary according to the learner (see example below). The "bags" argument is the number of learners you should train using Bootstrap Aggregation. If verbose is True, your code can generate output. Otherwise the code should should be silent.

We will not use the boosting parameter. It will always be false.

As an example, if we wanted to make a random forest of 20 Decision Trees with leaf_size 1 we might call BagLearner as follows

import BagLearner as bl
learner = bl.BagLearner(learner = dt.DTLearner, kwargs = {"leaf_size":1}, bags = 20, boost = False, verbose = False)
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

As another example, if we wanted to build a bagged learner composed of 10 LinRegLearners we might call BagLearner as follows

import BagLearner as bl
learner = bl.BagLearner(learner = lrl.LinRegLearner, kwargs = {}, bags = 10, boost = False, verbose = False)
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

If the training set contains n data items, each bag should contain n items as well. Note that because you should sample with replacement, some of the data items will be repeated.

This code should not generate statistics or charts. If you want create charts and statistics, you can modify testlearner.py, which you will not submit.

You can use code like the below to instantiate several learners with the parameters listed in kwargs:

learners = []
kwargs = {"k":10}
for i in range(0,bags):
    learners.append(learner(**kwargs))

Note that we will test your code with an arbitrarily named class (different every time), so you must not hardcode your BagLearner to work specifically/only with DTLearner and RTLearner!

Experiments and report (50 points)

Create a report that addresses the following questions. Use 11pt font and single spaced lines. We expect that a complete report addressing all the criteria would be at least 3 pages. It should be no longer than 3000 words. To encourage conciseness we will deduct 10 points if the report is too long. The report should be submitted as report.pdf in PDF format. Do not submit word docs or latex files. Include data as tables or charts to support each of your answers.

  • Does overfitting occur with respect to leaf_size? Consider the dataset istanbul.csv with DTLearner. For which values of leaf_size does overfitting occur? Use RMSE as your metric for assessing overfitting. Support your assertion with graphs/charts. (Don't use bagging).
  • Can bagging reduce or eliminate overfitting with respect to leaf_size? Again consider the dataset istanbul.csv with DTLearner. To investigate this choose a fixed number of bags to use and vary leaf_size to evaluate. Provide charts and/or tables to validate your conclusions.
  • Quantitatively compare "classic" decision trees (DTLearner) versus random trees (RTLearner). In which ways is one method better than the other?

What to turn in

Be sure to follow these instructions diligently!

Via T-Square, submit as attachment (no zip files; refer to schedule for deadline).

  • Your code as RTLearner.py, DTLearner.py, and BagLearner.py.
  • Your report as report.pdf

Unlimited resubmissions are allowed up to the deadline for the project.

Rubric

Code (50 points):

  • DTLearner in sample/out of sample test, auto grade 5 test cases (4 using istanbul.csv, 1 using another data set), 3 points each: 15 points.
    • For each test 60% of the data will be selected at random for training and 40% will be selected for testing.
    • Success criteria for each of the 5 tests:
      • 1) Does the correlation between predicted and actual results for in sample data exceed 0.95 with leaf_size = 1?
      • 2) Does the correlation between predicted and actual results for out of sample data exceed 0.15 with leaf_size=1?
      • 3) Is the correlation between predicted and actual results for in sample data below 0.95 with leaf_size = 50?
      • 4) Does the test complete in less than 10 seconds (i.e. 50 seconds for all 5 tests)?
  • RTLearner in sample/out of sample test, auto grade 5 test cases (4 using istanbul.csv, 1 using another data set), 3 points each: 15 points.
    • For each test 60% of the data will be selected at random for training and 40% will be selected for testing.
    • Success criteria for each of the 5 tests:
      • 1) Does the correlation between predicted and actual results for in sample data exceed 0.95 with leaf_size = 1?
      • 2) Does the correlation between predicted and actual results for out of sample data exceed 0.15 with leaf_size=1?
      • 3) Is the correlation between predicted and actual results for in sample data below 0.95 with leaf_size = 50?
      • 4) Does the test complete in less than 3 seconds (i.e. 15 seconds for all 5 tests)?
  • BagLearner, auto grade 10 test cases (8 using istanbul.csv, 2 using another data set), 2 points each 20 points
    • For each test 60% of the data will be selected at random for training and 40% will be selected for testing.
    • leaf_size = 20
    • Success criteria for each run of the 10 tests:
      • 1) For out of sample data is correlation with 1 bag lower than correlation for 20 bags?
      • 2) Does the test complete in less than 10 seconds (i.e. 100 seconds for all 10 tests)?
  • Does BagLearner work correctly with an arbitrarily named class (-10 points if not)

Report (50 points):

  • Is the report neat and well organized? (-5 points if not)
  • Is the experimental methodology well described (up to -5 points if not)
  • Overfitting / leaf_size question:
    • Is data (either a chart or table) provided to support the argument? (up to -5 points if not)
    • Does the student state where the region of overfitting occurs (or state that there is no overfitting)? (-5 points if not)
    • Are the starting point and direction of overfitting identified supported by the data (or if the student states that there is no overfitting, is that supported by the data)? (-5 points if not)
  • Does bagging reduce or eliminate overfitting?:
    • Is data (either a chart or table) provided to support the argument? (-5 points if not)
    • Does the student state where the region of overfitting occurs (or state that there is no overfitting)? (-5 points if not)
    • Are the starting point and direction of overfitting identified supported by the data (or if the student states that there is no overfitting, is that supported by the data)? (-5 points if not)
  • Comparison of DT and RT learning
    • Is each quantitative experiment explained well enough that someone else could reproduce it (up to -5 points if not)
    • Are there at least two quantitative properties that are compared? (-5 points if only one, -10 if none)
    • Is each conclusion regarding each comparison supported well with data (either tabular or graphic)? (up to -10 points if not)
  • Was the report exceptionally well done? (up to +2 points)
  • Does the student's response indicate a lack of understanding of overfitting? (up to -10 points)

Required, Allowed & Prohibited

Required:

  • Your code must implement a Random Tree learner.
  • Your project must be coded in Python 2.7.x.
  • Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu).
  • Your code must run within the given time limits on one of the university-provided computers.
  • The code you submit should NOT include any data reading routines. The provided testlearner.py code reads data for you.
  • The code you submit should NOT generate any output: No prints, no charts, etc.

Allowed:

  • You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines.
  • Your code may use standard Python libraries.
  • You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
  • Code provided by the instructor, or allowed by the instructor to be shared.

Prohibited:

  • Any libraries not listed in the "allowed" section above.
  • Any code you did not write yourself
  • Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
  • Code that includes any data reading routines. The provided testlearner.py code reads data for you.
  • Code that generates any output when verbose = False: No prints, no charts, etc.

Acknowledgements and Citations

The data used in this assignment was provided by UCI's ML Datasets.