Difference between revisions of "Spring 2020 Project 3: Assess Learners"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
Line 42: Line 42:
  
 
===Implement BagLearner (20 points)===
 
===Implement BagLearner (20 points)===
 +
 +
Implement Bootstrap Aggregating as a Python class named BagLearner.  Your BagLearner class should be implemented in the file <tt>BagLearner.py</tt>.  It should support EXACTLY the API defined below.  This API is designed so that BagLearner can accept any learner (e.g., RTLearner, LinRegLearner, even another BagLearner) as input and use it to generate a learner ensemble.  Your BagLearner should support the following function/method prototypes:
 +
 +
import BagLearner as bl
 +
learner = bl.BagLearner(learner = al.ArbitraryLearner, kwargs = {"argument1":1, "argument2":2}, bags = 20, boost = False, verbose = False)
 +
learner.addEvidence(Xtrain, Ytrain)
 +
Y = learner.query(Xtest)
 +
 +
Where learner is the learning class to use with bagging. You should be able to support any learning class that obeys the API defined above for DTLearner and RTLearner. kwargs are keyword arguments to be passed on to the learner's constructor and they vary according to the learner (see example below). The  "bags" argument is the number of learners you should train using Bootstrap Aggregation.  If boost is true, then you should implement boosting (optional).  If verbose is True, your code can generate output.  Otherwise the code should be silent.
 +
 +
As an example, if we wanted to make a random forest of 20 Decision Trees with leaf_size 1 we might call BagLearner as follows
 +
 +
import BagLearner as bl
 +
learner = bl.BagLearner(learner = dt.DTLearner, kwargs = {"leaf_size":1}, bags = 20, boost = False, verbose = False)
 +
learner.addEvidence(Xtrain, Ytrain)
 +
Y = learner.query(Xtest)
 +
 +
As another example, if we wanted to build a bagged learner composed of 10 LinRegLearners we might call BagLearner as follows
 +
 +
import BagLearner as bl
 +
learner = bl.BagLearner(learner = lrl.LinRegLearner, kwargs = {}, bags = 10, boost = False, verbose = False)
 +
learner.addEvidence(Xtrain, Ytrain)
 +
Y = learner.query(Xtest)
 +
 +
Note that each bag should be trained on a different subset of the data.  You will be penalized if this is not the case.
 +
 +
Boosting is an optional topic and not required.  There's a citation in the Resources section that outlines a method of implementing boosting.
 +
 +
If the training set contains n data items, each bag should contain n items as well.  Note that because you should sample with replacement, some of the data items will be repeated.
 +
 +
This code should not generate statistics or charts. If you want create charts and statistics, you can modify <tt>testlearner.py</tt>.
 +
 +
You can use code like the below to instantiate several learners with the parameters listed in kwargs:
 +
<pre>
 +
learners = []
 +
kwargs = {"k":10}
 +
for i in range(0,bags):
 +
    learners.append(learner(**kwargs))
 +
</pre>
  
 
===Implement InsaneLearner (Up to 10 point penalty)===
 
===Implement InsaneLearner (Up to 10 point penalty)===

Revision as of 21:13, 12 January 2020

Revisions

Overview

Template

Tasks

Hints & Resources

Implement DTLearner (15 points)

Implement a Decision Tree learner class named DTLearner in the file DTLearner.py. You should follow the algorithm outlined in the presentation here decision tree slides.

  • We define "best feature to split on" as the feature (Xi) that has the highest absolute value correlation with Y.

The algorithm outlined in those slides is based on the paper by JR Quinlan which you may also want to review as a reference. Note that Quinlan's paper is focused on creating classification trees, while we're creating regression trees here, so you'll need to consider the differences.

For this part of the project, your code should build a single tree only (not a forest). We'll get to forests later in the project. Your code should support exactly the API defined below. DO NOT import any modules besides those listed in the allowed section below. You should implement the following functions/methods:

import DTLearner as dt
learner = dt.DTLearner(leaf_size = 1, verbose = False) # constructor
learner.addEvidence(Xtrain, Ytrain) # training step
Y = learner.query(Xtest) # query

Where "leaf_size" is the maximum number of samples to be aggregated at a leaf. While the tree is being constructed recursively, if there are leaf_size or fewer elements at the time of the recursive call, the data should be aggregated into a leaf. Xtrain and Xtest should be ndarrays (numpy objects) where each row represents an X1, X2, X3... XN set of feature values. The columns are the features and the rows are the individual example instances. Y and Ytrain are single dimension ndarrays that indicate the value we are attempting to predict with X.

If "verbose" is True, your code can print out information for debugging. If verbose = False your code should not generate ANY output. When we test your code, verbose will be False.

This code should not generate statistics or charts.

Implement RTLearner (15 points)

Implement a Random Tree learner class named RTLearner in the file RTLearner.py. This learner should behave exactly like your DTLearner, except that the choice of feature to split on should be made randomly. You should be able to accomplish this by removing a few lines from DTLearner (the ones that compute the correlation) and replacing the line that selects the feature with a call to a random number generator.

You should implement the following functions/methods:

import RTLearner as rt
learner = rt.RTLearner(leaf_size = 1, verbose = False) # constructor
learner.addEvidence(Xtrain, Ytrain) # training step
Y = learner.query(Xtest) # query

Implement BagLearner (20 points)

Implement Bootstrap Aggregating as a Python class named BagLearner. Your BagLearner class should be implemented in the file BagLearner.py. It should support EXACTLY the API defined below. This API is designed so that BagLearner can accept any learner (e.g., RTLearner, LinRegLearner, even another BagLearner) as input and use it to generate a learner ensemble. Your BagLearner should support the following function/method prototypes:

import BagLearner as bl
learner = bl.BagLearner(learner = al.ArbitraryLearner, kwargs = {"argument1":1, "argument2":2}, bags = 20, boost = False, verbose = False)
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

Where learner is the learning class to use with bagging. You should be able to support any learning class that obeys the API defined above for DTLearner and RTLearner. kwargs are keyword arguments to be passed on to the learner's constructor and they vary according to the learner (see example below). The "bags" argument is the number of learners you should train using Bootstrap Aggregation. If boost is true, then you should implement boosting (optional). If verbose is True, your code can generate output. Otherwise the code should be silent.

As an example, if we wanted to make a random forest of 20 Decision Trees with leaf_size 1 we might call BagLearner as follows

import BagLearner as bl
learner = bl.BagLearner(learner = dt.DTLearner, kwargs = {"leaf_size":1}, bags = 20, boost = False, verbose = False)
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

As another example, if we wanted to build a bagged learner composed of 10 LinRegLearners we might call BagLearner as follows

import BagLearner as bl
learner = bl.BagLearner(learner = lrl.LinRegLearner, kwargs = {}, bags = 10, boost = False, verbose = False)
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

Note that each bag should be trained on a different subset of the data. You will be penalized if this is not the case.

Boosting is an optional topic and not required. There's a citation in the Resources section that outlines a method of implementing boosting.

If the training set contains n data items, each bag should contain n items as well. Note that because you should sample with replacement, some of the data items will be repeated.

This code should not generate statistics or charts. If you want create charts and statistics, you can modify testlearner.py.

You can use code like the below to instantiate several learners with the parameters listed in kwargs:

learners = []
kwargs = {"k":10}
for i in range(0,bags):
    learners.append(learner(**kwargs))

Implement InsaneLearner (Up to 10 point penalty)

Implement author MethodUp to 10 point penalty)=

Extra Credit (0 points)

Experiments and Report (50 points)

What to turn in

Rubric

Report

Code

Required, Allowed & Prohibited