Difference between revisions of "MC3-Project-1"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
Line 68: Line 68:
 
for i in range(0,bags):
 
for i in range(0,bags):
 
     learners[i] = learner(**kwargs)
 
     learners[i] = learner(**kwargs)
 +
</pre>
 
   
 
   
 
Other notes: There's a citation below in the Resources section that outlines a method of implementing bagging. If the training set contains n data items, each bag should contain n items as well.  Note that because you should sample with replacement, some of the data items will be repeated.
 
Other notes: There's a citation below in the Resources section that outlines a method of implementing bagging. If the training set contains n data items, each bag should contain n items as well.  Note that because you should sample with replacement, some of the data items will be repeated.

Revision as of 17:09, 9 November 2015

Draft

This is an unofficial draft of the project assignment. This notice will be removed when the assignment is official.

Updates / FAQs

Q: Can I use an ML library or do I have to write the code myself? A: You must write the KNN and bagging code yourself. For the LinRegLearner you are allowed to make use of NumPy or SciPy libraries but you must "wrap" the library code to implement the APIs defined below. Do not uses other libraries or your code will fail the auto grading test cases.

2015-10-07

Draft version posted.

Overview

You are to implement and evaluate three learning algorithms as Python classes: A KNN learner, a Linear Regression learner (provided) and a Bootstrap Aggregating learner. The classes should be named KNNLearner, LinRegLearner, and BagLearner respectively. We are considering this a regression problem (not classification). So the goal is to return a continuous numerical result (not a discrete numerical result).

In this project we are training & testing with static spatial data. In the next project we will make the transition to time series data.

You must write your own code for KNN and bagging. You are NOT allowed to use other peoples' code to implement KNN or bagging.

The project has two main components: The code for your learners, which will be auto graded and your report, report.pdf that should include the components listed below.

Template and Data

Instructions:

You will find these files in the mc3_p1 directory

  • Data/: Contains data for you to test your learning code on.
  • LinRegLearner.py: An implementation of the LinRegLearner class. You can use it as a template for implementing your learner classes.
  • __init__.py: Tells Python that you can import classes while in this directory.
  • testlearner.py: Helper code to test a learner class.

In the Data/ directory there are three files:

  • 3_groups.csv
  • ripple_.csv
  • simple.csv

We will mainly be working with ripple and 3_groups. Each data file contains 3 columns: X1, X2, and Y. In most cases you should use the first 60% of the data for training, and the remaining 40% for testing.

Part 1: Implement KNNLearner (30%)

Your KNNLearner class should be implemented in the file KNNLearner.py. It should implement EXACTLY the API defined below. DO NOT import any modules besides those from numpy, scipy, or the basic Python libraries. You should implement the following functions/methods:

learner = KNNLearner(k = 3) # constructor
learner.addEvidence(Xtrain, Ytrain) # training step
Y = learner.query(Xtest) # query

Where "k" is the number of nearest neighbors to find. Xtrain and Xtest should be ndarrays (numpy objects) where each row represents an X1, X2, X3... XN set of feature values. The columns are the features and the rows are the individual example instances. Y and Ytrain are single dimension ndarrays that indicate the value we are attempting to predict with X.

Use Euclidean distance.

Take the mean of the closest k points' Y values to make your prediction.

Part 2: Implement BagLearner (20%)

Implement Bootstrap Aggregating as a Python class named BagLearner. Your BagLearner class should be implemented in the file BagLearner.py. It should implement EXACTLY the API defined below. DO NOT import any modules besides those from numpy, scipy, or the basic Python libraries. You should implement the following functions/methods:

learner = BagLearner(learner = KNNLearner, kwargs = {"k":3}, bags = 20, boost = false)
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

Where learner is the learning class to use with bagging. kwargs are keyword arguments to be passed on to the learner's constructor (see note below). "bags" is the number of learners you should train using Bootstrap Aggregation. If boost is true, then you should implement boosting. Note that boosting is an extra credit topic and not required.

You can use code like the below to instantiate several learners with the parameters listed in kwargs:

for i in range(0,bags):
    learners[i] = learner(**kwargs)

Other notes: There's a citation below in the Resources section that outlines a method of implementing bagging. If the training set contains n data items, each bag should contain n items as well. Note that because you should sample with replacement, some of the data items will be repeated.

Part 3: Experiments and report (50%)

Create a report that addresses the following issues/questions. The report should be submitted as report.pdf in PDF format. Do not submit word docs or latex files. Include data as tables or charts to support each your answers. I expect that this report will be 4 to 10 pages.

  • Create your own dataset generating code (call it best4linreg.py) that creates data that performs significantly better with LinRegLearner than KNNLearner. Explain your data generating algorithm, and explain why LinRegLearner performs better. Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
  • Create your own dataset generating code (call it best4KNN.py) that creates data that performs significantly better with KNNLearner than LinRegLearner. Explain your data generating algorithm, and explain why KNNLearner performs better. Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
  • Consider the dataset ripple with KNN. For which values of K does overfitting occur? (Don't use bagging).
  • Now use bagging in conjunction with KNN with the ripple dataset. How does performance vary as you increase the number of bags? Does overfitting occur with respect to the number of bags?
  • Can bagging reduce or eliminate overfitting with respect to K for the ripple dataset?

Hints & resources

Some external resources that might be useful for this project:

What to turn in

Be sure to follow these instructions diligently!

Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):

  • Your code as KNNLearner.py, BagLearner.py, best4linreg.py, best4KNN.py
  • Your report as report.pdf

Unlimited resubmissions are allowed up to the deadline for the project.

Extra credit up to 3%

Implement boosting as part of BagLearner. How does boosting affect performance for ripple and 3_groups data?

Does overfitting occur for either of these datasets as the number of bags with boosting increases?

Create your own dataset for which overfitting occurs as the number of bags with boosting increases.

Rubric

TBD