Difference between revisions of "MC3-Project-1"

Revision as of 18:12, 8 November 2015

Draft

This is an unofficial draft of the project assignment. This notice will be removed when the assignment is official.

Updates / FAQs

Q: Can I use an ML library or do I have to write the code myself? A: You must write the KNN and bagging code yourself. For the LinRegLearner you are allowed to make use of NumPy or SciPy libraries but you must "wrap" the library code to implement the APIs defined below. Do not uses other libraries or your code will fail the auto grading test cases.

2015-10-07

Draft version posted.

Overview

You are to implement and evaluate three learning algorithms as Python classes: A KNN learner, a Linear Regression learner and a Bootstrap Aggregating learner. The classes should be named KNNLearner, LinRegLearner, and BagLearner respectively. We are considering this a regression problem (not classification). So the goal is to return a continuous numerical result (not a discrete numerical result).

In this project we are training & testing with static spatial data. In the next project we will make the transition to time series data.

You must write your own code for KNN and bagging. You are NOT allowed to use other peoples' code to implement KNN or bagging.

The project has two main components: The code for your learners, which will be auto graded and your report, report.pdf that should include the components listed below.

Template and Data

Instructions:

Download mc3_p1.zip, unzip inside ml4t/

You will find these files in the mc3_p1 directory

Data/: Contains data for you to test your learning code on.
LinRegLearner.py: An implementation of the LinRegLearner class. You can use it as a template for implementing your learner classes.
__init__.py: Tells Python that you can import classes while in this directory.
testlearner.py: Helper code to test a learner class.

In the Data/ directory there are three files:

3_groups.csv
ripple_.csv
simple.csv

We will mainly be working with ripple and 3_groups. Each data file contains 3 columns: X1, X2, and Y. In most cases you should use the first 60% of the data for training, and the second 40% for testing.

Part 1: Implement KNNLearner (30%)

Your KNNLearner class should implement the following functions/methods:

learner = KNNLearner(k = 3) # constructor
learner.addEvidence(Xtrain, Ytrain) # training step
Y = learner.query(Xtest) # query

Where "k" is the number of nearest neighbors to find. Xtrain and Xtest should be ndarrays (numpy objects) where each row represents an X1, X2, X3... XN set of feature values. The columns are the features and the rows are the individual example instances. Y and Ytrain are single dimension ndarrays that indicate the value we are attempting to predict with X.

Use Euclidean distance.

Take the mean of the closest k points' Y values to make your prediction.

Part 2: Implement BagLearner (20%)

For the Bootstrap Aggregating learner:

learner = BagLearner(learner = KNNLearner, bags = 20, boost = false)
learner.addEvidence(Xtrain, Ytrain)
Y = learner.query(Xtest)

Where learner is the learning class to use with bagging. "bags" is the number of learners you should train using Bootstrap Aggregation. If boost is true, then you should implement boosting. Note that boosting is an extra credit topic and not required.

There's a citation below in the Resources section that outlines a method of implementing bagging.

If the training set contains n data items, each bag should contain n items as well. Note that because you should sample with replacement, some of the data items will be repeated.

Part 3: Experiments and report (50%)

For the KNN learner:

Vary K from 1 to 50
For each data set create a chart with two lines that report K (as the horizontal axis) versus RMS error. One line for in-sample and one for out-of sample error on the same chart (two charts, each with two lines).
Scatter plots for each experiment that show predicted Y versus actual Y for the "best" K using the out-of-sample data (2 charts).

For the LinReg learner:

For each dataset compute the RMS error. Be sure to list these numbers in your report.
Scatter plots for each experiment that show predicted Y versus actual Y using the out-of-sample data (2 charts).

Note that you should create a total of 6 charts.

Deliverables

Submit files (attachments) via t-square

Your code in KNNLearner.py, LinRegLearner.py and testlearner.py
A SINGLE Report (in a pdf file, report.pdf):
- Include the 6 charts, and the data for LinReg required above.
- Answer the following questions:
  - What is the "best" K for each dataset? Explain your reasoning. Note that there is not necessarily a single correct answer. I want to see your reasoning.
  - As K decreases, does overfitting occur for the datasets? At approximately which K does it start? Explain why you think this is occurring (or that it is not occurring).
Important: Disclose and cite any code or ideas you drew from others.

Hints & resources

For the linear regression component, you can use numpy libraries, or other libraries as you wish. We suggest numpy.linalg.lstsq (see http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html).

In order to get a correct answer that includes the constant term (alpha) you need to append a column of 1s to your X matrix before you send it to lstsq.

Some external resources that might be useful for this project:

You may be interested to take a look at Andew Moore's slides on instance based learning.
A definition of correlation which we'll use to assess the quality of the learning.
Bootstrap Aggregating
numpy corrcoef
numpy argsort
RMS error

Extra Credit

Write additional code, and add plots to your report that do the following:

Write code to query the learner from -1 to 1 in steps of .001 in each dimension (1 million queries) and plot the learned model for each dataset.
Write code to view the original data and the learned model in 3D.
Is it better to approach one of these datasets as a classification problem, rather than regression? If you think so, create the code to do that and provide results (charts) that illustrate the improved approach.

Rubric

Start with 100. Points off as follows:

KNNLearner.py missing -50
LinRegLearner.py missing -10
testlearner.py missing -10
report.pdf missing -50
are all charts/data series present? (-10 for each missing data series)
are charts approximately correct? (-5 for each error)
Answer to "best K" question: Up to 10 points off if completely wrong
Answer to "over fitting" question: Up to 10 points off if completely wrong

If the report indicates significant problems, check the KNN implementation, and:
- KNN algorithm marginally incorrect -10
- KNN algorithm significantly incorrect -30

Extra credit:

Part 1: Up to +2.5 points
Part 2: Up to +2.5 points

To get full extra credit, execution must be stellar.

@@ Line 90: / Line 90: @@
 *** As K decreases, does overfitting occur for the datasets?  At approximately which K does it start? Explain why you think this is occurring (or that it is not occurring).
 * Important: Disclose and cite any code or ideas you drew from others.
-==How to submit==
-Go to the t-square site for the class, then click on the "assignments" tab.   Click on "add attachment" to add your 4 files.  Once you are sure you've added the files, click "submit."
 ==Hints & resources==

Difference between revisions of "MC3-Project-1"

Revision as of 18:12, 8 November 2015

Contents

Draft

Updates / FAQs

Overview

Template and Data

Part 1: Implement KNNLearner (30%)

Part 2: Implement BagLearner (20%)

Part 3: Experiments and report (50%)

Deliverables

Hints & resources

Extra Credit

Rubric

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

QuantSoftware Research Group

Spring 2020

Site

Tools