Difference between revisions of "MC3-Homework-1"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
 
(42 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==Draft==
 
 
The description for this assignment has not been created yet.  Once it is finalized, this notice will be removed.
 
  
 
==Updates / FAQs==
 
==Updates / FAQs==
  
* Q: Can I use an ML library or do I have to write the code myself?  A: You must write the decision tree and bagging code yourself.  The LinRegLearner is provided to you. Do not use other libraries or your code will fail the auto grading test cases.
+
* '''March 15, 2017''' API revised to include random number seed.
 
+
* '''March 20, 2017''' Homework description finalized and released..
* Q: Which libraries am I allowed to use?  Which library calls are prohibited?  A: The use of classes that create and maintain their own data structures are prohibited.  So for instance, use of <tt>scipy.spatial.KDTree</tt> is not allowed because it builds a tree and keeps that data structure around for reference later.  The intent for this project is that YOU should be building and maintaining the data structures necessary. You can, however, use methods that return immediate results and do not retain data structures
 
** Examples of things that are allowed: sqrt(), sort(), argsort() -- note that these methods return an immediate value and do not retain data structures for later use.
 
** Examples of things that are prohibited: any scikit add on library, scipy.spatial.KDTree, importing things from libraries other than pandas, numpy or scipy.
 
 
 
* Your strategy for defeating RTLearner and LinRegLearner should not depend on they way you select training data versus testing data.  The relationship of one learner performing better than another should persist regardless of which 60% of the data is selected for training and which 40% is selected for testing.
 
 
 
* Q: How should I read in the data?  A: Your code does not need to read in data, that is handled for you in the testlearner.py code.  You can modify testlearner.py to read in different datasets.  Your solution should NOT depend on any special code in testlearner.py
 
 
 
* Q: How many data items should be in each bag? A: If the training set is of size N, each bag should contain N items. Note that since sampling is with replacement some of the data items will be repeated.
 
  
 
==Overview==
 
==Overview==
You are to implement and evaluate three learning algorithms as Python classes: A Random Tree learner, a Linear Regression learner (provided) and a Bootstrap Aggregating learner.  The classes should be named RTLearner, LinRegLearner, and BagLearner respectively.  You can use the provided testlearner.py code as a framework for testing your code, we will use similar code in our autograder to test your learners.  Be sure that your solution does not depend on any code in testlearner.py
 
 
We are considering this a <b>regression</b> problem (not classification).  So the goal is to return a continuous numerical result (not a discrete result).  In this project we are training & testing with static spatial data.  In a later project we will make the transition to time series data.
 
  
In addition to using some data sets that we will provide, you will also write some code to generate your own datasetsThat part of the project will test your understanding of the strengths and weaknesses of various learners.
+
For this homework you will generate data that you believe will work better for one learner than anotherThis will test your understanding of the strengths and weaknesses of various learners.  The two learners you should aim your datasets at are:
 +
* A random tree learner with leaf_size = 1.
 +
* The LinRegLearner provided as part of the repo.
  
You must write your own code for Random Tree learning and bagging. You are NOT allowed to use other peoples' code to implement Random Trees or bagging.
+
Your data generation should use a random number generator as part of its data generation process. We will pass your generators a random number seed. Whenever the seed is the same you should return exactly the same data set. Different seeds should result in different data sets.
 
 
The project has two main components: The code for your learners and data generators, which will be auto graded, and your report, <tt>report.pdf</tt> that should include the components listed below.
 
 
 
Your learner should be able to handle any dimension in X from 1 to N.
 
 
 
==Reference Material==
 
 
 
'''Note: As of Sept 14, 2016, we are still adding to these materials.'''
 
 
 
The following materials are provided to give you information on what we want you to build:
 
 
 
* [https://www.youtube.com/watch?v=OBWL4oLT7Uc How to use a decision tree if you have one (Balch Youtube video)]
 
* [https://www.youtube.com/watch?v=OBWL4oLT7Uc How to build a decision tree & Random Trees (Balch Youtube video)] ('''not yet available''')
 
* [http://www.interfacesymposia.org/I01/I2001Proceedings/ACutler/ACutler.pdf paper on Random Trees by Adele Cutler]
 
* Balch slides on decision trees ('''not yet available''')
 
* You may be interested to take a look at Andew Moore's slides on [http://www.autonlab.org/tutorials/mbl.html instance based learning].
 
* A definition of [http://mathworld.wolfram.com/StatisticalCorrelation.html correlation] which we'll use to assess the quality of the learning.
 
* [https://en.wikipedia.org/wiki/Bootstrap_aggregating Bootstrap Aggregating]
 
* [https://en.wikipedia.org/wiki/AdaBoost AdaBoost]
 
* [http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html numpy corrcoef]
 
* [http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html numpy argsort]
 
* [http://en.wikipedia.org/wiki/Root_mean_square RMS error]
 
 
 
You can use code like the below to instantiate several learners with the parameters listed in kwargs:
 
 
 
<pre>
 
learners = []
 
kwargs = {"k":10}
 
for i in range(0,bags):
 
    learners.append(learner(**kwargs))
 
</pre>
 
  
 
==Template and Data==
 
==Template and Data==
  
'''Note: As of Sept 14, 2016, the repo has not been updated to include the wine data. We will notify you when that is complete.  In the mean time you can use the provided data sets.'''
+
If necessary, update your version of the repo following the instructions here: [[http://quantsoftware.gatech.edu/ML4T_Software_Setup#Updating_the_repository]].  You will see the following files in your directory mc3_h1:
 
 
Instructions:
 
* Update your copy of the class' github repo.  We will send separate instructions by email on how to do that.
 
 
 
You will find these files in the mc3_p1 directory
 
 
 
* <tt>Data/</tt>: Contains data for you to test your learning code on.
 
* <tt>LinRegLearner.py</tt>: An implementation of the LinRegLearner class.  You can use it as a template for implementing your learner classes.
 
* <tt>__init__.py</tt>: Tells Python that you can import classes while in this directory.
 
* <tt>testlearner.py</tt>: Helper code to test a learner class.
 
 
 
In the Data/ directory there are three files:
 
* 3_groups.csv
 
* ripple_.csv
 
* simple.csv
 
* red_wine.csv
 
* white_wine.csv
 
 
 
We will mainly be working with the wine-based data.  The other files are there as alternative sets for you to test your code on. Each data file contains N+1 columns: X1, X2, ... XN, and Y.  When we test your code we will randomly select 60% of the data to train on and use the other 40% for testing.  However, as of this writing, the testlearner.py code uses the <b>first 60% of the data for training</b>, and the <b>remaining 40% for testing</b>.  That may be helpful because it will enable you to compare results with your friends on piazza.
 
 
 
==Part 1: Implement RTLearner (30%)==
 
 
 
You should implement a Random Tree learner class in the file <tt>RTLearner.py</tt>.  You should consult the [http://www.interfacesymposia.org/I01/I2001Proceedings/ACutler/ACutler.pdf paper by Adele Cutler] as a reference.  Note that for this part of the project, your code should only build a single tree (not a forest).  We'll get to forests later in the project.  The primary differences between Cutler's Random Tree and the methodology originally proposed by [https://wwwold.cs.umd.edu/class/fall2009/cmsc828r/PAPERS/fulltext_Quilan_Ashwin_Kumar.pdf JR Quinlan] are:
 
 
 
# The feature i to split on at each level is determined randomly.  It is not determined using information gain or correlation, etc.
 
# The split value for each node is determined by: Randomly selecting two samples of data and taking the mean of their Xi values.
 
 
 
Your code should support exactly the API defined below.  DO NOT import any modules besides those listed in the prohibited/allowed section below.  You should implement the following functions/methods:
 
 
 
import RTLearner as rt
 
learner = rt.RTLearner(leaf_size = 1, verbose = False) # constructor
 
learner.addEvidence(Xtrain, Ytrain) # training step
 
Y = learner.query(Xtest) # query
 
 
 
Where "leaf_size" is the maximum number of samples to be aggregated at a leaf. Xtrain and Xtest should be ndarrays (numpy objects) where each row represents an X1, X2, X3... XN set of feature values.  The columns are the features and the rows are the individual example instances.  Y and Ytrain are single dimension ndarrays that indicate the value we are attempting to predict with X.
 
  
If "verbose" is True, your code can print out information for debuggingIf verbose = False your code should not generate ANY outputWhen we test your code, verbose will be False.
+
* <tt>gen_data.py</tt> An implementation of the code you are supposed to provide: It includes two functions that each return a data set.  Note that the data sets those functions return DO NOT satisfy the requirements for the homework.  But they do show you how you can generate a data set.
 +
* <tt>LinRegLearner.py</tt> Our friendly, working, correct, linear regression learnerIt is used by the testing code.
 +
* <tt>RTLearner.py</tt> A working, but INCORRECT, Random Tree learnerReplace it with your working, correct RTLearner.
 +
* <tt>testbest4.py</tt> Code that calls the two data set generating functions and tests them against the two learners.
  
This code should not generate statistics or charts. You may modify testlearner.py to generate statistics and charts.
+
==Generate your own datasets==
  
==Part 2: Implement BagLearner (20%)==
+
Create a Python program called gen_data.py that implements two functions.  The two functions should be named as follows, and support the following API:
  
Implement Bootstrap Aggregating as a Python class named BagLearner. Your BagLearner class should be implemented in the file <tt>BagLearner.py</tt>.  It should support EXACTLY the API defined below.  You should implement the following functions/methods:
+
  X1, Y1 = best4LinReg(seed = 5)
+
  X2, Y2 = best4RT(seed = 5)
import BagLearner as bl
 
learner = bl.BagLearner(learner = rt.RTLearner, kwargs = {"leaf_size":1}, bags = 20, boost = False, verbose = False)
 
  learner.addEvidence(Xtrain, Ytrain)
 
Y = learner.query(Xtest)
 
  
Where learner is the learning class to use with baggingkwargs are keyword arguments to be passed on to the learner's constructor and they vary according to the learner (see hints below)"bags" is the number of learners you should train using Bootstrap AggregationIf boost is true, then you should implement boosting.
+
* '''seed''' Your data generation should use a random number generator as part of its data generation processWe will pass your generators a random number seedWhenever the seed is the same you should return exactly the same data setDifferent seeds should result in different data sets.
  
If verbose is True, your code can generate outputOtherwise it should be silent.
+
best4LinReg() should return data that performs significantly better (see rubric) with LinRegLearner than RTLearnerbest4RT() should return data that performs significantly better with RTLearner than LinRegLearner.
  
Notes:  See hints section below for example code you might use to instantiate your learners.  Boosting is an optional topic and not required.  There's a citation below in the Resources section that outlines a method of implementing bagging. If the training set contains n data items, each bag should contain n items as well.  Note that because you should sample with replacement, some of the data items will be repeated.
+
Each data set should include from 2 to 1000 columns in X, and one column in YThe data should contain from 10 (minimum) to 1000 (maximum) rows.
 
 
This code should not generate statistics or charts. If you want create charts and statistics, modify testlearner.py for that purpose.
 
 
 
==Part 3: Generate your own datasets (20%)==
 
 
 
* Create your own dataset generating code (call it <tt>best4linreg.py</tt>) that creates data that performs significantly better with LinRegLearner than KNNLearner.  Explain your data generating algorithm, and explain why LinRegLearner performs better.  Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
 
* Create your own dataset generating code (call it <tt>best4KNN.py</tt>) that creates data that performs significantly better with KNNLearner than LinRegLearnerExplain your data generating algorithm, and explain why KNNLearner performs better.  Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
 
 
 
==Part 4: Experiments and report (30%)==
 
 
 
Create a report that addresses the following issues/questions.  Use 11pt font and single spaced lines. We expect that a complete report addressing all the criteria would be at least 4 pages. It should be no longer than 10 pages including charts, tables and text. To encourage conciseness we will deduct 2% for each page over 10 pages. The report should be submitted as <tt>report.pdf</tt> in PDF format.  Do not submit word docs or latex files.  Include data as tables or charts to support each your answers.  I expect that this report will be 4 to 10 pages.
 
 
 
* Include charts or tables of data to support your results.  However your submitted code should not generate statistics or charts. Modify testlearner.py to generate statistics and charts.
 
* Consider the dataset <tt>ripple</tt> with KNN.  For which values of K does overfitting occur? (Don't use bagging).
 
* Now use bagging in conjunction with KNN with the <tt>ripple</tt> dataset.  Choose some K keep it fixed. How does performance vary as you increase the number of bags?  Does overfitting occur with respect to the number of bags?
 
* Can bagging reduce or eliminate overfitting with respect to K for the <tt>ripple</tt> dataset?  Fix the number of bags and vary K.
 
  
 
==What to turn in==
 
==What to turn in==
Line 136: Line 40:
 
Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):
 
Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):
  
* Your code as <tt>KNNLearner.py, BagLearner.py</tt>, <tt>best4linreg.py</tt>, <tt>best4KNN.py</tt>
+
* Your code as <tt>gen_data.py</tt>
* Your report as <tt>report.pdf</tt>
 
  
 
Unlimited resubmissions are allowed up to the deadline for the project.
 
Unlimited resubmissions are allowed up to the deadline for the project.
 
==Extra Credit (3%)==
 
 
Implement boosting as part of BagLearner.  How does boosting affect performance for <tt>ripple</tt> and <tt>3_groups</tt> data?
 
 
Does overfitting occur for either of these datasets as the number of bags with boosting increases?
 
 
Create your own dataset for which overfitting occurs as the number of bags with boosting increases.
 
 
Describe and assess your boosting code in a separate <tt>report.pdf</tt>.  Your report should focus only on boosting.  It should be submitted separately to the "extra credit" assignment on t-square.
 
  
 
==Rubric==
 
==Rubric==
  
* KNNLearner, auto grade 10 test cases (including ripple.csv and 3_groups.csv), 3 points each: 30 points
+
Deductions:
* BagLearner, auto grade 10 test cases (including ripple.csv and 3_groups.csv), 2 points each: 20 points
+
* Does either dataset returned contain fewer or more than the allowed number of samples? -20% each.
* best4linreg.py
+
* Does either dataset returned contain fewer or more than the allowed number of dimensions in X? -20% each.
** Code submitted: -5 if absent
+
* When the seed is the same does the best4LinReg dataset generator return the same data? -20% otherwise.
** Description complete -- Sufficient that someone else could implement it: -5 if not
+
* When the seed is the same does the best4RT dataset generator return the same data? -20% otherwise.
** Description compelling -- The reasoning that linreg should do better is understandable and makes sense. Graph of the data helps but is not required if the description is otherwise compelling: -5 if not
+
* When the seed is different does the best4LinReg dataset generator return different data? -20% otherwise.
** Train and test data drawn from same distribution: -5 if not
+
* When the seed is different does the best4RT dataset generator return different data? -20% otherwise.
** Performance demonstrates that linreg does better: -10 if not
 
* best4KNN.py
 
** Code submitted: -5 if absent
 
** Description complete -- Sufficient that someone else could implement it: -5 if not
 
** Description compelling -- The reasoning that KNN should do better is understandable and makes sense. Graph of the data helps but is not required if the description is otherwise compelling: -5 if not
 
** Train and test data drawn from same distribution: -5 if not
 
** Performance demonstrates that KNN does better: -10 if not
 
* Overfitting
 
** Student conveys a correct understanding of overfitting in the report?: -5 points if not.
 
** Is the region of overfitting correctly identified? -5 points if not.
 
** Is the conclusion supported with data (table or chart): -5 points if not.
 
* Bagging
 
** Correct conclusion regarding overfitting as bags increase, supported with tables or charts: -10 points if not.
 
** Correct conclusion regarding overfitting as K increases, supported with tables or charts: -10 points if not.
 
 
 
==Test Cases==
 
 
 
Here are the test cases we used while grading. These are updated each semester, and released after grading.
 
  
* [[MC3-Project-1-Test-Cases-spr2016]]
+
For best4LinReg:
 +
* We will run 15 test cases and select the best 10.  For each successful test +5 points (total of 50%)
 +
* For each test case we will randomly select 60% of the data for training and 40% for testing.
 +
* Success for each case is defined as: RMSE LinReg < RMSE RT * 0.9
  
 +
For best4RT:
 +
* We will run 15 test cases and select the best 10.  For each successful test +5 points (total of 50%)
 +
* For each test case we will randomly select 60% of the data for training and 40% for testing.
 +
* Success for each case is defined as: RMSE RT < RMSE LinReg * 0.9
  
 
==Required, Allowed & Prohibited==
 
==Required, Allowed & Prohibited==
  
 
Required:
 
Required:
 +
* No reading of data from files.
 
* Your project must be coded in Python 2.7.x.
 
* Your project must be coded in Python 2.7.x.
 
* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
 
* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
 
* Your code must run in less than 5 seconds on one of the university-provided computers.
 
* Your code must run in less than 5 seconds on one of the university-provided computers.
* The code you submit should NOT include any data reading routines.  The provided testlearner.py code reads data for you.
+
* The code you submit should NOT include any data reading routines.  You should generate all of your data within your functions.
 
* The code you submit should NOT generate any output: No prints, no charts, etc.
 
* The code you submit should NOT generate any output: No prints, no charts, etc.
  
Line 200: Line 83:
  
 
Prohibited:
 
Prohibited:
* Any other method of reading data besides testlearner.py
+
* Any reading of data files.
 
* Any libraries not listed in the "allowed" section above.
 
* Any libraries not listed in the "allowed" section above.
 
* Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
 
* Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
Line 206: Line 89:
 
* Code that includes any data reading routines.  The provided testlearner.py code reads data for you.
 
* Code that includes any data reading routines.  The provided testlearner.py code reads data for you.
 
* Code that generates any output when verbose = False: No prints, no charts, etc.
 
* Code that generates any output when verbose = False: No prints, no charts, etc.
 +
* Ducks and wood.
  
 
==Legacy==
 
==Legacy==
  
 
[[MC3-Homework-1-legacy]]
 
[[MC3-Homework-1-legacy]]

Latest revision as of 09:34, 13 June 2017

Updates / FAQs

  • March 15, 2017 API revised to include random number seed.
  • March 20, 2017 Homework description finalized and released..

Overview

For this homework you will generate data that you believe will work better for one learner than another. This will test your understanding of the strengths and weaknesses of various learners. The two learners you should aim your datasets at are:

  • A random tree learner with leaf_size = 1.
  • The LinRegLearner provided as part of the repo.

Your data generation should use a random number generator as part of its data generation process. We will pass your generators a random number seed. Whenever the seed is the same you should return exactly the same data set. Different seeds should result in different data sets.

Template and Data

If necessary, update your version of the repo following the instructions here: [[1]]. You will see the following files in your directory mc3_h1:

  • gen_data.py An implementation of the code you are supposed to provide: It includes two functions that each return a data set. Note that the data sets those functions return DO NOT satisfy the requirements for the homework. But they do show you how you can generate a data set.
  • LinRegLearner.py Our friendly, working, correct, linear regression learner. It is used by the testing code.
  • RTLearner.py A working, but INCORRECT, Random Tree learner. Replace it with your working, correct RTLearner.
  • testbest4.py Code that calls the two data set generating functions and tests them against the two learners.

Generate your own datasets

Create a Python program called gen_data.py that implements two functions. The two functions should be named as follows, and support the following API:

X1, Y1 = best4LinReg(seed = 5)
X2, Y2 = best4RT(seed = 5)
  • seed Your data generation should use a random number generator as part of its data generation process. We will pass your generators a random number seed. Whenever the seed is the same you should return exactly the same data set. Different seeds should result in different data sets.

best4LinReg() should return data that performs significantly better (see rubric) with LinRegLearner than RTLearner. best4RT() should return data that performs significantly better with RTLearner than LinRegLearner.

Each data set should include from 2 to 1000 columns in X, and one column in Y. The data should contain from 10 (minimum) to 1000 (maximum) rows.

What to turn in

Be sure to follow these instructions diligently!

Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):

  • Your code as gen_data.py

Unlimited resubmissions are allowed up to the deadline for the project.

Rubric

Deductions:

  • Does either dataset returned contain fewer or more than the allowed number of samples? -20% each.
  • Does either dataset returned contain fewer or more than the allowed number of dimensions in X? -20% each.
  • When the seed is the same does the best4LinReg dataset generator return the same data? -20% otherwise.
  • When the seed is the same does the best4RT dataset generator return the same data? -20% otherwise.
  • When the seed is different does the best4LinReg dataset generator return different data? -20% otherwise.
  • When the seed is different does the best4RT dataset generator return different data? -20% otherwise.

For best4LinReg:

  • We will run 15 test cases and select the best 10. For each successful test +5 points (total of 50%)
  • For each test case we will randomly select 60% of the data for training and 40% for testing.
  • Success for each case is defined as: RMSE LinReg < RMSE RT * 0.9

For best4RT:

  • We will run 15 test cases and select the best 10. For each successful test +5 points (total of 50%)
  • For each test case we will randomly select 60% of the data for training and 40% for testing.
  • Success for each case is defined as: RMSE RT < RMSE LinReg * 0.9

Required, Allowed & Prohibited

Required:

  • No reading of data from files.
  • Your project must be coded in Python 2.7.x.
  • Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
  • Your code must run in less than 5 seconds on one of the university-provided computers.
  • The code you submit should NOT include any data reading routines. You should generate all of your data within your functions.
  • The code you submit should NOT generate any output: No prints, no charts, etc.

Allowed:

  • You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
  • Your code may use standard Python libraries.
  • You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
  • You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
  • Code provided by the instructor, or allowed by the instructor to be shared.
  • Cheese.

Prohibited:

  • Any reading of data files.
  • Any libraries not listed in the "allowed" section above.
  • Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
  • Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
  • Code that includes any data reading routines. The provided testlearner.py code reads data for you.
  • Code that generates any output when verbose = False: No prints, no charts, etc.
  • Ducks and wood.

Legacy

MC3-Homework-1-legacy