Difference between revisions of "MC3-Homework-1"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
 
(36 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==Draft==
 
  
The description for this assignment has not been created yet.  Once it is finalized, this notice will be removed.
+
==Updates / FAQs==
  
==Updates / FAQs==
+
* '''March 15, 2017''' API revised to include random number seed.
 +
* '''March 20, 2017''' Homework description finalized and released..
  
 
==Overview==
 
==Overview==
  
You will also write some code to generate your own datasetsThat part of the project will test your understanding of the strengths and weaknesses of various learners.
+
For this homework you will generate data that you believe will work better for one learner than anotherThis will test your understanding of the strengths and weaknesses of various learners.  The two learners you should aim your datasets at are:
 +
* A random tree learner with leaf_size = 1.
 +
* The LinRegLearner provided as part of the repo.
 +
 
 +
Your data generation should use a random number generator as part of its data generation process.  We will pass your generators a random number seed.  Whenever the seed is the same you should return exactly the same data set.  Different seeds should result in different data sets.
  
 
==Template and Data==
 
==Template and Data==
  
==Part 3: Generate your own datasets (20%)==
+
If necessary, update your version of the repo following the instructions here: [[http://quantsoftware.gatech.edu/ML4T_Software_Setup#Updating_the_repository]].  You will see the following files in your directory mc3_h1:
  
* Create your own dataset generating code (call it <tt>best4linreg.py</tt>) that creates data that performs significantly better with LinRegLearner than KNNLearnerExplain your data generating algorithm, and explain why LinRegLearner performs better. Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
+
* <tt>gen_data.py</tt> An implementation of the code you are supposed to provide: It includes two functions that each return a data set.  Note that the data sets those functions return DO NOT satisfy the requirements for the homeworkBut they do show you how you can generate a data set.
* Create your own dataset generating code (call it <tt>best4KNN.py</tt>) that creates data that performs significantly better with KNNLearner than LinRegLearnerExplain your data generating algorithm, and explain why KNNLearner performs better. Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
+
* <tt>LinRegLearner.py</tt> Our friendly, working, correct, linear regression learner. It is used by the testing code.
 +
* <tt>RTLearner.py</tt> A working, but INCORRECT, Random Tree learnerReplace it with your working, correct RTLearner.
 +
* <tt>testbest4.py</tt> Code that calls the two data set generating functions and tests them against the two learners.
  
==Part 4: Experiments and report (30%)==
+
==Generate your own datasets==
  
Create a report that addresses the following issues/questionsUse 11pt font and single spaced lines. We expect that a complete report addressing all the criteria would be at least 4 pages. It should be no longer than 10 pages including charts, tables and text. To encourage conciseness we will deduct 2% for each page over 10 pages. The report should be submitted as <tt>report.pdf</tt> in PDF format.  Do not submit word docs or latex files.  Include data as tables or charts to support each your answers.  I expect that this report will be 4 to 10 pages.
+
Create a Python program called gen_data.py that implements two functionsThe two functions should be named as follows, and support the following API:
  
* Include charts or tables of data to support your resultsHowever your submitted code should not generate statistics or charts. Modify testlearner.py to generate statistics and charts.
+
X1, Y1 = best4LinReg(seed = 5)
* Consider the dataset <tt>ripple</tt> with KNNFor which values of K does overfitting occur? (Don't use bagging).
+
X2, Y2 = best4RT(seed = 5)
* Now use bagging in conjunction with KNN with the <tt>ripple</tt> datasetChoose some K keep it fixed. How does performance vary as you increase the number of bags?  Does overfitting occur with respect to the number of bags?
+
 
* Can bagging reduce or eliminate overfitting with respect to K for the <tt>ripple</tt> dataset?  Fix the number of bags and vary K.
+
* '''seed''' Your data generation should use a random number generator as part of its data generation process.  We will pass your generators a random number seedWhenever the seed is the same you should return exactly the same data set. Different seeds should result in different data sets.
 +
 
 +
best4LinReg() should return data that performs significantly better (see rubric) with LinRegLearner than RTLearnerbest4RT() should return data that performs significantly better with RTLearner than LinRegLearner.
 +
 
 +
Each data set should include from 2 to 1000 columns in X, and one column in YThe data should contain from 10 (minimum) to 1000 (maximum) rows.
  
 
==What to turn in==
 
==What to turn in==
Line 30: Line 40:
 
Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):
 
Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):
  
* Your code as <tt>KNNLearner.py, BagLearner.py</tt>, <tt>best4linreg.py</tt>, <tt>best4KNN.py</tt>
+
* Your code as <tt>gen_data.py</tt>
* Your report as <tt>report.pdf</tt>
 
  
 
Unlimited resubmissions are allowed up to the deadline for the project.
 
Unlimited resubmissions are allowed up to the deadline for the project.
 
==Extra Credit (3%)==
 
 
Implement boosting as part of BagLearner.  How does boosting affect performance for <tt>ripple</tt> and <tt>3_groups</tt> data?
 
 
Does overfitting occur for either of these datasets as the number of bags with boosting increases?
 
 
Create your own dataset for which overfitting occurs as the number of bags with boosting increases.
 
 
Describe and assess your boosting code in a separate <tt>report.pdf</tt>.  Your report should focus only on boosting.  It should be submitted separately to the "extra credit" assignment on t-square.
 
  
 
==Rubric==
 
==Rubric==
  
* KNNLearner, auto grade 10 test cases (including ripple.csv and 3_groups.csv), 3 points each: 30 points
+
Deductions:
* BagLearner, auto grade 10 test cases (including ripple.csv and 3_groups.csv), 2 points each: 20 points
+
* Does either dataset returned contain fewer or more than the allowed number of samples? -20% each.
* best4linreg.py
+
* Does either dataset returned contain fewer or more than the allowed number of dimensions in X? -20% each.
** Code submitted: -5 if absent
+
* When the seed is the same does the best4LinReg dataset generator return the same data? -20% otherwise.
** Description complete -- Sufficient that someone else could implement it: -5 if not
+
* When the seed is the same does the best4RT dataset generator return the same data? -20% otherwise.
** Description compelling -- The reasoning that linreg should do better is understandable and makes sense. Graph of the data helps but is not required if the description is otherwise compelling: -5 if not
+
* When the seed is different does the best4LinReg dataset generator return different data? -20% otherwise.
** Train and test data drawn from same distribution: -5 if not
+
* When the seed is different does the best4RT dataset generator return different data? -20% otherwise.
** Performance demonstrates that linreg does better: -10 if not
 
* best4KNN.py
 
** Code submitted: -5 if absent
 
** Description complete -- Sufficient that someone else could implement it: -5 if not
 
** Description compelling -- The reasoning that KNN should do better is understandable and makes sense. Graph of the data helps but is not required if the description is otherwise compelling: -5 if not
 
** Train and test data drawn from same distribution: -5 if not
 
** Performance demonstrates that KNN does better: -10 if not
 
* Overfitting
 
** Student conveys a correct understanding of overfitting in the report?: -5 points if not.
 
** Is the region of overfitting correctly identified? -5 points if not.
 
** Is the conclusion supported with data (table or chart): -5 points if not.
 
* Bagging
 
** Correct conclusion regarding overfitting as bags increase, supported with tables or charts: -10 points if not.
 
** Correct conclusion regarding overfitting as K increases, supported with tables or charts: -10 points if not.
 
 
 
==Test Cases==
 
 
 
Here are the test cases we used while grading. These are updated each semester, and released after grading.
 
  
* [[MC3-Project-1-Test-Cases-spr2016]]
+
For best4LinReg:
 +
* We will run 15 test cases and select the best 10.  For each successful test +5 points (total of 50%)
 +
* For each test case we will randomly select 60% of the data for training and 40% for testing.
 +
* Success for each case is defined as: RMSE LinReg < RMSE RT * 0.9
  
 +
For best4RT:
 +
* We will run 15 test cases and select the best 10.  For each successful test +5 points (total of 50%)
 +
* For each test case we will randomly select 60% of the data for training and 40% for testing.
 +
* Success for each case is defined as: RMSE RT < RMSE LinReg * 0.9
  
 
==Required, Allowed & Prohibited==
 
==Required, Allowed & Prohibited==
  
 
Required:
 
Required:
 +
* No reading of data from files.
 
* Your project must be coded in Python 2.7.x.
 
* Your project must be coded in Python 2.7.x.
 
* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
 
* Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
 
* Your code must run in less than 5 seconds on one of the university-provided computers.
 
* Your code must run in less than 5 seconds on one of the university-provided computers.
* The code you submit should NOT include any data reading routines.  The provided testlearner.py code reads data for you.
+
* The code you submit should NOT include any data reading routines.  You should generate all of your data within your functions.
 
* The code you submit should NOT generate any output: No prints, no charts, etc.
 
* The code you submit should NOT generate any output: No prints, no charts, etc.
  
Line 94: Line 83:
  
 
Prohibited:
 
Prohibited:
* Any other method of reading data besides testlearner.py
+
* Any reading of data files.
 
* Any libraries not listed in the "allowed" section above.
 
* Any libraries not listed in the "allowed" section above.
 
* Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
 
* Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
Line 100: Line 89:
 
* Code that includes any data reading routines.  The provided testlearner.py code reads data for you.
 
* Code that includes any data reading routines.  The provided testlearner.py code reads data for you.
 
* Code that generates any output when verbose = False: No prints, no charts, etc.
 
* Code that generates any output when verbose = False: No prints, no charts, etc.
 +
* Ducks and wood.
  
 
==Legacy==
 
==Legacy==
  
 
[[MC3-Homework-1-legacy]]
 
[[MC3-Homework-1-legacy]]

Latest revision as of 09:34, 13 June 2017

Updates / FAQs

  • March 15, 2017 API revised to include random number seed.
  • March 20, 2017 Homework description finalized and released..

Overview

For this homework you will generate data that you believe will work better for one learner than another. This will test your understanding of the strengths and weaknesses of various learners. The two learners you should aim your datasets at are:

  • A random tree learner with leaf_size = 1.
  • The LinRegLearner provided as part of the repo.

Your data generation should use a random number generator as part of its data generation process. We will pass your generators a random number seed. Whenever the seed is the same you should return exactly the same data set. Different seeds should result in different data sets.

Template and Data

If necessary, update your version of the repo following the instructions here: [[1]]. You will see the following files in your directory mc3_h1:

  • gen_data.py An implementation of the code you are supposed to provide: It includes two functions that each return a data set. Note that the data sets those functions return DO NOT satisfy the requirements for the homework. But they do show you how you can generate a data set.
  • LinRegLearner.py Our friendly, working, correct, linear regression learner. It is used by the testing code.
  • RTLearner.py A working, but INCORRECT, Random Tree learner. Replace it with your working, correct RTLearner.
  • testbest4.py Code that calls the two data set generating functions and tests them against the two learners.

Generate your own datasets

Create a Python program called gen_data.py that implements two functions. The two functions should be named as follows, and support the following API:

X1, Y1 = best4LinReg(seed = 5)
X2, Y2 = best4RT(seed = 5)
  • seed Your data generation should use a random number generator as part of its data generation process. We will pass your generators a random number seed. Whenever the seed is the same you should return exactly the same data set. Different seeds should result in different data sets.

best4LinReg() should return data that performs significantly better (see rubric) with LinRegLearner than RTLearner. best4RT() should return data that performs significantly better with RTLearner than LinRegLearner.

Each data set should include from 2 to 1000 columns in X, and one column in Y. The data should contain from 10 (minimum) to 1000 (maximum) rows.

What to turn in

Be sure to follow these instructions diligently!

Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):

  • Your code as gen_data.py

Unlimited resubmissions are allowed up to the deadline for the project.

Rubric

Deductions:

  • Does either dataset returned contain fewer or more than the allowed number of samples? -20% each.
  • Does either dataset returned contain fewer or more than the allowed number of dimensions in X? -20% each.
  • When the seed is the same does the best4LinReg dataset generator return the same data? -20% otherwise.
  • When the seed is the same does the best4RT dataset generator return the same data? -20% otherwise.
  • When the seed is different does the best4LinReg dataset generator return different data? -20% otherwise.
  • When the seed is different does the best4RT dataset generator return different data? -20% otherwise.

For best4LinReg:

  • We will run 15 test cases and select the best 10. For each successful test +5 points (total of 50%)
  • For each test case we will randomly select 60% of the data for training and 40% for testing.
  • Success for each case is defined as: RMSE LinReg < RMSE RT * 0.9

For best4RT:

  • We will run 15 test cases and select the best 10. For each successful test +5 points (total of 50%)
  • For each test case we will randomly select 60% of the data for training and 40% for testing.
  • Success for each case is defined as: RMSE RT < RMSE LinReg * 0.9

Required, Allowed & Prohibited

Required:

  • No reading of data from files.
  • Your project must be coded in Python 2.7.x.
  • Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
  • Your code must run in less than 5 seconds on one of the university-provided computers.
  • The code you submit should NOT include any data reading routines. You should generate all of your data within your functions.
  • The code you submit should NOT generate any output: No prints, no charts, etc.

Allowed:

  • You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
  • Your code may use standard Python libraries.
  • You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
  • You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
  • Code provided by the instructor, or allowed by the instructor to be shared.
  • Cheese.

Prohibited:

  • Any reading of data files.
  • Any libraries not listed in the "allowed" section above.
  • Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
  • Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
  • Code that includes any data reading routines. The provided testlearner.py code reads data for you.
  • Code that generates any output when verbose = False: No prints, no charts, etc.
  • Ducks and wood.

Legacy

MC3-Homework-1-legacy