Difference between revisions of "Defeat learners"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
 
(20 intermediate revisions by 4 users not shown)
Line 1: Line 1:
==Draft==
+
==Updates==
  
This notice will be removed once the project description is complete.
+
'''2018-09-24'''
 
+
* Do not submit your learner or LinRegLearner, or any learner for that matter.
==Updates / FAQs==
+
* Do not import any learner into your code either.
  
 
==Overview==
 
==Overview==
Line 15: Line 15:
 
==Template and Data==
 
==Template and Data==
  
If necessary, update your version of the repo following the instructions here: [[http://quantsoftware.gatech.edu/ML4T_Software_Setup#Updating_the_repository]]You will see the following files in your directory defeat_learners:
+
Instructions:
 
+
* Download the appropriate zip file [[File:18fall_defeat_learners.zip]]
* <tt>gen_data.py</tt> An implementation of the code you are supposed to provide: It includes two functions that return a data set, and a third function that returns a user ID. Note that the data sets those functions return DO NOT satisfy the requirements for the homework.  But they do show you how you can generate a data set.
+
* You should see the following files and directory
* <tt>LinRegLearner.py</tt> Our friendly, working, correct, linear regression learner.  It is used by the testing code.
+
** <tt>defeat_learners/</tt> the assignment directory
* <tt>DTLearner.py</tt> A working, but INCORRECT, Decision Tree learner.  Replace it with your working, correct DTLearner.
+
** <tt>defeat_learners/gen_data.py</tt> An implementation of the code you are supposed to provide: It includes two functions that return a data set, and a third function that returns a user ID. Note that the data sets those functions return DO NOT satisfy the requirements for the homework.  But they do show you how you can generate a data set.
* <tt>testbest4.py</tt> Code that calls the two data set generating functions and tests them against the two learners.
+
** <tt>defeat_learners/LinRegLearner.py</tt> Our friendly, working, correct, linear regression learner.  It is used by the grading script. Do not rely on local changes you make to this file, as you may only submit <tt>gen_data.py</tt>.
 +
** <tt>defeat_learners/DTLearner.py</tt> A working, but INCORRECT, Decision Tree learner.  Replace it with your working, correct DTLearner.
 +
** <tt>defeat_learners/testbest4.py</tt> Code that calls the two data set generating functions and tests them against the two learners. Useful for debugging.
 +
** <tt>defeat_learners/grade_best4.py</tt> The grading script; for more details see here: [[ML4T_Software_Setup#Running_the_grading_scripts]]
  
 
==Generate your own datasets==
 
==Generate your own datasets==
Line 33: Line 36:
 
best4LinReg() should return data that performs significantly better (see rubric) with LinRegLearner than DTLearner.  best4DT() should return data that performs significantly better with DTLearner than LinRegLearner.
 
best4LinReg() should return data that performs significantly better (see rubric) with LinRegLearner than DTLearner.  best4DT() should return data that performs significantly better with DTLearner than LinRegLearner.
  
Each data set should include from 2 to 1000 columns in X, and one column in Y.  The data should contain from 10 (minimum) to 1000 (maximum) rows.
+
Each data set should include from 2 to 10 columns in X, and one column in Y.  The data should contain from 10 (minimum) to 1000 (maximum) rows.
  
 
==Implement the author() function==
 
==Implement the author() function==
Line 42: Line 45:
 
Be sure to follow these instructions diligently!
 
Be sure to follow these instructions diligently!
  
Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):
+
Via Canvas, submit as attachment (no zip files; refer to schedule for deadline):
  
 
* Your code as <tt>gen_data.py</tt>
 
* Your code as <tt>gen_data.py</tt>
  
We WILL NOT use your DTLLearner, so do not submit it.
+
We WILL NOT use your DTLearner, or LinRegLearner, so do not submit them.
  
 
Unlimited resubmissions are allowed up to the deadline for the project.
 
Unlimited resubmissions are allowed up to the deadline for the project.
Line 53: Line 56:
  
 
Deductions:
 
Deductions:
* Does either dataset returned contain fewer or more than the allowed number of samples? -20% each.
+
* Does either dataset returned contain fewer or more than the allowed number of samples? -20 points each.
* Does either dataset returned contain fewer or more than the allowed number of dimensions in X? -20% each.
+
* Does either dataset returned contain fewer or more than the allowed number of dimensions in X? -20 points each.
* When the seed is the same does the best4LinReg dataset generator return the same data? -20% otherwise.
+
* When the seed is the same does the best4LinReg dataset generator return the same data? -20 points otherwise.
* When the seed is the same does the best4DT dataset generator return the same data? -20% otherwise.
+
* When the seed is the same does the best4DT dataset generator return the same data? -20 points otherwise.
* When the seed is different does the best4LinReg dataset generator return different data? -20% otherwise.
+
* When the seed is different does the best4LinReg dataset generator return different data? -20 points otherwise.
* When the seed is different does the best4DT dataset generator return different data? -20% otherwise.
+
* When the seed is different does the best4DT dataset generator return different data? -20 points otherwise.
 +
* Is the author() method implemented? -10 points if not.
 +
* Does the code attempt to import a learner? -10 points if so.
  
For best4LinReg:
+
For best4LinReg (1 test case):
* We will run 15 test cases and select the best 10.  For each successful test +5 points (total of 50%)
+
* We will call best4LinReg 15 times, and select the 10 best datasets.  For each successful test +5 points (total of 50 points)
 
* For each test case we will randomly select 60% of the data for training and 40% for testing.
 
* For each test case we will randomly select 60% of the data for training and 40% for testing.
 
* Success for each case is defined as: RMSE LinReg < RMSE DT * 0.9
 
* Success for each case is defined as: RMSE LinReg < RMSE DT * 0.9
  
For best4DT:
+
For best4DT (1 test case):
* We will run 15 test cases and select the best 10.  For each successful test +5 points (total of 50%)
+
* We will call best4DT 15 times, and select the 10 best datasets.  For each successful test +5 points (total of 50 points)
 
* For each test case we will randomly select 60% of the data for training and 40% for testing.
 
* For each test case we will randomly select 60% of the data for training and 40% for testing.
 
* Success for each case is defined as: RMSE DT < RMSE LinReg * 0.9
 
* Success for each case is defined as: RMSE DT < RMSE LinReg * 0.9
Line 84: Line 89:
 
* Your code may use standard Python libraries.
 
* Your code may use standard Python libraries.
 
* You may use the NumPy, SciPy, matplotlib and Pandas libraries.  Be sure you are using the correct versions.
 
* You may use the NumPy, SciPy, matplotlib and Pandas libraries.  Be sure you are using the correct versions.
* You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
 
 
* Code provided by the instructor, or allowed by the instructor to be shared.
 
* Code provided by the instructor, or allowed by the instructor to be shared.
 
* Cheese.
 
* Cheese.
Line 91: Line 95:
 
* Any reading of data files.
 
* Any reading of data files.
 
* Any libraries not listed in the "allowed" section above.
 
* Any libraries not listed in the "allowed" section above.
* Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
+
* Any code you did not write yourself.
 
* Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
 
* Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
 
* Code that includes any data reading routines.  The provided testlearner.py code reads data for you.
 
* Code that includes any data reading routines.  The provided testlearner.py code reads data for you.

Latest revision as of 22:17, 24 September 2018

Updates

2018-09-24

  • Do not submit your learner or LinRegLearner, or any learner for that matter.
  • Do not import any learner into your code either.

Overview

For this homework you will generate data that you believe will work better for one learner than another. This will test your understanding of the strengths and weaknesses of various learners. The two learners you should aim your datasets at are:

  • A decision tree learner with leaf_size = 1 (DTLearner). Note that for testing purposes we will use our implementation of DTLearner
  • The LinRegLearner provided as part of the repo.

Your data generation should use a random number generator as part of its data generation process. We will pass your generators a random number seed. Whenever the seed is the same you should return exactly the same data set. Different seeds should result in different data sets.

Template and Data

Instructions:

  • Download the appropriate zip file File:18fall defeat learners.zip
  • You should see the following files and directory
    • defeat_learners/ the assignment directory
    • defeat_learners/gen_data.py An implementation of the code you are supposed to provide: It includes two functions that return a data set, and a third function that returns a user ID. Note that the data sets those functions return DO NOT satisfy the requirements for the homework. But they do show you how you can generate a data set.
    • defeat_learners/LinRegLearner.py Our friendly, working, correct, linear regression learner. It is used by the grading script. Do not rely on local changes you make to this file, as you may only submit gen_data.py.
    • defeat_learners/DTLearner.py A working, but INCORRECT, Decision Tree learner. Replace it with your working, correct DTLearner.
    • defeat_learners/testbest4.py Code that calls the two data set generating functions and tests them against the two learners. Useful for debugging.
    • defeat_learners/grade_best4.py The grading script; for more details see here: ML4T_Software_Setup#Running_the_grading_scripts

Generate your own datasets

Create a Python program called gen_data.py that implements two functions. The two functions should be named as follows, and support the following API:

X1, Y1 = best4LinReg(seed = 5)
X2, Y2 = best4DT(seed = 5)
  • seed Your data generation should use a random number generator as part of its data generation process. We will pass your generators a random number seed. Whenever the seed is the same you should return exactly the same data set. Different seeds should result in different data sets.

best4LinReg() should return data that performs significantly better (see rubric) with LinRegLearner than DTLearner. best4DT() should return data that performs significantly better with DTLearner than LinRegLearner.

Each data set should include from 2 to 10 columns in X, and one column in Y. The data should contain from 10 (minimum) to 1000 (maximum) rows.

Implement the author() function

Update the author() function to use your own user ID.

What to turn in

Be sure to follow these instructions diligently!

Via Canvas, submit as attachment (no zip files; refer to schedule for deadline):

  • Your code as gen_data.py

We WILL NOT use your DTLearner, or LinRegLearner, so do not submit them.

Unlimited resubmissions are allowed up to the deadline for the project.

Rubric

Deductions:

  • Does either dataset returned contain fewer or more than the allowed number of samples? -20 points each.
  • Does either dataset returned contain fewer or more than the allowed number of dimensions in X? -20 points each.
  • When the seed is the same does the best4LinReg dataset generator return the same data? -20 points otherwise.
  • When the seed is the same does the best4DT dataset generator return the same data? -20 points otherwise.
  • When the seed is different does the best4LinReg dataset generator return different data? -20 points otherwise.
  • When the seed is different does the best4DT dataset generator return different data? -20 points otherwise.
  • Is the author() method implemented? -10 points if not.
  • Does the code attempt to import a learner? -10 points if so.

For best4LinReg (1 test case):

  • We will call best4LinReg 15 times, and select the 10 best datasets. For each successful test +5 points (total of 50 points)
  • For each test case we will randomly select 60% of the data for training and 40% for testing.
  • Success for each case is defined as: RMSE LinReg < RMSE DT * 0.9

For best4DT (1 test case):

  • We will call best4DT 15 times, and select the 10 best datasets. For each successful test +5 points (total of 50 points)
  • For each test case we will randomly select 60% of the data for training and 40% for testing.
  • Success for each case is defined as: RMSE DT < RMSE LinReg * 0.9

Required, Allowed & Prohibited

Required:

  • No reading of data from files.
  • Your project must be coded in Python 2.7.x.
  • Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
  • Your code must run in less than 5 seconds on one of the university-provided computers.
  • The code you submit should NOT include any data reading routines. You should generate all of your data within your functions.
  • The code you submit should NOT generate any output: No prints, no charts, etc.

Allowed:

  • You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
  • Your code may use standard Python libraries.
  • You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
  • Code provided by the instructor, or allowed by the instructor to be shared.
  • Cheese.

Prohibited:

  • Any reading of data files.
  • Any libraries not listed in the "allowed" section above.
  • Any code you did not write yourself.
  • Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
  • Code that includes any data reading routines. The provided testlearner.py code reads data for you.
  • Code that generates any output when verbose = False: No prints, no charts, etc.
  • Ducks and wood.

Legacy

MC3-Homework-1-legacy