Difference between revisions of "Spring 2020 Project 4: Defeat Learners"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
Line 24: Line 24:
  
 
==Tasks==
 
==Tasks==
 +
 +
===Implement Functions===
 +
 +
Create a Python program called gen_data.py that implements two functions.  The two functions should be named as follows, and support the following API:
 +
 +
X1, Y1 = best4LinReg(seed = 5)
 +
X2, Y2 = best4DT(seed = 5)
 +
 +
* '''seed''' Your data generation should use a random number generator as part of its data generation process.  We will pass your generators a random number seed.  Whenever the seed is the same you should return exactly the same data set.  Different seeds should result in different data sets.
  
 
===Linear Regression Dataset===
 
===Linear Regression Dataset===
  
 +
best4LinReg() should return data that performs significantly better (see rubric) with LinRegLearner than DTLearner.
  
 +
Each data set should include from 2 to 10 columns in X, and one column in Y.  The data should contain from 10 (minimum) to 1000 (maximum) rows.
  
 
===Decision Tree Dataset===
 
===Decision Tree Dataset===
 +
 +
best4DT() should return data that performs significantly better with DTLearner than LinRegLearner.
 +
 +
Each data set should include from 2 to 10 columns in X, and one column in Y.  The data should contain from 10 (minimum) to 1000 (maximum) rows.
  
 
==What to turn in==
 
==What to turn in==

Revision as of 22:03, 12 January 2020

Revisions

This assignment is subject to change up until 3 weeks prior to the due date. We do not anticipate changes; any changes will be logged in this section.

Overview

For this homework you will generate data that you believe will work better for one learner than another. This will test your understanding of the strengths and weaknesses of various learners. The two learners you should aim your datasets at are:

  • A decision tree learner with leaf_size = 1 (DTLearner). Note that for testing purposes we will use our implementation of DTLearner
  • The LinRegLearner provided as part of the repo.

Your data generation should use a random number generator as part of its data generation process. We will pass your generators a random number seed. Whenever the seed is the same you should return exactly the same data set. Different seeds should result in different data sets.

Template

Instructions:

  • Download the appropriate zip file File:20Spring defeat learners.zip
  • You should see the following files and directory
    • defeat_learners/ the assignment directory
    • defeat_learners/gen_data.py An implementation of the code you are supposed to provide: It includes two functions that return a data set, and a third function that returns a user ID. Note that the data sets those functions return DO NOT satisfy the requirements for the homework. But they do show you how you can generate a data set.
    • defeat_learners/LinRegLearner.py Our friendly, working, correct, linear regression learner. It is used by the grading script. Do not rely on local changes you make to this file, as you may only submit gen_data.py.
    • defeat_learners/DTLearner.py A working, but INCORRECT, Decision Tree learner. Replace it with your working, correct DTLearner.
    • defeat_learners/testbest4.py Code that calls the two data set generating functions and tests them against the two learners. Useful for debugging.
    • defeat_learners/grade_best4.py The grading script; for more details see here: ML4T_Software_Setup#Running_the_grading_scripts

Tasks

Implement Functions

Create a Python program called gen_data.py that implements two functions. The two functions should be named as follows, and support the following API:

X1, Y1 = best4LinReg(seed = 5)
X2, Y2 = best4DT(seed = 5)
  • seed Your data generation should use a random number generator as part of its data generation process. We will pass your generators a random number seed. Whenever the seed is the same you should return exactly the same data set. Different seeds should result in different data sets.

Linear Regression Dataset

best4LinReg() should return data that performs significantly better (see rubric) with LinRegLearner than DTLearner.

Each data set should include from 2 to 10 columns in X, and one column in Y. The data should contain from 10 (minimum) to 1000 (maximum) rows.

Decision Tree Dataset

best4DT() should return data that performs significantly better with DTLearner than LinRegLearner.

Each data set should include from 2 to 10 columns in X, and one column in Y. The data should contain from 10 (minimum) to 1000 (maximum) rows.

What to turn in

Rubric

Report

Code

Required, Allowed & Prohibited