Difference between revisions of "MC3-Homework-1"
Line 23: | Line 23: | ||
==Generate your own datasets== | ==Generate your own datasets== | ||
− | Create a Python | + | Create a Python program called gen_data.py that implements two functions. The two functions should be named as follows, and support the following API: |
X1, Y1 = best4LinReg() | X1, Y1 = best4LinReg() |
Revision as of 13:20, 27 February 2017
Contents
Draft
Updates / FAQs
- October 5, 2016 The description is finalized. We're still working on creating the template for you and the autograder. It is fine to get started on this project now.
- October 10, 2016 Template and repo are updated.
Overview
For this homework you will generate data that you believe will work better for one learner than another. This will test your understanding of the strengths and weaknesses of various learners. The two learners you should aim your datasets at are:
- A random tree learner with leaf_size = 1.
- The LinRegLearner provided as part of the repo.
Template and Data
If necessary, update your version of the repo following the instructions here: [[1]]. You will see the following files in your directory mc3_h1:
- gen_data.py An implementation of the code you are supposed to provide: It includes two functions that each return a data set. Note that the data sets those functions return DO NOT satisfy the requirements for the homework. But they do show you how you can generate a data set.
- LinRegLearner.py Our friendly, working, correct, linear regression learner. It is used by the testing code.
- RTLearner.py A working, but INCORRECT, Random Tree learner. Replace it with your working, correct RTLearner.
- testbest4.py Code that calls the two data set generating functions and tests them against the two learners.
Generate your own datasets
Create a Python program called gen_data.py that implements two functions. The two functions should be named as follows, and support the following API:
X1, Y1 = best4LinReg() X2, Y2 = best4RT()
best4LinReg() should return data that performs significantly better (see rubric) with LinRegLearner than RTLearner. best4RT() should return data that performs significantly better with RTLearner than LinRegLearner.
Each data set should include from 2 to 1000 columns in X, and one column in Y. The data should contain from 10 (minimum) to 1000 (maximum) rows.
What to turn in
Be sure to follow these instructions diligently!
Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):
- Your code as gen_data.py
Unlimited resubmissions are allowed up to the deadline for the project.
Rubric
Deductions:
- Does either dataset returned contain fewer or more than the allowed number of samples? -20% each.
- Does either dataset returned contain fewer or more than the allowed number of dimensions in X? -20% each.
For best4LinReg:
- We will run 15 test cases and select the best 10. For each successful test +5 points (total of 50%)
- For each test case we will randomly select 60% of the data for training and 40% for testing.
- Success for each case is defined as: RMSE LinReg < RMSE RT * 0.9
For best4RT:
- We will run 15 test cases and select the best 10. For each successful test +5 points (total of 50%)
- For each test case we will randomly select 60% of the data for training and 40% for testing.
- Success for each case is defined as: RMSE RT < RMSE LinReg * 0.9
Required, Allowed & Prohibited
Required:
- No reading of data from files.
- Your project must be coded in Python 2.7.x.
- Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
- Your code must run in less than 5 seconds on one of the university-provided computers.
- The code you submit should NOT include any data reading routines. You should generate all of your data within your functions.
- The code you submit should NOT generate any output: No prints, no charts, etc.
Allowed:
- You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
- Your code may use standard Python libraries.
- You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
- You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
- Code provided by the instructor, or allowed by the instructor to be shared.
- Cheese.
Prohibited:
- Any reading of data files.
- Any libraries not listed in the "allowed" section above.
- Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
- Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
- Code that includes any data reading routines. The provided testlearner.py code reads data for you.
- Code that generates any output when verbose = False: No prints, no charts, etc.