Difference between revisions of "MC3-Homework-1"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
Line 11: Line 11:
 
==Template and Data==
 
==Template and Data==
  
==Part 3: Generate your own datasets (20%)==
+
==Generate your own datasets==
  
 
* Create your own dataset generating code (call it <tt>best4linreg.py</tt>) that creates data that performs significantly better with LinRegLearner than KNNLearner.  Explain your data generating algorithm, and explain why LinRegLearner performs better.  Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
 
* Create your own dataset generating code (call it <tt>best4linreg.py</tt>) that creates data that performs significantly better with LinRegLearner than KNNLearner.  Explain your data generating algorithm, and explain why LinRegLearner performs better.  Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).

Revision as of 23:55, 16 September 2016

Draft

The description for this assignment has not been created yet. Once it is finalized, this notice will be removed.

Updates / FAQs

Overview

You will also write some code to generate your own datasets. That part of the project will test your understanding of the strengths and weaknesses of various learners.

Template and Data

Generate your own datasets

  • Create your own dataset generating code (call it best4linreg.py) that creates data that performs significantly better with LinRegLearner than KNNLearner. Explain your data generating algorithm, and explain why LinRegLearner performs better. Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).
  • Create your own dataset generating code (call it best4KNN.py) that creates data that performs significantly better with KNNLearner than LinRegLearner. Explain your data generating algorithm, and explain why KNNLearner performs better. Your data should include at least 2 dimensions in X, and at least 1000 points. (Don't use bagging for this section).

Part 4: Experiments and report (30%)

Create a report that addresses the following issues/questions. Use 11pt font and single spaced lines. We expect that a complete report addressing all the criteria would be at least 4 pages. It should be no longer than 10 pages including charts, tables and text. To encourage conciseness we will deduct 2% for each page over 10 pages. The report should be submitted as report.pdf in PDF format. Do not submit word docs or latex files. Include data as tables or charts to support each your answers. I expect that this report will be 4 to 10 pages.

  • Include charts or tables of data to support your results. However your submitted code should not generate statistics or charts. Modify testlearner.py to generate statistics and charts.
  • Consider the dataset ripple with KNN. For which values of K does overfitting occur? (Don't use bagging).
  • Now use bagging in conjunction with KNN with the ripple dataset. Choose some K keep it fixed. How does performance vary as you increase the number of bags? Does overfitting occur with respect to the number of bags?
  • Can bagging reduce or eliminate overfitting with respect to K for the ripple dataset? Fix the number of bags and vary K.

What to turn in

Be sure to follow these instructions diligently!

Via T-Square, submit as attachment (no zip files; refer to schedule for deadline):

  • Your code as KNNLearner.py, BagLearner.py, best4linreg.py, best4KNN.py
  • Your report as report.pdf

Unlimited resubmissions are allowed up to the deadline for the project.

Extra Credit (3%)

Implement boosting as part of BagLearner. How does boosting affect performance for ripple and 3_groups data?

Does overfitting occur for either of these datasets as the number of bags with boosting increases?

Create your own dataset for which overfitting occurs as the number of bags with boosting increases.

Describe and assess your boosting code in a separate report.pdf. Your report should focus only on boosting. It should be submitted separately to the "extra credit" assignment on t-square.

Rubric

  • KNNLearner, auto grade 10 test cases (including ripple.csv and 3_groups.csv), 3 points each: 30 points
  • BagLearner, auto grade 10 test cases (including ripple.csv and 3_groups.csv), 2 points each: 20 points
  • best4linreg.py
    • Code submitted: -5 if absent
    • Description complete -- Sufficient that someone else could implement it: -5 if not
    • Description compelling -- The reasoning that linreg should do better is understandable and makes sense. Graph of the data helps but is not required if the description is otherwise compelling: -5 if not
    • Train and test data drawn from same distribution: -5 if not
    • Performance demonstrates that linreg does better: -10 if not
  • best4KNN.py
    • Code submitted: -5 if absent
    • Description complete -- Sufficient that someone else could implement it: -5 if not
    • Description compelling -- The reasoning that KNN should do better is understandable and makes sense. Graph of the data helps but is not required if the description is otherwise compelling: -5 if not
    • Train and test data drawn from same distribution: -5 if not
    • Performance demonstrates that KNN does better: -10 if not
  • Overfitting
    • Student conveys a correct understanding of overfitting in the report?: -5 points if not.
    • Is the region of overfitting correctly identified? -5 points if not.
    • Is the conclusion supported with data (table or chart): -5 points if not.
  • Bagging
    • Correct conclusion regarding overfitting as bags increase, supported with tables or charts: -10 points if not.
    • Correct conclusion regarding overfitting as K increases, supported with tables or charts: -10 points if not.

Test Cases

Here are the test cases we used while grading. These are updated each semester, and released after grading.


Required, Allowed & Prohibited

Required:

  • Your project must be coded in Python 2.7.x.
  • Your code must run on one of the university-provided computers (e.g. buffet02.cc.gatech.edu), or on one of the provided virtual images.
  • Your code must run in less than 5 seconds on one of the university-provided computers.
  • The code you submit should NOT include any data reading routines. The provided testlearner.py code reads data for you.
  • The code you submit should NOT generate any output: No prints, no charts, etc.

Allowed:

  • You can develop your code on your personal machine, but it must also run successfully on one of the university provided machines or virtual images.
  • Your code may use standard Python libraries.
  • You may use the NumPy, SciPy, matplotlib and Pandas libraries. Be sure you are using the correct versions.
  • You may reuse sections of code (up to 5 lines) that you collected from other students or the internet.
  • Code provided by the instructor, or allowed by the instructor to be shared.
  • Cheese.

Prohibited:

  • Any other method of reading data besides testlearner.py
  • Any libraries not listed in the "allowed" section above.
  • Any code you did not write yourself (except for the 5 line rule in the "allowed" section).
  • Any Classes (other than Random) that create their own instance variables for later use (e.g., learners like kdtree).
  • Code that includes any data reading routines. The provided testlearner.py code reads data for you.
  • Code that generates any output when verbose = False: No prints, no charts, etc.

Legacy

MC3-Homework-1-legacy