Difference between revisions of "ML4T Software Setup"

From Quantitative Analysis Software Courses
Jump to navigation Jump to search
(19 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Draft ==
 
This page is in the process of being updated for the Spring 2018 semester. This notice will be removed once all changes are finalized.
 
 
 
== Notice ==
 
== Notice ==
The repository has been made private for the Fall 2017 semester, and so the links to the repository below will no longer be visible for you. A zip file containing the grading script and any template code or data will be linked off of each assignment's individual wiki page. A zip file containing the <tt>grading</tt> and <tt>util</tt> modules, as well as the data, is available here: [[Media:ML4T_2017Fall.zip]]. The instructions on running the test scripts provided below still applies.
+
The repository has been made private for the Fall 2017 semester, and so the links to the repository below will no longer be visible for you. A zip file containing the grading script and any template code or data will be linked off of each assignment's individual wiki page. A zip file containing the <tt>grading</tt> and <tt>util</tt> modules, as well as the data, is available here: [[Media:ML4T_2019Spring.zip]]. The instructions on running the test scripts provided below still applies.
  
 
== Overview ==
 
== Overview ==
Line 11: Line 8:
 
===Important Notes===
 
===Important Notes===
  
* Your code '''MUST''' run properly on the Georgia Tech provided servers, and your code must be submitted to T-square. If you do not test your code on the provided machines it may not run correctly when we test it.  If your code fails to run on the provided servers, you will not get credit for the assignment.  So it is very important that you ensure that you have access to, and that your code runs correctly on, these machines. If you would like to develop on your personal machine and are comfortable installing libraries by hand, you can follow the instructions here: [[ML4T_Software_Installation]]. Note that these instructions are from an earlier version of the class, but should work reasonably well.
+
* Your code '''MUST''' run properly on the Georgia Tech provided servers, and your code must be submitted to Canvas. If you do not test your code on the provided machines it may not run correctly when we test it.  If your code fails to run on the provided servers, you will not get credit for the assignment.  So it is very important that you ensure that you have access to, and that your code runs correctly on, these machines. If you would like to develop on your personal machine and are comfortable installing libraries by hand, you can follow the instructions here: [[ML4T_Software_Installation]]. Note that these instructions are from an earlier version of the class, but should work reasonably well.
 
* We use a specific, static dataset for this course, which is provided as part of the repository detailed below. If you download your own data from Yahoo (or elsewhere), you will get wrong answers on assignments.
 
* We use a specific, static dataset for this course, which is provided as part of the repository detailed below. If you download your own data from Yahoo (or elsewhere), you will get wrong answers on assignments.
 
* We reserve the right to modify the grading script while maintaining API compatibility with what is described on the project pages. This includes modifying or withholding test cases, changing point values to match the given rubric, and changing timeout limits to accommodate grading deadlines. The scripts are provided as a convenience to help students avoid common pitfalls or mistakes, and are intended to be used as a sanity check. '''Passing all tests does not guarantee full credit on the assignment, and should be considered a necessary but not sufficient condition for completing an assignment.'''
 
* We reserve the right to modify the grading script while maintaining API compatibility with what is described on the project pages. This includes modifying or withholding test cases, changing point values to match the given rubric, and changing timeout limits to accommodate grading deadlines. The scripts are provided as a convenience to help students avoid common pitfalls or mistakes, and are intended to be used as a sanity check. '''Passing all tests does not guarantee full credit on the assignment, and should be considered a necessary but not sufficient condition for completing an assignment.'''
 
* Using github.gatech.edu to back up your work is a very good idea which we encourage, however make sure that you '''do not''' make your solutions to the assignments public. It's easy to accidentally do this, so please be careful:
 
* Using github.gatech.edu to back up your work is a very good idea which we encourage, however make sure that you '''do not''' make your solutions to the assignments public. It's easy to accidentally do this, so please be careful:
 
** '''Do not''' put your solutions in a '''public''' repository. Repositories on github.com are public by default. The Georgia Tech github, github.gatech.edu, provides the same interface and allows for free private repos for students.
 
** '''Do not''' put your solutions in a '''public''' repository. Repositories on github.com are public by default. The Georgia Tech github, github.gatech.edu, provides the same interface and allows for free private repos for students.
 +
* '''Do not''' make use of or generate any files (whether through automated tools, like IDEs, or manually), related to the assignments, in the /tmp partition under any circumstances.  Failure to comply with this may result in lost work or a possible violation of student integrity policies. 
 +
** Many IDEs/tools offer a remote synchronization option that creates artifacts in /tmp by default.  Unless you are certain you know what you're doing, do not make use of these remote synchronization options.
  
 
==Access to machines at Georgia Tech==
 
==Access to machines at Georgia Tech==
Line 24: Line 23:
 
  ssh -X gtname@buffet0X.cc.gatech.edu
 
  ssh -X gtname@buffet0X.cc.gatech.edu
  
replacing the <tt>X</tt> in <tt>buffet0X</tt> with 1-4, as detailed below. You will then be asked for your password and be logged in. Windows users may have to install an ssh client such as [http://www.putty.org/ putty]. In order to distribute workload across the machines, please use the specific machines as follows:
+
replacing the <tt>X</tt> in <tt>buffet0X</tt> with 1, 3 or 4, as detailed below. You will then be asked for your password and be logged in. Windows users may have to install an ssh client such as [http://www.putty.org/ putty]. In order to distribute workload across the machines, please use the specific machines as follows:
  
* buffet01.cc.gatech.edu if your last name begins with A-G
+
* <s>buffet01.cc.gatech.edu if your last name begins with A-G</s> buffet01.cc.gatech.edu if your last name begins with A-I
* buffet02.cc.gatech.edu if your last name begins with H-N
+
* <s>buffet02.cc.gatech.edu if your last name begins with H-N</s>
* buffet03.cc.gatech.edu if your last name begins with O-U
+
* <s>buffet03.cc.gatech.edu if your last name begins with O-U</s> buffet03.cc.gatech.edu if your last name begins with J-R
* buffet04.cc.gatech.edu if your last name begins with V-Z
+
* <s>buffet04.cc.gatech.edu if your last name begins with V-Z</s> buffet04.cc.gatech.edu if your last name begins with S-Z
  
 
These machines use your GT login credentials.  
 
These machines use your GT login credentials.  
 +
 +
The xhost command and the -X argument to ssh are only necessary if you want to interactively draw plots directly to your screen while running code remotely on buffet. If you have any problems doing this, just forgo xhost and the -X argument and instead plot to a file using the Agg backend of matplotlib and the savefig() function. These require no "screen" access.
  
 
'''NOTE:''' We reserve the right to limit login access or terminate processes to avoid resource contention during grading, although we will endeavor to limit such interruptions.
 
'''NOTE:''' We reserve the right to limit login access or terminate processes to avoid resource contention during grading, although we will endeavor to limit such interruptions.
Line 37: Line 38:
 
==Getting code templates==
 
==Getting code templates==
  
After you've successfully logged in, you will need to clone the following git repository containing all of the template code and data into your home directory: https://github.gatech.edu/ML4T/ML4T_2017Fall. You can do this with the following command:
+
As of Spring 2018, code for each of the individual assignments is provided in zip files, linked to on the individual project page. The data, grading module, and util.py, which are common across all assignments, are available here [[Media:ML4T_2019Spring.zip]] (<span style="color:red">same file as above</span>).
 
 
git clone https://github.gatech.edu/ML4T/ML4T_2017Fall.git
 
 
 
again providing your GT login credentials when asked for. For the remainder of these instructions, we'll assume you checked out the repository into your home directory, and that you did not change the name of the folder.
 
  
 
== Running the grading scripts ==
 
== Running the grading scripts ==
  
The repository you've just cloned contains the grading scripts, data, and template code for all assignments. To complete the assignments you'll need to modify the templates according to the assignment description. You can do this on the <tt>buffet0X</tt> machines directly using a text editor such as <tt>gedit</tt>, <tt>nano</tt>, or <tt>vim</tt>. Or you can copy the file to your local machine, edit them in your favorite text editor or IDE, and upload them back to the server. Make sure to test run your code on the server after making changes to catch any typos or other bugs.
+
The above zip files contain the grading scripts, data, and util.py for all assignments. Some project page will also have a link to a zip file containing a directory with some template code, which you should extract in the same directory that contains the <tt>data/</tt> and <tt>grading/</tt> directories, and <tt>util.py</tt>, (<tt>ML4T_2019Spring/</tt>). To complete the assignments you'll need to modify the templates according to the assignment description. You can do this on the <tt>buffet0X</tt> machines directly using a text editor such as <tt>gedit</tt>, <tt>nano</tt>, or <tt>vim</tt>. Or you can copy the file to your local machine, edit them in your favorite text editor or IDE, and upload them back to the server. Make sure to test run your code on the server after making changes to catch any typos or other bugs.
  
To test your code, you'll need to set up your PYTHONPATH to include the <tt>grading</tt> module and the utility module <tt>util.py</tt>, which are both one directory up from the project directories. Here's an example of how to run the grading script for the first assignment:
+
To test your code, you'll need to set up your PYTHONPATH to include the <tt>grading</tt> module and the utility module <tt>util.py</tt>, which are both one directory up from the project directories. Here's an '''example''' of how to run the grading script for the optional (deprecated) assignment Assess Portfolio (note, grade_anlysis.py is included in the template zip file for Assess Portfolio):
  
 
  PYTHONPATH=../:. python grade_analysis.py
 
  PYTHONPATH=../:. python grade_analysis.py
  
which assumes you're typing from the folder '''ML4T_2017Fall/assess_portfolio/'''. This will print out a lot of information, and will also produce two text files: <tt>points.txt</tt> and <tt>comments.txt</tt>, which summarize the output, including any errors or failed test cases.
+
which assumes you're typing from the folder '''ML4T_2019Spring/assess_portfolio/'''. This will print out a lot of information, and will also produce two text files: <tt>points.txt</tt> and <tt>comments.txt</tt>. It will probably be helpful to scan through all of the output printed out in order to trace errors to your code, while <tt>comments.txt</tt> will contain a succinct summary of which test cases failed and the specific errors (without the backtrace). Here's an example of the contents of <tt>comments.txt</tt> for the first assignment using the unchanged template:
 
 
== Updating the repository ==
 
We will periodically update the repository throughout the semester. When this happens, we will make a note of it on the [[Repository Update Page]], which you should check regularly. Below are instructions for updating your copy of the repository.
 
  
Note: these instructions are for students who have not committed to their repository, added a different origin, or any other advanced git techniques. If you have done this, some quick googling should resolve any questions you have.
+
&lt;pre&gt;--- Summary ---
 +
Tests passed: 0 out of 3
 
   
 
   
From here on, we'll assume you've checked out the repository, and may have made some modifications you'd like to keep. First things first, figure out what you have changed since you originally pulled the repo. From the your code root directory (e.g., <tt>ML4T_2017Fall/</tt>), run the following command:
+
--- Details ---
 
+
Test #0: failed
git status
+
Test case description: Wiki example 1
 
+
IncorrectOutput: One or more stats were incorrect.
Look at the list of files that have changed and make sure it makes sense. For example, if you've only modified the python file for the first assignment, the output may look something like this:
+
  Inputs:
 
+
    start_date: 2010-01-01 00:00:00
bhrolenok3@buffet02:~/ML4T_2017Fall$ git status
+
    end_date: 2010-12-31 00:00:00
On branch master
+
    symbols: ['GOOG', 'AAPL', 'GLD', 'XOM']
Your branch is up-to-date with 'origin/master'.
+
    allocs: [0.2, 0.3, 0.4, 0.1]
 +
    start_val: 1000000
 +
  Wrong values:
 +
    cum_ret: 0.25 (expected: 0.255646784534)
 +
    avg_daily_ret: 0.001 (expected: 0.000957366234238)
 +
    sharpe_ratio: 2.1 (expected: 1.51819243641)
 
   
 
   
  Changes not staged for commit:
+
  Test #1: failed
  (use "git add <file>..." to update what will be committed)
+
  Test case description: Wiki example 2
  (use "git checkout -- <file>..." to discard changes in working directory)
+
  ...
 
modified:  assess_portfolio/analysis.py
 
 
no changes added to commit (use "git add" and/or "git commit -a")
 
You may see a few lines after this under the heading "Untracked files", these are safe to ignore. They are just files that aren't part of the repository (temporary backups, <tt>.pyc</tt> files, notes, etc). If you see any modified files that you don't remember editing, you can look at the exact differences by using the following git command:
 
 
 
git diff <filename>
 
 
 
replacing <tt>&lt;filename&gt;</tt> with the name of the file that's been marked as modified. Following the example earlier, here's what running that command looks like for the <tt>analysis.py</tt> changes I made:
 
 
 
bhrolenok3@buffet02:~/ML4T_2017Fall$ git diff assess_portfolio/analysis.py
 
diff --git a/assess_portfolio/analysis.py b/assess_portfolio/analysis.py
 
index 9a9c1c6..10d422e 100644
 
--- a/assess_portfolio/analysis.py
 
+++ b/assess_portfolio/analysis.py
 
@@ -24,7 +24,7 @@ def assess_portfolio(sd = dt.datetime(2008,1,1), ed = dt.datetime(2009,1,1), \
 
      port_val = prices_SPY # add code here to compute daily portfolio values
 
 
 
      # Get portfolio statistics (note: std_daily_ret = volatility)
 
-    cr, adr, sddr, sr = [0.25, 0.001, 0.0005, 2.1] # add code here to compute stats
 
  +    cr, adr, sddr, sr = [0.50, 0.002, 0.0010, 4.2] # twice as good!
 
 
 
      # Compare daily portfolio value with SPY using a normalized plot
 
      if gen_plot:
 
 
 
lines with <b>-</b> have been removed, lines with <b>+</b> have been added, so this output means I changed one line in the file, changing the <code>cr, adr, sddr, sr</code> variables. You'll be able to scroll up and down through the changes using your arrow keys, and you'll need to hit the <b>q</b> key to get back to the command line. Once you've identified all the changed files, use scp (or WinSCP or the ssh client of your choice) to copy the files you'd like to keep to your local computer. Now, you can stash all the changes you've made on your copy of the repo on <tt>buffet0x</tt> using <code>git stash</code>, which, following our example, will look something like this:
 
 
 
bhrolenok3@buffet02:~/ML4T_2017Fall$ git stash
 
Saved working directory and index state WIP on master: a97a488 Grading script for mc1p1, initial commit
 
HEAD is now at a97a488 Grading script for mc1p1, initial commit
 
 
 
Now you can safely pull down all the changes that have been made to the repo since the last time. Do that using <code>git pull</code>:
 
 
 
bhrolenok3@buffet02:~/ML4T_2017Fall$ git pull
 
Enter passphrase for key '/home/bhrolenok3/.ssh/id_rsa':
 
DISPLAY "(null)" invalid; disabling X11 forwarding
 
remote: Counting objects: 7, done.
 
remote: Compressing objects: 100% (7/7), done.
 
remote: Total 7 (delta 0), reused 0 (delta 0), pack-reused 0
 
Unpacking objects: 100% (7/7), done.
 
From https://github.gatech.edu/ML4T/ML4T_2017Fall
 
    228f9ec..803f0be  master    -> origin/master
 
Updating 228f9ec..803f0be
 
Fast-forward
 
  assess_learners/Data/winequality-red.csv  | 1599 ++++++++++++
 
  assess_learners/Data/winequality-white.csv | 4898 +++++++++++++++++++++++++++++++++++++
 
  assess_learners/Data/winequality.names.txt |  72 +
 
  3 files changed, 6569 insertions(+)
 
  create mode 100644 assess_learners/Data/winequality-red.csv
 
  create mode 100644 assess_learners/Data/winequality-white.csv
 
  create mode 100644 assess_learners/Data/winequality.names.txt
 
 
 
This should be similar for everyone, since the only time the remote repository is updated is when we (TAs/Professor Balch) make changes. At this point, you'll have all the new changes to the repository. From here you can 1) start working from scratch on the current assignment (safest option), 2) copy back the modified files using scp (verify by hand), or 3) use <code>git stash</code> to apply the changes to the new repository.  Option 2 should be safe and quick in most instances, and if you're not comfortable with <code>git</code> and the command line it may be the easiest. You'll have to check the differences between any files you overwrite when you copy them back to <tt>buffet0x</tt>, which you can do easily with the <code>git diff</code> command described earlier. Option 3 handles all of these things using <code>git</code>'s own tools. To apply your stashed changes from earlier, you can simply call <code>git stash pop</code>:
 
 
 
bhrolenok3@buffet02:~/ML4T_2017Fall$ git stash pop
 
On branch master
 
  Your branch is up-to-date with 'origin/master'.
 
 
 
Changes not staged for commit:
 
  (use "git add <file>..." to update what will be committed)
 
  (use "git checkout -- <file>..." to discard changes in working directory)
 
 
 
modified:  assess_portfolio/analysis.py
 
  
which tells you the status of the repo after applying all your changes, which you should double check makes sense using <code>git diff</code> as before. If you see any "conflicts" or error messages when applying your stashed changes, you'll need to go back over them by hand. Since you have a backup of your files, you can always wipe out the repo and start from a clean slate.
+
The <tt>comments.txt</tt> file will contain a summary of which tests were passed or failed, and any error messages. The <tt>points.txt</tt> file reports the score from the autograder, used by the teaching staff to automate grading submitted code in a batch run, and can be safely ignored by students.

Revision as of 15:55, 18 January 2019

Notice

The repository has been made private for the Fall 2017 semester, and so the links to the repository below will no longer be visible for you. A zip file containing the grading script and any template code or data will be linked off of each assignment's individual wiki page. A zip file containing the grading and util modules, as well as the data, is available here: Media:ML4T_2019Spring.zip. The instructions on running the test scripts provided below still applies.

Overview

Most of the projects in this class will be graded automatically. As of the summer 2017 semester, we are providing the grading scripts with the template code for each of the projects, so that students can test their code to make sure they are API compatible. Georgia Tech also provides access to four servers that have been configured to be identical to the grading environment, specifically in terms of operating system and library versions. Since these servers have already been configured with all necessary libraries, setup has been greatly simplified.

Important Notes

  • Your code MUST run properly on the Georgia Tech provided servers, and your code must be submitted to Canvas. If you do not test your code on the provided machines it may not run correctly when we test it. If your code fails to run on the provided servers, you will not get credit for the assignment. So it is very important that you ensure that you have access to, and that your code runs correctly on, these machines. If you would like to develop on your personal machine and are comfortable installing libraries by hand, you can follow the instructions here: ML4T_Software_Installation. Note that these instructions are from an earlier version of the class, but should work reasonably well.
  • We use a specific, static dataset for this course, which is provided as part of the repository detailed below. If you download your own data from Yahoo (or elsewhere), you will get wrong answers on assignments.
  • We reserve the right to modify the grading script while maintaining API compatibility with what is described on the project pages. This includes modifying or withholding test cases, changing point values to match the given rubric, and changing timeout limits to accommodate grading deadlines. The scripts are provided as a convenience to help students avoid common pitfalls or mistakes, and are intended to be used as a sanity check. Passing all tests does not guarantee full credit on the assignment, and should be considered a necessary but not sufficient condition for completing an assignment.
  • Using github.gatech.edu to back up your work is a very good idea which we encourage, however make sure that you do not make your solutions to the assignments public. It's easy to accidentally do this, so please be careful:
    • Do not put your solutions in a public repository. Repositories on github.com are public by default. The Georgia Tech github, github.gatech.edu, provides the same interface and allows for free private repos for students.
  • Do not make use of or generate any files (whether through automated tools, like IDEs, or manually), related to the assignments, in the /tmp partition under any circumstances. Failure to comply with this may result in lost work or a possible violation of student integrity policies.
    • Many IDEs/tools offer a remote synchronization option that creates artifacts in /tmp by default. Unless you are certain you know what you're doing, do not make use of these remote synchronization options.

Access to machines at Georgia Tech

There are 4 machines that will be accessible to students enrolled in the ML4T class via ssh. These machines may not be available until the second week of class; we will make an announcement once they are ready, and if at that time you are still unable to log in, please contact us. If you are using a Unix based operating system, such as Ubuntu or Mac OS X, you already have an ssh client, and you can connect to one of the servers by opening up a terminal and typing:

xhost +
ssh -X gtname@buffet0X.cc.gatech.edu

replacing the X in buffet0X with 1, 3 or 4, as detailed below. You will then be asked for your password and be logged in. Windows users may have to install an ssh client such as putty. In order to distribute workload across the machines, please use the specific machines as follows:

  • buffet01.cc.gatech.edu if your last name begins with A-G buffet01.cc.gatech.edu if your last name begins with A-I
  • buffet02.cc.gatech.edu if your last name begins with H-N
  • buffet03.cc.gatech.edu if your last name begins with O-U buffet03.cc.gatech.edu if your last name begins with J-R
  • buffet04.cc.gatech.edu if your last name begins with V-Z buffet04.cc.gatech.edu if your last name begins with S-Z

These machines use your GT login credentials.

The xhost command and the -X argument to ssh are only necessary if you want to interactively draw plots directly to your screen while running code remotely on buffet. If you have any problems doing this, just forgo xhost and the -X argument and instead plot to a file using the Agg backend of matplotlib and the savefig() function. These require no "screen" access.

NOTE: We reserve the right to limit login access or terminate processes to avoid resource contention during grading, although we will endeavor to limit such interruptions.

Getting code templates

As of Spring 2018, code for each of the individual assignments is provided in zip files, linked to on the individual project page. The data, grading module, and util.py, which are common across all assignments, are available here Media:ML4T_2019Spring.zip (same file as above).

Running the grading scripts

The above zip files contain the grading scripts, data, and util.py for all assignments. Some project page will also have a link to a zip file containing a directory with some template code, which you should extract in the same directory that contains the data/ and grading/ directories, and util.py, (ML4T_2019Spring/). To complete the assignments you'll need to modify the templates according to the assignment description. You can do this on the buffet0X machines directly using a text editor such as gedit, nano, or vim. Or you can copy the file to your local machine, edit them in your favorite text editor or IDE, and upload them back to the server. Make sure to test run your code on the server after making changes to catch any typos or other bugs.

To test your code, you'll need to set up your PYTHONPATH to include the grading module and the utility module util.py, which are both one directory up from the project directories. Here's an example of how to run the grading script for the optional (deprecated) assignment Assess Portfolio (note, grade_anlysis.py is included in the template zip file for Assess Portfolio):

PYTHONPATH=../:. python grade_analysis.py

which assumes you're typing from the folder ML4T_2019Spring/assess_portfolio/. This will print out a lot of information, and will also produce two text files: points.txt and comments.txt. It will probably be helpful to scan through all of the output printed out in order to trace errors to your code, while comments.txt will contain a succinct summary of which test cases failed and the specific errors (without the backtrace). Here's an example of the contents of comments.txt for the first assignment using the unchanged template:

<pre>--- Summary ---
Tests passed: 0 out of 3

--- Details ---
Test #0: failed
Test case description: Wiki example 1
IncorrectOutput: One or more stats were incorrect.
  Inputs:
    start_date: 2010-01-01 00:00:00
    end_date: 2010-12-31 00:00:00
    symbols: ['GOOG', 'AAPL', 'GLD', 'XOM']
    allocs: [0.2, 0.3, 0.4, 0.1]
    start_val: 1000000
  Wrong values:
    cum_ret: 0.25 (expected: 0.255646784534)
    avg_daily_ret: 0.001 (expected: 0.000957366234238)
    sharpe_ratio: 2.1 (expected: 1.51819243641)

Test #1: failed
Test case description: Wiki example 2
...

The comments.txt file will contain a summary of which tests were passed or failed, and any error messages. The points.txt file reports the score from the autograder, used by the teaching staff to automate grading submitted code in a batch run, and can be safely ignored by students.