CS4811: Homework 6 --- Learning Decision Trees


Due: Friday, April 25, 2014, 11:59pm.
(Assigned: Friday, April 11, 2014.)

Reminder: This is an individual assignment. All the work should be the author's and in accordance with the university's academic integrity policies. You are allowed to use any written source in preparing your answers, but if you use any other source than the textbook and the class notes, you should specify it on your assignment.

Problem:

In this assignment, you will implement the basic algorithm for learning decision trees.

You should implement your own code from scratch. You may consult existing implementations but you may not build on them. The textbook's web site has Java, Python and Lisp implementations, if you'd like to take a look.

Task:

Implement a decision tree learner that uses the "probability error" heuristic that we discussed in class. Use your program in the following dataset: http://archive.ics.uci.edu/ml/datasets/Mushroom It is fine for your program to work on the mushroom dataset only. In other words, you are not required to write a general purpose decision tree program. You may hardcode that the dataset has 22 attributes, and also the possible values for each of the attributes. However, in addition to the original dataset, we will test your program with a subset of it.

The program should print a trace of how the decision tree is generated. For each node of the decision tree, the trace should include information about each attribute tested and its probability of error value and the attribute that was chosen, in a way similar to the slides. When the final decision tree is generated, print the tree in a way that is readable and is convenient to you. For example, you can traverse the tree in a depth first fashion and print it as text, indenting each level by a few characters. Printing must be part of the program because we will test your code with different datasets. According to the description file the dataset has missing values for attribute 11, the stalk root. Simply replace the "?" with an "m" for missing, and use is as the value.

Write a short report that describes your implementation, and presents a discussion of the results. In your submission include a file named README that contains full instructions on how to execute your code. We should be able to test it with another set of examples from the mushroom domain. In your documentation explain how we can feed another data file to your program, and what kind of output to expect. You are free to use any language that runs on the CS department Linux platforms.



Submit the following on Canvas. Hardcopies are not needed.