|
|
|
1. INTRODUCTION |
Classification using decision trees was one of the earliest forms of data mining. Ross Quinlan's C4.5 is arguably the most referenced decision tree algorithm (Quinlan, 1993). The code available from this WWW page is the LUCS-KDD (Liverpool University Department of Computer Science - Knowledge Discovery in Data) group's own implementation of a decision tree classifier for binary valued data sets, such as those typically used in Association Rule Mining (ARM).
One of the most significant issues in decision tree generation is deciding on which attribute to split. Various algorithms have been proposed in the literature. Two are used here:
The first selects the first attribute in a list of attributes order according to frequency within the entire data set. Information gain (Mitchel 1997) is one of the standard measures used in decision tree construction.
The code given here divides the input data into a training set and a test set using a 50:50 split. The training set is used to build the decision tree, which is then evaluated using the test set. Once the decision tree has been generated a set of classification rules is extracted. The rules are ordered according to size of the antecedent with rules with the largest antecedent being listed first (i.e. the most specific riles). The classifier is applied to the test set by selecting the first rule that satisfies the antecedent of each test set example. The last rule in the classifier is the default rule.
2. DOWNLOADING THE SOFTWARE |
The decision tree software comprises seven source files. These are provided from this WWW page together with three application classes. The source files are as follows:
The decisionTreeNode and RuleNode classes are separate to the remaining classes which are arranged in a class hierarchy of the form illustrated below.
|
The decision tree application classes included here are as follows:
There is also a "tar ball" decisionTrees.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf decisionTrees.tgz.
The decision tree software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.5.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:
javac *.java
The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:
javadoc -d Documentation/ *.java
This will produce a hierarchy of WWW pages contained in the Document directory.
3. RUNNING THE SOFTWARE |
When compiled the software can be invoked in the normal manner using the Java interpreter:
java APPLICATION_CLASS_FILE_NAME
If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:
java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME
The input to the software, in all cases is a (space separated) binary valued data set R. The set R comprises a set of N records such that each record (r), in turn, comprises a set of attributes. Thus:
R = {r | r = subset A}
Where A is the set of available attributes. The value D is then defined as:
D = |A|
We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:
1 2 3 6 1 4 5 7 1 3 4 6 1 2 6 1 2 3 4 5 7
where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). The input file is included using a -F flag.
The program assumes that the last attribute in each row is the class for that record (exactly one class per record). The software does not calculate how many classes there are, the user has to inform the software using the -N flag.
If required (the DecTreeRandom_2file_App and DecTreeInfoGain_2file_App applications) the name of a seperate test file can be included using a -T flag.
Some example invocations, using a discretized/ normalised Pima Indians data set (also available from this site) and the two application classes provided by this WWW site, are given below:
java DecTreeRandomApp -FpimaIndians.D42.N768.C2.num -N2 java DecTreeInfoGainApp -N2 -FpimaIndians.D42.N768.C2.num java DecTreeRandom_2file_App -FpimaIndians.D42.N568.C2.num -N2 -TpimaIndians.D42.N200.C2.num java DecTreeRandomApp10 -FpimaIndians.D42.N768.C2.num -N2
(note that the ordering of flags is not significant). The output from each application is a set of CRs plus some diagnostic information (run time, number of rules generated etc.).
Tables 1 and 2 give some sample output. Note that in these examples the "most frequent attribute" (random) splitting criteria gives a better result than information gain.
SETTINGS -------- Training File name = pimaIndians.D42.N768.C2.num Number of classes = 2 Reading input file: pimaIndians.D42.N768.C2.num Number of records = 768 Number of columns = 42 Min support = 153.6 (records) Accuracy = 82.03125 Generation time = 0.03 seconds (0.0 mins) |
Table 1: java DecTreeRabdomApp -FpimaIndians.D42.N768.C2.num -N2
SETTINGS -------- Training File name = pimaIndians.D42.N768.C2.num Number of classes = 2 Reading input file: pimaIndians.D42.N768.C2.num Number of records = 768 Number of columns = 42 Min support = 153.6 (records) Accuracy = 92.96875 Generation time = 0.02 seconds (0.0 mins) |
Table 2: java DecTreeRabdomApp -N2 -FpimaIndians.D42.N768.C2.num
4. OUTPUT |
Some output is always generated such as: (i) the input parameters and start settings and (ii) "mile stones" during processing. Additional output statements can be included in application classes. The available additional output options are as follows:
double time1 = (double) System.currentTimeMillis(); // Program statements < INSTANCE_NAME > .outputDuration(time1, (double) System.currentTimeMillis());
5. SOME EXAMPLE RESULTS |
Below are some example test runs using number ten data sets taken from the UCI machine learning repository (Blake and Merz, 1998) and discretised using the LUCS-KDD DN software (Coenen 2003). The datasets may be obtained from:
http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html
The naming convention used is as follows: D = total number of attributes (including class attributes), N = number of records and C = number of classes. The results were obtained using a 50:50 training/test set split.
| Data set | Information Gain | Random | ||
|---|---|---|---|---|
| Accuracy (%) | Run time (s) | Accuracy (%) | Run time (s) | |
| mushroom.D90.N8124.C2.num | 96.28 | 0.71 | 99.43 | 0.09 |
| nursery.D32.N12960.C5.num | 22.89 | 1.35 | 22.53 | 1.11 |
| pageBlocks.D46.N5473.C5.num | 92.54 | 0.14 | 92.51 | 0.07 |
| penDigits.D89.N10992.C10.num | 71.43 | 1.19 | 75.93 | 0.42 |
| pimaIndians.D42.N768.C2.num | 72.14 | 0.04 | 71.61 | 0.03 |
| soybean-large.D118.N683.C19.num | 47.21 | 0.09 | 46.63 | 0.02 |
| ticTacToe.D29.N958.C2.num | 58.87 | 0.04 | 75.99 | 0.03 |
| waveform.D101.N5000.C3.num | 73.04 | 0.72 | 96.80 | 0.21 |
| wine.D68.N178.C3.num | 57.30 | 0.04 | 66.29 | 0.03 |
| zoo.D42.N101.C7.num | 90.00 | 0.04 | 88.00 | 0.01 |
6. CONCLUSIONS |
The decision tree algorithms described here have been use successfully used by the LUCS-KDD research group to contrast and compare a variety of classification algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the decision tree implementation available from this www site is suggested:
Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.
REFERENCES |
Created and maintained by Frans Coenen. Last updated 9 January 2012