|
1. INTRODUCTION |
TFPC, Total From Partial Classification (Coenen and leng 2005, Coenen et al. 2005), is a Classification Association Rule Mining (CARM) algorithm founded on the TFP (Total From Partial) Association Rule Mining (ARM) algorithm; which, in turn, is an extension of the Apriori-T (Apriori Total) ARM algorithm. All three algorithms --- Apriori-T, TFP and TFPC --- were developed by the LUCS-KDD research team. To fully understand the TFPC algorithm (as described here) it is recommended that interested readers first study the Apriori-T and TFP algorithms which are both also available from these WWW pages.
Briefly, Apriori-T is an "apriori" style algorithm (such first proposed by Agrawal and Srikant 1994) designed to process a binary valued input data set so as to identify frequent (large) itemsets and store the resulting frequent itemset information in a "reverse" set enumeration tree called a T-tree (Total support tree). This T-tree can then be processed to identify ARs.
TFP (Coenen et al. 2004a, 2004b) proceeds in a similar manner to Apriori-T except that, instead of operating with the raw input data directly, the input data is first preprocessed and placed in a P-tree (Partial support tree). As such TFP has all the functionality of Apriori-T with the additional advantage that it operates in a much more efficient manner. Advantages which are particularly significant when operating with data that features many duplicate records and/or records with duplicate leading sub-strings (in this respect attribute ordering is advantageous).
TFPC is an extension of TFP designed to produce Classification Association Rules (CARs) whereas Apriori-T and TFP are designed to generate ARs. In its simplest form TFPC determines a classifier according to given support and confidence thresholds. The nature of the selected thresholds are therefore the most significant influencing factors on classification accuracy. A more sophisticated version of TFPC uses a hill climbing technique to find a best accuracy given start support and confidence thresholds.
TFPC is different to many other CARM algorithms in that it does not use a "standard" two stage approach: (1) generate all CARs, (2) prune CARs to produce a classifier. Instead TFPC comprises only a single stage where CRs are generated as part of the frequent set identification process. On completion of the generation process the only pruning undertaken is to remove "unreachable rules" and the identification of a default rule.
This is an updated version of the software (Version 1 is still available). The principal distinction between versions 1 and 2 is that the GUI in version 2 allows: (i) seperate training and tests set files to be loaded (instead of using Ten Cross Validation (TCV) or some other training/test set split) and (ii) produces Area Under the operating Curve (AUC) values as well as accuracy values.
2. DOWNLOADING THE SOFTWARE |
The TFPC software comprises fourteen source files. These are provided from this WWW page together with seven application classes (see below). The source files are as follows:
The PtreeNode, PtreeNodeTop and TtreeNode classes are separate to the remaining classes which are arranged in a class hierarchy of the form illustrated in Figure 1.
|
The TFPC application classes included here are as follows:
There is also a "tar ball", aprioriTFP.tgz, that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf aprioriTFPC.tgz.
The TFPC software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:
javac *.java
The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:
javadoc -d Documentation/ *.java
This will produce a hierarchy of WWW pages contained in the Document directory.
3. RUNNING THE SOFTWARE |
The software, as noted above, includes five distinct application classes these are discussed in some further detail below. Whatever the case, when compiled the software can be invoked in the normal manner using the Java interpreter:
java APPLICATION_CLASS_FILE_NAME -F FILE_NAME -N NUMBER_OF_CLASSES
The -F flag is used to input the name of the input file and the -N flag the number of classes represented in the input data.
If you are planning to process a very large data set it is a good idea to grab some extra memory:
java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME -N NUMBER_OF_CLASSES
The input to the software, in all cases is a (space separated) binary valued data set R. The set R comprises a set of N rows or records such that each record (r), in turn, comprises a set of attributes. Thus:
R = {r | r = subset A}
Where A is the set of available attributes. The value D is then defined as:
D = |A|
We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:
1 2 3 6 1 4 5 7 1 3 4 6 1 2 6 1 2 3 4 5 7
where, in this case, A = {1, 2, 3, 4, 5, 6, 7} of which 6 and 7 represent the possible classes. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). The input file is included using the -F flag.
The program assumes support and confidence default threshold values of 20% and 80% respectively. However the user may enter their own thresholds using the -S and -C flags. Thresholds are expressed as percentages.
Some example invocations, using a discretized/ normalised (using the LUCS-KDD DN software) Pima Indians data set (also available from this site, the raw data can be obtained from the UCI Machine learning Repository (Blake and Merz 1998)) and each of the six command line application classes provided by this WWW site, are given below:
java AprioriTFP_CR_App -FpimaIndians.D42.N768.C2.num -N2 java AprioriTFP_CR_App10 -FpimaIndians.D42.N768.C2.num -N2 java AprioriTFP_CR_AppHC8 -FpimaIndians.D42.N768.C2.num -N2 -S1 -C70 java AprioriTFP_CR_AppHC8_10 -S1 -FpimaIndians.D42.N768.C2.num -C70 -N2 java AprioriTFP_CR_AppHC_10 -FpimaIndians.D42.N768.C2.num -N2 -S1.5 -C70 java AprioriTFPclassOnlyApp -S5 -C70 -FpimaIndians.D42.N768.C2.num -N2
(note that the ordering of flags is not significant). The output from each application is a set of CRs (ordered according to confidence), i.e. a classifier, plus some diagnostic information (run time, number of rules generated etc.).
To enhance the efficiency of the CBA algorithm the attributers are ordered according to frequency. This means that the original set of input attributes is transformed (for example the original attribute 1 may no longer be attribute 1, while some other attribute N has been redesignated as attrinute 1). To avoid ordering users can comment out the lines:
newClassification.idInputDataOrdering(); newClassification.recastInputData();
in the appropriate application source code files.
The AprioriTFP_CR_App, AprioriTFP_CR_AppHC8 and AprioriTFPclassOnlyApp application (i.e. not the GUI or the applications that provide TCV --- Ten Cross Validation) include a number of output options (listed below) that may be included in the application file (in most cases these are already included and are either active or "commented out"). Note that it in many case it would not make sense to include them in the TCV applications because different classifiers will be generated for each 10th!
| Method Summary | |
| public static void | outputDataArray() Outputs the input data. |
| public static void | outputDataArraySize() Outputs size (number of records and number of elements) of stored input data set read from input data file. |
| public static void | outputConversionArrays() Outputs conversion array (used to renumber columns for input data in terms of frequency of single attributes --- reordering will enhance performance for some ARM algorithms). |
| public static void | outputSettings() Outputs command line values provided by user. |
| public static void | outputRulesWithDefault() Outputs contents of rule linked list (if any), with reconversion, such that last rule is the default rule. |
| public static void | outputRules() Outputs contents of rule linked list (if any), without identifying a default rule. |
| public static void | outputNumRules() Outputs number of generated rules. |
| public static void | outputFrequentSets() Commences the process of outputting the frequent sets contained in the T-tree. |
| public static void | outputNumFreqSets() Commences the process of counting and outputing number of supported nodes in the T-tree. A supported set is assumed to be a non null node in the T-tree. |
| public static void | outputNumUpdatess() Outputs the the number of support count increments (esentially a measure of the amount of work done). |
| public static void | outputStorage() Outputs the storage used by the complete T-tree (measure of storage efficiency). |
| public static void | outputTtree() Outputs the T-tree (in an encoded form, the GUI application includes an option to output the T-tree in graphical form). |
All the applications also include the options the elapsed time, in secondsusing the outputDuration(double time1, double time2) method. This is typically used to measure processing time as follows:
double time1 = (double) System.currentTimeMillis(); // Program statements < INSTANCE_NAME > .outputDuration(time1, (double) System.currentTimeMillis()); |
4. THE TFPC ALGORITHM IN MORE DETAIL |
The TFPC algorithm is founded on a tree data structure, the T-tree (Total support tree), designed to hold information regarding the frequent sets contained in a binary valued input data sets. A frequent set is any sub-set of the available set of attributes that appears in a given percentage (the support threshold) of the available records. Each level in the T-tree is represented as an array of references to T-tree nodes. T-tree nodes have two fields: (1) a support value, (2) a reference to the following level in the T-tree which will be set to null if there is no following sun-branch. A representation of the T-tree, for the data set presented in section 3 and assuming a support threshold of 25%, is given in Figure 1.
Figure 1: Example T-tree
Note, from Figure 1, that itemsets are stored in "reverse", for example the set {w,x,y,z} is stored commencing with attribute Z, this offers two principal advantages:
Note also, from Figure 1, that the tree increases in size as the tree progresses from left to right. To reduce the number of nodes it helps if the attributes (not the classes) are ordered according the frequency with which they occur in the data set.
A T-tree can be constructed in an apriori manner direct from an input file. However, in common with other apriori style approaches, this has the disadvantage that it requires a number of file accesses. To avoid this the TFPC algorithm first loads the input data in to another tree structure, the P-tree (Partial support tree), which contains all the required information contained in the input data while at the same time performing a partial support count. The P-tree is essentially a set enumeration tree or prefix tree. P-tree nodes have four fields:
and are organised so that siblings are ordered in geographical order and parent nodes represent leading subsets of the child nodes. The first node in the P-tree is pointed at by a variable startRef. When generating the P-tree, records are added by comparing the current current record (r) with existing nodes (N) in the P-tree commencing with the start nodes (startRef). When comparing r with an existing node N there are six possibilities (the < and > operators should be interpreted as "lexicographically" before/after).
|
To build a T-tree using a P-tree, and during the generation process produce CRs, the following "apriori" style procedure is applied:
|
In the following sub-sections the various application classes available from this WWW page are discussed in some further detail.
The AprioriTFP_CR_App application class uses a user supplied support and confidence threshold to produce a classifier. To enhance the efficiency of the TFPC it helps if the attributes are ordered according to frequency. This is achieved by the methods:
newClassification.idInputDataOrdering(); newClassification.recastInputData();
The first identifies the ordering and the second recasts the input data (apart from the class) according to the identified ordering.
There are many output options:
Note that the six of the above output methods are instance methods contained in either the PartialSupportTree class or its parent classes. The remainder of the above are instance methods of the RuleList class and thus must be called using an instance of that class. An instance of the RuleList class is created during processing and may be accessed using the getCurrentRuleListObject() public method found in the TotalSupportTree class. Thus, for example, the outputRules() method would be invoked as follows:
< INSTANCE_NAME > .getCurrentRuleListObject().outputRules();
The AprioriTFP_CR_App10 application class is similar to the AprioriTFP_CR_App application class (see above) except that classification accuracy is determined using TCV. The output options are much more limited here. Two options (with and without output):
newClassification.commenceTCVwithOutput(); newClassification.commenceTCV();
The first includes a summary of the TCV process.
The accuracy of the classifier produced by the previous two applications is dependent on the nature of the user specified support and confidence thresholds. The AprioriTFP_CR_AppHC8 application class uses a hill climbing technique to maximise accuracy by repeatedly adjusting the support and confidence threshold values.
Output options include:
Note that the first three of the above output methods are instance methods contained in either the PartialSupportTree class or its parent classes. The remainder of the above are instance methods of the RuleList class and thus must be called using an instance of that class as illustrated above
The hill climbing technique makes use of a 3-D playing area measuring 100x100x100. The axises of the paying area represent: (1) support thresholds, (2) confidence thresholds and (3) accuracies (all defined in terms of percentages). The technique commences with a start support and confidence threshold, describing a currentLoctaion (cl) in the playing area, from which a start accuracy is calculated using TFP. The technique and then attempts to move round the playing area with the aim of maximising the accuracy value. In its "quest" to do this it continuously generates data for a set of eight test locations --- hence the term 8 point hill climbing (the research team also tried a 4 point approach but with disappointing results). The test locations are defined by applying two values, sSupp (change in support threshold) and dConf (change in confidence threshold) to the support and confidence threshold values associated with cl. Consequently the current and test lactations form a 3x3 location grid with cl at the center (Figure 2). The test locations are labeled: n, ne, e, se, s, sw, w, nw and n, with the obvious interpretations.
Figure 1: 8 point location grid within hill climbing playing area on plane of support and confidence axises.
Information regarding the current and test locations in the location grid is contained in a set of structures with the following fields: support, confidence, accuracy, inArea and notCalculated. The last two take boolean values. The inArea field is set to true if the associated location is within the playing area and false false otherwise (it is possible, when working close to the edge of the paying area, to generate a test location which is outside the area). The notCalculated field is set to false if an accuracy has previously been calculated and true otherwise --- this is to prevent the application of TFPC to support and confidence locations for which an accuracy value is already known. The hill climbing process operates as shown in table 3.
|
Note that the procedure makes use of three constants: MIN_dSUPP the minimum adjustment in the support value usually set to 0.1, MIN_dCONF the minimum adjustment in the confidence value usually set to 1.0, BTAR (Best To Available Ratio) the ration of the number of best locations to the number of available locations usually set to 2/3.
The above procedure loops continuously until either: (1) no new location grid can be generated because the dSuppand dConf values have been reduced to a minimum (line 7) in which case the current location is taken as producing the best accuracy, or (2) the majority of locations (as defined using the BTAR constant) display an equal best accuracy (lines 14 and 16) in which case one of these locations is selected as producing the best accuracy. In determining whether a majority of locations display a best accuracy the number of "available" locations is taken into consideration. It may be that (say) if the current location is situated in a corner of the playing area five of the test location will be outside the playing area (i.e. the number of available locations will be equivalent to 4).
Where the above end conditions do not apply there are two options, either: (1) if the best accuracy is featured by the current location (regardless of whether some other locations may also feature this best accuracy) zero in on this location by reducing the "grid size", or (2) move to a location with the best accuracy. In the case of option 2 the discussion to move to a location with the best accuracy is straight forward if there is only one such location. The hill climbing process proceeds by assigning weightings to "potential directions to move in" and then selecting a direction according to these weightings.
The AprioriTFP_CR_AppHC8_10 application class is similar to the AprioriTFP_CR_AppHC8 class except that it incorporates TCV with hill climbing instead of using a 50:50 training/test set split.
The AprioriTFP_CR_AppHC8_10 application class (see above) is computationally expensive because of the amount of "hill-climbing" that it includes. The AprioriTFP_CR_AppHC_10 application class is more computationally effective because it only uses hill climbing (with a 50:50 training/test set split) to obtain optimum support and confidence thresholds. And then make use a straight forward TCV approach to establish accuracy using the identified support and confidence. Output is fixed. Note that AprioriTFP_CR_AppHC_10 is not as accurate as AprioriTFP_CR_AppHC8_10 because the latter generates optimum support and confidence thresholds for each 9/10ths.
TFPC algorithm for generating a classifier using the entire data set as the training set (i.e. without using part of the data as a test set).
5. CONCLUDING REMARKS |
The TFPC algorithm described here have been use successfully by the LUCS-KDD research group to contrast and compare a variety of classification algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgment. The following reference format for referring to the TFPC implementation available here is suggested:
Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.
REFERENCES |
Created and maintained by Frans Coenen. Last updated 20 January 2011