THE LUCS-KDD TFP ASSOCIATION RULE MINING ALGORITHM GUI
Frans Coenen
Department of Computer Science
The University of Liverpool
13 February 2008
|
|
CONTENTS
TFP (Total From Partial), also sometimes referred to as
Apriori-TFP, is an Association Rule Mining (ARM) algorithm,
developed by the LUCS-KDD
research team. The TFP algorithm in its current form together with further
details concerning P-trees and T-trees that are not available from this WWW
page can be found in Coenen et al. (2004a, 2004b). The code obtainable from
this page is a GUI version that includes (for comparison purposes)
the LUCS-KDD implementation of the FPgrowth algorithm
developed by Han
et al. (2000).
2. DOWNLOADING THE SOFTWARE |
The TFP ARM software comprises thirteen source files. These are provided
from this WWW page together with an application class. The
class files are as follows:
- AprioriTFP_GUI_App: Application class.
- AprioriTFPcontrol:
Class containing the
GUI control methods.
- AssocRuleMining:
Set of general ARM utility methods to allow: (i) data input and input
error checking, (ii) data preprocessing, (iii) manipulation of records (e.g.
operations such as subset, member, union etc.) and (iv) data and parameter
output.
- FPtree: An implementation of Han's FP-growth
ARM algorithm.
- PartialSupportTree:
Methods to implement the TFP (Total From Partial) ARM algorithm
using both the T-tree (Total support tree) and P-tree (Partial support tree
data structures.
- PtreeCanvas: Paint canvas
in which to display the P-tree.
- PtreeNode: Tree structure to store
Ptree nodes. Same as top level structure (see below) but with the addition of
a sibling branch.
- PtreeNodeTop:
Top level Ptree node structure. An array of such structures is created
in which to store the top level of the Ptree.
- PtreeWindow: Window
in which to place the P-tree paint canvas.
- RuleNode: Set of methods that
allow the creation and manipulation (e.g. ordering, etc.) of a list of ARs.
- TotalSupportTree:
Methods to implement the "Apriori-T" algorithm using the "Total support" tree
data structure (T-tree).
- TtreeCanvas: Methods to create a canvas onto
which the T-tree may be painted if rquired.
- TtreeNode: Methods concerned
with the structure of Ttree nodes. Arrays of
these structures are used to store nodes at the same level in any sub-branch of
the T-tree. Note this is a separate class to the other T-tree classes which are
arranged in a class hierarchy as illustrated below.
- TtreeWindow: Methods to create a GUI window to
contain a T-tree canvas (if requested).
The relationship between some of the principal classes is
as shown in the hierarchy in Table 1 below.
AssocRuleMining
|
TotalSupportTree
|
+----------------------+----------------------+
| |
FPtree ParrtialSupportTree
|
|
Table 1: Class hierarchy
There is also a "tar ball"
aprioriTFP_GUI.tgz
that can be downloaded that includes all of the above source and application
class files. It can be unpacked using tar -zxf aprioriT.tgz.
2.1. Compiling
The software has been implemented in Java which should therefore make it highly
portable. The code does not require any special packages and thus can be
compiled using the standard Java compiler:
javac *.java
2.2. Documentation
The code can be documented using Java Doc. First create a directory
Documentation in which to place the resulting HTML pages and then type:
javadoc -d Documentation/ *.java
This will produce a hierarchy of WWW pages contained in the Document
directory.
When compiled the software can be invoked in the normal manner using the
Java interpreter:
java AprioriTFP_GUI_App
The resulting GUI window is shown in Figure 1. If you are planning to process a very
large data set it is a good idea to
grab some extra memory. For example:
java -Xms600m -Xmx600m AprioriTFP_GUI_App
Figure 1: TFP GUI Window
The input to the software, in all cases is a (space separated)
binary valued data set D. The set R
comprises a set of N records such that each record (r),
in turn, comprises a set of attributes (A).
We say that a particular data set has M columns and
N rows. A small example data sets might be as follows:
1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7
where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that
attribute numbers are ordered sequentially commencing with the
number 1 (the value 0 has a special meaning).
EXAMPLE
- Select File -> Load Data and load the
pima.D38.N768.C2.num preprocessed data
set also available from this WWW page.
- Select File -> Load Schema and load the
pima.D38.N768.C2.outSchema
output schema file available from this WWW page. This is optional but it makes
the output more informative. For further details see
Section 5 below.
- Select Param. Input -> Support and input a support value of
70. This is very high in terms of ARM but for the example we only want
to generate a few rules, a much lower value is usually used.
- Select Param. Input -> Confidence and input a confidence value of
98. A high confidence value is not unusual in ARM, although a value of
about 80% is more common.
- Select Gen. Frequent Sets -> TFP (with X check), this will
generate a set of frerquent item sets with a support value greater or equal to
70% (the input value).
- Selecdt Generator -> Generate ARs (Min Conf), this will
generate a set of ARs with a confidence value greater or equal to
98% (the input value).
- Select Output -> Association Rules -> ARs Att. Numbers this will
produce output of the form given in Table 2, i.e. using attribute numbers.
- Select Output -> Association Rules -> ARs Output Schema this will
produce output of the form given in Table 3, i.e. more informative output using
the schema file loaded earlier.
Some notes on the suppoprt and confidence
framework are available at:
http://www.csc.liv.ac.uk/~frans/KDD/Software/KDDnotes/noteOnSuppConfFwork.html.
| OUTPUT ASSOCIATION RULES USING ATTRIBUTE NUMBERS: |
| (N) ANTECEDENT -> CONSEQUENT CONFIDENCE (%) |
| (1) {23 5} -> {19} 99.28 |
| (2) {9 23 5} -> {19} 99.26 |
| (3) {5} -> {19} 99.13 |
| (4) {9 5} -> {19} 99.1 |
| (5) {19 9 27 14} -> {23} 98.12 |
| (6) {19 9 1 14} -> {23} 98.06 |
| (7) {19 9 14 32} -> {23} 98.0 |
|
Table 2: ARs listed usinmg attribute numbers
| |
| OUTPUT ASSOCIATION RULES USING OUTPUT SCHEMA: |
| (N) ANTECEDENT -> CONSEQUENT CONFIDENCE (%) |
| (1) {BodyMassIndex <= 44.733 PlasmaGluConcent <= 142} ->
{2-HourSerumIns <= 341} 99.28 |
| (2) {DiastolicBldPress <= 94 BodyMassIndex <= 44.733 PlasmaGluConcent <= 142} ->
{2-HourSerumIns <= 341} 99.26 |
| (3) {PlasmaGluConcent <= 142} -> {2-HourSerumIns <= 341} 99.13 |
| (4) {DiastolicBldPress <= 94 PlasmaGluConcent <= 142} -> {2-HourSerumIns
<= 341} 99.1 |
| (5) {2-HourSerumIns <= 341 DiastolicBldPress <= 94 DiabPedFunc
<= 0.93 TricepsSkinFold <= 41} -> {BodyMassIndex <= 44.733} 98.12 |
| (6) {2-HourSerumIns <= 341 DiastolicBldPress <= 94 NumberPregnacies
<= 10 TricepsSkinFold <= 41} -> {BodyMassIndex <= 44.733} 98.06 |
| (7) {2-HourSerumIns <= 341 DiastolicBldPress <= 94 TricepsSkinFold
<= 41 Age <= 48} -> {BodyMassIndex <= 44.733} 98.0 |
|
Table 3: ARs listed using the output schema
|
The interface comprises six menus:
- File: Four items.
- About: Displays some instructions on a screen.
- Load Data: Allows input of a data set.
- Load Schema: Allows input of a data schema
describing the data set (optional). Further details about
schemas are given below.
- Exit: Shuts down the GUI.
- Parameter Input: Two items.
- Support: Allows input of a minimum support
threshold.
- Confidence: Allows input of a minimum confidence
threshold.
- Data Pre-Processing: Two items.
- Sort: Orders input data according to frequency of
1-itemsets (this make tree based ARM algorithms more efficient).
- Sort and Prune: Orders input data according to frequency
of 1-itemsets and then prunes all occurrences of unsupported
1-itemset (again this make tree based ARM algorithms more efficient)
- Generate Frequent Sets: Two items.
- TFP (With X-check): Invokes TFP algorithm (version with
cross checking).
- FP-Growth: Invokes FP Growth algorithm of Han et al.
- Generator: Options to generate ARs, two items.
- Generate ARs (Min Conf): Causes the generation of
ARs using the confidence measure.
- Generate ARs (Lift): Causes the generation of ARs
using the lift measure.
- Output: Many output options.
A feature of the TFP GUI is that output can be done
using an output schema. This is simply a text
file of attribute labels, one label per line (of course
we need as many labels as attributes). Table 4 gives
the first ten lines of an output schema for the UCL Pima
indians data set after it has been processed uisng the
LUCS-KDD Data Normalisation software. Note that in
this case the output schema was also automatically genertaed
by the LUCS-KDD Data Normalisation software, but it
could equally well have been "hand crafted". Both the
processed Pima indians
set and the associated
output schema are available from this WWW page.
| NumberPregnacies <= 10 |
| 10 < NumberPregnacies <= 11 |
| 11 < NumberPregnacies <= 12 |
| 12 < NumberPregnacies |
| PlasmaGluConcent <= 142 |
| 142 < PlasmaGluConcent <= 150 |
| 150 < PlasmaGluConcent <= 152 |
| 152 < PlasmaGluConcent |
| DiastolicBldPress <= 94 |
| 94 < DiastolicBldPress <= 103 |
| .... |
|
Table 4: Class hierarchy
- Coenen, F., Leng, P. and Ahmed, S. (2004a).
- Data Structures for association Rule
Mining: T-trees and P-trees.
IEEE Transactions on Data and Knowledge Engineering, Vol 16, No 6, pp774-778.
- Coenen, F.P. Leng, P. and Goulbourne, G. (2004b).
- Tree Structures for Mining Association
Rules.
Journal of Data Mining and Knowledge Discovery, Vol 8, No 1, pp25-51.
- Han, J., Pei, J. and Yiwen, Y. (2000).
- Mining Frequent Patterns Without Candidate Generation.
Proceedings ACM-SIGMOD International Conference on
Management of Data, ACM Press, pp1-12.
Created and maintained by
Frans Coenen.
Last updated 25 August 2008