|
|
1. INTRODUCTION |
Apriori-T (Apriori Total) is an Association Rule Mining (ARM) algorithm, developed by the LUCS-KDD research team which makes use of a "reverse" set enumeration tree where each level of the tree is defined in terms of an array (i.e. the T-tree data structure is a form of Trie). The Apriori-T algorithm was actually developed as part of a more sophisticated ARM algorithm --- Apriori-TFP (Apriori Total From Partial). Both algorithms are described in Coenen and Leng (2004).
2. DOWNLOADING THE SOFTWARE |
The Apriori-T ARM software comprises four source files. These are provided from this WWW page together with three application classes. The source files are as follows:
|
The application classes are as follows:
There is also a "tar ball" aprioriT.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf aprioriT.tgz.
The software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:
javac *.java
The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:
javadoc -d Documentation/ *.java
This will produce a hierarchy of WWW pages contained in the Document directory.
3. RUNNING THE SOFTWARE |
When compiled the software can be invoked in the normal manner using the Java interpreter:
java APPLICATION_CLASS_FILE_NAME
If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:
java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME
The input to the software, in all cases is a (space separated) binary valued data set R. The set R comprises a set of N records such that each record (r), in turn, comprises a set of attributes. Thus:
R = {r | r = subset A}
Where A is the set of available attributes. The value D is then defined as:
D = |A|
We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:
1 2 3 6 1 4 5 7 1 3 4 6 1 2 6 1 2 3 4 5 7
where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). The input file is included using a -F flag.
The program assumes support and confidence default threshold values of 20% and 80% respectively. However the user may enter their own thresholds using the -S and -C flags. Thresholds are expressed as percentages.
Some example invocations, using a discretized/ normalised Pima Indians data set (also available from this site) and each of the three application classes provided by this WWW site, are given below:
java AprioriTapp -FpimaIndians.D42.N768.C2.num java AprioriTsortedApp -FpimaIndians.D42.N768.C2.num -S2.5 java AprioriTsortedPrunedApp -FpimaIndians.D42.N768.C2.num -S1 -C50
(note that the ordering of flags is not significant). The output from each application is a set of ARs (ordered according to confidence) plus some diagnostic information (run time, number of rules generated etc.).
4. MORE DETAIL ON APPLICATION CLASSES |
Apriori-T Applications classes all have the following basic form:
|
Some output is always generated such as: (1) the input parameters and start settings and (2) "mile stones" during processing. Additional output statements can be included in application classes. The available additional output options are as follows:
double time1 = (double) System.currentTimeMillis(); // Program statements < INSTANCE_NAME > .outputDuration(time1, (double) System.currentTimeMillis());
Note that the first nine of the above output methods are instance methods contained in either the TotalSupportTree class or the AssocRuleMining class the latter are therefore inherited by the TotalSupportTree class). The last three of the above are instance methods of the RuleList class and thus must be called using an instance of that class. An instance of the RuleList class is created during processing and may be accessed using the getCurrentRuleListObject() public method found in the TotalSupportTree class. Thus, for example, the outputRules() method would be invoked as follows:
< INSTANCE_NAME > .getCurrentRuleListObject().outputRules();
5. THE T-TREE DATA STRUCTURE |
A T-tree is a set enumeration tree structure in which to store frequent item set information. What distinguishes the T-tree from other set enumeration tree structures is:
Thus given a data set of the form:
{1 3 4}
{2 4 5}
{2 4 6}
and assuming a support count of less than 1, we can identify the following frequent sets (support counts in parenthesise):
1 (1) 2 (2) 3 (1) 4 (3) 5 (1) 6 (1) |
1 3 (1) 1 4 (1) 2 4 (2) 2 5 (1) 2 6 (1) 3 4 (1) |
4 5 (1) 4 6 (1) 1 3 4 (1) 2 4 5 (1) 2 4 6 (1) |
These can be presented in a T-tree of the form given in Figure 1 (note the reverse nature of the tree). The internal representation of this "reverse" T-tree founded on arrays of T-tree nodes that can be conceptualised as shown in Figure 2. The storage required for each node (representing a frequent set) in the T-tree is then 12 Bytes:
Thus house keeping requirements are still 8 Bytes, however storage gains are obtained because it is not necessary to explicitly store individual attribute labels (i.e. column numbers representing instantiated elements) as these are implied by the indexing. Of course this approach must also require storage for "stubs" (4 Bytes) where nodes are missing (unsupported). Overall the storage advantages for this technique is thus, in part, dependent on the number of missing combinations contained in the data set.
Figure 1: Conceptual example of the T-tree data structure
Figure 2: Internal representation of T-tree presented in Figure 1
6. THE APRIOR-T ALGORITHM |
The T-tree described above is built in an apriori manner, as first proposed by Agrawal and Srikant (1994), starting with the one item sets and continuing until there are no more candidate N-itemsets. Thus, at a high level, a standard apriori algorithm is used.
K <-- 1
nextlevelFlag=true;
generate candidate K-itemsets
Loop
count support values for candidate K-itemsets
prune unsupported K-itemsets
K <-- 2
generate candidate K2 itemsets from previous level
if no K2 itemsets break
end Loop
In more detail the Apriori-T algorithm commences with a method createTotalSupportTree which is presented in Table 1. The method commences by generating the top level of the T-tree (createTtreeTopLevel) and then generating the next level (generateLevel2) from the supported sets in level 1 --- remember that if a 1-itemset is not supported none of its super sets will be supported. Once we have generated level 2 further levels can be generated (createTtreeLevelN).
Method: createTotalSupportTree Arguments: none Return: none Fields: NA ------------------------------ createTtreeTopLevel() generateLevel2() createTtreeLevelN() ------------------------------ |
Table 1: The createTotalSupportTree method
The method to generate the top level of a T-tree is as presented in Table 2. Note that the method includes a call to a general T-tree utility method pruneLevelN described later.
Method: createTtreeTopLevel
Arguments: none
Return: none
Fields: D number of attributes
startTtreeRef start of T-tree
dataArray 2D array holding input sets
--------------------------------------
Dimension and initialise top level of T-tree (length = D)
Loop from i = 0 to i = number of records in dataArray
Loop from j = 0 to j = number of attributes in dataArray[i]
startTtreeRef[i][j]++
End loop
End Loop
pruneLevelN(startTtreeRef,1)
--------------------------------------
|
Table 2: The createTtreeTopLevel method
The generateLevel2 methods loops through the top level of the T-tree creating new T-tree arrays where appropriate (i.e. where the immediate parent nodes is supported). The method is outlined in Table 3. Note that the method includes a call to a general T-tree utility method generateNextLevel (also described later).
Method: generateLevel2
Arguments: none
Return: none
Fields: startTtreeRef start of T-tree
nextlevelFlag set to true if next level exists
--------------------------------------
nextlevelFlag <-- false
loop from i = 0 to i = number of nodes in T-tree top level
if startTtreeRef[i] != null
generateNextLevel(startTtreeRef,i,{i})
End Loop
--------------------------------------
|
Table 3: The createTtreeTopLevel method
Once we have a top level T-tree and a set of candidate second levels (arrays) we can proceed with generating the rest of the T-tree using an iterative process --- the createTtreeLevelN method presented in Table 4. The createTtreeLevelN method calls a number of other methods addSupportToTtreeLevelN, pruneLevelN (also called by the createTtreeTopLevel method) and generateLevelN which are presented in Tables 5, 6 and 7 respectively.
Method: createTtreeLevelN
Arguments: none
Return: none
Fields: startTtreeRef start of T-tree
nextlevelFlag set to true if next level exists
--------------------------------------
K <-- 2
while (nextlevelFlag)
addSupportToTtreeLevelN(K)
pruneLevelN(startTtreeRef,K)
nextlevelFlag <-- false
generateLevelN(startTtreeRef,K,{})
K <-- K+1
End loop
--------------------------------------
|
Table 4: The createTtreeLevelN method
Method: addSupportToTtreeLevelN
Arguments: K the current level
Return: none
Fields: startTtreeRef start of T-tree
dataArray 2D array holding input sets
--------------------------------------
Loop from i = 0 to i = number of records in dataArray
length = number of attributes in dataArray[i]
addSupportToTtreeFindLevel(startTtreeRef,K,length,
dataArray[i]
End loop
--------------------------------------
Method: addSupportToTtreeFindLevel
Arguments: linkref reference to current array in T-tree
K level marker
length length of array at current branch in t-tree
record input data record under consideration
Return: none
Fields: None
--------------------------------------
if (K=1)
Loop from i = 0 to i = length
if (linkref[record[i]] != null)
increment linkref[record[i]].support by 1
End if
End Loop
else
Loop from i = K-1 to i = length
if (linkref[record[i]] != null &&
linkref[record[i]].childRef != null)
addSupportToTtreeFindLevel(linkref[record[i]].childRe,
K-1,i,record)
End if
End loop
end if else
--------------------------------------
|
Table 5: The addSupportToTtreeLevelN method and its related addSupportToTtreeFindLevel method
Method: pruneLevelN
Arguments: linkref reference to current array in T-tree
K level marker
Return: true if entire array pruned
Fields: minSupport the minimum support threshold
--------------------------------------
if (K=1)
allUnsupported <-- true
Loop from i = 1 to i = length of array
if (linkref[i] != null)
if (linkref[i].support < minSupport)
linkref[i] <-- null
else allUnsupported <-- false
End if else
End if
return allUnsupported
End Loop
else
Loop from i = K to i = length of array
if (linkref[i] != null)
if (pruneLevelN(linkref[i].childRef,K-1)
linkref[i].childRef <-- null
End if
End if
End loop
End if else
--------------------------------------
|
Table 6: The pruneLevelN
Method: generateLevelN
Arguments: linkref reference to current array in T-tree
K level marker
I the item set represented by the parent node
Return: None
Fields: None
--------------------------------------
if (K=1)
Loop from i = 2 to i = length of array
if (linkref[i] != null)
generateNextLevel(linkref,i,I union i)
End if
End loop
else
Loop from i = K to i = length of array
if (linkref[i] != null && linkref[i].childRef != null)
generateLevelN(linkref[i].childRef,K-1,I union i
--------------------------------------
Method: generateNextLevel
Arguments: linkref reference to current array in T-tree
i index to parent node in vurrent array
I the item set represented by the parent node
Return: None
Fields: nextLevelExists flafg set to true or false
--------------------------------------
linkref[i].childRef <-- array of empty t-tree nodes of length i
Loop from j = 1 to j = i
if (linkRef[j] != null)
newI <-- I union j
if (all newI true subsets are all supported in the
T-tree sofar)
linkRef[i].childRef[j] <-- new T-tree Node
nextLevelExists <-- true
else linkRef[i].childRef[j] <-- null
End if else
End if
End loop
--------------------------------------
|
Table 7: The generateLevelN method and its related generateNextLevel method
7. CONCLUSSIONS |
The Apriori-T ARM algorithm described here have been use successfully by the LUCS-KDD reseach group to contrast and compare a variety of ARM algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the Apriori-T implementation available here is suggested:
Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.
REFERENCES |
Created and maintained by Frans Coenen. Last updated 17 March 2004