|
|
|
1. INTRODUCTION |
Association Rule Mining (ARM) is concerned with finding "interesting" patterns in binary valued data sets.
In classical ARM the assumption is that all attributes in the dataset are of equal significance; in practice there are many applications where this is not the case. Consequently Weighted ARM (WARM) is a variation of classical ARM where individual attributes are weighted. ARM algorithms typically obtain an efficiency gain by using what is commonly termed the "Downward Closure Property" (the DC property) of items sets, i. e. that for an item set to be frequent all its subsets must be frequent; or, vice-versa, if an itemset is not frwquent that non of its supersets can be frequent. There are many schemes proposed in the literature for implementing WARM, some which maintain the DC property and some which do not. The view here is that for WARM to be effective the application of weightings must be such that the DC property is maintained.
From the above ARM data sets are expected to be binary valued. Of course, in practice, not all fields in the data sets we want to apply ARM to are binary valued. Given a field that can take one of a sequence of values, for example an attribute colour with a value set of {Red, Green, Blue}, the practice is to turn this into a sequence of binary valued attributes (three in the example). Given a numeric attribute that can take a range of values, for example the attribute age with a value range of {0..120} the practice is to "range" this set of values into a sub ranges. In the case of the above age attribute these might be {0..30, 31..60, 61..90, 91..120}, i.e. four binary valued attributes. We can also also ascribe labels to these ranges, for example {young, middle_aged, old, very_old}. The LUCS-KDD research team have produced software to perform such LUCS-KDD Data Normalisation"> discretisation/normalisation.
The problem with the above approach to dividing ranged values into sub-ranges is that the boundaries between the sub-ranges are crisp boundaries. Thus, in the above example, an age of 30 will be allocated to the range young when it could be argued that this value should also be partly associated with the range middle_aged. It is suggested that this crisp boundary problem effects the outcome of ARM when applied to ranged data.
Fuzzy Association Rule Mining (FARM) is intended to address the crisp boundary problem encountered in traditional ARM. The principal idea is that ranged values can belong to more than one sub-range, we say that the value has a membership degree that associates it with one or more sub-ranges. The total membership degree for a value typically totals to one. There are many different membership functions reported in the literature that are used to calculate the membership degree foe a given numeric value.
The software that may be downloaded from this page allows: WARM, FARM and FWARM; the last being a combination of Fuzzy and Weighted ARM. All are variations of the established Apriori-T algorithm. Apriori-T, like many ARM algorithms uses the support-confidence framework, a feature of DC property discussed above.
2. DOWNLOADING THE SOFTWARE |
The Fuzzy and/or Weighted Apriori-T software comprises fourteen source files. These are provided from this WWW page together with an application class. The source files are as follows:
The decisionTreeNode and RuleNode classes are separate to the remaining classes which are arranged in a class hierarchy of the form illustrated below.
|
One fuzzy weighted Apriori-T application class is included here is as follows:
There is also a "tar ball" weightedFuzzyAprioriT.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf weightedFuzzyAprioriT.tgz.
The decision tree software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.5.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:
javac *.java
The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:
javadoc -d Documentation/ *.java
This will produce a hierarchy of WWW pages contained in the Document directory.
3. RUNNING THE SOFTWARE |
When compiled the software can be invoked in the normal manner using the Java interpreter:
java APPLICATION_CLASS_FILE_NAME
If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:
java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME
The input to the software always comprises at least two files:
In addition an optional input schema file may also be loaded. Each style of input file is discussed further in the following sub- sections.
There are two forms of input data: (i) binary valued data for Weighted ARM only and (ii) fuzzy valued data for Fuzzy or Weighted Fuzzy ARM.
Binary Valued Data
Binary valued input comprises a (space separated) binary valued data set R. The set R comprises a set of N records such that each record (r), in turn, comprises a set of attributes. Thus:
R = {r | r = subset A}
Where A is the set of available attributes. The value D is then defined as:
D = |A|
We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:
1 2 3 6 1 4 5 7 1 3 4 6 1 2 6 1 2 3 4 5 7
where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning).
Fuzzy Valued Data
In this case the input is a (space separated) numeric data set R with values ranging between 0..1. Two example data sets test1.num and test2.num are available from this WWW page to act as an examples for users. The test1.num data set (the smaller of the two) is as follows:
<c,1.0> <a,0.25> <b,0.5> <a,0.5> <c,0.75> <a,0.5> <b,0.25>
Weightings are loaded in a separate text file. The assumption is that the weighting for each attribute is constant. Weightings are expressed in the form of real numbers between 1.0 and 0.0. The format is one weighting per line. Thus there should be as many lines in the weighting file as attributes. Example (the testWeightings1.txt file available from this WWW page):
0.2 1.0 0.3
A feature of the GUI is that output can be done using an output schema. This is simply a text file of attribute labels, one label per line (of course we need as many labels as attributes). Table 2 gives the output schema for test1 data file test1.schema.
|
Table 2: Schema file example
4. SUPPORT CALCULATION |
There are three different mechanisms for calculating support according to the nature of the algorithm chosen:
For single item sets the support is simply the sum of the weightings divided by the total number of records.
For two item sets and greater the support is the sum of the products of the weightings.
Thus given the data set (where the attribute set is {a, b, c}):
c b c a c a b
and a weighting file of the form:
0.25 0.5 0.75
The support calculations will be:
{a} = (0.25+0.25)/4 = 0.125
{b} = (0.5+0.5)/4 = 0.25
{c} = (0.75+0.75+0.75)/4 = 0.563
{a b} = ((0.25*0.5))/4 = 0.031
{a c} = ((0.25*0.75))/4 = 0.047
{b c} = ((0.5*0.75))/4 = 0.094
For single item sets the support is simply the sum of the membership functions divided by the total number of records.
For two item sets and greater the support is the sum of the products of the membership functions.
Thus given the data set (again where the attribute set is {a, b, c}):
<c,1.0> <b,0.5><c,0.5> <a,0.5><c,0.5> <a,0.5><b,0.5>
The support calculations will be:
{a} = (0.5+0.5)/4 = 0.25
{b} = (0.5+0.5)/4 = 0.25
{a} = (1.0+0.5+0.5)/4 = 0.5
{a b} = ((0.5*0.5))/4 = 0.06
{a c} = ((0.5*0.5))/4 = 0.06
{b c} = ((0.5*0.5))/4 = 0.06
For single items sets the support is the sum of the product calculation for each weighting/fuzzy member ship pair (w*f). For 2-itemsets and larger the support is the sum of the products of all the weightings and fuzzy membership calculations. Thus given the data set below (for ease of reading the attribute numbers have been replaced by letters):
<c,1.0> <a,0.25> <b,0.5> <a,0.5> <c,0.75> <a,0.5> <b,0.25>
and the associated weighting file:
0.2 1.0 0.3
The support calculations will be as follows:
{a} = ((0.25*0.2)+(0.5*0.2)+(0.5*0.2))/4 = *(0.05 + 0.1 + 0.1)/4 = 0.0625
{b} = ((0.5*1.0)+(0.25*1.0))/4 = (0.5+0.25)/4 = 0.1875
{c} = ((1.0*0.3)+(0.75*0.3))/4 = (0.3+0.225)/5 = 0.13125
{a b} = ((0.25*0.2*0.5*1.0)+(0.5*0.2*0.25*1.0))/4 = (0.025+0.025)/4 = 0.0125
{a c} = ((0.5*0.2*0.75*0.3))/4 = 0.0225/4 = 0.005625
5. NOTES ON GUI INTERFACE |
The interface has six menus. Each is discussed below.
6. CONCLUSIONS |
The Weighted Fuzzy Apriori-T algorithms described here has been use successfully used by the LUCS-KDD research group in its work on FWARM. The software is available for non-commercial usage free charge, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the fuzzy Apeiori-T implementation available from this www site is suggested:
Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.
Created and maintained by Frans Coenen. Last updated 26 September 2008