|
|
|
1. INTRODUCTION |
|
The LUCS-KDD (Liverpool University Computer Science - Knowledge Discovery in Data) ARM (Association Rule Mining) DN (Discretization/ Normalisation) software has been developed to convert data files available in the UCI data repository ([1]) into a binary format suitable for use with Association Rule Mining (ARM) applications. The software can, of course, equally well be used to convert data files obtained from other sources. We define discretisation and normalisation as follows:
ARM requires input data sets where each record represents an itemset which in turn is a subset of the available set of attributes. Thus:
N = The number of available attributes
A = A set of attributes = {a | 0 < a <= N}
D = A data set comprising M records (R)
R = {r | r subset A}
For example if A={1 2 3 4} then we might have a data set of the form: 1 2 3 4 1 2 3 1 2 4 1 3 4 2 3 4 1 2 1 3 1 4 2 3 2 4 3 4 1 2 3 4 which may be used as the input to an ARM software system. Generally speaking real data sets do not comprise only binary fields. Typically such data sets comprise a mixture of nominal, continuous and integer fields. Thus real data will require disacretization/ normalisation, i.e. conversion into a binary valued format, before it can be used for ARM. How this is done using the LUCS-KDD DN software will depend on the nature of each field in the data:
|
Thus given a (space separated) data set with the schema {colour (nominal), average (double), age (int), class (int)} such as: red 25.6 56 1 green 33.3 1 1 green 2.5 23 0 blue 67.2 111 1 red 29.0 34 0 yellow 99.5 78 1 yellow 10.2 23 1 yellow 9.9 30 0 blue 67.0 47 0 red 41.8 99 1 This would be discretised/ normalised as follows: 3 5 9 13 2 5 8 13 2 5 8 12 1 7 11 13 3 5 9 12 4 7 10 13 4 5 8 13 4 5 8 12 1 7 9 12 3 6 11 13 The entire data set is now presented by 13 binary valued attributes and can be mined using appropriate ARM software. 1.1 Missing valuesIt is not unusual for data sets in the UCI repository to include missing values. The convention is to indicate these using a `?' character. This convention has also been adopted for the purpose of the LUCS-KDD DN software described here. WARNING: Where a missing value has occured in a numeric field this is recorded internally as -100000001.0. In the unfortunate case where input data includes very large negative numbers (integers or doubles) these will be identified as missing values! |
|
The LUCS-KDD DN software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. In the interest of "user friendliness" the user interface has been implemented as a GUI. The source code is available at: http://csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN_ARM/LUCS_KDD_DN_ARM.java The code does not require any special packages and thus can be compiled using the standard Java compiler: |
javac LUCS_KDD_DN_ARM.java An example data file and schema are also avialble from this WWW page (they are used to illustrate the softeare in Section 4):
The Pima Indians data set is a well known data set often used for "bench marking" purposes within the Knowledge Disciovery in Data (KDD) community. The data set is oner of many available from the UCI Machine Learninf Repository (Blake and Merz, 1998). |
|
Before the LUCS-KDD DN software can convert a data file into the desired binary valued format it needs to know the schema for the data to be converted. The schemas for the data files in the UCI repository are described (usually) in free text format within a .names file. Users of the LUCS-KDD DN software will thus have to create their own schema files before any conversion can take place. LUCS-KDD DN Schema files comprise three lines of text each containing a sequence of N literals separated by white space (not carriage returns) where N is the number of attributes in the data set to be converted:
|
The schema file for the above example data set would be: nominal double int nominal colour average age class blue/green/red/yellow null null 0/1 The schema file for the Pima Indians UCI data set, available from this WWW page, is as follows: int int int int int double double int nominal NumberPregnacies PlasmaGluConcent DiastolicBldPress TricepsSkinFold 2-HourSerumIns BodyMassIndex DiabPedFunc Age Class null null null null null null null null 0/1 (Remember that data fields of "type" unused are ignored by the LUCS-KDD DN software and do not appear in the output.) |
|
When compiled the software can be invoked in the normal manner using the Java interpreter: java LUCS_KDD_DN_ARM.java If you are planning to process a very large data set, such as covertype, it might be an idea to grab some extra memory. For example: java -Xms600m -Xmx600m LUCS_KDD_DN_ARM Once invoked you will see an interface of the form presented in Figure 1. Note that in Figure 1 most of the interface buttons are "greyed out"; this is because, at this stage, we have no data to work with. The two functional buttons are: |
Note also, in Figure 1, that some instructions are presented in the main window of the GUI. |
Figure 1: LUCS-KDD DN interface on start up.
|
A schema file is loaded by selecting the "Input Schema" button. Once the schema has been "read" three more buttons become active:
|
(1) int: NumberPregnacies
(2) int: PlasmaGluConcent
(3) int: DiastolicBldPress
(4) int: TricepsSkinFold
(5) int: 2-HourSerumIns
(6) double: BodyMassIndex
(7) double: DiabPedFunc
(8) int: Age
(9) nominal: Class { 0 1 }
Figure 2 shows the LUCS-KDD DN interface once a schema file (the example for the Pima Indians data set given above) has been loaded and listed using the "Input Schema" and "List Schema" buttons respectively. |
Figure 2: LUCS-KDD DN interface with schema file loaded.
|
A data file can be loaded using either the "Input Data ('/nbsp/nbsp' separated) or "Input Data (',' separated" buttons. Some example raw data (the first few lines from the Pima Indians comma separated data set) is given below: 6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 5,116,74,0,0,25.6,0.201,30,0 3,78,50,32,88,31.0,0.248,26,1 10,115,0,0,0,35.3,0.134,29,0 2,197,70,45,543,30.5,0.158,53,1 8,125,96,0,0,0.0,0.232,54,1 4,110,92,0,0,37.6,0.191,30,0 10,168,74,0,0,38.0,0.537,34,1 10,139,80,0,0,27.1,1.441,57,0 1,189,60,23,846,30.1,0.398,59,1 5,166,72,19,175,25.8,0.587,51,1 ...... Figure 3 shows the LUCS-KDD DN interface once a data set has been loaded (the Pima Indians data set in this case). Note that six more buttons have now become active:
|
|
Figure 3: LUCS-KDD DN interface with schema and data files loaded.
|
The desired number of sub-ranges is entered using the "Input num. ranges" button, we will assume a value of 5. Remember that the higher the number of sub-ranges value the more output attributes will be generated during the discretisation (normalisation) process, which in turn will have an effect on the computational efficiency of any ARM algorithm that may be applied to the data. |
Where an integer data field has a range less than the user specified number of sub-ranges the LUCS KDD DN software will assign a number of columns to the integer field equivalent to the size of the range. For example if we have an integer field which can take values in the range of 99 to 102 inclusive (i.e. a range size of 4) and the number of sub-ranges value is 5 then 4 columns would be assigned to the integer field. If, however, the number of sub-ranges value was set to 3, the 99 to 102 range would be divided into three sub-ranges (99 <= n <= 100, 100 < n <= 101 and 101 < n <= 102 ) each with its own unique column number assigned to it. |
|
Once a value for the number of sub-ranges has been successfully entered the "Normalisation" button becomes available:
To discretise/normalise the data set the user simply selects the "Normalisation" button. Note that for some data sets this process may take a few seconds. The result in the case of the Pima Indians data set will be of the following form: |
2 9 13 17 21 28 32 38 42 1 8 13 17 21 27 31 36 41 3 10 13 16 21 27 32 36 42 1 8 13 17 21 28 31 36 41 1 9 12 17 21 29 35 37 42 2 8 14 16 21 27 31 36 41 1 7 13 17 21 28 31 36 42 3 8 11 16 21 28 31 36 41 1 10 13 18 24 28 31 38 42 3 9 14 16 21 26 31 38 42 2 8 14 16 21 28 31 36 41 3 10 14 16 21 28 31 37 42 3 9 14 16 21 28 33 39 41 1 10 13 17 25 28 31 39 42 2 10 13 16 22 27 32 38 42 ............ |
|
Once normalisation is complete a further five (output) buttons become active (Figure 4) as follows:
|
(24) 2-HourSerumIns<676.8 (25) 2-HourSerumIns<846.0 (26) BodyMassIndex<13.419999999999998 (27) BodyMassIndex<26.839999999999996 (28) BodyMassIndex<40.26 (29) BodyMassIndex<53.67999999999999 (30) BodyMassIndex<67.1 (31) DiabPedFunc<0.5464 (32) DiabPedFunc<1.0148000000000001 (33) DiabPedFunc<1.4832 (34) DiabPedFunc<1.9516000000000002 (35) DiabPedFunc<2.42 (36) Age<33.0 (37) Age<45.0 (38) Age<57.0 (39) Age<69.0 (40) Age<81.0 (41) Class=0 (42) Class=1 |
Figure 4: LUCS-KDD DN interface after discretisation/normalisation.
6. CONCLUSION |
|
The LUCS-KDD DN software has been in use successfully within the LUCS-KDD group for some time. The software is available for free for non-commercial use, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the software is suggested: Coenen, F. (2003), LUCS-KDD ARM DN Software, http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN_ARM/, Department of Computer Science, The University of Liverpool, UK. |
Should you discover any "bugs" or other problems within the software (or this documentation), or have suggestions for additional features, please do not hesitate to contact the author. |
Created and maintained by Frans Coenen. Last updated 18 January 2005