|
Most recent version available at http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/lucs-kdd_DN.html.
|
|
1. INTRODUCTION |
|
The LUCS-KDD (Liverpool University Computer Science - Knowledge Discovery in Data) DN (Discretization/ Normalisation) software has been developed to convert data files available in the UCI data repository ([1]) into a binary format suitable for use with Association Rule Mining (ARM) and Classification Association Rule Mining (CARM) applications. The software can, of course, equally well be used to convert data files obtained from other sources. We define discretisation and normalisation as follows:
Both ARM and CARM require input data sets where each record represents an itemset which in turn is a subset of the available set of attributes. Thus:
N = The number of available attributes
A = A set of attributes = {a | 0 < a <= N}
D = A data set comprising M records (R)
R = {r | r subset A}
For example if A={1 2 3 4} then we might have a data set of the form: 1 2 3 4 1 2 3 1 2 4 1 3 4 2 3 4 1 2 1 3 1 4 2 3 2 4 3 4 1 2 3 4 which may be used as the input to an ARM software system. The distinction between the data sets used for ARM and those used for CARM is that in the latter case one of the attributes in each record represents a class to which the record is said to "belong". Usually, to facilitate identification, either the last or first attribute in each record in the input data represents the class. Thus we might have a CARM data set of the following form where the last item in each record represents the class (either 3 or 4): 1 2 4 1 2 3 1 4 2 4 1 3 2 3 (Note that records in a CARM input set must have at least one attribute as well as a class). Generally speaking real data sets do not comprise only binary fields. Typically such data sets comprise a mixture of nominal, continuous and integer fields. Thus real data will require disacretization/ normalisation, i.e. convertion into a binary valued format, before it can be used for ARM or CARM. How this is done using the LUCS-KDD DN software will depend on the nature of each field in the data:
|
Thus given a (space separated) data set with the schema {colour (nominal), average (double), age (int), class (int)} such as: red 25.6 56 1 green 33.3 1 1 green 2.5 23 0 blue 67.2 111 1 red 29.0 34 0 yellow 99.5 78 1 yellow 10.2 23 1 yellow 9.9 30 0 blue 67.0 47 0 red 41.8 99 1 This would be discretised/ normalised as follows: 3 5 9 13 2 5 8 13 2 5 8 12 1 7 11 13 3 5 9 12 4 7 10 13 4 5 8 13 4 5 8 12 1 7 9 12 3 6 11 13 The entire data set is now presented by 13 binary valued attributes and can be mined using appropriate ARM or CARM software. 1.1 Missing valuesIt is not unusual for data sets in the UCI repository to include missing values. The convention is to indicate these using a `?' character. This convention has also been adopted for the purpose of the LUCS-KDD DN software described here. |
|
The LUCS-KDD DN software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. In the interest of "user freindliness" the user interface has been implemented as a GUI. The source code is available at: |
http://csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN/LUCS_KDD_DN.java The code does not require any special packages and thus can be compiled using the standard Java compiler: javac LUCS_KDD_DN.java |
|
Before the LUCS-KDD DN software can convert a data file into the desired binary valued format it needs to know the schema for the data to be converted. The schemas for the data files in the UCI repository are described (usually) in free text format within a .names file. Users of the LUCS-KDD DN software will thus have to create their own schema files before any convertion can take place. LUCS-KDD DN Schema files comprise three lines of text each containing a sequence of N literals separated by white space (not carriage returns) where N is the number of attributes in the data set to be converted:
|
The schema file for the above example data set would be: nominal double int nominal colour average age class blue/green/red/yellow null null null The schema file for the Pima Indians UCI data set might be: int int int int int double double int nominal NumberPregnacies PlasmaGluConcent DiastolicBldPress TricepsSkinFold 2-HourSerumIns BodyMassIndex DiabPedFunc Age Class null null null null null null null null 0/1 (Remember that data fields of "type" unused are ignored by the LUCS-KDD DN software and do not appear in the output.) |
|
When compiled the software can be invoked in the normal manner using the Java interpreter: java LUCS_KDD_DN.java If you are planning to process a very large data set, such as covertype, it might be an idea to grab some extra memory. For example: java -Xms600m -Xmx600m LUCS_KDD_DN Once invoked you will see an interface of the form presented in Figure 1. Note that in Figure 1 most of the interface buttons are "greyed out"; this is because, at this stage, we have no data to work with. The two functional buttons are:
|
Note also, in Figure 1, that some instructions are presented in the main window of the GUI. |
Figure 1: LUCS-KDD DN interface on start up.
|
A schema file is loaded by selecting the "Input Schema" button. Once the schema has been "read" three more buttons become active:
|
(1) int: NumberPregnacies
(2) int: PlasmaGluConcent
(3) int: DiastolicBldPress
(4) int: TricepsSkinFold
(5) int: 2-HourSerumIns
(6) double: BodyMassIndex
(7) double: DiabPedFunc
(8) int: Age
(9) nominal: Class { 0 1 }
Figure 2 shows the LUCS-KDD DN interface once a schema file (the example for the Pima Indians data set given above) has been loaded and listed using the "Input Schema" and "List Schema" buttons respectively. |
Figure 2: LUCS-KDD DN interface with schema file loaded.
|
A data file can be loaded using either the "Input Data ('/nbsp/nbsp' separated) or "Input Data (',' separated" buttons. Some example raw data (the first few lines from the Pima Indians comma separated data set) is given below: 6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 5,116,74,0,0,25.6,0.201,30,0 3,78,50,32,88,31.0,0.248,26,1 10,115,0,0,0,35.3,0.134,29,0 2,197,70,45,543,30.5,0.158,53,1 8,125,96,0,0,0.0,0.232,54,1 4,110,92,0,0,37.6,0.191,30,0 10,168,74,0,0,38.0,0.537,34,1 10,139,80,0,0,27.1,1.441,57,0 1,189,60,23,846,30.1,0.398,59,1 5,166,72,19,175,25.8,0.587,51,1 ...... Figure 3 shows the LUCS-KDD DN interface once a data set has been loaded (the Pima Indians data set in this case). Note that six more buttons have now become active:
|
|
Figure 3: LUCS-KDD DN interface with schema and data files loaded.
|
The desired number of sub-ranges is entered using the "Input num. ranges" button. With respect to the Pima Indians data set the minimum number of ranges setting for this class should be 2 as there are two classes ---- "NoSignsOfDiabetes" and "SignsOfDiabetes". However a better CARM result will be produced using a number greater than 2, we will assume a value of 5. Remember that the higher the number of sub-ranges value the more output attributes will be generated during the discretisation (normalisation) process, which in turn will have an effect on the computational efficiency of any ARM or CARM algorithm that may be applied to the data. |
Where an integer data field has a range less than the user specified number of sub-ranges the LUCS KDD DN software will assign a number of columns to the integer field equivalent to the size of the range. For example if we have an integer field which can take values in the range of 99 to 102 inclusive (i.e. a range size of 4) and the number of sub-ranges value is 5 then 4 columns would be assigned to the integer field. If, however, the number of sub-ranges value was set to 3, the 99 to 102 range would be divided into three sub-ranges (99 <= n <= 100, 100 < n <= 101 and 101 < n <= 102 ) each with its own unique column number assigned to it. |
|
Once a value for the number of sub-ranges has been successfully entered the "Normalisation" button becomes available:
To discretise/normalise the data set the user simply selects the "Normalisation" button. Note that for some data sets this process may take a few seconds. The result in the case of the Pima Indians data set will be of the following form: |
2 9 13 17 21 28 32 38 42 1 8 13 17 21 27 31 36 41 3 10 13 16 21 27 32 36 42 1 8 13 17 21 28 31 36 41 1 9 12 17 21 29 35 37 42 2 8 14 16 21 27 31 36 41 1 7 13 17 21 28 31 36 42 3 8 11 16 21 28 31 36 41 1 10 13 18 24 28 31 38 42 3 9 14 16 21 26 31 38 42 2 8 14 16 21 28 31 36 41 3 10 14 16 21 28 31 37 42 3 9 14 16 21 28 33 39 41 1 10 13 17 25 28 31 39 42 2 10 13 16 22 27 32 38 42 ............ |
|
Once normalisation is complete a further five (output) buttons become active (Figure 4) as follows:
|
(24) 2-HourSerumIns<676.8 (25) 2-HourSerumIns<846.0 (26) BodyMassIndex<13.419999999999998 (27) BodyMassIndex<26.839999999999996 (28) BodyMassIndex<40.26 (29) BodyMassIndex<53.67999999999999 (30) BodyMassIndex<67.1 (31) DiabPedFunc<0.5464 (32) DiabPedFunc<1.0148000000000001 (33) DiabPedFunc<1.4832 (34) DiabPedFunc<1.9516000000000002 (35) DiabPedFunc<2.42 (36) Age<33.0 (37) Age<45.0 (38) Age<57.0 (39) Age<69.0 (40) Age<81.0 (41) Class=0 (42) Class=1 |
Figure 4: LUCS-KDD DN interface after discretisation/normalisation.
6. CONCLUSION |
|
The LUCS-KDD DN software has been in use successfully within the LUCS-KDD group for some time. The software is available for free, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the software is suggested: Coenen, F. (2003), LUCS-KDD DN Software, http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN/, Department of Computer Science, The University of Liverpool, UK. |
Should you discover any "bugs" or other problems within the software (or this documentation), or have suggestions for additional features, please do not hesitate to contact the author. Some additional notes on processing a number of example data sets from the UCI repository, together with example schema files are available at: http://csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/exmpleDNnotes.html A number of example discretised/normalised data sets, taken from the UCI library, are available at: http://csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html |
Created and maintained by Frans Coenen. Last updated 18 January 2005