SOME NOTES AND EXAMPLE SCHEMA FOR THE LUCS-KDD DN
(DISCRETISATION/NORMALISATION) SOFTWARE VERSION 2
Frans Coenen
Department of Computer Science
The University of Liverpool
Friday 7 January 2005
|
|
This page contains some further notes on using the
LUCS-KDD (Liverpool University Computer Science - Knowledge Discovery in
Data) DN (discretization/ normalisation)
software Version 2. More specifically this page includes:
-
- Notes on processing a number of data sets (available within the UCI data repository
([1]) as used by the LUCS-KDD research team for a variety of experiments,
- Suggests schema files for these data sets, and
- Statistical information on the processed data sets. Where measurements
differ from the data produced using version 1 of the
normalisation/discretisation software the version 1 data is given in
parenthesis. (In most cases the version 2 produces less attribute columns
than version 1).
A number of example discretised/normalised data sets, taken from the UCI
library, are available at:
http://csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html
CONTENTS
- Data set comprised of a concatenation (adult.num) of two data sets
adult.data and adult.test.
- Files (adult.data and adult.test) are ", " (comma space)
separated --- the normalisation software will only take comma or space
separated input file. The files must therefore be edited so that items are
either comma or space separated.
- The records in the adult.test file are terminated with a "."
(full stop), these must also be removed.
| SCHEMA FILE |
|---|
|
int nominal int nominal int nominal nominal nominal nominal nominal int int
int nominal nominal
|
|
age workclass fnlwgt education education-num marital-status occupation
relationship race sex capital-gain capital-loss hours-per-week native-country
class
|
|
none Private/Self-emp-not-inc/Self-emp-inc/Federal-gov/Local-gov/State-gov/
Without-pay/Never-worked
none Bachelors/Some-college/11th/HS-grad/Prof-school/Assoc-acdm/Assoc-voc/
9th/7th-8th/12th/Masters/1st-4th/10th/Doctorate/5th-6th/Preschool
none Married-civ-spouse/Divorced/Never-married/Separated/Widowed/
Married-spouse-absent/Married-AF-spouse
Tech-support/Craft-repair/Other-service/Sales/Exec-managerial/Prof-specialty/
Handlers-cleaners/Machine-op-inspct/Adm-clerical/Farming-fishing/
Transport-moving/Priv-house-serv/Protective-serv/Armed-Forces
Wife/Own-child/Husband/Not-in-family/Other-relative/Unmarried
White/Asian-Pac-Islander/Amer-Indian-Eskimo/Other/Black
Female/Male none none none
United-States/Cambodia/England/Puerto-Rico/Canada/Germany/
Outlying-US(Guam-USVI-etc)/India/Japan/Greece/South/China/Cuba/Iran/
Honduras/Philippines/Italy/Poland/Jamaica/Vietnam/Mexico/Portugal/Ireland/
France/Dominican-Republic/Laos/Ecuador/Taiwan/Haiti/Columbia/Hungary/Guatemala/
Nicaragua/Scotland/Thailand/Yugoslavia/El-Salvador/Trinadad&Tobago/Peru/Hong/
Holand-Netherlands >50K/<=50K
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 6465 |
|---|
| Number of records | 48842 |
|---|
| Num. input columns | 15 |
|---|
| Num. output columns (Ver 1) | 97 (131) |
|---|
| Density % (Ver 1) | 15.46 (11.45) |
|---|
| Number of classes | 2 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 96 | 11687 | 23.93 |
| 97 | 37155 | 76.07 |
|
|---|
| File name |
adult.D97.N48842.C2.num |
|---|
- Data set comprised of a concatenation (anneal.num) of two comma
separated data sets anneal.data and anneal.test. (Version 1
only used the anneal.data data to produce a binary valued data set.)
- Contains many missing values represented by a '?'character; 19692 in
anneal.data, 2483 in anneal.test, and consequently 22175 in
anneal.num file represented by a '?'character.
- Many (20) of the available nominal values do not appear in the
data and consequently some attributes are unrepresented.
- Nominal values that do no feature in the data set:
- GB, GK, GS, ZA, ZF, ZH
or ZM for attribute family (Column number 1).
- H or G for attribute product-type (Column number 2).
- U for attribute steel (column number 4).
- X for attribute condition (column number 7).
- M for attribute surface-finish (column number 10).
- Y for attribute m (column number 19).
- Y for attribute marvi (column number 23).
- Y for attribute corr (column number 26).
- R for attribute blue/bright/varn/clean (column number 27).
- Y for attribute jurofm (column number 29).
- Y for attribute s (column number 30).
- Y for attribute p (column number 31).
- 0760 for attribute bore (column number 37).
- Attributes that do not feature in the data set: m, marvi,
corr, jurofm, s and p.
- Classes that do not feature in the data set: 4 (column number 71).
| SCHEMA FILE |
|---|
|
nominal nominal nominal double double nominal nominal
int double nominal nominal nominal int
nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal double double double nominal nominal
int nominal
|
|
family product-type steel carbon hardness temper_rolling condition
formability strength non-ageing surface-finish surface-quality enamelability
bc bf bt bw/me bl m chrom phos cbond
marvi exptl ferro corr blue/bright/varn/clean lustre
jurofm s p shape thick width len oil bore
packing classes
|
|
GB/GK/GS/TN/ZA/ZF/ZH/ZM/ZS C/H/G R/A/U/K/M/S/W/V null null T S/A/X null nul
l N P/M D/E/F/G null Y Y Y B/M Y Y C P Y Y Y Y Y B/R/V/C Y Y Y Y COIL/SHEET
null null null Y/N 0000/0500/0600/0760 null 1/2/3/4/5/U
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 22175 |
|---|
| Number of records (Ver 1) | 898 (798) |
|---|
| Num. input columns | 39 |
|---|
| Num. output columns (Ver 1) | 73 (106) |
|---|
| Density % (Ver 1) | 53.42 (41.05) |
|---|
| Number of classes | 6 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 71 | 0 (0) | 0.00 (0.00) |
| 68 | 8 (8) | 0.89 (1.00) |
| 73 | 40 (34) | 4.45 (4.26) |
| 72 | 67 (60) | 7.56 (7.52) |
| 69 | 99 (88) | 11.02 (11.03) |
| 70 | 684 (608) | 76.17 (76.19) |
|
|---|
| File name | anneal.D73.N898.C6.num |
|---|
- Contains missing values (59) in .data file represented by a '?' character.
- Several of the attributes in the database could be used as a "class"
attribute however the first "symboling" has been selected here. This must
therefore be moved to the end.
- The data has a propensity of class -1 towards the end so
should be randomised.
- Classes that do not feature in the data set: -3 (column number 131).
| SCHEMA FILE |
|---|
|
nominal double nominal nominal nominal nominal nominal nominal nominal
double double double double double nominal nominal double nominal double
double double double double double double double
|
|
symboling normalized-losses make fuel-type aspiration num-of-doors body-style
drive-wheels engine-location wheel-base length width height curb-weight
engine-type num-of-cylinders engine-size fuel-system bore str
|
|
-3/-2/-1/0/1/2/3 null alfa-romero/audi/bmw/chevrolet/dodge/honda/isuzu/
jaguar/mazda/mercedes-benz/mercury/mitsubishi/nissan/peugot/plymouth/
porsche/renault/saab/subaru/toyota/volkswagen/volvo diesel/gas std/turbo
four/two hardtop/wagon/sedan/hatchback/convertible 4wd/fwd/rwd front/rear
null null null null null dohc/dohcv/l/ohc/ohcf/ohcv/rotor
eight/five/four/six/three/twelve/two null
1bbl/2bbl/4bbl/idi/mfi/mpfi/spdi/spfi null null null null null null null null
|
| DN STATISTICS |
|---|
| Numner of divisions | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 59 |
|---|
| Number of records | 205 |
|---|
| Num. input columns | 26 |
|---|
| Num. output columns (Ver 1) | 137 (142) |
|---|
| Density % (Ver 1) | 18.98 (18.31) |
|---|
| Number of classes | 7 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 131 | 0 | 0.00 |
| 132 | 3 | 1.46 |
| 133 | 22 | 10.73 |
| 147 | 27 | 13.17 |
| 146 | 32 | 15.61 |
| 145 | 54 | 26.34 |
| 134 | 67 | 32.68 |
|
|---|
| File name |
auto.D137.N205.C7.num |
|---|
- Data file includes a few missing attributes represented by a '?' character.
- First column is a counter so remove from dataset
| SCHEMA FILE |
|---|
|
int int int int int int int int int int nominal
|
|
number ClumpThickness UniformityOfCellSize UniformityOfCellShape
MarginalAdhesion SingleEpithelialCellSize BareNuclei BlandChromatin
NormalNucleoli Mitoses Class
|
|
null null null null null null null null null null 2/4
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 16 |
|---|
| Number of records | 699 |
|---|
| Num. input columns | 11 (Remove 1) |
|---|
| Num. output columns (Ver 1) | 20 (47) |
|---|
| Density % (Ver 1) | 50 (21.28) |
|---|
| Number of classes | 2 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 20 | 241 | 34.48 |
| 19 | 458 | 65.52 |
|
|---|
| File name | breast.D20.N699.C2.num |
|---|
5. CHESS (KING AND ROOK v. KING) |
- Data is ordered and therefore must be randomized.
- Nominal values e, f, g and h for
attribute White_King_file (column 1) do not appear in the
data set.
- Nominal values 5, 6, 7 and 8 for
attribute White_King_rank (column 2) do not appear in the
data set.
| SCHEMA FILE |
|---|
|
nominal nominal nominal nominal nominal nominal nominal
|
|
White_King_file White_King_rank White_Rook_file White_Rook_rank
Black_King_file Black_King_rank depth-of-win
|
|
a/b/c/d/e/f/g/h 1/2/3/4/5/6/7/8 a/b/c/d/e/f/g/h 1/2/3/4/5/6/7/8
a/b/c/d/e/f/g/h 1/2/3/4/5/6/7/8 draw/zero/one/two/three/four/five/six/seven/
eight/nine/ten/eleven/twelve/thirteen/fourteen/fifteen/sixteen
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 28056 |
|---|
| Num. input columns | 7 |
|---|
| Num. output columns (Ver 1) | 58 (66) |
|---|
| Density % (Ver 1) | 12.07 (10.61) |
|---|
| Number of classes | 18 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 42 | 27 | 0.1 |
| 43 | 78 | 0.28 |
| 45 | 81 | 0.29 |
| 46 | 198 | 0.71 |
| 44 | 246 | 0.88 |
| 58 | 390 | 1.39 |
| 47 | 471 | 1.68 |
| 48 | 592 | 2.11 |
| 49 | 683 | 2.43 |
| 50 | 1433 | 5.11 |
| 51 | 1712 | 6.1 |
| 52 | 1985 | 7.08 |
| 57 | 2166 | 7.72 |
| 41 | 2796 | 9.97 |
| 53 | 2854 | 10.17 |
| 54 | 3597 | 12.82 |
| 55 | 4194 | 14.95 |
| 56 | 4553 | 16.23 |
|
|---|
| File name | chessKRvK.D58.N28056.C18.num |
|---|
| SCHEMA FILE |
|---|
|
nominal nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal
|
|
a1 a2 a3 a4 a5 a6 b1 b2 b3 b4 b5 b6 c1 c2 c3 c4 c5 c6 d1 d2 d3 d4 d5 d6 e1 e2
e3 e4 e5 e6 f1 f2 f3 f4 f5 f6 g1 g2 g3 g4 g5 g6 Class
|
|
x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b
x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b
x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b
x/o/b x/o/b x/o/b win/loss/draw
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 67557 |
|---|
| Num. input columns | 43 |
|---|
| Num. output columns (Ver 1) | 129 (129) |
|---|
| Density % (Ver 1) | 33.33 (33.33) |
|---|
| Number of classes | 3 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 129 | 6449 | 9.55 |
| 128 | 16635 | 24.62 |
| 127 | 44473 | 65.83 |
|
|---|
| File name | connect4.D129.N67557.C3.num |
|---|
- Data file bands.data contains nominal values presented using both
upper and lower case letters, for example the nominal values "YES" and "yes" are
assumed to be identical. The LUCS-KDD-DN software assumes nominal values are
case sensative so some editing is required.
- First four attributes are all identifiers and thus not required.
- Nominal values that do not feature in the data set:
- type for attribute color (column number 2).
- daetwyler for attribute bladeMFG (column number 4).
- warsaw or mattoon for attribute cylinderDivision
(column number 5)
- lactol or other for attribute solventType (column
number 9)
- 3, 4, 8 for attribute unitNumber (column
number 13)
- other for attribute platingTank (column number 16).
| SCHEMA FILE |
|---|
|
unused unused unused unused nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal int int double int int double int double int double
double int int double int int int int int nominal
|
|
timestamp cylinderNumber customer jobNumber grainScreened color
proofOnCtdInk bladeMFG cylinderDivision paperType inkType directSteam
solventType typeOnCylinder pressType press unitNumber cylinderSize
paperMillLocation platingTank proofCut viscosity caliper inkTemperature
humidity roughness bladePressure varnishPCT pressSpeed inkPCT solventPCT
ESAvoltage ESAamperage wax hardener rollerDurometer currentDensity
anodeSpaceEatio chromeContent bandType
|
|
none none none none yes/no key/type yes/no benton/daetwyler/uddeholm
gallatin/warsaw/mattoon uncoated/coated/super uncoated/coated/cover
yes/no xylol/lactol/naptha/line/other yes/no
WoodHoe70/Motter70/Albert70/Motter94 821/802/813/824/815/816/827/828
1/2/3/4/5/6/7/8/9/10 catalog/spiegel/tabloid
NorthUS/SouthUS/Canadian/Scandanavian/MidEuropean
1910/1911/other none none none none none none none none none none
none none none none none none none none none band/noband
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 999 |
|---|
| Number of records | 540 |
|---|
| Num. input columns | 40 (remove 4) |
|---|
| Num. output columns | 124 |
|---|
| Density % | 29.03 |
|---|
| Number of classes | 2 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 123 | 228 | 42.22 |
| 124 | 312 | 57.78 |
|
|---|
| File name | cylBands.D124.N540.C2.num |
|---|
- Input data set (flare.num) comprioses a concatenation of
flares.data1 and flares.data2>
- Space separated.
- Last three columns, 11, 12 and 13, may all potentially be used as the
classifier; column 11 is used here because it has 9 class values associated with
it, whereras the other two column have less. Columns 12 and 13 should thus be
deleted.
- There are no records that contain the nominal value A for attribute
modifiedZurichClass (column number 1).
| SCHEMA FILE |
|---|
|
nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal
|
|
modifiedZurichClass largestSpotSize spotDistribution activity evolution
prev24hourFlareActivity historically-complex regionBecameHistComplex area
areaLargestSpot C-class M-class X-class
|
|
A/B/C/D/E/F/H X/R/S/A/H/K X/O/I/C 1/2 1/2/3 1/2/3 1/2 1/2 1/2 1/2
0/1/2/3/4/5/6/7/8 0/1/2/3/4/5 0/1/2
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 1389 |
|---|
| Num. input columns | 13 (remove 2) |
|---|
| Num. output columns | 39 |
|---|
| Density % | 28.21 |
|---|
| Number of classes | 9 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 38 | 0 | 0.00 |
| 39 | 1 | 0.07 |
| 37 | 3 | 0.22 |
| 36 | 4 | 0.29 |
| 35 | 9 | 0.65 |
| 34 | 20 | 1.44 |
| 33 | 40 | 2.88 |
| 32 | 141 | 10.15 |
| 31 | 1171 | 84.31 |
|
|---|
| File name | flare.D39.N1389.C9.num |
|---|
If using column 13 as the class attribute the class distribution is:
| Class | Num. Rec. | % |
|---|
| 22 | 1 | 0.07 |
| 21 | 11 | 0.79 |
| 20 | 1377 | 99.14 |
And if using column 12 as the class attribute the distribution is:
| Class | Num. Rec. | % |
|---|
| 30 | 1 | 0.07 |
| 28 | 2 | 0.14 |
| 29 | 3 | 0.22 |
| 27 | 9 | 0.65 |
| 26 | 53 | 3.82 |
| 25 | 1321 | 95.10 |
- First Column is a counter so remove from dataset.
- Data is ordered according to class so must be distributed.
- Data includes no examples where the class (typeOfGlass)
is 4.
| SCHEMA FILE |
|---|
|
int double double double double double double double double double nominal
|
|
number RI:refractiveIndex Na:Sodium Mg:Magnesium Al:Aluminum Si:Silicon
K:Potassium Ca:Calcium Ba:Barium Fe:Iron TypeOfGlass
|
|
none none none none none none none none none none 1/2/3/4/5/6/7
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 214 |
|---|
| Num. input columns | 11 (remove 1) |
|---|
| Num. output columns (Ver 1) | 48 (52) |
|---|
| Density % (Ver 1) | 20.83 (19.23) |
|---|
| Number of classes | 7 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 45 | 0 | 0.00 |
| 47 | 9 | 4.21 |
| 46 | 13 | 6.07 |
| 44 | 17 | 7.94 |
| 48 | 29 | 13.55 |
| 42 | 70 | 32.71 |
| 43 | 76 | 35.51 |
|
|---|
| File name | glass.D48.N214.C7.num |
|---|
- The Cleveland data set is used here. It is unclear which was used for
evaluating CMAR (2) and CPAR (3) where the authors use a "heart" data set with
a reported
270 records and 3 classes as opposed to the
303 records and 5 classes used here.
| SCHEMA FILE |
|---|
|
double nominal nominal double double nominal nominal double nominal double
nominal nominal nominal nominal
|
|
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal num
|
|
null 0.0/1.0 1.0/2.0/3.0/4.0 null null 0.0/1.0 0.0/1.0/2.0 null 0.0/1.0
null 1.0/2.0/3.0 0.0/1.0/2.0/3.0 3.0/6.0/7.0 0/1/2/3/4
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 6 |
|---|
| Number of records | 303 |
|---|
| Num. input columns | 14 |
|---|
| Num. output columns (Ver 1) | 52 (53) |
|---|
| Density % (Ver 1) | 26.92 (26.42) |
|---|
| Number of classes | 5 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 52 | 13 | 4.29 |
| 51 | 35 | 11.55 |
| 50 | 36 | 11.88 |
| 49 | 55 | 18.15 |
| 48 | 164 | 54.13 |
|
|---|
| File name | heart.D52.N303.C5.num |
|---|
- First column is the class so needs to be moved to the end.
| SCHEMA FILE |
|---|
|
nominal int nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal double int int double int nominal
|
|
Class AGE SEX STEROID ANTIVIRALS FATIGUE MALAISE ANOREXIA LIVER_BIG
LIVER_FIRM SPLEEN_PALPABLE SPIDERS ASCITES VARICES BILIRUBIN
ALK_PHOSPHATE SGOT ALBUMIN PROTIME HISTOLOGY
|
|
1/2 null 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 null null null
null null 1/2
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 167 |
|---|
| Number of records | 155 |
|---|
| Num. input columns | 20 |
|---|
| Num. output columns (Ver 1) | 56 (58) |
|---|
| Density % (Ver 1) | 35.71 (34.48) |
|---|
| Number of classes | 2 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 55 | 32 | 20.65 |
| 56 | 123 | 79.35 |
|
|---|
| File name | hepatitis.D56.N155.C2.num |
|---|
- Data is space separated (more usual to have comma separated).
- Input data comprises the 300 record training set (horse-colic.data)
and the 68 record test set (horse-colic.test) which have been put
together to form a single space-separated data file (horse-colic.num).
- Column 24 is the class (whether lesion is surgical): 1 = Yes, 2 = No.
- Columns 23 (outcome), 25 (typeOfLesion1), 26
(typeOfLesion2), 27 (typeOfLesion3), 28 (cp_data)
are irrelevant and can therefore be removed.
- Errors in names file: (i) nominal values for column 2 are actually 1 or 9
and not 1 or 2 as reported; (ii) nominal values for column 10 are actually 1, 2 or
3 and not 1 or 2 as reported.
- Many missing values (1927).
| SCHEMA FILE |
|---|
|
nominal nominal int double int int nominal nominal nominal nominal nominal
nominal nominal nominal nominal int nominal nominal double double nominal
int nominal nominal int int int nominal
|
|
surgery? Age HospitalNumber rectalTemperature pulse respiratoryRate
temperatureOfExtremities peripheralPulse mucousMembranes capillaryRefillTime
pain peristalsis abdominalDistension nasogastricTube nasogastricReflux
nasogastricRefluxPH ectalExamination abdomen packedCellVolume totalProtein
abdominocentesisAppearanc abdomcentesisTotalProtein outcome surgicalLesion?
typeOfLesion1 typeOfLesion2 typeOfLesion3 cp_data
|
|
1/2 1/9 null null null null 1/2/3/4 1/2/3/4 1/2/3/4/5/6 1/2/3 1/2/3/4/5
1/2/3/4 1/2/3/4 1/2/3 1/2/3 null 1/2/3/4 1/2/3/4/5 null null 1/2/3 null
1/2/3 1/2 null null null 1/2
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 1927 |
|---|
| Number of records | 368 |
|---|
| Num. input columns | 28 |
|---|
| Num. output columns (Ver 1) | 85 (94) |
|---|
| Density % (Ver 1) | 27.06 (24.47) |
|---|
| Number of classes | 2 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 85 | 136 | 36.96 |
| 84 | 232 | 63.04 |
|
|---|
| File name | horseColic.D85.D368.C2.num |
|---|
| SCHEMA FILE |
|---|
|
double double double double double double double double double
double double double double double double double double double
double double double double double double double double double
double double double double double double double nominal
|
|
att1 att2 att3 att4 att5 att6 att7 att8 att9 att10 att11 att12
att13 att14 att15 att16 att17 att18 att19 att20 att21 att22 att23
att24 att25 att26 att27 att28 att29 att30 att31 att32 att33 att34 class
|
|
null null null null null null null null null null null null null
null null null null null null null null null null null null null
null null null null null null null null g/b
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 351 |
|---|
| Num. input columns | 35 |
|---|
| Num. output columns (Ver 1) | 157 (172) |
|---|
| Density % (Ver 1) | 22.29 (?) |
|---|
| Number of classes | 2 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 157 | 126 | 35.90 |
| 156 | 225 | 64.10 |
|
|---|
| File name |
ionosphere.D157.N351.C2.num |
|---|
- Data is ordered according to class so must be distributed.
| SCHEMA FILE |
|---|
|
double double double double nominal
|
|
sepalLength sepalWidth petalLength petalWidth class
|
|
null null null null Iris-setosa Iris-versicolour Iris-virginica
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 150 |
|---|
| Num. input columns | 5 |
|---|
| Num. output columns (Ver 1) | 19 (23) |
|---|
| Density % (Ver 1) | 26.32 (21.74) |
|---|
| Number of classes | 3 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 17 | 50 | 33.33 |
| 18 | 50 | 33.33 |
| 19 | 50 | 33.33 |
|
|---|
| File name | iris.D19.N150.C3.num |
|---|
- Used to evaluate both CMAR (2) and CPAR (3) but unclear as to what was used
as the class!
- Space separated.
- Not processed.
- Data file is space separated.
- Attributes have settings of 1 or 0.
| SCHEMA FILE |
|---|
|
nominal nominal nominal nominal nominal nominal nominal nominal
|
|
light1 light2 light3 light4 light5 light6 light7 class
|
|
0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1/2/3/4/5/6/7/8/9
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 3200 |
|---|
| Num. input columns | 8 |
|---|
| Num. output columns (Ver 1) | 24 (24) |
|---|
| Density % (Ver 1) | 33.33 (33.33) |
|---|
| Number of classes | 10 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 21 | 301 | 9.41 |
| 18 | 307 | 9.59 |
| 19 | 312 | 9.75 |
| 17 | 313 | 9.78 |
| 20 | 313 | 9.78 |
| 22 | 314 | 9.81 |
| 23 | 327 | 10.22 |
| 15 | 329 | 10.28 |
| 24 | 334 | 10.44 |
| 16 | 350 | 10.94 |
|
|---|
| File name | led7.D24.N3200.C10.num |
|---|
- Class (lettr) is in first column so must be moved to end.
| SCHEMA FILE |
|---|
|
nominal int int int int int int int int int int int int int int int int
|
|
lettr x-box y-box width high onpix x-bar y-bar x2bar y2bar xybar x2ybr xy2br
x-ege xegvy y-ege yegvx
|
|
A/B/C/D/E/F/G/H/I/J/K/L/M/N/O/P/Q/R/S/T/U/V/W/X/Y/Z none none none none none
none none none none none none none none none none none
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 20000 |
|---|
| Num. input columns | 17 |
|---|
| Num. output columns (Ver 1) | 106 (106) |
|---|
| Density % (Ver 1) | 16.04 (16.04) |
|---|
| Number of classes | 26 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 88 | 734 | 3.67 |
| 106 | 734 | 3.67 |
| 83 | 736 | 3.68 |
| 91 | 739 | 3.70 |
| 90 | 747 | 3.74 |
| 99 | 748 | 3.74 |
| 103 | 752 | 3.76 |
| 95 | 753 | 3.77 |
| 89 | 755 | 3.78 |
| 98 | 758 | 3.79 |
| 92 | 761 | 3.81 |
| 102 | 764 | 3.82 |
| 82 | 766 | 3.83 |
| 85 | 768 | 3.84 |
| 87 | 773 | 3.87 |
| 86 | 775 | 3.88 |
| 94 | 783 | 3.92 |
| 97 | 783 | 3.92 |
| 105 | 786 | 3.93 |
| 104 | 787 | 3.94 |
| 81 | 789 | 3.95 |
| 93 | 792 | 3.96 |
| 100 | 796 | 3.98 |
| 96 | 803 | 4.01 |
| 84 | 805 | 4.03 |
| 101 | 813 | 4.07 |
|
|---|
| File name | letRecog.D106.N20000.C26.num |
|---|
- Class values are in first column so should be moved to end.
- Many missing values (2480).
- Nominal values that do not feature in the data set:
- d amd n for attribute gill-attachment (column number 6).
- d for attribute gill-spacing (column number 7).
- u and z for attribute stalk-root (column number 11).
- u for attribute veil-type (column number 16).
- c, s and z for attribute ring-type (column number 19).
| SCHEMA FILE |
|---|
|
nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal
|
|
class cap-shape cap-surface cap-color bruises? odor gill-attachment
gill-spacing gill-size gill-color stalk-shape stalk-root
stalk-surface-above-ring stalk-surface-below-ring stalk-color-above-ring
stalk-color-below-ring veil-type veil-color ring-number ring-type
spore-print-color population habitat
|
|
e/p b/c/x/f/k/s f/g/y/s n/b/c/g/r/p/u/e/w/y t/f a/l/c/y/f/m/n/p/s a/d/f/n
c/w/d b/n k/n/b/h/g/r/o/p/u/e/w/y e/t b/c/u/e/z/r f/y/k/s f/y/k/s
n/b/c/g/o/p/e/w/y n/b/c/g/o/p/e/w/y p/u n/o/w/y n/o/t c/e/f/l/n/p/s/z
k/n/b/h/r/o/u/w/y a/c/n/s/v/y g/l/m/p/u/w/d
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 2480 |
|---|
| Number of records | 8124 |
|---|
| Num. input columns | 23 |
|---|
| Num. output columns (Ver 1) | 90 (127) |
|---|
| Density % (Ver 1) | 25.56 (18.11) |
|---|
| Number of classes | 2 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 90 | 3916 | 48.20 |
| 89 | 4208 | 51.80 |
|
|---|
| File name | mushroom.D90.N8124.C2.num |
|---|
- The "non_prob" possible value for column 7 given in names file should be "nonprob".
- Data ordered according to attribute so should be randomised.
| SCHEMA FILE |
|---|
|
nominal nominal nominal nominal nominal nominal nominal nominal nominal
|
|
parents has_nurs form children housing finance social health class
|
|
usual/pretentious/great_pret proper/less_proper/improper/critical/very_crit
complete/completed/incomplete/foster 1/2/3/more convenient/less_conv/critical
convenient/inconv nonprob/slightly_prob/problematic
recommended/priority/not_recom
not_recom/recommend/very_recom/priority/spec_prior
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 12960 |
|---|
| Num. input columns | 9 |
|---|
| Num. output columns (Ver 1) | 32 (32) |
|---|
| Density % (Ver 1) | 28.13 (28.13) |
|---|
| Number of classes | 5 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 29 | 2 | 0.02 |
| 30 | 328 | 2.53 |
| 32 | 4044 | 31.20 |
| 31 | 4266 | 32.92 |
| 28 | 4320 | 33.33 |
|
|---|
| File name | nursery.D32.N12960.C5.num |
|---|
- Space separated.
- Mostly (90%) class 5.
| SCHEMA FILE |
|---|
|
int int int double double double double int int int nominal
|
|
height lenght area eccen p_black p_and mean_tr blackpix blackand wb_trans
class
|
|
null null null null null null null null null null 1/2/3/4/5
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 5473 |
|---|
| Num. input columns | 11 |
|---|
| Num. output columns (Ver 1) | 46 (55) |
|---|
| Density % (Ver 1) | 23.91 (20.00) |
|---|
| Number of classes | 5 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 44 | 28 | 0.51 |
| 45 | 88 | 1.61 |
| 46 | 115 | 2.10 |
| 43 | 329 | 6.01 |
| 42 | 4913 | 89.77 |
| | | |
|
|---|
| File name | pageBlocks.D46.N5473.C5.num |
|---|
- Made up of two sets of records: pendigits.tes (test) and
pendigits.tra (training), which
have been put together to form a single data file pendigits.num.
- The files pendigits.tes and
pendigits.tra are mostly ', ' (comma-space) separated,
but not entirely so, therefore pendigits.num has been pre-processed
so that it is entirely ',' (comma) separated.
| SCHEMA FILE |
|---|
|
int int int int int int int int int int int int int int int int nominal
|
|
att1 att2 att3 att4 att5 att6 att7 att8 att9 att10 att11 att12 att13 att14
att15 att16 class
|
|
null null null null null null null null null null null null null null null
null 0/1/2/3/4/5/6/7/8/9
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 10992 |
|---|
| Num. input columns | 17 |
|---|
| Num. output columns (Ver 1) | 89 (90) |
|---|
| Density % (Ver 1) | 19.10 (18.89) |
|---|
| Number of classes | 10 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 83 | 1055 | 9.60 |
| 85 | 1055 | 9.60 |
| 88 | 1055 | 9.60 |
| 89 | 1055 | 9.60 |
| 86 | 1056 | 9.61 |
| 87 | 1142 | 10.39 |
| 80 | 1143 | 10.40 |
| 81 | 1143 | 10.40 |
| 82 | 1144 | 10.41 |
| 84 | 1144 | 10.41 |
|
|---|
| File name | penDigits.D89.N10992.C10.num |
|---|
| SCHEMA FILE |
|---|
|
int int int int int double double int nominal
|
|
NumberPregnacies PlasmaGluConcent DiastolicBldPress TricepsSkinFold
2-HourSerumIns BodyMassIndex
DiabPedFunc Age Class
|
|
none none none none none none none none 0/1
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 768 |
|---|
| Num. input columns | 9 |
|---|
| Num. output columns (Ver 1) | 38 (42) |
|---|
| Density % (Ver 1) | 23.68 (21.43) |
|---|
| Number of classes | 2 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 38 | 268 | 34.90 |
| 37 | 500 | 65.10 |
|
|---|
| File name | pima.D38.N768.C2.num |
|---|
- Made up of two sets of records: soybean-large.data and
soybean-large.test, which
have been put together to form a single data file soybean-large.num.
- Classes in soybean-large.data and
soybean-large.test are grouped so records need to be randomised.
- Colimn 1 is the class atribute (so move to end).
- Nominal vlaues that do not feature in the data set:
- 2 for attribute stem (column number 19).
- 3 for attribute fruit (column number 29).
| SCHEMA FILE |
|---|
|
nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal nominal nominal nominal nominal
|
|
class date plant-stand precip temp hail crop-hist area-damaged severity
seed-tmt germination plant-growth leaves leafspots-halo leafspots-marg
leafspot-size leaf-shread leaf-malf leaf-mild stem lodging stem-cankers
canker-lesion fruiting-bodies external-decay mycelium int-discolor
sclerotia fruit-pods fruit seed mold-growth seed-discolor seed-size
shriveling roots
|
|
diaporthe-stem-canker/charcoal-rot/rhizoctonia-root-rot/phytophthora-rot/
brown-stem-rot/powdery-mildew/downy-mildew/brown-spot/bacterial-blight/
bacterial-pustule/purple-seed-stain/anthracnose/phyllosticta-leaf-spot/
alternarialeaf-spot/frog-eye-leaf-spot/diaporthe-pod-&-stem-blight/
cyst-nematode/2-4-d-injury/herbicide-injury 0/1/2/3/4/5/6 0/1 0/1/2 0/1/2
0/1 0/1/2/3 0/1/2/3 0/1/2 0/1/2 0/1/2 0/1 0/1 0/1/2 0/1/2 0/1/2 0/1 0/1
0/1/2 0/1/2 0/1 0/1/2/3 0/1/2/3 0/1 0/1/2 0/1 0/1/2 0/1 0/1/2/3 0/1/2/3/4
0/1 0/1 0/1 0/1 0/1 0/1/2
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 2337 |
|---|
| Number of records | 683 |
|---|
| Num. input columns | 36 |
|---|
| Num. output columns | 118 |
|---|
| Density % | 30.51 |
|---|
| Number of classes | 19 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 118 | 8 | 1.17 |
| 116 | 14 | 2.05 |
| 115 | 15 | 2.2 |
| 117 | 16 | 2.34 |
| 100 | 20 | 2.93 |
| 101 | 20 | 2.93 |
| 102 | 20 | 2.93 |
| 105 | 20 | 2.93 |
| 106 | 20 | 2.93 |
| 108 | 20 | 2.93 |
| 109 | 20 | 2.93 |
| 111 | 20 | 2.93 |
| 112 | 20 | 2.93 |
| 104 | 44 | 6.44 |
| 111 | 44 | 6.44 |
| 103 | 88 | 12.88 |
| 113 | 91 | 13.32 |
| 114 | 91 | 13.32 |
| 107 | 92 | 13.47 |
|
|---|
| File name | soybean-large.D118.N683.C19.num |
|---|
- Data set is ordered according to class so must be distributed.
| SCHEMA FILE |
|---|
|
nominal nominal nominal nominal nominal nominal nominal nominal nominal
nominal
|
|
top-left-square top-middle-square top-right-square middle-left-square
middle-middle-square middle-right-square bottom-left-square
bottom-middle-square bottom-right-square Class
|
|
x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b x/o/b positive/negative
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 958 |
|---|
| Num. input columns | 10 |
|---|
| Num. output columns (Ver 1) | 29 (29) |
|---|
| Density % (Ver 1) | 34.48 (34.48) |
|---|
| Number of classes | 2 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 29 | 332 | 34.66 |
| 28 | 626 | 65.34 |
|
|---|
| File name | ticTacToe.D29.N958.C2.num |
|---|
| SCHEMA FILE |
|---|
|
double double double double double double double double double double double
double double double double double double double double double double nominal
|
|
att1 att2 att3 att4 att5 att6 att7 att8 att9 att10 att11 att12 att13 att14
att15 att16 att17 att18 att19 att20 att21 class
|
|
none none none none none none none none none none none none none none none
none none none none none none 0/1/2
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 5000 |
|---|
| Num. input columns | 22 |
|---|
| Num. output columns (Ver 1) | 101 (108) |
|---|
| Density % (Ver 1) | 21.78 (20.37) |
|---|
| Number of classes | 3 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 100 | 1647 | 32.94 |
| 99 | 1657 | 33.14 |
| 101 | 1696 | 33.92 |
|
|---|
| File name |
waveform.D101.N5000.C3.num |
|---|
- Class is at column 1 so must be moved to end.
- Data set is ordered according to class so must be distributed.
| SCHEMA FILE |
|---|
|
nominal double double double double int double double double double
double double double int
|
|
Class Alcohol MalicAcid Ash AlcalinityOfAsh Magnesium TotalPhenols
Flavanoids NonflavanoidPhenols Proanthocyanins ColorIntensity Hue
OD280/OD315ofDilutedWines Proline
|
|
1/2/3 null null null null null null null null null null null null null
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 178 |
|---|
| Num. input columns | 14 |
|---|
| Num. output columns (Ver 1) | 68 (68) |
|---|
| Density % (Ver 1) | 20.59 (20.59) |
|---|
| Number of classes | 3 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 68 | 48 | 26.97 |
| 66 | 59 | 33.15 |
| 67 | 71 | 39.89 |
|
|---|
| File name | wine.D68.N178.C3.num |
|---|
- Seven classes according to species of animal.
- First column is animal name so remove.
- Records listed alphabetically according to animal name so randomize.
| SCHEMA FILE |
|---|
|
unused nominal nominal nominal nominal nominal nominal nominal nominal
nominal nominal nominal nominal nominal nominal nominal nominal nominal
|
|
name hair feathers eggs milk airborne aquatic predator toothed backbone
breathes venomous fins legs tail domestic catsize type
|
|
null 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/2/4/5/6/8 0/1 0/1
0/1 1/2/3/4/5/6/7
|
| DN STATISTICS |
|---|
| Num. divs. setting | 5 |
|---|
| Distributed/Randomised | Yes |
|---|
| Missing values | 0 |
|---|
| Number of records | 101 |
|---|
| Num. input columns | 18 |
|---|
| Num. output columns (Ver 1) | 42 (43) |
|---|
| Density % (Ver 1) | 40.48 (39.53) |
|---|
| Number of classes | 7 |
|---|
| Num. records per class: |
| Class | Num. Rec. | % |
|---|
| 40 | 4 | 3.96 |
| 38 | 5 | 4.95 |
| 41 | 8 | 7.92 |
| 42 | 10 | 9.90 |
| 39 | 13 | 12.87 |
| 37 | 20 | 19.80 |
| 36 | 41 | 40.59 |
|
|---|
| File name | zoo.D42.N101.C7.num |
|---|
Created and maintained by
Frans Coenen.
Last updated 18 January 2005