|
|
1. INTRODUCTION |
The LUCS-KDD text mining system is a sophisticated piece of Java code, fronted by a GUI. The software is designed to address two different objectives:
Because of these dual objectives the GUI interface has many different options (most of which would not be required if all you want to do is produce a text classifier).
The classification algorithm used is TFPC, and tghis is incorporated into the GUI. For more information on the text preprocessing approach used see Coenen et al. (2006) and Wang et al. 2006
2. DOWNLOADING THE SOFTWARE |
The text mining software comprises 33 source files. These are provided from this WWW page together with an application class. The class files are as follows:
The relationship between some of the principal classes is as shown in the hierarchies in Tables 1 and 2 below.
|
Table 1: Main class hierarchy
|
Table 2: Set document base class hierarchy
There is also a "tar ball" textMining.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf aprioriT.tgz.
The software has been implemented in Java which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:
javac *.java
The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:
javadoc -d Documentation/ *.java
This will produce a hierarchy of WWW pages contained in the Document directory.
3. RUNNING THE SOFTWARE |
When compiled the software can be invoked in the normal manner using the Java interpreter:
java TextMiningGUI_App
The resulting GUI window is shown in Figure 1. If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:
java -Xms600m -Xmx600m TextMiningGUI_App
Figure 1: Text mining GUI Window
The input to the software is a collection of training text files, pre-labeled with an appropriate class name. Table 3 gives an example of such a file taken from the 20 Newsgroup data set (Lang 1995). Further examples taken from the 20 Newsgroup set can be found at:
http://www.csc.liv.ac.uk/~frans/KDD/Software/DataSets/newsGroupsTestData.html .
The thing to note about the example file is the @Class rec.motorcycles meta tag, this needs to be on the first line and tells the text mining system what class the text file belongs to.
@Class rec.motorcycles paint jobs in the uk can anyone recommend a good place for reasonably priced bike paint jobs, preferably but not essentially in the london area. thanks lisa rowlands -- alex technologies ltd cp house 97-107 uxbridge road |
Table 3: Example text file with @class meta tag
Once processed the system will have identified a set (A) of key words or phrases. Consequently the document base will now be represented in terms of a set of records, one record per document, such that each record R is some subset of A. We say that the data set has M columns (attributes) and N rows (records). A small example data sets might be as follows:
1 2 3 6 1 4 5 7 1 3 4 6 1 2 6 1 2 3 4 5 7
where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning).
A naming convention is used for the data sets. The name of each text file comprises: (i) a file stem, (ii) a sequential number (N) and (iii) a file end. The naming convention we used for the newsgroups is as follows:
newsgroupsN.txt
where newsgroups is the file stem, and .txt is the file end. The system assumes that all text files making up a document base are stored in a single directory.
To load the text files the system needs to know: (i) the file stem, (ii) the file end. (iii) the number of files (documents) in the document base, and (iv) the directory in which the document base text files are stored. This can be supplied in the form of a document base specification (DB spec.) file. An example (for part of the 20 newsgroups set) is given in Table 4. Note that the list of classes must be all on one line.
ClassList: comp.windows.x rec.motorcycles talk.religion.misc FileStem: newsgroups FileEnd: .txt NumRecords: 3000 |
Table 4: Example DB specification file for 20 Newsgroups sample.
4. GUI OPTIONS |
The interface comprises eight menus:
Details for each menu are given on the following tables.
| File Menu | |
|---|---|
| About | Displays some instructions in the GUI text area. |
| Text Base Dir. | Allows user to input the directory where # the document base is stored. |
| Text Base Spec. | Allows user to input the name (pathway) of the desired DB specification file. |
| First N recs | Option to specify first N documents to be processed (use if not desired to process all documents in a given set). |
| Start | Start the classification (will only work of appropriate input have been made). |
| Start Batch Mode | Run in batch mode; thid is a specialised mode of operation that should only be used where it is desired (for experimentation) to run a sequence of classification applications on the same data set but with different parameters. |
| Start Set of 8 Mode | Specialised mode of operation that undertakes run of eight classification applications using all eight different phrase identification techniques incorporated into the text miner. |
| Exit | Shuts down the GUI. |
| Thresholds Menu | |
|---|---|
| T-hold Input | Brings up a separate window to allow input of thresholds: (i) support, (ii) confidence, (iii) significance, (iv) upper noise and (v) lower noise. |
| T-hold Output | Outputs current threshold values. |
| Sig. Word Strats. Menu | |
|---|---|
| Set Max # Sig Wds. | Allows user to specify number N of significant words to be identafied. The value is application dependent, no harm is done to select too many other than slowing down the computational efficiency. The value is usually in the hundreds. |
| Sig. Contrtib. Gen. | Two ways of calculating the contribution of a given word to determine its significance: (i) "Word Frequency", the number of occurrences of the word in the document set; and (ii) "Word Support", the number of documents in which the word appears (count of 1 per document). |
| Pot. SW list Gen. | Two ways of selecting significant words from the list of potential significant words identified according to their contribution count: (i) "Unique Sig. Words", those that are significant in only one class; and (ii) "All Significant Words". |
| Sig. Word ID Strat. | Two ways of pruning the list of significant words according to the input N: (i) "First N (distributed)", first N distributed equally across classes; and (ii) "First N", top N words in the list (not distributed). |
| Algorithm Menu | |
|---|---|
| DelSN_contGO | Generate key Phrases comprising at least one significant word and ordinary words, using stop and noise words as the delimiters. |
| DelSN_contGW | Generate key Phrases comprising at least one significant word and "wildcard" words, using stop and noise words as the delimiters. |
| DelSO_contGN | Generate key Phrases comprising at least one significant word and noise words, using stop and ordinary words as the delimiters. |
| DelSO_contGW | Generate key Phrases comprising at least one significant word and "wildcard" words, using stop and ordinary words as the delimiters. |
| Keywords | Do not generate phrases but use keywords instead, i.e. the identified significant words as the keywords |
| Evaluation | |
|---|---|
| 90:10 | Evaluate the generated classifier using a 90:10 training/test set split. |
| 50:50 | Evaluate the generated classifier using a 50:50 training/test set split. |
| TCV | Evaluate the generated classifier using Ten Cross Validation (TCV). |
| Class. Strat. | |
|---|---|
| CSA | Order classification rules according to highest confidence value, if two rules have the same confidence then according to highest support value, if two rules have the same confidence and support then according to largest number of antecedent items (attributes). Then apply first rule in the ordered list that that fits a new example. |
| CSA best K | Use CSA ordering but obtain results of first k rules and select the majority. (k=5 in this case.) |
| Output | |||
|---|---|---|---|
| Training set | Eleven options: (i) "Training set doc IDs", list of training set document numbers; (ii) "Training set doc sizes", list of number of words in each training set document, (iii) "Training set raw", the training set documents in full; (iv) "Training set raw 1st 10", the first ten training set documents in full; (v) "Training set marked up", the training set documents marked up according to whether each word is a significant, ordinary, noise or stop word; (vi) "Training set marked up 1st 10", the first ten training set documents marked up according to whether each word is a significant, ordinary, noise or stop word; (vii) "Train. set Phrases/K'words", the training set documents in terms of identified key phrases/words; (viii) "Train. set Phrases/K'words 1st 10", the first ten training set documents in terms of identified key phrases/words; (ix) "Training set attribute #", the list of training set attributes (identified by unique numbers); (x) "Training set Docs. per class", the number of training set documents per class; (xi) "Training Set Stats", set statistics describing the training set. | ||
| Test set | Six options: (i) "Test set doc IDs", list of test set document numbers; (ii) "Test set doc sizes", list of number of words in each test set document, (iii) "Test set raw", the test set documents in full; (iv) "Test set Phrases/Keywords", the test set documents in terms of identified key phrases/words; (v) "Test set attribute #", the list of test set attributes (identified by unique numbers); (vi) "Test Set Stats", set statistics describing the test set. | ||
| Word bin tree | Seven options including a sub-sub-menu.
First six options are: (i) "Word Bin Tree", the entire list of words represented
in the training set (this is actually stored in a binary tree); (ii) "Lower
Noise Words", the lower noise words (words with a count below the lower noise
threshold) that are represented in the training set; (iii) "Upper Noise Words",
the upper noise words (words with a count above the upper noise threshold) that
are represented in the training set; (iv) "Ordinary Words", list of ordinary
words represented in the training set (non-noise words that are not considered
to be significant); (v) "Word Bin Tree Stats", statistics for the generated word
binary tree; (vi) "Words with Count 1", words that only appear once in the
entire document set. The sub-sub-menu has five options all pertaining to significant words: (i) "Significant Words", the list of significant words represented in the training set; (ii) "# pot sig words per class"; the number of potential significant words per class (not the same as the actual number of significant words); (iii) "# sig words per class", the number of actual significant words per class (iv) "Potential Sig. Words", the list of potential significant words represented in the training set; (v) "Top 10 Sig. Wds. per Class", the first ten significant words per class. |
||
| Phrase bin tree | Four options: (i) "Phrase Bin Tree", the entire list of phrases represented in the training set (this is actually stored in a binary tree); (ii) "Phrase Bin Tree Stats", statistics for the generated word binary tree; (iii) "Phrase list", list of phrase; (iv) "Phrase list 1st 100", first 100 phrases in the phrase list | ||
| TFPC | Five options pertaining to the TFPC classifier software incorporated into the text miner: (i) "T-tree Stats", statistics for the T-tree (size, number of updates etc) --- T-tree is a storage structure used during the classifier generation process; (ii) "P & T tree Storage", storage requirement for P- and T-tree (P-tree is a data structure to hold a partially process version of the input data); (iii) "Large Itemsets", list of frequent item sets (some authors call these {large item sets), (iv) "Ttree", the T-tree itself (in serialised form); (v) "Ptree", the P-tree itself (in serialised form). | Classifier | Four options: (i) "Classifier Stats.", some statistics pertaining to the classifier (number of rules, etc.), (iii) "Classifier", the classifier 9as set of classification rules), (iii) "Rules fired", list of the classification rules actually used; (iv) "Classification", the results of the classifier being applied to the test set. |
| Batch | |
|---|---|
| Output Batch | Brings up a separate window to allow input of batch parameters: (i) support, (ii) confidence, (iii) significance, (iv) upper noise and (v) lower noise. In each case user can specify the start value, the end value and the increment when stepping up from the start value to the end value (negative value for the step when going backwards). |
| Output Batch | Outputs current batch mode threshold parameters. |
5. THE PHRASE IDENTIFICATION ALGORITHM |
Each document is processed to identify the words in it. Words are continuous sequences of alphabetic characters delimited by non-alphabetic characters, e.g. punctuation marks, white space and numbers. Some non-alphabetic characters ('!', ',', '.', ':', ';' and '?'), referred to as stop marks, play a role in the identification of phrases and are indicated in the processed document by a "null". All other non-alphabetic characters are ignored. As a result some single words in the training set may be identified as two words, for example Tom's will be the word Tom and the "word" s.
Words are stored in a bin tree.
Four types of word are identified in the document base.
A number of different schemes for identifying phrases can be identified:
To determine weather a word is significant or not the process is as follows:
The algorithm uses two bin trees, one for words and one for phrases, and two arrays of arrays for the training and test sets respectively. The algorithm works as follows:
PHASE 1: READING THE TRAINING SET
PHASE 2: READING THE TEST SET
Starts with method findPhrasesInDB in class PhraseMining.
PHASE 3: CLASSIFICATION
6. CONCLUSIONS |
The TFPC text mining software described here has been used successfully by the LUCS-KDD research group for the purpose of conduction research into text mining issues. The software is available for free, however the authors would appreciate appropriate acknowledgement. The following reference format for refering to the TFP implementation available here is suggested:
Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the authors.
REFERENCES |
Created and maintained by Frans Coenen. Last updated 4 March 2008