BCS SGAI WORKSHOP REPORT

STATE-OF-THE-ART NATURAL LANGUAGE PROCESSING (NLP) SYSTEMS

(Friday 12th September 2003)

Frans Coenen

Dept. of Computer Science

The University of Liverpool

The "state of the art" NLP workshop held at Nottingham Trent University on Friday 12 September was part of an occasional series of workshops held under the auspices of the British Computer Society's Specialist Group on Artificial Intelligence (SGAI). This one was organised and chaired by Tony Allen, the group's workshop coordinator.

The workshop started at 10:00 with Tony Allen doing the opening remarks. With respect to numbers I counted some 23 people in the room during the opening address; a few more arrived halfway through the morning and some left immediately before and after lunch --- in other words a reasonable "turn out".

The first speaker of the day was Somayajulu Sripada from the Department of Computing Science at the University of Aberdeen. Somayajulu described a system (SUMTIME-MOUSAM) for the automatic generation of textual weather forecasts from standard Numerical Weather Prediction (NWP) models. The work is supported by Weathernews UK which is using the system to generate marine weather forecasts for the offshore oil industry. The system operates using what is essentially an Expert System approach to information extraction from NWP models. The paper was well received. (Just as an aside, Chris Mellish --- of Clocksin and Mellish fame --- has joined the Computing Science department at Aberdeen making it a "budding" centre for NLP.)

The next speaker was Diana Maynard of the Department of Computer Science at the University of Sheffield who provided an overview of the MUSE system. MUSE is an Information Extraction (IE) engine developed using GATE (General Architecture for Text Engineering). Diana commence by emphasising that Information Extraction (IE) is not the same as Information Retrieval (IR) and likened IE systems to advanced WWW search engines. She continued her presentation by considering the basic and complex challenges of IE and how these are addressed by MUSE. She completed her presentation with discussion on the ease with which MUSE can be adapted to other languages citing several "real life" examples, including the system's adaptation to Cebuano (which as everyone knows is spoken in the southern Philippians). Overall this was an excellent presentation.

Tony Allen of the Department of Computer Science at Nottingham Trent University, The NLP Systems Workshop Chair and SGAI Workshop Coordinator

Diana Maynard of the Department of Computer Science at The University of Sheffield, "Multi-source and multi-lingual information extraction"

The third presentation was by Alexiei Dingli, a colleague of Diana Maynard at the University of Sheffield, who described a mechanism that will allow IE systems to learn to extract domain-specific information from WWW resources (work carried out with, amongst others, Yorik Wilks who gave the technical keynote presentation at ES2001). The learning is more or less unsupervised in that the process commences with what Alexiei described as "highly reliable or easy-to-mine sources" from which information is extracted using an IE engine. This information is then used to "bootstrap" more complex IE, the result of which is then used to for even more complex IE and so on. The initial IE engine (for bootstrapping) exploits the existence, or lack of it, of redundancy (i.e. multiple occurrences of the same information) in WWW pages. Alexiei described the operation of the methodology in some detail with reference to a trial application concerned with the mining of CS department WWW pages.

The final presentation before lunch was by Bayan Abu Shawar from the School of Computing at the University of Leeds. Chatbots are an established part of agent technology where the agent takes the place of a human in (say) a call centre scenario. Bayan described an approach to automatically adapting the capabilities of a chatbot, typically designed to operate using English, into alternative languages, especially "minority languages" (in a sense continuing the theme of Diana Maynard's presentation). Bayan commenced by outlining some of the history of "chatbot" research and then went on to describe the software development process behind her current work. Her current system has been tested using Afrikaans, and Bayan completed her presentation with a practical demonstration of this system.

Alexiei Dingli of the Department of Computer Science at The University of Sheffield, "Integrating information to bootstrap information extraction from WWW sites"

Bayan Abu Shawar of the School of Computing at The University of Leeds, "machine learning from dialogue corpora to generate chatbots".

The first paper after lunch was by Heather Powell of the Computing Department at Nottingham Trent University who gave the first of three Neural Nets (NN) related papers. Heather presented an update on ongoing research by her and two colleagues (Lindsay Evett and Shaomin Zhang) concerned with the automated knowledge acquisition from textual sources, using Neural Networks, for the development of KBS (some of us still remember the "knowledge acquisition bottleneck" of the late 1980s!). At the heart of this process is a natural language interface to allow the identification of keywords. The identification of these keywords is founded on what Heather referred to as "seed words" which in turn are provided by the end user.

The second presentation after lunch was by Sheila Garfield of the Centre for Hybrid Intelligent Systems at The University of Sunderland. Sheila described ongoing work, supported by BT, concerned with the automated routing of calls to helpdesks (call centres are currently a significant application area for NLP technologies). Sheila commenced by describing the application domain and the difficulties encountered in the automated understanding of "human utterances". The application was founded on a corpus of 8441 recorded operator assistance telephone calls which were divided into 19 classes (with a small number of classes being allocated to the preponderance of calls). To carry out the desired classification Sheila used "word association" coupled with two alternative matching mechanisms, the first founded on a Recurrent (Neural) Network and the second on a Support Vector Machine (SVM). Overall Sheila reported that the Recurrent Network performed better than the SVM, however the SVM approach did offer some advantages suggesting that a hybrid approach might be a fruitful avenue for further research.

Heather Powell of the Computing Department at Nottingham Trent University, "Neural Networks for thematic concept extraction"

Sheila Garfield of the Centre for Hybrid Intelligent Systems at The University of Sunderland, "Recurrent Neural Learning for Classifying Spoken Utterances"

The final presentation of the day was by Jonathan Tepper, also of the Computing Department at the Nottingham Trent University. Jonathan's presentation was concerned with natural language parsing (as opposed to processing), which is of course the first step towards both semantic interpretation and information retrieval. Jonathan commenced by giving a thorough overview of parsing: symbolic parsing, statistical parsing; and then focussed in on connectionist networks and parsing, and particularly corpus-based connectionist parsing. Jonathan then went on to describe a particular application of the latter. The approach made use of various sentence delimiters and word tags. The application comprised 654 sentences of between 2 and 27 words in length with an average word length of 7. Encouraging results were produced, but with some limitations which Jonathan and his co-workers are currently addressing.

Overall I felt this was a highly successful event and look forward to the next SGAI workshop. The most appealing aspect of the whole enterprise, at least for me, was its low cost --- allowing me an opportunity to learn something about NLP in a manner that was both accessible and economical.

The papers from the workshop will appear in a special issue of Expert Update scheduled for February 2004.

Postscript

For anybody interested in organising an SGAI workshop the requirements are:

Subject: The topic should be on an AI related topic.
Cost: The emphasis should be on low cost; registration for the NLP workshop was set at Ł35 for SGAI members, and Ł45 for non-members (which included copies of papers and lunch).
Venue: The cost limitation also means that organisers should have access to an appropriate venue for little or no cost.
Speakers Budget: The cost of the workshop is also intended to allow for a contribution to speaker's travel costs (typically Ł40 to Ł60).
Publication: Papers will be published in the groups magazine, Expert Update. Note that space is limited (typically to about six pages of single spaced, 11pt, A4) and that authors retain the copyright in their work.
Underwriting: SGAI's mission is to both foster and promote interest in all aspects of AI. SGAI is not particularly interested in making a profit (although at the same time it cannot afford significant losses) and, as such, would be willing to provide support for workshops to be run under its auspices.

Anybody interested in organising an SGAI workshop should, in the first instance, contact Tony Allen (email: tony.allen@ntu.ac.uk).

Created and maintained by Frans Coenen. Last updated 03 December 2003