Next: Knowledge Classification Up: Knowledge-based Information Agents Previous: Knowledge-based Information Agents

Introduction

Information agents are intelligent pieces of software which can automatically search for information on the WWW [2] [10]. They usually deal with multiple Web sites in a single domain or multiple domains. One key step of building information agents is to extract information from multiple Web sites, that is, to transfer important information to structured data so that more accurate search can be carried out as querying a structured database.

The main challenge of building an information agent is how to make the agent scalable and adaptable. More and more online documents are becoming available and each has a different data format. The number of Web sites and their domains is huge and is growing very fast. Existing Web pages are being updated continuously, and their data formats may be modified at any time without any warning. While it might be easy to handcraft an information agent for one particular Web site in one specific domain for a particular time, how to update the Web site, how to adapt it and make it scalable to new Web sites and new domains, is a big challenge. There is an urgent need to develop methods and tools to ease agent generation and adaptation.

Recent research has used machine learning technology to build scalable agents [1] and to automatically learn information extraction patterns [6] [7] [9]. However, these systems work on relatively structured Web pages. The majority of Web pages with flexible data format, for example, data presented in free text and spread across sentences and paragraphs, are out of reach of current automatic systems.

Our research introduces a knowledge-based approach to support the generation and adaptation of information agents. We view an information agent as a knowledge-based system. The knowledge for guiding information extraction, such as information extraction patterns, is saved in the knowledge base of the agent. The information extraction process is coded as an inference engine. We assume the knowledge can be separated from the information extraction process. Instead of building an agent from scratch, an agent can be generated by adding knowledge bases to a reusable shell. An agent can be adapted to new domains and new Web sites by changing the knowledge bases. In slogan form,

Information Agent = Knowledge Bases + Agent Shell

We focus on building agents for information extraction from semi-structured data, that is data in an intermediate format between data in free text and structured data in databases. Typical examples are Web pages provided by online services such as classified advertisements, product categories, and telephone books. We believe that knowledge plays an important role for information extraction from semi-structured data, and information extraction from semi-structured data can be achieved based on a limited amount of knowledge with only simple natural language processing. Semi-structured data provides the right level of diversity and difficulty for testing our methods.

The rest of this paper is organized as follows. Section 2 discusses the knowledge that is useful for building agents and describes our classification of knowledge into three categories. Section 3 introduces the agent architecture. The two main parts of an agent, the knowledge bases and the information extraction engine are discussed in Sections 4 and 5 respectively. Section 6 gives some experimental results, while the final section concludes this paper.

Next: Knowledge Classification Up: Knowledge-based Information Agents Previous: Knowledge-based Information Agents

Xiaoying Gao
Tue Dec 11 16:30:56 NZDT 2001