|||
January 2000
Flexible Information Extraction Learning Algorithm
Contract No. F30602-99-C-0102
Wei Li, Principal Investigator
Rohini K. Srihari, Ph.D.,Co-Principal Investigator
The proliferation of electronic documents has created an information overload. It has become necessary to develop automated tools for quickly extracting key information from the mass of documents. The popularity of news clipping services, advanced WWW search engines and software agents illustrate the need for such tools. Currently, only shallow Information Extraction (IE), mainly the identification of named entities, is available for commercial applications. There is an acute demand for high-level extraction of relationships and events in situations where massive amounts of natural language texts are involved.
Cymfony Inc. has assessed the technical feasibility for domain independent, high-level information extraction by effectively employing machine learning techniques. A hierarchical, modular system named Textract has been proposed for high-level as well as low-level IE tasks. In the Textract architecture, high-level IE consists of two main modules/tasks: Correlated Entity (CE) extraction and General Event (GE) extraction. CE extracts pre-defined multiple relationships between entities, such as relationships of “affiliation”, “position”, “address”, and “email” for a person entity. GE is designed to extract open-ended key events to provide information on who did what, (to whom), when and where. These relationships and events could be contained within sentence boundaries, or span a discourse of running text. The application of Textract/IE in the task of natural language Question Answering (QA) has also been explored.
A unique, hybrid approach, combining the best of both paradigms, namely, machine learning and rule-based systems using finite state transducers (FST) has been employed. The latter has the advantage of being intuitive as well as efficient. However, knowledge acquisition is laborious and incomplete, especially when domain portability is involved. Machine learning techniques address this deficiency by automated learning from an annotated corpus. Statistical techniques such as Hidden Markov Models, maximum entropy and rule induction have been examined for possible use in different tasks and module development of this effort.
The work implemented by Cymfony under this SBIR Phase I grant includes the IE system architecture, task definitions, machine learning toolkit development, FST grammar modeling for relationship/event extraction, implementation of the Textract/CE prototype, implementation of the Textract/QA prototype based on IE results and a detailed simulation involving all the modules up to general event extraction. These accomplishments make the feasibility study reliable and provide a solid foundation for future system development.
The goal of this research was to assess the technical feasibility of developing tools, with emphasis on machine learning techniques, for high-level information extraction from electronic documents. The specific objectives for the Phase I effort were:
· to implement a suite of machine learning tools for IE;
· to design and implement the Level-2 IE prototype Textract 2.0, i.e. a system for extracting pre-defined multiple relationships between entities based on Textract1.0, the existing named entity tagger developed by Cymfony;
· to conduct the feasibility study of the higher level IE task for event extraction.
These objectives have been achieved successfully within the time frame of this Phase I.
Much of the electronic text that is generated daily, concerns multiple relationships and events of various types. These relationships and events usually involve entities such as:
· person
· organization
· location
· time
· numericalexpressions (money, percent, weight, length, etc.)
· contactinformation (e.g., telephone number, email, address)
Cymfony has developed technology to identify multiple relationships between these entities. A prototype Textract2.0/CE has been implemented for relationship extraction. The use of IE results in natural language Question Answering (QA) has also been explored. This has led to the implementation of a working QA prototype, namely, Textract1.0/QA. More significantly, the research under SBIR Phase I demonstrated that open-ended General Events (GE) involving these entities could also be extracted fairly reliably. The work on CE, GE and QA is regarded as a significant further step beyond the currently available systems for shallow IE because more meaningful information is made available to a user when individual, isolated named entities are inter-related.
The success of this effort has created a solid foundation for implementing a deployable IE system, Textract 3.0, for both high-level and low-level information extraction, defined as the final goal for the entire project. More importantly, in achieving this Phase I objective, the feasibility of the proposed hybrid approach of machine learning and FST rules to advanced IE tasks has been verified.
2.1. Conceptual Design: Background
This section presents issues involving conceptual design to serve as the background for this research. The major work on the conceptual design for the Textract multi-level IE system was completed in the previous SBIR effort “A Domain Independent Event Extraction Toolkit” (Contract No: F30602-97-C-0179) [Srihari 1998][1]. In particular, Cymfony established the overall architecture for this hierarchical system. Although further refinement may be necessary, the conceptual design along with the task definitions and system architecture constitute a blueprint for the Phase I work reported in this document.
The last decade has seen great advance and interest in the area of IE. In the US, the DARPA sponsored Tipster Text Program [Grishman 1997] and the Message Understanding Conferences (MUC) [MUC-7 1998] have been the driving force for developing this technology. In fact, the MUC specifications for various IE tasks have become de facto standards in the IE research community. It is therefore necessary to present this report in the context of the MUC program.
MUC divides IE into distinct tasks, namely, NE (NamedEntity), TE (Template Element), TR (Template Relation), CO (Co-reference), and ST (Scenario Templates) [Chinchor & Marsh 1998]. The proposal presented in this report for three levels of IE is modeled after the MUC standards using MUC-style representation. However, there was modification on the MUC IE task definitions in order to make them more useful and more practical.
The rationale is the following. In order to ensure portability across domains, we need to push the idea of domain independence to the limit while retaining the hierarchical nature of MUC-style task definitions for IE. MUC TR is an effort in this direction, but TR does not provide much information: only a couple of relations are defined for TR (i.e. LOCATION_OF, EMPLOYEE_OF,PRODUCT_OF). ST on the other hand does provide detailed information for events of interest in pre-defined complicated templates, but it is generally felt that the task is too ambitious for commercial application at present. Besides, ST is totally domain dependent. Each defined ST by nature only addresses the needs of a particular group to gain information on a very restrictive topic in a domain. It poses real challenges when a system needs to be ported to another domain.
2.1.2 Enhanced Named Entity Tagging
Cymfony had developed a state-of-the-art NE tagger, as one deliverable in the SBIR project "A Domain Independent Event ExtractionToolkit" (Contract No: F30602-97-C-0179) [Srihari 1998].
The Textract definition of NE has significantly expanded the type of information to be extracted. In addition to all the MUC defined NE types (person,organization, location, time, date, money and percent), the following entities are also identified by the existing NE tagger, Textract 1.0:
· duration,frequency, age
· number, fraction,decimal, ordinal, math equation, weight, length, temperature, angle, area,capacity, speed, rate
· product,trademark, software
· address, email,phone, fax, telex, www
· named event,conference
These new types of named entities enable an NE system to better satisfy a user’s information needs. More importantly, they provide a better foundation for defining multiple relationships between the identified entities and for supporting question answering functionality. In question answering, it was found that the key to a question processor is to identify the asking point (who, what, when, where, etc.). In many cases, the asking point corresponds to an NE beyond the MUC definition, e.g. the how-type questions: how long (duration or length depending on the question context), how far (length), how often (frequency), how old (age),etc. Therefore, an extended NE tagset is required for sophisticated IE.
2.1.3 Extraction of Correlated Entities
A more significant step in intelligent information extraction is to correlate identified NEs rather than simply identifying individual, isolated NEs. With a solid foundation of the implemented NE tagger, Textract 1.0, Cymfony was in a position to define the task for the level-2 IE for extracting pre-defined multiple relationships, named CE (for Correlated Entity) extraction.
The CE Template was defined as a feature-value list to represent information on entities such as person, organization, productor named event. Each feature gives some information about the entity in one aspect. Each defined relation is represented by a feature slot in the CE Template. The goal for level-2 processing was to fill the slots for CE templates if such information exists in the processed text. Assume the following text (from MUC-7 data) to be processed for the CE extraction.
Julian Hill, a research chemist whose accidental discovery of a tough, taffylike compound revolutionized everyday life after it proved its worth in warfare and courtship, died on Sunday in Hockessin, Del. He was 91.
Hill died at the Cokesbury Village retirement community, where he had lived in recent years with his wife of 62 years, Polly.
............
Julian Werner Hill was born in St. Louis, graduated from Washington University there in 1924 and earned a doctorate in organic chemistry from the Massachusetts Institute of Technologyin 1928. His wife recalled on Wednesday that his doctoral studies were delayed a year because he was stricken with scarlet fever.
Hill played the violin and was an accomplished squash player and figure-skater until his early40s, when an attack of polio weakened one leg, his wife said.
Before his retirement from Du Pont in 1964, Hill supervised the company's program of aid to universities for research in physics and chemistry.
............
The CE prototype developed in Phase I has extracted information about the entity Julian Werner Hill, and filled the CE Template below, among other CE templates.
name: <JulianWerner Hill>; <Julian Werner>; <Julian Hill>; <Hill>
type: "PERSON"
position: researchchemist
age: 91
gender: "MALE"
birth_place: St. Louis
affiliation: <DuPont Co.>
trained_in: <WashingtonUniversity>;
<Massachusetts Institute of Technology>
spouse: <Polly>
descriptors: anaccomplished squash player and figure-skater
The information in the CE represents a mini-CV of the person. In general, the CE template in Textract2.0 integrates and enriches the information contained in MUC TE and TR. This CE technology is expected to add considerable value to conventional searching or browsing.
2.1.4 Extraction of General Events
The concept of general event (GE) as an IE task comes from the notion of shallow events (or simple events) proposed in the IE project for Intelligence Analyst Associate (IAA) [Pine 1996] and in the initial feasibility study of Textract [Srihari 1998]. Shallow event for IAA is basically syntactic subject-predicate or S-V-O (subject-verb-object) relationships; the shallow event for Textract is defined to be semantic, with logical S-V-O relationships at the core. With the progress of research and development of Textract, this task has been more precisely defined as a tangible and very useful high-level IE objective, namely GE.
Unlike CE for pre-defined relationships, the GE Template is designed to capture infinite relations expressed in natural language texts by filling the PREDICATE slot with the actual verb used in the text. More precisely, in this proposed system, a GE Template is defined as an expanded argument structure, i.e. a feature-value list with the following slots: logical PREDICATE, logical subject ARGUMENT1 (who), and/or logical object ARGUMENT2 (what/whom), and/or other types of complement ARGUMENT3 (for information like to whom) plus the associated information of LOCATION (where) and TIME (when) or FREQUENCY (how often).
For example, given the same text about the news of Julian Hill's death shown previously, the GE system will extract a series of general events as simulated below:
<GE-001>=
PREDICATE: die
ARGUMENT1: <Julian Werver Hill>
TIME: Sunday
LOCATION: Cokesbury Village
Hockessin,Del.
.........
<GE-008>=
PREDICATE: graduate
ARGUMENT1: <Julian Werver Hill>
ARGUMENT2: from <WashingtonUniversity>
TIME: 1924
LOCATION: St. Louis
.........
GE is designed to capture only key information about events; unimportant information is filtered out, based on some pragmatic considerations, including:
· ignoring modifiers other than time and location; the system only extracts information on who did what (to whom) when and where, disregarding other details of an event, like modifiers of purpose, effect, accompanying circumstance, etc.
· extracting only the events where at least one argument slot is filled by a named person or organization (or its anaphor); the assumption is that events involving individual persons or organizations are usually of particular interest to potential users
For this entire project, a hierarchical,3-level architecture was proposed for developing a kernel IE system that is domain-independent throughout.
Figure 1 shows a blueprint for the proposed overall system architecture involving all the major modules.
Figure 1: Textract IE System Architecture
As can be seen in Figure 1, the core of the system consists of three kernel IE modules[2] and five linguistic modules. These modules remain domain independent. The linguistic modules serve as an underlying support system for different levels of IE. A striking feature of the Textract system architecture is that the IE technology is being built on multi-level NLP (Natural Language Processing) support. The IE results are stored in a database which is the basis for several IE-related applications, including QA (Question Answering or Intelligent Querying), IB (Intelligent Browsing/Threading) and AS (Automatic Summarization). All the IE and NLP modules adopt a common data structure for information exchange and updating.
This proposed system is hierarchical in the sense that each lower level IE module can be developed independently, and runs independent of the higher level modules. With this hierarchical design, complicated tasks of IE and linguistic processing are decomposed into well defined sub-tasks.
Table 1 gives a concise comparison of the IE/NLP task definitions in the Textract architecture and those defined in MUC.
Textract modules | Corresponding MUC definitions | Remark |
NE | NE | more NE types defined |
CE | TE, TR | more TR types defined in CE |
GE | ST | substituting domain dependent ST by Textract domain independent GE |
CO | CO | Textract CO allows for non-deterministic output |
POS |
| not used in MUC |
Shallow Parsing |
| undefined in MUC, but assumed (for supporting TE/TR/CO/ST) |
Full Parsing |
| undefined in MUC, but assumed (for supporting ST and/or TE/TR/CO) |
Pragmatic Filtering |
| undefined in MUC |
Table 1: Comparison of IE Task Definitions
[1] Some ideas have been further developed in this research, e.g. from the concept of simple/shallow events to the concept of general events. But they did not affect the general conceptual design.
[2]It is expected that for each level of IE, some optional domain dependent module can be developed and ‘plugged-in’ to complement the domain independent extraction. This in principle will not affect existing modules. For example, in addition to the existing defintion of NE types and subtypes in Textract 1.0, a customized version for the medicine domain will include new NE types/subtypes like names of physician, nurse, patient, disease, drug, procedure, symptom,etc.
(to be continued at Pre-Knowledge-Graph Profile Extraction Research via SBIR (2) 2015-10-24)
Related Publications
Srihari, R, W. Li and X. Li, 2006.
Question Answering Supported by Multiple Levels of Information Extraction, a book chapter in T. Strzalkowski & S. Harabagiu (eds.), Advances in Open- Domain Question Answering. Springer, 2006, ISBN:1-4020-4744-4.
online info
Srihari, R., W. Li, C. Niu and T. Cornell. 2006.
InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006.
online info
史海钩沉:Early arguments for a hybrid model for NLP and IE
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-21 17:15
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社