|||
Design of Concept-based, General Event
Project finished September 2003
1.1. Identification and Significance of the Problem or Opportunity
Currently, only keyword based, shallow IE, mainly the identification of named entities and simple events, is available for deployment. There is an acute demand for concept based, intermediate-level extraction of events.
In this effort, we seek to develop a deployable Concept-based, General Event extraction system, namely CGE. Essentially, this effort aims at ‘translating’ language specific, keyword based representation of IE results into a type of ‘interlingua’ based mainly on concepts. More precisely, the key verb for a shallow event will be mapped into a concept cluster (e.g. kill à {kill, cause to die, put to death}; the time and location of the event will be normalized (e.g. last Saturday à 1999-01-30).
1.2. Research Objectives
The goal of this effort is the development of a deployable intermediate-level IE system (i) which is capable of extracting and merging concept based general events from free English text and (ii) which can support event visualization applications.
2.1. Conceptual Design
The last decade has seen great advance and interest in the area of Information Extraction (IE). In the US, the DARPA sponsored Tipster TextProgram [Grishman 1997], the Message Understanding Conferences (MUC) [Chinchor1998], DARPA’s Evidence Extraction and Link Discovery (EELDhttp://www.darpa.mil/ipto/Solicitations/CBD_01-27.html) and DARPA’sTranslingual Information Detection, Extraction, and Summarization (TIDEShttp://www.darpa.mil/iao/BAA03-23-PIP.pdf) have been driving forces for developing this technology.
The most successful IE task thus far has been Named Entity (NE) tagging. The state-of-the-art exemplified by systems such as NetOwl [Krupka andHausman 1998], IdentiFinder [Miller et al. 1998] and InfoXtract [Srihari et al. 2000] has reached near human performance, with 90 percent or above F-measure. On the other hand, the deep level MUC IE task Scenario Template (ST) is designed to extract detailed information for predefined event scenarios of interest. It involves filling slots of complex event templates, including causality and aftermath slots. It is generally felt that this task is too ambitious for deployable systems at present, except for very narrowly focused domains. This report focuses on new intermediate level information extraction tasks that are defined and implemented based on our existing IE engine, named InfoXtract. Specifically, it defines new concept-based IE tasks such as Entity Profile (EP) extraction, which is designed to accumulate interesting information about an entity across documents as well as within a discourse. Furthermore, Concept-based General Event (CGE) is defined as a domain-independent representation of event information that is useful and more tractable than MUC ST.
A variety of IE engines, reflecting various goals in terms of extraction as well as architectures are now available. Among these, the most widely used are the GATE system from the University of Sheffield [Cunningham et al. 2003], the IE components from Clearforest (www.clearforest.com), SIFT from BBN [Miller et al.1998], REES from SRA [Aone and Ramon-Santacruz 1998] and various tools providedby Inxight (www.inxight.com). Of these,the GATE system most closely resembles InfoXtract in terms of its goals as well as the architecture and customization tools. Cymfony differentiates itself by using a hybrid model that efficiently combines statistical and grammar-based approaches, as well as by using an internal data structure known as a token-list that can represent hierarchical linguistic structures and IE results for multiple modules to work on.
2.2.2. Concept-based IE
Traditional IE aims at extracting textual strings and classifying them as mentions of entities, relationships or events (i.e. keyword-based information objects). Concept-based IE requires the conversion of keyword-based information objects into concept level representation. This involves an information fusion process through which mention-level objects are merged and consolidated into concept-based objects. The entity-centric information will be fused into Entity Profiles as a concept-based object for representing real world entities. The action-centric information will be fused and linked into event scenarios to model what happens in the real world.
Entity Profile (EP) Extraction
The inception of EP forms an important step towards concept-based IE representation. EP extraction was proposed as a significant intermediate level IE task, collecting information about a given entity, say, Julian Hill, and generate his profile. The extracted profile in effect represents a miniature résumé of the person, as shown in the EP popup in our IE-supported intelligent textbrowser (Figure 1).
Figure1. Screenshot for Intelligent Browsing Prototype Based on InfoXtract EP
The progress from NE to EP is a significant development in IE representation for an entity. EP enriches the information contained in MUC TE and TR [Li & Srihari 2000]. The design is to integrate the two types of information and represent them in an entity-centric format. As building blocks of EP, Correlated Entity (CE) relationships consist of three types of information: (i) specific relationships such as affiliation/staff (corresponding to EMPLOYEE_OF in MUC TR), (ii) general relationships such as descriptors, modifiers and associated-entities, and (iii) links to involved-events. The introduction of ‘modifiers’ and ‘associated-entities’ in addition to the MUC ‘descriptors’ reflects the desire to form a relationship back-off in order not to miss potential important information about an entity. For example, even if the ‘head-of’ relationship is not specifically defined, the‘associated-entities’ relationship should still link ‘bin Laden’ with ‘Al-Qaeda’.
In Cymfony’s design, EPs are fused information objects used to model the individual entities in the outside world. These objects embody a fused collection of information extracted and merged from various places of the text. This is unlike the mention-level information objects such as NE, CE, or simple events in the form of Subject-Verb-Object(SVO) triples, which correspond to a token string (e.g. Julian Hill, Du Pont, research chemist) or linked token strings (e.g. POSITION<Julian Hill à research chemist>) in the text.[1]
Concept-based General Event (CGE)
GE was proposed to extract open-ended key events pertaining to who did what to whom when and where [Li & Srihari 2000]. CGE intends to further conceptualize the IE representation for events. The key to CGE is to transform keyword-based representation into concept-based representation as appropriate attribute values. More precisely, a CGE AVM is defined as consisting of the following types of attributes: (i) a predefined event type for the given domain (which can be detected using light-weight lexicon grammar rules); (ii) attributes about involved entities such as people-involved and organizations-involved whose appropriate values are EPs; (iii) event location whose appropriate value is a location EP that has undergone location normalization; (iv) event time whose appropriate value is a normalized time.
A CGE AVM is a superset of GE as it is linked to the GE or GEs where it is derived from. To contain and retain the sentence-level real objects GEs in the CGE formation has good reasons. First it reflects a phased modular approach to IE where a deeper level object is built on top of the lower level objects. Second, GEs (as well as the underlying snippets where GEs are extracted from) serve as evidence for the derived CGE.
Assume the incoming sentenceis ‘Yesterday IBM announced the appointment of John Smith as CEO due to the sudden resignation of Peter Lee’. The corresponding domain independent CGE Attribute Value Matrix (AVM) and the domain specific PE AVM are listed below:
<Domain Independent CGE-I>
Keyverb: {appoint,charge}
Logical-subject: <IBM>
Logical-object: <John Smith>
Complement: ‘as CEO’
Time: ‘Yesterday’(2002/05/30)
Reason: ‘due to the sudden resignation of Peter Lee’
<Domain Specific PE>
Event-type: ExecutiveChange
Involved-company: <IBM>
Person-in: <John Smith>
Person-out: <Peter Lee>
Position: ‘CEO’
Time: ‘Yesterday’(2002/05/30)
Reason-in: ‘due to the sudden resignation of Peter Lee’
The proposed AVM of PE+CGE events is shown below:
<Domain Specific CGE-II>
Event-type: ExecutiveChange
Involved-entities: <IBM>;<John Smith>; <Peter Lee>
Complement-nouns: ‘CEO’
Time: ‘Yesterday’(2002/05/30)
Reason: ‘due to the sudden resignation of Peter Lee’
Snippet: ‘YesterdayIBM announced the appointment of John Smith as CEO due to the
sudden resignation of Peter Lee.’
As seen, the entity profiles in the grammatical slots (subject, object, etc.) are now listed under the slot “Involved-entities”. This is because for domain specific CGE, the key verb is further conceptualized to an important event concept/type in the given domain. This makes the original grammatical relationships such as logical subject no longer appropriate. For example, an event type like ‘Job Change’ (a supertype of Executive Change event) may involve the keyverb ‘hire’ or its antonym ‘fire’: note that the object of ‘hire’ is dramatically different than the object of ‘fire’. Therefore, when the keyverb maps to a predefined event type such as ‘Job Change’, the original grammatical roles no longer suit. However, as involved entities, they are always appropriate. The user can easily determine the exact roles of these entities from the attached snippet; no need for automatic full fledged PE extraction. In many applications, it is sufficient as long as the involved entities are determined to be associated with a given event. These entities provide an important angle for indexing extracted events (normalized time stamping and location association are the other two important angles of indexing events for applications such as visualization; we can also index events according to event types or their hierarchy). From the “Involved-events” slot of these entity profiles, a user can access relevant events easily.
The following sample AVMs illustrate the transition from keyword-based GE to concept-based CGE.
<CGE 200> =
keyverb: {die,pass_away: DIE}
who: <Julian WernerHill: PersonProfile 001>
when: {Sunday:1996-01-07}
where: <Hockessin,Del.: LocationProfile 301>
<CGE 202> =
keyverb: {graduate_from: GRADUATE_FROM}
who: <JulianWerner Hill: PersonProfile 001>
whom-what: <WashingtonUniversity: OrgProfile 101>
when: {1924:1924}
where: <St.Louis: LocationProfile 300>
As shown in the notation, there are two types of conceptualization involved in the transition: (i) values placed in {…} involve some type of sense disambiguation or normalization; (ii) values placed in <…> are actually embedded virtual objects: technically they can be regarded as hyperlinks from this virtual object to the related virtual objects. In both types of values, the first part of the value represented before the colon inside the brackets retains keyword-based flavor which provides a link to the real tokens in the text snippets (from which this information was extracted) and which serves as evidence for the extracted information. The second part of the value, following the colon inside the brackets represents the concept.
The replacement of a verb literal as the predicate of an event by a concept cluster is an important step towards concept based representation. A concept cluster refers to a synonym set, similar to the notion of synset from WordNet [Beckwith etal. 1991]. The use of concept cluster to represent the core concept of an event enables the use of a concept hierarchy in event merging. For example, since the cluster {get, acquire: GET} is a hypernym of the synset {buy,purchase: BUY}, a CGE of {buy,purchase: BUY} can be merged with a CGE of {get, acquire: GET} as long as other parts of the information are unifiable.
The idea of using Entity Profiles instead of NEs as values of argument slots in CGE follows the practice of the MUC ST standards where the fillers of event participants are usually TE Templates. But Entity Profile collects far more information about an entity than TE, as shown in the previous sample profile <PersonProfile 001> on Julian Hill.
The last major issue is location/time normalization in CGE. This is crucial in supporting event visualization on a map or along a timeline. Normalization transforms variations of expressions into a canonical form corresponding to an absolute instance of the time/location concept. Cymfony has proposed to use international standards as the internal representation, e.g., ISO 8601 standards for time/date and geographic representation standards from Geographic Information Systems (GIS). Variations of time/location NEs will be disambiguated and mapped into the chosen ISO standards using various techniques proposed under this effort.
It is worth pointing out that the location normalization result is actually embodied in the location profile which contains not only the normalized form of the location name but also records the relationships, as shown in the sample AVM below.
<LocationProfile 301> ::
name: Hockessin
normalized-form: (a standard to be chosen)
type: CITY
state: DELAWRE
country: USA
The significance of the proposal of CGE for concept based event extraction lies mainly in the capability of directly supporting information visualization. With normalized time and location, the monitoring of events on timelines and maps can be enabled based on continuous processing of incoming text sources. The other benefit of CGE is its capability in facilitating information consolidation. As information is normalized and disambiguated in the representation of concept-based information objects, it creates a condition for information merging and consolidation. The CGE task defined here can also be adopted as the common basis, or IE interlingua, for multi-lingual IEdevelopment. Language specific IE/NLPmodules can be developed and ultimately mapped to the concept-based CGE AVMsin the storage. This would potentiallyallow events extracted from sources of different languages to be merged and enriched. More importantly, as the key information in these templates is conceptualized and language independent, an information analyst who does not know a particular language can still retrieve and understand the essential information of the events.
Another benefit of directing research along this line is the possible separation of event extraction into two sub-tasks: (i) event (type) detection; and (ii) event entity association (involved-entities slot filling). The lexicon grammar rules are only designed to handle task (i). This makes the task of constructing a lexicon grammar for event detection a lot simpler. The key point is not to waste too much time trying to pull out the event participants, attributes and so on, and just focus on highlighting sentences that might mention an event of interest. If this reduces a lexicon grammar to little more than a list of keywords plus some necessary contextual constraints to search for, that in itself is an interesting result.
This separation between event tagging (more precisely, tagging event type to the keyverb of a local event mention) and event entity association is in line with modularization of a number similar IE tasks. This is similar to the relationship between NE tagging (tagging proper name mentions) and CE association between NEs. Identified messages such as modifiers and descriptors also need to be associated with NEs. We found that there is some common practice and techniques that can be used to conduct these association tasks. Grammar based structural associations plus support from Co-reference and the structure-less co-occurrence based association can be combined to reach the best balance between precision and recall. Ingeneral, if the association module is not separated from the tagging module, we will not have as much flexibility in combining the two techniques in one task. This is because the required balance of the two techniques for each of the two tasks is usually not the same. The separation strategy provides an additional strong case for fine-grained modularization effort.
For the required entity-event association in CGE-II, the starting point is SVO parsing. Any NEs directly or indirectly linked with the tagged keyverb must have been “involved entities” of this event mention. We want to go beyond sentence level to figure out the association of entities with their involved events. Therefore, co-occurrence constraints plus co-reference will play a role here. In fact, even within sentence boundaries, clause-level and sentence-level co-occurrence constraints can in many occasionsalso help raise the recall of the entity-event association with minimum or no price on precision. This compensates for parsing errors.
Part of the reason for the lightweight event detection lexicon grammar is to have very quick development time for a bunch of event detectors (event taggers) so that we can deploy this technology in real life applications. The second benefit lies in its being easier for third parties or analysts to pick up and write their own rules quickly. This is because writing rules for event detection/tagging is much easier than full-fledged PE extraction which involves the checking and mapping the GE slots to PE slots. Finally, based on our IE-supported product development experiences, we found that in majority of application cases, there is no need for too fine-grained semantics reflected in PE AVMs. A user with his own domain knowledge and common sense can easily figure out almost as much information from a CGE-II AVM as from a full-fledged PE. For example,as long as the event type is determined to be ‘Executive Change’, it makes little difference to an information analyst or user whether to place ‘CEO’ into an abstract association slot ‘Complement-nouns’ for CGE-II or into very specific slot ‘Position’ for PE. However, this work of extracting the right values and placing them into very specific PE slots has proven to be extremely labor-intensive, skill demanding and error-prone. In contrast, the concept of CGE-II facilitates this by recognizing that the users of information systems have their own intelligence to easily figure things out when the information overload has been handled by the system and the relevant information pieces are presented to them.
In summary, there is little point in struggling to tag the participants in an event. The feeling is that our goal is really to draw the attention of a human to potentially interesting text passages, rather than to try to fill up a database automatically, with no human oversight.
Sample CGE-IIs are shown below:
<CGE-II>
Event-type: Executive Change
Involved-entities: <IBM>;<John Smith>; <Peter Lee>
Complement-nouns: ‘CEO’
Time: ‘Yesterday’(2002/05/30)
Reason: ‘dueto the sudden resignation of Peter Lee’
Snippet: ‘YesterdayIBM announced the appointment of John Smith as CEO due to the
sudden resignation of Peter Lee.’
<CGE-II>
Event-type: CompanyAcquisition
Involved-entities: <AOL>;<Netscape>
Time: ‘inthe spring of 1999’ (1999/02-1999/04)
Condition: ‘subjectto regulatory and shareholder approval’
Snippet: ‘AOLwill buy Netscape for $4.2 billion in the spring of 1999,
subject to regulatory and shareholder approval’
Since named entities (NEs) are usually important information objects, their participation in an event present important aspects of the event details which should not be missed. In the second example for CGE-II above, the money amount is one key piece of information for this event, that should be made explicit in the representation. In fact, all types of involved NEs (or their correspondingentity profiles), including person, organization, time, location, money, etc, should also serve as an indexing dimension of the related events to facilitate retrieval of events. For example, the following types of queries are of great interest to a user:
Find all Executive Change events involving IBM from 1999 to 2001.
Find all Company Acquisition events involving the money amount exceeding $1 billion.
To satisfy such queries, indexing CGE-II using the involved NEs is critical. In addition, the linkage between NEs and their involved events also facilitates automatic hyperlinks between an NE index page with the events in the intelligent browsing/navigation application as well as other ways of information presentation including visualization applications.
<CGE-II>
Event-type: ExecutiveChange
Company-Involved: <IBM>
Person-Involved: <John Smith>; <Peter Lee>
Noun-Involved: ‘CEO’
Time-Involved: ‘Yesterday’(2002/05/30)
Reason-Involved: ‘dueto the sudden resignation of Peter Lee’
GE-involved: <GE1
Keyverb: appointment
Object: John Smith
Complement: as CEO
>
Snippet-Involved: ‘YesterdayIBM announced the appointment of John Smith as CEO
due to the sudden resignation of Peter Lee.’
<CGE-II>
Event-type: CompanyAcquisition
Company-Involved: <AOL>;<Netscape>
Time-Involved: ‘springof 1999’ (1999/02-1999/04)
Money-Involved: ‘$4.2billion’
Condition-Involved: ‘subjectto regulatory and shareholder approval’
GE-involved: <GE2
Keyverb: will buy
Subject: AOL
Object: Netscape
Time: in the spring of 1999
Other-Modifier: for $4.2 billion
Other-Modifier: subject to regulatory and shareholderapproval
>
Snippet-Involved: ‘AOL will buy Netscape for $4.2billion in the spring of 1999,
subject to regulatory and shareholder approval’
The above AVMs contain many details of event information, however; uncovering the details does not require specific domain dependent rules. This effect has been achieved by maximizing the leverage of domain independentlinguistic processing. In particular, by using the rich types and subtypes of NE information, the hidden domain-specific attribute relationships are fairly explicit to a human user. For example,in the case of ExecutiveChange events, the Company-Involved attribute is derived from the NeCompany type instead of being decoded by special domain dependent event grammar rules. In the case of Company Acquisition events, Money-Involved, again derived from the NeMoney type, apparently implies the Price information, having achieved the same effect of a special attribute such as PriceOfAcquisition which may otherwise have to be defined and decoded by special event grammar rules. In the case of Company Acquisition, such treatment will not directly decode who is the Acquiring Company and who is Acquired Company; both are put into the same attribute slot Company-Involved. However, a human user can easily determine this by one of the following ways: (i) based on his domain knowledge: AOL is a lot larger than Netscape, so in Company-Involved, AOL should be the buyer and Netscape should the ‘buy-ee’; (ii) by noting the GE AVM looking at the keyverb and the logical grammar relationships; or (iii) by reading the snippet which describes this event.
In general, event AVMs do not have to be defined as elaborate templates similar to the MUC ScenarioTemplate (ST). In other words, a lightweight event tagging module and the techniques described above will approximate the value of a very elaborate PE AVM or MUC ST, without the necessity of having to develop specific event attribute mapping rules to fill the slots. More importantly, in most applications of event extraction that we have examined, the lightweight event detection plus generic structures of event details as presented above in the CGE-II AVMs are sufficient. This makes it possible for rapid domain porting of the event extraction work – recognized as the most challenging IE task. When domain dependent work is minimized to event tagging (like a context-based verb categorizing job) supported by the structural context provider Semantic Parser and the Lexicon Grammar tool, the time and resources required for this task are reduced significantly. The knowledge bottleneck in requiring significant amounts of skilled labor for writing full-fledged Pre-defined Event grammars has been overcome.
[1]‘Real’ in the sense that the NE object is always uniquely represented by a pairof character offsets: <begin-offset,end-offset> and the CE object uniquely represented by a pair of NEs, or anNE and another token string represented by <begin-offset, end-offset>.
Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization
(SBIR Phase 2)
Wei Li, Ph.D., Principal Investigator
Rohini K. Srihari, Ph.D.,Co-Principal
Contract No. F30602-01-C-0035
September 2003
[Related]
SVO as General Events 2015-10-25
Pre-Knowledge-Graph Profile Extraction Research via SBIR 2015-10-24
《知识图谱的先行:从 Julian Hill 说起 》 2015-10-24
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-21 22:18
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社