博文

前知识图谱钩沉，信息抽取任务由浅至深的定义

已有 6780 次阅读 2015-10-30 22:54 |个人分类:立委科普|系统分类:论文交流| 信息抽取, 知识图谱

IE Space: Spectrum of Tasks

Based on years of IE study at Cymfony, the overall picture of the IE space is fairly clear now; a relatively complete spectrum of IE tasks and their relationships are illustrated in Figures 1 and 2, placing them in context.

Figure 2. Cymfony IE Hierarchy Based on Extraction Scope

The starting point (the left-most box) is the extraction of NE which is the foundation for more advanced IE. The ultimate goal of deep level IE will be to extract related key events into scenarios (the right-most box) which represent the major content of the processed documents.

Figure 3: Cymfony IE Hierarchy Based on Representation Type

The proposed IE hierarchy is presented as a three-dimensional division: (i) the types of targeted information objects for extraction: entity-centric information is captured by NEs and their correlated relationships in individual CE (individual, local CE relationship) and Entity Profile (a set of relationships about an entity), and event-centric information is captured by either GE or PE; (ii) the scope of extraction: local extraction within a sentence, discourse extraction within a document and global extraction within an archive; (iii) the types of IE representation: keyword-based representation or concept-based representation. Two figures, Figure 1 and Figure 2, are used in order to present a clear and accurate picture of the three-dimensional IE hierarchy in two-dimensional form.

Figure 2 dividesthe IE tasks into two parts: the left side which includes NE, CE, GE and PE (and part of Profile) represents keyword-based IE while the right side which includes C-Profile, C-GE, C-PE and C-ES (and part of Profile) belongs to concept-based IE. The key distinction between the two parts is in whether the appropriate values to fill the attribute slots for AVM objects are actual strings (real objects) extracted from the processed documents or virtualobjects. Virtual objects include normalized forms (e.g. normalized time), disambiguated forms (e.g.disambiguated verb), and conceptualized references to entity profiles (e.g.involved participants of events) and/or events (e.g. event linkage in a scenario).

The interesting information object is the Entity Profile, which seems to show a dual identity. As shown in Figure 2, Entity Profile is conceptually between keyword-based representation and concept-based representation. In fact, it should be considered to be still in a transition status from keyword-based representation of an entity NE/CE to a fully conceptualized representation of an entity C-Profile. On one hand, in the Entity Profile AVM, the appropriate values for its correlated entities are no longer NEs, but the references to other Entity Profiles. For example, the attribute slot affiliation for <PersonProfile 001> as shown before is no longer the NE string Du Pont Co., but <OrgProfile020>. In this sense, the representation is already concept-based. On the other hand, some other attribute slots such as age, birth-place as shown in <PersonProfile 001> still use extracted strings as appropriate values. So it still retains an element of keyword-based representation. It is assumed that the remaining keyword-based values will be fully normalized or disambiguated when the transition from Profile to C-Profile is completed. This process requires new capabilities such as location normalization and word sense disambiguation. Research and development of these capabilities are one focus of this project.

The text below gives descriptions of the tasks shown in the above figures. NE is an isolated information object. In contrast, relationships and events are captured by three major types of information objects, namely entity-centric objects (CE and Entity Profile) for relationships, and GE and PE for events. Depending on the level of processing depth, CE, GE and PE can be divided into five major stages or sub-tasks.

Concept-based Information Extraction

Keyword-based IE relies on strings of words extracted from the text to fill template slots. Concept-based IE aims at transforming information objects from keyword-based representations into concept-based representations, hence C-CE, C-GE and C-PE. This is essentially a disambiguating and normalization process: entities (people, organizations, etc.) are represented by entity profiles, not just keyword strings; times and locations are normalized so they appear in a standard form; and the event relation is represented by a verb concept, not just a verb keyword. This prepares a better basis for information consolidation in discourse, across documents or languages. The results appear as follows:

<C-Profile>

name: {He: PersonProfile}

affiliation: {Cymfony: OrgProfile}

position: {a research scientist}

<C-PE>

predicate: {appoint: ExecutiveChange}

person-in: {John Smith: PersonProfile}

position: {CEO}

company: {this start-up: OrgProfile}

<C-GE>

keyverb: {discover, find: DISCOVER}

who: {Julian Hill: PersonProfile}

whom-what: {nylon}

when: {1930s: 1930-1939}

<C-GE>

keyverb: {die, pass_away: DIE}

who: {Julian Hill: PersonProfile}

when: {Sunday: 1996-01-07}

where: {Hockessin, Del.: LocationProfile}

As reviewed previously, the area of IE has been fundamentally influenced by the MUC program. On the positive side, the MUC program helped to define this area as one important direction of research for applied NLP, which has great potential in various applications. However, there seems also to be a negative effect due to insufficient reflection of semantic representation in MUC-defined IE tasks. Even in the definition of Scenario Template which involves a very elaborate AVM, it is only semi-semantic: the structures embodied in attributes represent logical relations of the modeled event scenario, but the values used to fill these attribute slots are mainly keyword-based and not yet fully conceptualized.[1]

In the initial stage of IE, the keyword-based representation actually helped to make the IE tasks more tangible.[2] As the research moves gradually from shallow IE to intermediate and deep IE, from local processing towards discourse and global processing, the limitation of the keyword-based representation becomes obvious: it makes information consolidation such as merging and linking difficult as there lacks a common logical basis for consolidation. This is the rationale behind the Cymfony proposal of a necessary process of conceptualization for IE representation.

Between traditional MUC keyword-based representations and the proposed concept-based representations, the picture is fairly clear: the best approach to the representation issue seems to take two steps: step one, parse the natural language text into information objects in keyword-based representation; step two, treat the keyword-based representation as interim representation and gradually map it to concept-based representation when the knowledge is available for such a mapping. This is exactly what has been happening in the Cymfony IE effort. This two-step approach not only makes theNLP/IE task more tangible but also has the flexibility of delivering information objects in mixed representation so that some key attributes are first conceptualized while other attribute slots still retain keywords (realtokens) as legitimate values. It has the added benefit of delivering both keyword-based local objects and concept-based virtual object for storage and various IE applications (see Footnote 23 for the rationale on the need for this capability).

[1] To be fair to the MUC designers, the use of TE (Template Element) as values for event participant slots for ST marked the MUC effort towards conceptualization in IE representation. All other attributes require an extracted sub-string of real tokens from the processed documents as appropriate values. Inaddition, the TE template, unlike the Cymfony profile object, is itself keyword-based in its representation.

[2] Traditional research in computational linguistics and NLP seems to haveover-emphasized fully semantic and fully logical representation as the target for all text processing systems. The negative effects are obvious: (i) most NLP systems pursuing this target too literally end up with not being able to scale up since the target is too ambitious; (ii) by pursuing the ambitious target, many research programs have overlooked the fact that once properlyplaced in pre-defined structures such as AVMs, natural language itself, or more precisely, selected parts of natural language, can also be used in representation (so-called keyword-based representation). MUC has verified the possibility of keyword based representation and helped to correct these problems.

The Cymfony proposed IE hierarchy (Figure 1 and Figure 2) is a result of a series of SBIR efforts sponsored by Air Force Research Laboratory's Information Directorate (AFRL/IF). It originates from the tradition established by AFRL/IF which has been defining, promoting and refining an IE hierarchy from shallow to deep extraction as its long-term strategy in this space ever since 1993, through its IAA program and its SBIR contractors. The rationale behind this tradition was to define a set of more realistic IE tasks/templates than the MUC standards. The Cymfony hierarchy is a natural development following this tradition.

It is interesting to compare the Cymfony IE hierarchy with the traditional MUC hierarchy. Figure 4 uses the Cymfony IE Hierarchy Based on Extraction Scope as an example to demonstrate the comparison with the MUC IE hierarchy.

Figure 5. IE Hierarchy Comparison

The final goals of the Cymfony-defined IE hierarchy and MUC-defined IE hierarchy are actually the same, namely organizing the extracted information in some type of scenario of events. However, Cymfony’s research has led to an understanding of a vast range of tasks between the shallow level IE and the deep level IE: many are missing links in the MUC program and they include those boxes with no corresponding MUC counterparts as well as the entire list of Concept-based IE components. Correspondingly, the Cymfony approach to IE is to move step by step from shallow IE to intermediate IE and ultimately to deep IE. In other words, Cymfony has not abandoned the final goal of the MUC program for scenario extraction; it believes in progressing steadily towards that goal. The disappointment in the IE community arising from the poor performance for the MUC ST task is no accident: there are too many missing links to fill before such a sophisticated task can be handled in a feasible way.

Table 2 gives a concise comparison of the major IE tasks with different levels of difficulty. The tasks defined in the Cymfony InfoXtract design are contrasted with those defined in MUC.

Table 2. Comparison of Major IE Tasks

InfoXtract IE tasks	MUC IE tasks	rank	level	Remark
NE: Named Entity	MUC NE: Named Entity	shallow	1	domain independent and predefined more NE types/sub-types defined in InfoXtract than MUC
	MUC TE: Template Element	shallow-intermediate	2	domain independent and predefined (MUC TE integrated into CE in InfoXtract)
CE/Profile: Correlated Entity	MUC TR: Template Relation	intermediate	3	domain independent and predefined more relationships defined in InfoXtract CE than MUC TR
GE: General Event		intermediate	4	domain independent and open-ended GE is more semantic than IAA SE: GE logical form embodies semantic relations instead of SE syntactic relations
PE: Predefined Event		intermediate	5	domain dependent and predefined PE is simpler than MUC ST in structure and granularity
C-GE: Concept-based General Event		intermediate-deep	6	domain independent and open-ended C-GE more semantic than keyword-based GE
C-PE: Concept-based Predefined Event		Intermediate-deep	7	domain dependent and predefined CPE more semantic than keyword-based PE
S-GE: Scenario for General Event		deep	8	domain independent and open-ended S-GE places related, individual GEs into a scenario
S-PE: Scenario for Predefined Event	MUC ST: Scenario Template	deep	9	domain dependent and predefined S-PE is very close to the goal of MUC ST

The multi-level organization of IE tasks allows for a modular, monotonic approach to IE where shallow level extraction (e.g. NE), intermediate level extraction (e.g. CE, GE and PE) and deep level extraction (S-GE, S-PE) can be structured into amulti-module hierarchical IE system.

Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization

(SBIR Phase 2)

Wei Li, Ph.D., Principal Investigator

Rohini K. Srihari, Ph.D.,Co-Principal

Contract No. F30602-01-C-0035

September 2003

[Related]

前知识图谱钩沉，关于事件的抽取 2015-10-30

SVO as General Events 2015-10-25

Pre-Knowledge-Graph Profile Extraction Research via SBIR 2015-10-24

《知识图谱的先行：从 Julian Hill 说起》 2015-10-24