|||
January 2000
Flexible Information Extraction Learning Algorithm
Contract No. F30602-99-C-0102
Wei Li, Principal Investigator
Rohini K. Srihari, Ph.D.,Co-Principal Investigator
Cymfony has developed aprototype Textract 2.0 for CE extraction on a Windows 95/98/NT Platform. Considerable progress has been made during the Phase I effort in implementing this hybrid CE prototype.
A hybrid CE prototype is the first goal of this Phase I project. It is designed to extract pre‑defined multiple relationships between the entities, on the foundation of the existing NE system Textract 1.0. The results will be used to support intelligent browsing/threading in the sense that a user has access to more information about each identified entity and can be directed to jump freely between related entities. This represents a giant step forward from existing deployed IE systems such as NetOwl, IdentiFinder [MUC-7 1998], which mainly output isolated named entities.
The CE prototype developed in Phase I, namely Textract 2.0/CE, is able to detect/extract the following relationships:
for PERSON:
name: including aliases
title: e.g. Mr.; Prof; etc.
subtype: e.g. MILITARY; RELIGIOUS; etc
age:
gender: e.g. MALE; FEMALE
affiliation: reverse relationship of staff
position:
where_from:
address:
phone:
descriptors:
for ORGANIZATION
name: including aliases
subtype: e.g. COMPANY; SCHOOL; etc.
location:
staff: reverse relationship of affiliation
address:
phone:
www:
descriptors:
Efforts are being made to increase the coverage both in manual CE rules and CE training. CE relationships about product and named entity will also be targeted.
2.4.1 CE Prototype Architecture
Figure 2 illustrates the architecture of the Textract 2.0/CE prototype.
Figure 2: Textract 2.0/CE Prototype Architecture
As shown in this CE Architecture, fundamental linguistic support is from shallow parsing which detects basic linguistic chunks like basic noun phrases, verb groups, etc. Although sentence-level full parsing is found to be necessary for extracting general events (with argument structure at the core), it has been suggested and verified in the CE prototyping practice, that shallow parsing creates reliable conditions for local CE extraction as well as for co-referencing nominal strings. As Yangarber & Grishman [1998] comment, "This style of text analysis, - as contrasted with full syntactic parsing,- has gained the wider popularity due to limitations on the accuracy of full syntactic parsers, and the adequacy of partial, semantically-constrained, parsing for this task."
In the CE component, note that the CE task actually consists of two modules: Local CE and Merging. The former only extracts CE information within sentence boundaries, based on the results of shallow parsing. It is the task of Merging that unifies the locally extracted CE information into coherent information objects in the wider context of discourse. This is achieved with the support of the co-reference module (CO). For example, given the sentence He was 91, Local CE will extract 91 as a filler of the slot age for the entity denoted by He. But the person to whom He refers in this context is left to CO to resolve.
The implementation of this prototype has involved several phases which are discussed below.
2.4.2 Semantic Word Clustering
The first issue to be discussed is why semantic lexical support is needed for local CE extraction. As we know, a certain CE relationship is often expressed by some specific key words (e.g. the verb work, hire, and/or the preposition for, by, from) between two entities. However, the conditions on the units of the entities themselves are usually based on some semantic classification, rather than specific words. Otherwise, there will be little generality in the CE rules, whether hand-coded or machine learned, which are bound to suffer from the sparse data problem. This type of semantic classification information may come from the previous NE module in cases of proper names. But for non-NE , common noun entities involved in a possible rule, similar semantic classification is also needed. For example, Bill Clinton, John Smith, etc. are tagged as NE of person by the NE module. However, the common nouns like gentleman, guy, girl, lady, etc. should also be classified as of the person type. Otherwise local CE rules for the affiliation (staff) relationship like <PERSON works for ORGANIZATION> would only cover a fraction of the intended phenomena. Such a rule cannot cover the cases like This gentleman works for Cymfony unless the system has access to the information that gentleman is PERSON.
There are two ways to obtain this information, from some lexical resources that classify such words as person, or from the co-reference link between This gentleman and its NE antecedent across sentence boundaries. But the co-reference link cannot be established reliably without the lexical information that gentleman is PERSON. Besides, for modularity consideration, Local CE and CO should be two distinct modules and it is not wise to bring in discourse analysis in the stage of local CE extraction. Therefore, the technique proposed here is to rely on support from on-line lexical resources. In fact, reliable word classification information not only supports CE, it can support NE and GE as well.
Such word class information is usually assumed to come from some semantic lexicon. Originally, it was expected that WordNet [Beckwith 1991] would provide this lexical support as it is widely acknowledged to be a reliable (and free!) lexical resource with a sophisticated semantic classification scheme. But it was soon found that there is too much noise to render it usable for the IE purposes. WordNet tries to cover as many senses (synsets in their term) as possible for each word no matter how rare a particular sense actually is. For example, the words dog, cat, bird are all tagged as PERSON (in addition to other synsets). It was realized that there is a need to design and implement a word classification or clustering algorithm in order to get statistically meaningful semantic word class information. This work will take place in the next phase.
For time being, in order to enable the CE prototype implementation (as well as supporting question parsing) and to examine the feasibility of the CE grammar development and CE rule learning, a hand-crafted FST grammar has been developed which classifies some more frequently used words into classes.
2.4.3 Development of Local CE Grammars
The Local CE module is itself a hybrid component. It is comprised of an FST Model (pattern matcher using hand-coded CE grammars) as a preprocessor and a CE Learned Model (established via symbolic rule induction).
In the preceding SBIR efforts, Cymfony had made great progress in the development of local CE grammars, using the Textract FST toolkit. It was found that in order to capture CE relationships, multi-level grammars need to be developed based on different levels of structures. Some examples and the corresponding sample rules are cited below to demonstrate this point.
Three levels of structures have been identified to be required for three levels of CE FST models, namely, CE1, CE2 and CE3. They are all organized in a pipelinearchitecture together with the other text processing components of the system, as shown in Figure 3.
Figure 3: Multi-level CE Grammar
Each level of the CE FST Model is supported by the corresponding CE grammar (compiled into FST for run-time application). The following is a pattern rule in the Textract CE1 grammar for the CE relationships affiliation and position.
Sample CE1 Rules:
0|NE(ORGANIZATION) 1|N(position_w) 2|NE(PERSON)
==> 2:affiliation=0position=1
0|NE(LOCATION) 1|-based 2|NE(ORGANIZATION)
==>2: location=0
The first rule links a person NE with an organization NE with the affiliation relationship; it also extracts a position word (position_w) like spokesman,chairman, secretary, researcher, salesman, etc. as the fill to the CE slot (feature) position for the person NE. This rule covers cases like UAW spokesman Owen Bieber. The second rule works for cases like Buffalo –based Cymfony. The output of the corresponding local CE template are shown below:
name: OwenBieber
position: spokesman
affiliation: UAW
name: Cymfony
location: Buffalo
For this type of very local phenomena, parsing is not helpful. In fact, a shallow parser would group both UAW spokesman Owen Bieber and Seattle–based Microsoft as basic NPs (noun phrase). As a result, the units inside the phrases would no longer be checkable for extracting the relationships. Therefore, the proper structural basis for CE1 is identified to be only the NE results (i.e. linear token string remainsas linear except for multi-token NEs which have been combined into a structure).
The CE2 model is based on the first stage shallow parsing (i.e. Shallow Parser1) results. Shallow Parser1 aims at grouping together some of the very basic linguistic units BaseNP (Basic Noun Phrase) and VG (VerbGroup). This creates a basic structural base for capturing some further CE relationships. The following is a pattern rule in a CE2 grammar for the CE relationship affiliation and position.
Sample CE2 Rule:
0|NP(PERSON) 1|COMMA 2|NP(position_w) 3|P(of/for/with/in/at) 4|NP(ORGANIZATION)
==> 0:affiliation=4position=2
This rule will extract the relationships for cases like Robert Callahan , spokesmanof Seattle -based Microsoft. Note that the CE relationship affiliation between Robert Callahan and Microsoft in the preceding example cannot be captured without the structural basis provided by ShallowParsing1. This is because the last organization NE serves as head unit for an NP with a preceding modifier Seattle-based; pattern matching has to jump over (ignore) such modifiers in order to find the related entity. This jump-over operation is difficult to realize when parsing is not available. This is because pre-modifiers can take various forms of various length (in theory, infinite length).
The CE3 model is designed to handle CE relationships at clause level. It requires support from the second stage of shallow parsing (i.e. Shallow Parser 2). Shallow Parser 2 aims at grouping together some further linguistic units NP (e.g. NP ’s NP will be grouped into one bigger NP) and BasePP (Basic Prepositional Phrase, i.e. Preposition followed by a BaseNP). This creates a more sophisticated structural base for capturing some CE relationships which spans a longer distance in the sentence. It is especially suited for targeting CE relationships expressed by S-V-O (Subject-Verb-Object) structures. The following is a sample pattern rule in Textract CE3 grammar for the CE relationship affiliation.
Sample CE3 Rule:
0|NP(PERSON) [Adv|PP]* 1|VG(work) 2|PP(for,ORGANIZATION)
==> 0: affiliation=2
This rule had been tested to be able to cover cases like Robert Jackson originally from Washington, D.C. has been working for Seattle -based Microsoft for almost a decade. Note that the rule specifies that any number of adverbs or PPs (prepositional phrases) should be ignored before the subject NP and its predicate VG can be captured. In the above example sentence, the adverb originally and the PP from Washington, D.C. are safely jumped over by the pattern and the relationship between the subject NP Robert Jackson and the head of the PP object Microsoft can be successfully captured. This is only possible with the necessary support of the shallow parsing which groups both NP and PP. The resulting CE template for the sample sentence is shown below.
name: RobertJackson
affiliation: Microsoft
Abney, S.P. 1991. Parsingby Chunks, Principle-Based Parsing: Computation and Psycholinguistics,Robert C. Berwick, Steven P. Abney, Carol Tenny, eds. Kluwer Academic Publishers, Boston, MA, pp.257-278.
Appelt, D.E. et al. 1995. SRI International FASTUS System MUC-6 TestResults and Analysis. Proceedings ofMUC-6, Morgan Kaufmann Publishers, San Mateo, CA
Beckwith, R. et al.1991. WordNet: A Lexical Database Organized on Psycholinguistic Principles. Lexicons: Using On-line Resources to build a Lexicon, Uri Zernik,editor, Lawrence Erlbaum, Hillsdale, NJ.
Bikel, D.M. et al.,1997. Nymble: a High-Performance Learning Name-finder. Proceedings ofthe Fifth Conference on Applied Natural Language Processing, MorganKaufmann Publishers, pp. 194-201.
Brill, E., 1995.Transformation-based Error-Driven Learning and Natural language Processing: A Case Study in Part-of-Speech Tagging, Computational Linguistics, Vol.21,No.4, pp. 227-253
Briscoe, T. & Waegner,N., 1992. Robust Stochastic Parsing Using the Inside-Outside Algorithm.WorkshopNotes, Statistically-Based NLP Techniques, AAAI, pp. 30-53
Charniak, E. 1994. Statistical Language Learning, MIT Press, Cambridge, MA.
Chiang, T-H., Lin, Y-C.& Su, K-Y. 1995. Robust Learning, Smoothing, and Parameter Tying on Syntactic Ambiguity Resolution, Computational Linguistics, Vol.21,No.3, pp. 321-344.
Chinchor, N. & Marsh,E. 1998. MUC-7 Information Extraction Task Definition (version 5.1), Proceedingsof MUC-7
Darroch, J.N. &Ratcliff, D. 1972. Generalized iterative scaling for log-linear models. TheAnnals of Mathematical Statistics, pp. 1470-1480.
Grishman, R., 1997.TIPSTER Architecture Design Document Version 2.3. Technical report, DARPA.
Hobbs, J.R. 1993. FASTUS: A System for Extracting Informationfrom Text, Proceedings of the DARPA workshop on Human Language Technology, Princeton, NJ, pp. 133-137.
Krupka, G.R. & Hausman, K. 1998. IsoQuest Inc.: Description of the NetOwl (TM) ExtractorSystem as Used for MUC-7, Proceedings of MUC-7
Lin, D. 1998. Automatic Retrieval and Clustering of Similar Words, Proceedings of COLING-ACL '98, Montreal, pp. 768-773.
Miller, S. et al.,1998. BBN: Description of the SIFT System as Used for MUC-7. Proceedings of MUC-7
Mohri, M. 1997.Finite-State Transducers in Language and Speech Processing,ComputationalLinguistics, Vol.23, No.2, pp.269-311.
Mooney, R.J. 1999. Symbolic Machine Learning for NaturalLanguage Processing. Tutorial Notes, ACL ’99.
MUC-7, 1998. Proceedings of the Seventh MessageUnderstanding Conference (MUC-7), published on the websitehttp://www.muc.saic.com/
Pine, C. 1996. Statement-of-Work (SOW) for The Intelligence Analyst Associate (IAA)Build 2, Contract for IAA Build 2, USAF, AFMC, RomeLaboratory.
Rilof, E. & Jones, R.1999. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Proceedings of the Sixteenth a National Conference on Artificial Intelligence (AAAI-99)
Rosenfeld, R. 1994. Adaptive Statistical Language Modeling. PhD thesis, Carnegie Mellon University.
Senellart, J. 1998. Locating Noun Phrases with Finite StateTransducers, Proceedings of COLING-ACL '98, Montreal, pp. 1212-1219.
Silberztein, M. 1998.Tutorial Notes: Finite State Processing with INTEX, COLING-ACL'98, Montreal(also available at http://www.ladl.jussieu.fr)
Srihari, R. 1998. A Domain Independent Event Extraction Toolkit, AFRL-IF-RS-TR-1998-152 Final Technical Report, published by Air Force Research Laboratory, Information Directorate,Rome Research Site, New York
Yangarber, R. & Grishman, R. 1998. NYU: Description of the Proteus/PET System as Used for MUC-7ST, Proceedings of MUC-7
[Related]
Pre-Knowledge-Graph Profile Extraction Research via SBIR (1) 2015-10-24
史海钩沉:Early arguments for a hybrid model for NLP and IE
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-4 11:00
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社