《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵,插队修地球,1991年去国离乡,不知行止。


Pre-Knowledge-Graph Profile Extraction Research via SBIR (2)

已有 3181 次阅读 2015-10-24 23:57 |个人分类:立委科普|系统分类:论文交流| 信息抽取, SBIR, 钩沉

January 2000

Flexible Information Extraction Learning Algorithm 

Contract No. F30602-99-C-0102


Wei Li,  Principal Investigator

Rohini K. Srihari, Ph.D.,Co-Principal Investigator  

2.4. CE Prototyping 

Cymfony has developed aprototype Textract 2.0 for CE extraction on a Windows 95/98/NT Platform.  Considerable progress has been made during the Phase I effort in implementing this hybrid CE prototype.  


A hybrid CE prototype is the first goal of this Phase I project.  It is designed to extract pre‑defined multiple relationships between the entities, on the foundation of the existing NE system Textract 1.0.  The results will be used to support intelligent browsing/threading in the sense that a user has access to more information about each identified entity and can be directed to jump freely between related entities.  This represents a giant step forward from existing deployed IE systems such as NetOwl, IdentiFinder [MUC-7 1998], which mainly output isolated named entities.  


The CE prototype developed in Phase I, namely Textract 2.0/CE, is able to detect/extract the following relationships:



  • name:     including aliases

  • title:        e.g.     Mr.; Prof; etc.                      

  • subtype: e.g. MILITARY; RELIGIOUS; etc

  • age:

  • gender:     e.g. MALE; FEMALE

  • affiliation: reverse relationship of staff          

  • position:

  • where_from:

  • address:

  • phone:

  • descriptors:



  • name:     including aliases

  • subtype:     e.g. COMPANY; SCHOOL; etc.

  • location:              

  • staff:     reverse relationship of affiliation

  • address:

  • phone:

  • www:

  • descriptors:

Efforts are being made to increase the coverage both in manual CE rules and CE training.  CE relationships about product and named entity will also be targeted.

2.4.1  CE Prototype Architecture

Figure 2 illustrates the architecture of the Textract 2.0/CE prototype.


Figure 2: Textract 2.0/CE Prototype Architecture


As shown in this CE Architecture, fundamental linguistic support is from shallow parsing which detects basic linguistic chunks like basic noun phrases, verb groups, etc.  Although sentence-level full parsing is found to be necessary for extracting general events (with argument structure at the core), it has been suggested and verified in the CE prototyping practice, that shallow parsing creates reliable conditions for local CE extraction as well as for co-referencing nominal strings.  As Yangarber & Grishman [1998] comment, "This style of text analysis, - as contrasted with full syntactic parsing,- has gained the wider popularity due to limitations on the accuracy of full syntactic parsers, and the adequacy of partial, semantically-constrained, parsing for this task."


In the CE component, note that the CE task actually consists of two modules: Local CE and Merging. The former only extracts CE information within sentence boundaries, based on the results of shallow parsing. It is the task of Merging that unifies the locally extracted CE information into coherent information objects in the wider context of discourse.  This is achieved with the support of the co-reference module (CO).  For example, given the sentence He was 91, Local CE will extract 91 as a filler of the slot age for the entity denoted by He.  But the person to whom He refers in this context is left to CO to resolve.


The implementation of this prototype has involved several phases which are discussed below.  

2.4.2  Semantic Word Clustering


The first issue to be discussed is why semantic lexical support is needed for local CE extraction.  As we know, a certain CE relationship is often expressed by some specific key words (e.g. the verb work, hire, and/or the preposition for, by, from) between two entities.   However, the conditions on the units of the entities themselves are usually based on some semantic classification, rather than specific words. Otherwise, there will be little generality in the CE rules, whether hand-coded or machine learned, which are bound to suffer from the sparse data problem.  This type of semantic classification information may come from the previous NE module in cases of proper names.  But for non-NE , common noun entities involved in a possible rule, similar semantic classification is also needed.  For example, Bill Clinton, John Smith, etc. are tagged as NE of person by the NE module.  However, the common nouns like gentleman, guy, girl, lady, etc. should also be classified as of the person type. Otherwise local CE rules for the affiliation (staff) relationship like <PERSON works for ORGANIZATION> would only cover a fraction of the intended phenomena.  Such a rule cannot cover the cases like This gentleman works for Cymfony unless the system has access to the information that gentleman is PERSON.  


There are two ways to obtain this information, from some lexical resources that classify such words as person, or from the co-reference link between This gentleman and its NE antecedent across sentence boundaries.  But the co-reference link cannot be established reliably without the lexical information that gentleman is PERSON.  Besides, for modularity consideration, Local CE and CO should be two distinct modules and it is not wise to bring in discourse analysis in the stage of local CE extraction.  Therefore, the technique proposed here is to rely on support from on-line lexical resources. In fact, reliable word classification information not only supports CE, it can support NE and GE as well.  


Such word class information is usually assumed to come from some semantic lexicon.  Originally, it was expected that WordNet [Beckwith 1991] would provide this lexical support as it is widely acknowledged to be a reliable (and free!) lexical resource with a sophisticated semantic classification scheme.  But it was soon found that there is too much noise to render it usable for the IE purposes.  WordNet tries to cover as many senses (synsets in their term) as possible for each word no matter how rare a particular sense actually is.  For example, the words dog, cat, bird are all tagged as PERSON (in addition to other synsets).  It was realized that there is a need to design and implement a word classification or clustering algorithm in order to get statistically meaningful semantic word class information.  This work will take place in the next phase.


For time being, in order to enable the CE prototype implementation (as well as supporting question parsing) and to examine the feasibility of the CE grammar development and CE rule learning, a hand-crafted FST grammar has been developed which classifies some more frequently used words into classes.

2.4.3  Development of Local CE Grammars 

The Local CE module is itself a hybrid component.  It is comprised of an FST Model (pattern matcher using hand-coded CE grammars) as a preprocessor and a CE Learned Model (established via symbolic rule induction).


In the preceding SBIR efforts, Cymfony had made great progress in the development of local CE grammars, using the Textract FST toolkit.   It was found that in order to capture CE relationships, multi-level grammars need to be developed based on different levels of structures.  Some examples and the corresponding sample rules are cited below to demonstrate this point.  


Three levels of structures have been identified to be required for three levels of CE FST models, namely, CE1, CE2 and CE3.  They are all organized in a pipelinearchitecture together with the other text processing components of the system, as shown in Figure 3.


Figure 3: Multi-level CE Grammar


Each level of the CE FST Model is supported by the corresponding CE grammar (compiled into FST for run-time application).  The following is a pattern rule in the Textract CE1 grammar for the CE relationships affiliation and position.


Sample CE1 Rules:    


0|NE(ORGANIZATION)      1|N(position_w)          2|NE(PERSON)

==> 2:affiliation=0position=1


           0|NE(LOCATION)                1|-based           2|NE(ORGANIZATION)

==>2: location=0


The first rule links a person NE with an organization NE with the affiliation relationship; it also extracts a position word (position_w) like spokesman,chairman, secretary, researcher, salesman, etc. as the fill to the CE slot (feature) position for the person NE.  This rule covers cases like UAW spokesman Owen Bieber.  The second rule works for cases like Buffalo –based Cymfony.  The output of the corresponding local CE template are shown below:


name:              OwenBieber

position:           spokesman

affiliation:        UAW


name:              Cymfony

location:          Buffalo


For this type of very local phenomena, parsing is not helpful.  In fact, a shallow parser would group both UAW spokesman Owen Bieber and Seattle–based Microsoft as basic NPs (noun phrase). As a result, the units inside the phrases would no longer be checkable for extracting the relationships.  Therefore, the proper structural basis for CE1 is identified to be only the NE results (i.e. linear token string remainsas linear except for multi-token NEs which have been combined into a structure).  


The CE2 model is based on the first stage shallow parsing (i.e. Shallow Parser1) results. Shallow Parser1 aims at grouping together some of the very basic linguistic units BaseNP (Basic Noun Phrase) and VG (VerbGroup).  This creates a basic structural base for capturing some further CE relationships.  The following is a pattern rule in a CE2 grammar for the CE relationship affiliation and position.


Sample CE2 Rule:      


           0|NP(PERSON)          1|COMMA      2|NP(position_w)        3|P(of/for/with/in/at)   4|NP(ORGANIZATION)

==> 0:affiliation=4position=2


This rule will extract the relationships for cases like Robert Callahan ,  spokesmanof  Seattle -based Microsoft.  Note that the CE relationship affiliation between Robert Callahan and Microsoft in the preceding example cannot be captured without the structural basis provided by ShallowParsing1.  This is because the last organization NE serves as head unit for an NP with a preceding modifier Seattle-based;  pattern matching has to jump over (ignore) such modifiers in order to find the related entity.  This jump-over operation is difficult to realize when parsing is not available. This is because pre-modifiers can take various forms of various length (in theory, infinite length).


The CE3 model is designed to handle CE relationships at clause level.  It requires support from the second stage of shallow parsing (i.e. Shallow Parser 2).  Shallow Parser 2 aims at grouping together some further linguistic units NP (e.g. NP ’s NP will be grouped into one bigger NP) and BasePP (Basic Prepositional Phrase, i.e. Preposition followed by a BaseNP).  This creates a more sophisticated structural base for capturing some CE relationships which spans a longer distance in the sentence. It is especially suited for targeting CE relationships expressed by S-V-O (Subject-Verb-Object) structures. The following is a sample pattern rule in Textract CE3 grammar for the CE relationship affiliation.


Sample CE3 Rule:      


0|NP(PERSON)          [Adv|PP]*       1|VG(work)     2|PP(for,ORGANIZATION)

==> 0: affiliation=2


This rule had been tested to be able to cover cases like Robert Jackson originally from Washington, D.C. has been working for Seattle -based Microsoft for almost a decade.  Note that the rule specifies that any number of adverbs or PPs (prepositional phrases) should be ignored before the subject NP and its predicate VG can be captured. In the above example sentence, the adverb originally and the PP from Washington, D.C. are safely jumped over by the pattern and the relationship between the subject NP Robert Jackson and the head of the PP object Microsoft  can be successfully captured.  This is only possible with the necessary support of the shallow parsing which groups both NP and PP.  The resulting CE template for the sample sentence is shown below.  


name:              RobertJackson

affiliation:        Microsoft



Abney, S.P. 1991. Parsingby Chunks, Principle-Based Parsing: Computation and Psycholinguistics,Robert C. Berwick, Steven P. Abney, Carol Tenny, eds.  Kluwer Academic Publishers, Boston, MA, pp.257-278.

Appelt, D.E. et al. 1995.  SRI International FASTUS System MUC-6 TestResults and Analysis.  Proceedings ofMUC-6, Morgan Kaufmann Publishers, San Mateo, CA

Beckwith, R. et al.1991.  WordNet: A Lexical Database Organized on Psycholinguistic Principles. Lexicons: Using On-line Resources to build a Lexicon, Uri Zernik,editor, Lawrence Erlbaum, Hillsdale, NJ.

Bikel, D.M. et al.,1997.  Nymble: a High-Performance Learning Name-finder.  Proceedings ofthe Fifth Conference on Applied Natural Language Processing, MorganKaufmann Publishers, pp. 194-201.

Brill, E., 1995.Transformation-based Error-Driven Learning and Natural language Processing: A Case Study in Part-of-Speech Tagging, Computational Linguistics, Vol.21,No.4,  pp. 227-253

Briscoe, T. & Waegner,N., 1992. Robust Stochastic Parsing Using the Inside-Outside Algorithm.WorkshopNotes, Statistically-Based NLP Techniques, AAAI, pp. 30-53

Charniak, E. 1994.  Statistical Language Learning, MIT Press, Cambridge, MA.

Chiang, T-H., Lin, Y-C.& Su, K-Y. 1995. Robust Learning, Smoothing, and Parameter Tying on Syntactic Ambiguity Resolution, Computational Linguistics, Vol.21,No.3,  pp. 321-344.

Chinchor, N. & Marsh,E. 1998. MUC-7 Information Extraction Task Definition (version 5.1), Proceedingsof MUC-7

Darroch, J.N. &Ratcliff, D. 1972.  Generalized iterative scaling for log-linear models.  TheAnnals of Mathematical Statistics, pp. 1470-1480.

Grishman, R., 1997.TIPSTER Architecture Design Document Version 2.3. Technical report, DARPA.

Hobbs, J.R. 1993.  FASTUS: A System for Extracting Informationfrom Text, Proceedings of the DARPA workshop on Human Language Technology, Princeton, NJ, pp. 133-137.

Krupka, G.R. & Hausman, K. 1998. IsoQuest Inc.: Description of the NetOwl (TM) ExtractorSystem as Used for MUC-7, Proceedings of MUC-7

Lin, D. 1998.  Automatic Retrieval and Clustering of Similar Words, Proceedings of COLING-ACL '98, Montreal, pp. 768-773.

Miller, S. et al.,1998. BBN: Description of the SIFT System as Used for MUC-7.  Proceedings of MUC-7

Mohri, M. 1997.Finite-State Transducers in Language and Speech Processing,ComputationalLinguistics, Vol.23, No.2,  pp.269-311.

Mooney, R.J. 1999.  Symbolic Machine Learning for NaturalLanguage Processing. Tutorial Notes, ACL ’99.

MUC-7, 1998.  Proceedings of the Seventh MessageUnderstanding Conference (MUC-7), published on the websitehttp://www.muc.saic.com/

Pine, C. 1996.  Statement-of-Work (SOW) for The Intelligence Analyst Associate (IAA)Build 2, Contract for IAA Build 2, USAF, AFMC, RomeLaboratory.

Rilof, E. & Jones, R.1999.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Proceedings of the Sixteenth a National Conference on Artificial Intelligence (AAAI-99)

Rosenfeld, R. 1994.  Adaptive Statistical Language Modeling. PhD thesis, Carnegie Mellon University.

Senellart, J. 1998.  Locating Noun Phrases with Finite StateTransducers, Proceedings of COLING-ACL '98, Montreal, pp. 1212-1219.

Silberztein, M. 1998.Tutorial Notes: Finite State Processing with INTEX, COLING-ACL'98, Montreal(also available at http://www.ladl.jussieu.fr)

Srihari, R. 1998. A Domain Independent Event Extraction Toolkit, AFRL-IF-RS-TR-1998-152 Final Technical Report, published by Air Force Research Laboratory, Information Directorate,Rome Research Site, New York

Yangarber, R. & Grishman, R. 1998. NYU: Description of the Proteus/PET System as Used for MUC-7ST, Proceedings of MUC-7



Pre-Knowledge-Graph Profile Extraction Research via SBIR (1) 2015-10-24
史海钩沉:Early arguments for a hybrid model for NLP and IE

朝华午拾:在美国写基金申请的酸甜苦辣 - 科学网



上一篇:Pre-Knowledge-Graph Profile Extraction Research via SBIR (1)
下一篇:钩沉:Early arguments for a hybrid model for NLP and IE


该博文允许注册用户评论 请点击登录 评论 (0 个评论)


Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2020-8-8 08:19

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社