《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵,插队修地球,1991年去国离乡,不知行止。

博文

钩沉:Early arguments for a hybrid model for NLP and IE

已有 3137 次阅读 2015-10-25 01:01 |个人分类:立委科普|系统分类:论文交流|关键词:machine,learning,FST,hybrid| machine, Learning, Hybrid, FST

January 2000

On Hybrid Model 

Pre-Knowledge-Graph Profile Extraction Research via SBIR (3)


This section presents the feasibility study conducted in Phase I of the proposed hybrid model for Level-2 and Level‑3 IE.  This study is based on literature review and supported by extensive experiments and prototype implementation.  This model complements corpus-based machine learning by hand-coded FST rules.  The essential argument for this strategy is that by combining machine learning methods with an FST rule-based system, the system is able to exploit the best of both paradigms while overcoming their respective weaknesses.  This approach was intended to meet the demand of the designed system for processing unrestricted real life text.

2.2.1  Hybrid Approach

It was proposed that FST hand-crafted rules combine with corpus-based learning in all major modules of Textract.  More precisely, each module M will consist of two sub-modules M1 and M2, i.e. FST model and trained  model. The former serves as a preprocessor, as shown below.

     

M1: FST Sub-module

ˉ

M2: trained Sub-module

 

The trained model M2 has two features:  (i) adaptive training [Rosenfeld 1994];  (ii) structure-based training.

 

In a pipeline architecture, the output of the previous module is the input of the succeeding module. If the succeeding module is a trained model, there are two types of training:  adaptive training or non-adaptive training.  In adaptive training, the input in the training phase is exactly the same as the input in the application phase.  That is, the possibly imperfect output from the previous module is the input for training even if the previous module may make certain mistakes.  This type of training “adapts” the model to imperfect input and the trained model will be more robust and results in some necessary adjustment.  In contrast, a naive non-adaptive training is often conducted based on a perfect, often artificial input.  The assumption is that the previous module is a continuously improving module and will be able to provide near-perfect output for the next module.  

 

There are pros and cons for both adaptive and non-adaptive methods.  Non-adaptive training is suitable for the case when the training time is significantly long and in the case where the previous module is simple and reaches high precision.  In contrast, an adaptively trained model has to be re-trained each time the previous module(s) undergo some major changes.  Otherwise, the performance will be seriously affected. This imposes stringent requirements on training time and algorithm efficiency.  Since the machine learning tools, which Cymfony has developed in-house, are very efficient, Textract can afford to adopt the more flexible training method using adaptive input.  

 

Adaptive training provides the rationale for placing the FST model before the trained model. The development of the FST sub-module M1 and the trained sub-module M2 can be done independently.  When the time comes for the integration of M1 and M2 for better performance, it suffices to re-train M2 on the output of M1.  The flexible adaptive training capabilities make this design viable, as verified inthe prototype implementation of Textract2.0/CE.  In contrast, if M1 were placed after M2, the development of hand-crafted rules for M1 would have to wait until M2 is implemented.  Otherwise, many rules may have to be re-written and re-debugged, which is not desirable.

 

The second issue is structure-based training.  Natural language is structural by nature; any sophisticated high level IE can hardly be successful based on linear strings of tokens.  In order to capture CE/GEphenomena, traditional n-gram training with a window size of n linear tokens is not sufficient. Sentences can be long where the related entities are far apart, not to mention the long distance phenomena in linguistics.  Without structure based training, no matter how large the window size one chooses, generalized rules cannot be effectively learned.  However, once the training is based on linguistic structures, the distance between the entities becomes tractable.  

 

In fact, as linguistic structures are hierarchical, we need to perform multi-level training in order to capture CE/GE.  For CE, it has been found during the Phase I research that three levels of training are necessary.  Each level of training should be supported by the corresponding natural language parser.

 

The remainder of this section presents the feasibility study and arguments for the choice of an FST rule based system to complement the corpus-based machine learning models.


2.2.2  FST Grammars

The most attractive feature of the FST formalism lies in its superior time and space efficiency. Applying FST basically depends linearly on the input size of the text.  This is in contrast to the more pervasive formalism used in NLP, namely, Context Free Grammars.

 

This theoretical time/space efficiency has been verified through the extensive use of Cymfony’s proprietary FST Toolkit in the following applications of Textract implementation:  (i) tokenizer; (ii) FST-based rules for capturing NE; (iii) FST representation of lexicons (lexical transducers); (iv) experiments in FST local grammars for shallow parsing; and (v) local CE/GEgrammars in FST.  For example, the Cymfony shallow parser has been benchmarked to process 460 MB of text an hour on a 450 MHz Pentium II PC running Windows NT.

 

There is a natural combination of FST-based grammars and lexical approaches to natural language phenomena [Silberztein 1998].  In order for IE grammars/rules to perform well, the lexical approach must be employed. In fact, the NE/CE/GE grammars which have been developed in Phase I have demonstrated a need for the lexical approach. Take CE as an example.  In order to capture a certain CE relationship, say affiliation, the corresponding rules need to check patterns involving specific verbs and/or prepositions, say work for/hired by, which denote this relationship in English.  The GE grammar, which aims at decoding the key semantic relationships in the argument structure in stead of surface syntactic relationships, has also demonstrated the need to involve considerable level of lexical constraints.  

 

Efficiency is always an important consideration indeveloping a large-scale deployable software system.  Efficiency is particularly required for lexical grammars since lexical grammars are usually too large for efficient processing using conventional, more powerful grammar formalisms (e.g. Context Free Grammar formalism).  Cymfony is convinced through extensive experiments that the FST technology is an outstanding tool to tackle this efficiency problem.  

 

It was suggested that a set of cascaded FST grammars [Abney 1991] [Hobbs 1993] could simulate sophisticated natural language parsing.  This use of FST application has already successfully been applied to the Textract shallow parsing and local CE/GE extraction.

 

There are a number of success stories of FST-based rule systems in the field of IE.  For example, the commercial NE system NetOwl relies heavily on FST pattern matching rules [Krupka & Hausman 1998]. SRI also applied a very efficient FST local grammar for the shallow parsing of basic noun phrases and verb groups in order to support IE tasks [Hobbs 1993] [Appelt et al 1995]. More recently, Universite Paris VII/LADL [Senellart 1998] has successfully applied FST technology to one specified information extraction/retrieval task; that system can extract information on-the-fly about one's occupation from huge amounts of free text.  The system is able to answer questions which conventional retrieval systems cannot handle, e.g. Who is the minister of culture in France?  

 

Finally, it has also been proven by many research programs such as [Krupka & Hausman 1998], [Hobbs 1993] and INTEX [Silberztein 1998], as well as Cymfony [Srihari1998], that an FST-based rule system is extremely efficient.  In addition, FST is a convenient tool for capturing linguistic phenomena, especially for idioms and semi-productive expressions that are abundant in natural languages.  As Hobbs [1993] says, “languages in general are very productive in the construction of short, multiword fixed phrases and proper names employing specialized microgrammars”.  

 

However, a purely FST-based rule system suffers from the same disadvantage in knowledge acquisition as that for all handcrafted rule systems.  After all, the FST rules or local grammars have to be encoded by human experts, imposing this traditional labor-intensive problem in developing large scale systems.  The conclusion is that while FST overcomes a number of shortcomings of the traditional rule based system (in particular the efficiency problem), it does not relieve the dependence on highly skilled human labor.  Therefore, approaches for automatic machine learning techniques are called for.  


2.2.3  Machine Learning

The appeal of corpus-based machine learning in language modeling lies mainly in its automatic training/learning capabilities,hence significantly reducing the cost for hand-coding rules.   Compared with rule based systems, there are definite advantages of corpus-based learning:

·        automatic knowledge acquisition:  results in fast development time since the system discovers regularity automatically when given sufficient correctly annotated data

·        robustness:  since knowledge/rules are learned directly from corpus

·        acceptable speed:  in general, there is little run-time processing;  the knowledge/rules obtained in the training phase can be stored in efficient data structures for run-time lookup

·        portability:  a domain shift only requires the truthing of new data; new knowledge/rules will be automatically learned with no need to change any part of the program or control

 

BBN has recently implemented an integrated, fully trainable model, SIFT, applied to IE [Miller et al 1998].  This system performs the tasks of linguistic processing (POS tagging, syntactic parsing and semantic relationship identification), TE and TR as well as NE, all at once.  They have reported 83.49% F-measures for TE and 71.23% F-measures for TR, a result close to those of the best systems in MUC-7.  In addition, their successful experiment in making use of the Penn Treebank for training the initial syntactic parser significantly reduces the cost of human annotation.  There is no doubt that their effort is a significant progress in this field.  It demonstrates the state-of-the-art in applying grammar induction to Level-2 IE.  However, there are two potentially serious problems with their approach.

 

The first is the lack of efficiency in applying the model.  As they acknowledge, the speed of the system is rather slow.  In terms of efficiency, the CKY‑based parsing algorithm they use is not comparable with algorithms for formalisms based on the finite state scheme (e.g. FST, Viterbi for HMM).  This limiting factor is due to the inherent nature of the learned grammar based on the CFG formalism.  To overcome this problem, rule induction has been explored in the direction of learning FST style grammars for local CE/GEextraction instead of CFG.

 

The second problem is with their integrated approach.  Because everything is integrated in one process, it is extremely difficult to trace where a problem lies, making debugging difficult.  

 

It is believed that a much more secure way is to follow the conventional practice of modularizing the NLP/IE process in different tasks and sub-tasks, as Cymfony has proposed in the Textract architecture design:  POS tagging, shallow parsing, co-referencing, full parsing, pragmatic filtering, NE, CE,GE.  Along this line, it is easy to find directly whether a particular degradation in performance is due to poor support from co-referencing or from mistakes in shallow parsing, for example.  Performance benchmarking can be measured for each module; efforts to improve the performance of each individual module will contribute to the improvement of the overall system performance.  


2.2.4  Drawbacks of Corpus-based Learning

The following drawbacks motivate the proposed idea of building a hybrid system/module, complementing the automatic corpus-based learning by handcrafted grammars in FST.

 

·        ‘Sparse data’ problem:  this is recognized as a bottle-neck for all corpus-based models [Charniak 1994]. Unfortunately, a practical solution to this problem (e.g. smoothing or back-off techniques) often results in a model much less sophisticated than traditional rule-based systems.  

 

·        ‘Local maxima’ problem:  even if the training corpus is large and sufficiently representative, the training program can result in a poor model because training got stuck in a local maximum and failed to find the global peak [Charniak 1994]. This is an inherent problem with the standard training algorithms for both HMM (i.e. forward-backward algorithm) and CFG grammar induction (inside-outside algorithm).  This problem can be very serious when there is no extra information applied to guide the training process.  

 

·        computational complexity problem:  It is often the case that there is a trade-off between expressive power/prior knowledge/constraints in the templates and feasibility.  Usually, the more sophisticateda model or rule template is, the more the minimum requirement for a corpus increases, often up to an unrealistic level of training complexity.  To extend the length of the string to be examined (e.g. from bigram to trigram), or to add more features (or categories/classes) for a template to be able to make reference to, usually means an enormous jump in such requirement. Otherwise, the system suffers from more serious sparse data effect. In many cases, the limitation imposed on the training complexity makes some research ideas unattainable, which in turn limits the performance power.

 

·        potentially very high cost for manual annotationof corpus:  that is why Cymfony has proposed as one important direction for future research to explore the combination of supervised training and unsupervised training.  

 

Among the above four problems, the sparse data problem is believed to be most serious.  To a considerable extent, the success of a system depends on how this problem is addressed.  In general, there are three ways to minimize the negative effect of sparse data, discussed below.  

 

The first is to condition the probabilities/rules on fewer elements, e.g. to back off from N-gram model to (N-1)-gram model.  This remedy is clearly a sacrifice of the power and therefore is not a viable option for sophisticated NLP/IE tasks.  

 

The second approach is to condition the probabilities/rules on appropriate levels of linguistic structures (e.g. basic phrase level) instead of surface based linear tokens.  The research in the CE prototyping showed this to be one the most promising ways of handling the sparse data problem.  This approach calls for a reliable natural language parser to establish the necessary structural foundation for conducting structure-based adaptive learning.  The shallow parser which Cymfony has built using the FST engine and an extensively tested manual grammar has been tested to perform with 90.5% accuracy.  

 

The third method is to condition the probabilities/rules on more general features, e.g. using syntactic categories (e.g. POS) or semantic classes (e.g. the results from semantic lexicon; or from word clustering training) instead of the token literal.  This is also a proven effective means for overcoming this bottleneck.  However,there is considerable difficulty in applying this approach due to the high degree of lexical ambiguity widespread in natural languages.

 

As for the ‘local maxima’ problem, the proposed hybrid approach in integrating handcrafted FST rules and the automatic grammar learner promises a solution.  The learned model can be re-trained using the FST component as a ‘seed’ to guide the learning.  In general, the more constraints and heuristics that are given to the initial statistical model for training, the better the chance for the training algorithm to result in the global maximum.  It is believed that a handcrafted grammar is the most effective of such constraints since it embodies human linguistic knowledge.


2.2.5  Feasibility and Advantages of Hybrid Approach

In fact, the feasibility of such collaboration between a handcrafted rule system (FST in this case) and a corpus-based system has already been verified for all the major types of models:

 

·        For transformation based systems, Brill's training algorithm ensures that the input to the system can be either a randomly tagged text (naive initial state) or a text tagged by another module with the same function (sophisticated initial state) [Brill 1995].  Using the POS tagging as an example, the input to the transformation-based tagger can be either a text randomly tagged or a text tagged by another POS tagger.  The shift in the input sources only requires re-training the system; nothing in the algorithm and the annotated corpus need to be changed.

 

·        In the case of rule induction, the FST-based grammar can serve as a ‘seed’ to effectively constrain/guide the learning process in overcoming the ‘local maxima’ problem.  In general, a better initial estimate of the parameters gives the learning procedure a chance to obtain better results when many local maximal points exist [Chiang et al 1995].  It is proven by experiments conducted by Briscoe & Waegner [1992] that even with a very crude handcrafted grammar of only seven binary-branching rules (e.g. PP --> P NP) to start with, a much better grammar is automatically learned than the one using the same approach without a grammar ‘seed’.  Another more interesting experiment they conducted gives the following encouraging results.  Given the seed of an artificial grammar that can only parse 25% of the 50,00-word corpus, the training program is able to produce a grammar capable of parsing 75% the corpus.  This demonstrates the feasibility of combining handcrafted grammar and automatic grammar induction in line with the general approach proposed above:  FST rules before statistical model.

 

·        When the trained sub-module is an HMM, Cymfony has verified its feasibility through extensive experiments in implementing the hybrid NE tagger, Textract 1.0.  Cymfony first implemented an NE system purely on HMM bi-gram learning, and found there were weaknesses.  Due to sparse data problem, although time and numerical NEs are expressed in very predictable patterns, there was considerable amount of mistagging.  Later this problem was addressed by FST rules which are good at capturing these patterns.  The FST pattern rules for NE serve as a preprocessor.  As a result, Textract1.0 achieved significant performance enhancement (from 85% F-measures raised to 93%).

 

The advantages of this proposed hybrid approach are summarized below:

 

·        strict modularity:  the proposal of combining FST rules and statistical models makes the system more modular as each major module is now divided into two sub-modules.  Of course, adaptive re-training is necessary in the later stage of integrating the two sub-modules [1] but it is not a burden as the process is automatic and in principle, it does not require modifications in the algorithm or the training corpus.

 

·        enhanced performance:  due to the complementary nature of handcrafted and machine-learning systems.

 

·        flexible ratio of sub-modules:  one module may have a large trained model and a small FST component, or the other way around, depending on the nature of a given task, i.e. how well the FST approach or the learning approach applies to the task.  One is free to decide how to allocate more effort and resources to develop one component or the other.  If we judge that for Task One, automatic learning is most effective, we are free to decide that more effort and resources should be used to develop the trained module M2 for this task (and less effort for the FST module M1).  In other words, the relative size or contribution of M1 versus M2 is flexible,e.g. M1=20% and M2=80%.[2]  

 

Technology developed for the proposed information extraction system and its application has focused on six specific areas:  (i) machine learning toolkit, (ii) CE, (iii),CO (iv) GE, (v) QA  and (vi) truthing and evaluation.  The major accomplishments in these areas from the Phase I research are presented in the following sections.


[1]In fact, itis also the case in the development of a pure statistical system:  repeated training and testing is the normal practice of adjusting the model in the effort for performance improvement and debugging.

[2]It is possible that one module is based exclusively on FST rules, i.e. M1=100% and M2=0%, or completely on a learned model, i.e. M1=0% and M2=100% so long as its performance is deemed good enough or the overhead of combining the FST grammar and the learned model outweighs the slight gain in performance. In fact, some minor modules like Tokenizer and POS Tagger can produce very reliable results using only one approach.


REFERENCES

 

Abney, S.P. 1991. Parsingby Chunks, Principle-Based Parsing: Computation and Psycholinguistics,Robert C. Berwick, Steven P. Abney, Carol Tenny, eds.  Kluwer Academic Publishers, Boston, MA, pp.257-278.

Appelt, D.E. et al. 1995.  SRI International FASTUS System MUC-6 TestResults and Analysis.  Proceedings ofMUC-6, Morgan Kaufmann Publishers, San Mateo, CA

Beckwith, R. et al.1991.  WordNet: A Lexical Database Organized on Psycholinguistic Principles. Lexicons: Using On-line Resources to build a Lexicon, Uri Zernik,editor, Lawrence Erlbaum, Hillsdale, NJ.

Bikel, D.M. et al.,1997.  Nymble: a High-Performance Learning Name-finder.  Proceedings ofthe Fifth Conference on Applied Natural Language Processing, MorganKaufmann Publishers, pp. 194-201.

Brill, E., 1995.Transformation-based Error-Driven Learning and Natural language Processing: A Case Study in Part-of-Speech Tagging, Computational Linguistics, Vol.21,No.4,  pp. 227-253

Briscoe, T. & Waegner,N., 1992. Robust Stochastic Parsing Using the Inside-Outside Algorithm.WorkshopNotes, Statistically-Based NLP Techniques, AAAI, pp. 30-53

Charniak, E. 1994.  Statistical Language Learning, MIT Press, Cambridge, MA.

Chiang, T-H., Lin, Y-C.& Su, K-Y. 1995. Robust Learning, Smoothing, and Parameter Tying on Syntactic Ambiguity Resolution, Computational Linguistics, Vol.21,No.3,  pp. 321-344.

Chinchor, N. & Marsh,E. 1998. MUC-7 Information Extraction Task Definition (version 5.1), Proceedingsof MUC-7

Darroch, J.N. &Ratcliff, D. 1972.  Generalized iterative scaling for log-linear models.  TheAnnals of Mathematical Statistics, pp. 1470-1480.

Grishman, R., 1997.TIPSTER Architecture Design Document Version 2.3. Technical report, DARPA.

Hobbs, J.R. 1993.  FASTUS: A System for Extracting Informationfrom Text, Proceedings of the DARPA workshop on Human Language Technology, Princeton, NJ, pp. 133-137.

Krupka, G.R. & Hausman, K. 1998. IsoQuest Inc.: Description of the NetOwl (TM) ExtractorSystem as Used for MUC-7, Proceedings of MUC-7

Lin, D. 1998.  Automatic Retrieval and Clustering of Similar Words, Proceedings of COLING-ACL '98, Montreal, pp. 768-773.

Miller, S. et al.,1998. BBN: Description of the SIFT System as Used for MUC-7.  Proceedings of MUC-7

Mohri, M. 1997.Finite-State Transducers in Language and Speech Processing,ComputationalLinguistics, Vol.23, No.2,  pp.269-311.

Mooney, R.J. 1999.  Symbolic Machine Learning for NaturalLanguage Processing. Tutorial Notes, ACL ’99.

MUC-7, 1998.  Proceedings of the Seventh MessageUnderstanding Conference (MUC-7), published on the websitehttp://www.muc.saic.com/

Pine, C. 1996.  Statement-of-Work (SOW) for The Intelligence Analyst Associate (IAA)Build 2, Contract for IAA Build 2, USAF, AFMC, RomeLaboratory.

Rilof, E. & Jones, R.1999.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Proceedings of the Sixteenth a National Conference on Artificial Intelligence (AAAI-99)

Rosenfeld, R. 1994.  Adaptive Statistical Language Modeling. PhD thesis, Carnegie Mellon University.

Senellart, J. 1998.  Locating Noun Phrases with Finite StateTransducers, Proceedings of COLING-ACL '98, Montreal, pp. 1212-1219.

Silberztein, M. 1998.Tutorial Notes: Finite State Processing with INTEX, COLING-ACL'98, Montreal(also available at http://www.ladl.jussieu.fr)

Srihari, R. 1998. A Domain Independent Event Extraction Toolkit, AFRL-IF-RS-TR-1998-152 Final Technical Report, published by Air Force Research Laboratory, Information Directorate,Rome Research Site, New York

Yangarber, R. & Grishman, R. 1998. NYU: Description of the Proteus/PET System as Used for MUC-7ST, Proceedings of MUC-7

 

[Related]

Pre-Knowledge-Graph Profile Extraction Research via SBIR (1) 2015-10-24
Pre-Knowledge-Graph Profile Extraction Research via SBIR (2) 2015-10-24

朝华午拾:在美国写基金申请的酸甜苦辣 - 科学网

【置顶:立委科学网博客NLP博文一览(定期更新版)】






http://blog.sciencenet.cn/blog-362400-930754.html

上一篇:Pre-Knowledge-Graph Profile Extraction Research via SBIR (2)
下一篇:钩沉:SVO as General Events

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备14006957 )

GMT+8, 2019-3-23 21:32

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部