《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵,插队修地球,1991年去国离乡,不知行止。

博文

前知识图谱钩沉: 信息体理论

已有 2825 次阅读 2015-10-31 00:45 |个人分类:立委科普|系统分类:论文交流|关键词:Information,Extraction,信息抽取,只是图谱| Information, 信息抽取, 只是图谱

【立委按】 这些十几年前的老文字闲着也是闲着,趁着图谱热,也许仍有一点启发作用。当年除了概念设计上想得很多以外,每一个想法也都能落到实处,至少是做了prototyping的验证。我比工程师形而上,比空头理论家 hands-on,才算把这 IE 图谱摸了个遍。摸到最后,才一路到了大数据的舆情挖掘,也算是水到渠成,顺理成章。 


Hierarchy of Information Objects

In parallel to the research on IE task definition, Cymfony has established a representation theory of hierarchical information objects.  This information object hierarchy serves as the basis for the IE system to model the real world objects which the extracted information stands for.  The hierarchy of information objects is shown in Figure 4.

Figure 4.  Hierarchy of Information Objects

 

Information objects (InfoObject) are data structures which an IE system uses to represent various types of extracted information.  In Cymfony’s definition, all information objects are divided into two major categories: real information objects (RealObject) and virtual information objects (VirtualObject).   

Real Information Objects

Real information objects are characterized by the direct one-to-one correspondence between the object and its physical location.  For example, a token or token sequence (RealToken) such as a name (NE), an ordinary word or word sequence (BaseXP[1]) such as BaseNP (e.g. a beautiful girl), etc, can always be uniquely defined by its physical location in the document, namely a pair of character offsets, namely, begin-offsets and end-offsets. In contrast, virtual information objects are data structures which cannot be directly linked to a unique physical location.  A virtual object stands for some type of consolidated information due to information filtering, merging, linking, and/or inferring (ArtificialToken, for instance, can represent inferred information).  It often involves assembled information which is logically related but the source of such information may be scattered in various places of a document or document archive.   

Real information objects are the basis for the formation of virtual information objects.  More precisely, real information objects are the ‘first-level’ objects representing the keyword-based information directly extracted from the text while virtual information objects are the‘second-level’ derived objects representing consolidated information from the real information objects.  Examples will be given shortly to illustrate this point.

From the perspectives of granularity for real information objects, there are four levels:  (i) the finest level object is RealToken (NE, BaseXP, etc) which sits at word or phrase level;  (ii) the next level object which sits at sentence/clause level is local CE and local GE: they link RealTokens within sentence boundaries; (iii)  SnippetObject sits at paragraph level;  (iv) URL_Object which represents a document is the most coarse-grained.  This granularity distinction between real information objects provides a foundation for the Cymfony proposed back-off model for multi-level question answering application based on multi-level information extraction [Srihari& Li 1999b,2000b]  [Li & Srihari 2000b].    

It is noteworthy that the results of discourse processing modules Alias Association and Co-reference are represented by the discourse link object (DiscourseLink, i.e. AliasLink and CorefLink).  The discourse link object links two real tokens within a discourse to show that these inter-related tokens stand for the same entity.  For example, in a test on the MUC-7 dryrun data, the Alias Association Module in InfoXtract has produced the following linkage for AliasLink:

…………

(100) AliasLink: [(21346) Du Pont Co.] à [(21347) Du Pont ]

(101) AliasLink: [(21347) Du Pont] à [(21349) Du Pont ]

(102) AliasLink: [(21349) (1161) Du (1162) Pont] à [(21358) (1663) Du (1664) Pont ]

(103) AliasLink: [(21357) Massachusetts Institute ofTechnology] à [(21691) MIT]

………

(367) AliasLink: [(21344) Julian Hill] à [(21354) Julian Werner Hill]

(368) AliasLink: [(21344) Julian Hill] à [(1626) Hill]

………

Here, the alias relation 100 links the real token no. 21346 Du Pont Co. with its variant, the realtoken no. 21347 Du Pont.  It further links three instances of Du Pont with the alias relations 101and 102.  Such linkage, together with CorefLink, provides the necessary support for information merging of locally extracted CE, GE and PE into discourse virtual objects.  DiscourseLink belongs to the type LinkObject.  In general, the data structure for LinkObject can be defined as a triplet: (Relation: RealObject1 à RealObject2).   

The other type of LinkObject is LocalLink: unlike DiscourseLink, the linkage of the tokens for LocalLink cannot be established beyond sentence boundaries.  For example, given the incoming sentence He works for Cymfony as a software engineer, a Local CE object CeAffiliation: He à Cymfony should be extracted to link the real token He and the NE Cymfony.  

Local PE and Local GE objects are designed to capture event instances locally expressed within asentence.  A sample set of Local PE and Local GE objects are illustrated below.

           Input:   IBM appointed John Smith as new CEO.  

Peter Lee stepped down from this position due to his healthproblem.    

 

           Local PE output:

           EC_company:appointed à IBM

           EC_person-in:appointed à John Smith

           EC_position:appointed à CEO

 

           EC_person-out:stepped_down à Peter Liu

           EC_position:stepped_down à this position

           EC_person-out-reason:stepped_down à his health problem

 

           LocalGE output:

           GE_who:appointed à IBM

           GE_whom-what:appointed à John Smith

           GE_complement:appointed à CEO

           

Local GE output:

           GE_who:stepped_down à Peter Liu

           GE_complement:stepped_down à from this position

           GE_why:stepped_down à his health problem

In both cases, the keyverb appointed/stepped_down is used as an anchoring point for the predefined event EC (Executive Change) and for the general event on who did what.[2]  In fact, these locally extracted objects form a local dependency tree as the semantic representation of the processed sentence, as illustrated in Figure 5 for the first Local PE.  

 

Figure 7.  Sample Dependency Tree for Local PE

 

Virtual Information Objects

There is a logical link and correspondence between real information objects and virtual information object (VirtualObject).  The three major categories for virtual objects, namely, artificial tokens (ArtificialToken), scenarios objects (ScenarioObject) and AVM (Attribute Value Matrix) objects (AVM_Object), correspond to the four major categories for real objects.  Roughly speaking, the artificial token is an extension of the real token in transforming keyword-based representation of information to concept-based representation.[3]  A high-level of scenario object (ScenarioObject) captures the key content of the events expressed in a snippet (SnippetObject) or document (URL_Object). Finally, the AVM object (AVM_Object) represents the result of information fusion using link objects (LinkObject).[4]   

The AVM objects represent the two most important objects, namely profiles and events, for an IE system to capture.  This marks a significant progress from conventional IE systems which can only extract real objects, e.g. NE or local relationships such as TE and TR, or simple events SE.   

The profile object is represented by an AVM data structure used to model an entity in the real world.  It provides information from a variety of pre-defined key aspects (attributes) about the entity being modeled.   Depending on the major type of entity it models, it is sub-typed into person profile (PersonProfile) , organization profile (OrgProfile), product profile (ProductProfile), location profile (LocationProfile) and named-event profile (EventProfile).[5]  More sub-types of profile objects can be defined when the entity modeling in a specific domain requires extension. 

The event object is represented by an AVM data structure used to model an event happening in the real world.  The two major types arepre-defined events (PE_Object or simply PE) and general events (GE_Object or simple GE).  The former is domain-dependent and may play an important role for a particular IE application.  The latter is largely domain-independent.  As GE captures open-ended events, it can also be regarded as a back-off from PE.   

All AVM objects result from a fusion process where locally extracted objects are merged and condensed.  These objects form a discourse or global dependency tree as the semantic representation of the content for the processed text or archive.  For example, when the two Local PE objects as presented before are merged successfully, the corresponding dependency tree for this event is information-enriched as shown below in Figure 6.  


Figure 6.  Sample Dependency Tree for Merged Discourse PE

In Cymfony’s design, the data structures for all AVM objects take the following form:

                       <AVM_type ID> =

                       A1: V1

                       A2: V2

                       ……

                       An: Vn

Accordingly, the definition of a given AVM object involves the following work:  (i) type of the object; (ii) the finite set of attributes; (ii) the definition of appropriate values for each attribute.  For example, the definition for the PersonProfile AVM is given below.

                       <PersonProfile ID>=

                       name:               real token (of the type NePerson)

                       aliases:             real token (of the type NePerson)

                       gender:             artificial token: MALE | FEMALE

                       age:                  artificial token (from number normalization)

                       birth-date:        artificial token (from time normalization)

                       birth-place:      location profile (from location normalization)

                       affiliation:        organization profile

                       position:           real token

                       descriptors:      realtoken

Objects defined as appropriate values to fill an attribute slot are of three major types: (i) realtokens; (ii) AVM object IDs;[6] (iii) artificial tokens.[7]  In a completely concept-based system, there is little room for using real tokens.  In Cymfony’s proposed effort to gradually transform keyword-based IE to concept-based IE, all three types of value are employed with greatly reduced dependence on real tokens.  For example,for an event object, the attributes accommodating participants and location of events take profile objects as appropriate values; the attribute for time requires artificial token which results from the time normalization process.  Meanwhile, the value for attributes such as‘descriptors’ and ‘position’ is still a real token, typically an NP (noun phrase); there is no plan in disambiguating or normalizing this type ofopen-ended expression.[8]  

The use of AVM object IDs as values provides a way of linking related virtual objects.[9]  A profile can be linked to another profile to represent a CE relationship between two entities.  For example, an organization profile ID can fill the ‘affiliation’ slot for a person profile whose ID serves as value for the ‘staff’ slot for the organization. As mentioned before, related events can be linked to each other by filling the slot of ‘preceding-event’ or ‘subsequent-event’ with event ID.  Finally, an event is often linked to profiles by filling profile ID in event participant slots, such as ‘person-in’,‘person-out’ for the predefined event Executive Change and ‘who’, ‘whom-what’ for the general event.  At the same time, the reverse linkage from the profiles to the event is realized by filling the event ID in a special attribute ‘involved-events’, as shown in the sample profile below.  

<PersonProfile001> ::

name:                           Julian Werner Hill

aliases:                        Julian Werner; Julian Hill; Hill  

position:                       research chemist

age:                             91

gender:                        MALE

birth_place:                 <St. Louis: LocationProfile 300>

affiliation:                    <Du Pont Co.: OrgProfile 100>

education:                    <Washington University: OrgProfile 101>;

                                   <MassachusettsInstitute of Technology: OrgProfile 102>

spouse:                        <Polly: PersonProfile 002>

descriptors:                  an accomplished squash player and figure-skater

involved-events:           <die: GE_Object 200>;  

                                   <discover:  GE_Object 201>;

                                   <graduate:  GE_Object 202>;

The above profile AVM embodies a number of issues worth reviewing.  First, the attribute involved-invents links this profile with the events it is involved in, e.g. he died on Sunday in Hockessin, Del.;  he discovered nylon in 1930s; he graduated from Washington University in 1924 in St. Louis; etc.  Second, each virtual object is assigned a unique ID and type to facilitate the linkage between inter-relatedobjects:  in this case, PersonProfile is the type and 001 is the unique ID for the entity named Julian Werner Hill.  Third, the related entities in angle brackets as values of certain attributes reflect the design that the correlation between entities is not simply a linkage between names, but a linkage between profiles.[10]  Fourth, some attributes are designed to allow for only a pre-defined set of artificial tokens as values:  for example, the attribute gender has two pre-defined values MALE and FEMALE which should be filled by the system.[11]  

 

The artificial token, as aconcept carrier, is an important means for supporting concept-based IE.  It is in fact a distinctive feature for concept-based IE since there is absolutely no role for an artificial token to play in the conventional keyword-based IE representation.  In the present intermediate-level IE project, normalized forms for time NEs are artificial tokens designed to fill the ‘when’ slot of a C--GE object.  When sense disambiguation for verbs is developed, the disambiguated verbs can also take the form of artificial tokens to fill the ‘keyverb’ slot for C-GE.  

 

Artificial tokens are system-internally defined symbols whose sole purpose is to serve as values in concept-based IE objects such as C-GE. They represent unambiguous information as a result of processing real tokens, e.g. time normalization, sense tagging, etc.  Therefore, they can all be associated with some real tokens.[12]  Some artificial tokens correspond directly to real tokens themselves, e.g. the artificial token for a normalized token vs. the original time NE.[13]  Other artificial tokens correspond to some specific features of real tokens, e.g. the artificial token MALE and FEMALE vs. the subtype feature of a person NE.[14]    

 

Finally, it is felt that the virtual object ‘scenario’ (ScenarioObject) as object for deep IE requires in-depth study.  This work is left to future research.  The goal of this object is to organize the related event objects into some discourse structure, providing a way of modeling the key content of a ‘story’ presented in adocument.[15]  The structure for a scenario object may take the form of an AVM,[16] with  event objects to fill certain attribute slots.  It is also conceivable that the scenario object be defined as a list or set of event objects.

 

To summarize, the proposalof virtual information objects is a significant development in the IE research.  It provides a theoretical basis for advancing the IE research to the next level, i.e. concept-based IE. Virtual information objects derive from real information objects, but are moreconsolidated and less ambiguous than real information objects.  In fact, the capability of extracting virtualinformation objects can be regarded as a key feature which distinguishes the next generation IE system from the conventional shallow-level IE system.  


[1]  In InfoXtract, BaseXP stands for basic X-phrase as a result of shallow parsing.  It includes BaseNP (basic noun phrase),  BaseAP (basic adjective phrase), BasePP (basic prepositional phrase) and VG (verb group).

[2] Of course, in a real life IE system, the GE extracted from a sentence which corresponds to a PE will be filtered out before merging and outputting to the storage.  This is to avoid unnecessaryinformation redundancy as PE gives more specific semantic representation of an event than its corresponding GE.

[3]  In fact, ‘keyword’ in so-called keyword-based representation stands for real tokens which are by nature subject to ambiguity (a word may have numerous senses). ‘Concept’ in concept-based representation refers to virtualobjects, including artificial tokens, used in the IE representation.  The use of system-internally defined symbols, or ‘artificial tokens’, to replace real tokens to eliminate the ambiguity in the representation, is one way Cymfony advocates towards the goal C-GE forintermediate level IE.  Concrete proposals and examples will be discussed shortly.

[4] There has been significant research on AVM-style information representationunder the topics of typed feature structure and unification formalism [Pollard& Sag 1987] [Shieber 1986] [Carpenter 1992].  The proposed AVM representation in this report differs from the above in that a probability-based fuzzy merging operation has been designed to replace (and to simulate the effect of) the unification operation.  Despite the theoretical significance of unification based formalisms, it is felt that fuzzy merging handles the information fusion from the natural language data more efficiently and effectively.

[5] The named event entity, which is designed in InfoXtract to capture proper names of historical event (e.g. World War II) or regularly organized events such as conferences, exhibitions, is a good candidate for profile modeling.  So the sub-type of AVM object EventProfile is defined to contrast  the other type of virtual object EventObject: the former anchors on a proper nameof the event while the latter centers around a verb concept.

[6] ID is a system-internal unique identification number for an information object.  It is used to distinguish one object from the other objects.

[7] In fact, MUC employs AVM in the definition of TR and ST:  some slots require TE AVM (corresponding to Cymfony profile object) as appropriate values. But the keyword-based nature of the representation is not changed as only real tokens are legitimate values to fill non-AVM slots.   No artificial tokens, not even morphological canonical forms, are defined for representation.

[8] It seems a tangible task to normalize the position tokens in time as they form a closed set.  But the values for ‘descriptors’ are open-ended which are not easy to disambiguate and normalize.  In future, open-ended realtoken values will go through a sense tagging process to map them to concepts or concept clusters.  For the current proposalof C-GE, the first sense tagging task to accomplish is the real tokens for the attribute ‘keyverb’ in GE.

[9] Such linkage can be visually understood as hyperlinks to the related objects:  in browsing and navigation applications, the linkage between profiles/events should actually be implemented as automatic hyperlinks between information objects to provide the functionality of ‘threaded-browsing’.  

[10]  Linkage between names (more precisely,between real tokens) is a characteristic of local link object.  For virtual objects, conceptually, it is sufficient to link them via their ID.  In practice, however, it is convenient to link them through both ID (placed after colon in the notation) as well as real tokens (placed before colon); the latter serves as an evidence link to the source where such information is extracted from.    

[11] In InfoXtract, this artificial token value is inferred from the NE subtyping tags NeMan and NeWoman which both belong to the type NePerson.

[12] It is conceivable that once inferencing is introduced into the system,artificial tokens can be created from another artificial token and the association with real token becomes indirect. For example, from the artificial token for the normalized time in the ‘birth-date’ slot, another artificial token may be generated to fill the ‘age’slot.

[13] Another example is to normalize real tokens such as increased, up, etc.into the artificial token UP to fill the slot ‘direction’ for pre-definedevents like Stock Change.

[14] Other examples involve the design of a list of attributes to capture various important aspects of verb-centric information for event objects.  For instance, we can define a set of artificial tokens like FACT, REQUEST, QUESTION as appropriate values for the attribute ‘mode’ and a set of artificial tokens PAST, PRESENT, FUTURE as appropriate values for the attribute ‘aspect’.  Binary values PLUS and MINUS can be defined as artificial tokens to fill the attribute slots such as ‘negation’ so that a negative fact can also becaptured if needed.  In order to extract such information, VG (Verb Group) analysis via shallow parsing provides features which can be mapped to this type of artificial tokens.

[15] It is expected that the research on scenario extraction will lead to a technology breakthrough in one special area of IE application, i.e. automaticsummarization.  In theory, a summary of key content can be generated from the semantic representation embodied in ascenario.

[16]  In fact, the MUC Scenario Template is defined as an elaborate AVM structure.  If defined as AVM, scenario object perhaps should be regarded as a sub-type of AVM_Object in the hierarchy of information objects.


REFERENCES

Aone, A. & M.Ramos-Santacruz 2000.  REES: ALarge-Scale Relation and Event Extraction System. Proceedings of ANLP-NAACL2000, Seattle.

BBN Technologies, Cambridge,MA & AFRL/IFED, Rome, NY. June 2001. Information Extraction (IE) Technology for Counterdrug Applications. http:// www.dodcounterdrug.com/Documents.html.

Chinchor, N. & E. Marsh1998. MUC-7 Information Extraction Task Definition (version 5.1), Proceedingsof MUC-7

Glasgow, B. & A. Mandelet al.  1997.  MITA: An Information Extraction Approach to Analysis of Free-form Text in LifeInsurance Policies.  Proceedings ofthe Ninth Annual Conference on Innovative Applications of ArtificialIntelligence, Providence, RI.

Grishman, R., 1997. TIPSTERArchitecture Design Document Version 2.3. Technical report, DARPA.

Gross, M. 1994. ConstructingLexicon-grammars. Computational Approaches to the Lexicon, Atkins andZampolli (eds.), Oxford Univ. Press: 213-263.

Gross, M. 1997. TheConstruction of Local Grammars. Finite-State Language Processing, E.Roche & Y. Schabes (eds.), Language, Speech, and Communication, Cambridge,MA: MIT Press: 329-354.

Hobbs, J.R. & D. Israel,1994. Principles of Template Design. Proceedings of Human LanguageTechnology Workshop: 177-181, NJ.

Li, W & R. Srihari2000a.  A Domain Independent EventExtraction Toolkit, Phase 2 Final Technical Report, USAF, AFMC/Rome .

Li, W & R. Srihari2000b.  Flexible Information ExtractionLearning Algorithm, Phase 1 Final Technical Report, USAF, AFMC/Rome

MUC-6, 1995. Proceedingsof the Sixth Message Understanding Conference (MUC-6): 363. Morgan KaufmannPublishers, San Francisco, CA.

MUC-7, 1998. Proceedingsof the Seventh Message Understanding Conference (MUC-7), published on thewebsite http://www.muc.saic.com/

Pine, C. 1996.  Statement-of-Work (SOW) for The IntelligenceAnalyst Associate (IAA) Build 2, USAF, AFMC/Rome

Riloff, E.  1996. Automatically Generating Extraction Patterns from Untagged Text, Proceedingsof the 13th National Conference on Artificial Intelligence (AAAI-96):1044-1049

Srihari, R. 1998. A DomainIndependent Event Extraction Toolkit, Phase 1 Final Technical Report, USAF,AFMC/Rome.

Srihari, R & W. Li.2000.  A Question Answering SystemSupported by Information Extraction.  Proceedingsof ANLP 2000, Seattle.  

Yakushiji, A., Y. Tateisi,Y. Miyao & J. Tsujii 2001. Event Extraction from Biomedical Papers Using aFull Parser.  Pacific Symposium onBiocomputing 6:408-419

 

Abney, S.,M. Collins and A. Singhal 2000. Answer Extraction.  Proceedings of ANLP-2000, Seattle.

Agichtein, E. & Gravano, L.  2000. Snowball:  Extracting Relations from Large Plain-Text Collections.  Proceedings of the 5th ACM International Conference on Digital Libraries, San Antonio, TX.

Beckwith, R. et al. 1991. WordNet: A Lexical Database Organized on Psycholinguistic Principles.  Lexicons: Using On-line Resources to build a Lexicon, Uri Zernik, editor, Lawrence Erlbaum, Hillsdale, NJ.

Bikel, D.M. et al. 1997.  Nymble: a High-Performance Learning Name-finder.  Proceedings of the Fifth Conference on ANLP, Morgan Kaufmann Publishers. 194-201.

Brennan, S.E., M.W. Friedman and C.J.Pollard 1987. A centering approach to pronouns. Proceedings of 25th AnnualMeeting of the ACL, 155-162.

Brill, E., 1995. Transformation-based Error-Driven Learning and Natural language Processing: A Case Study in Part-of-Speech Tagging, Computational Linguistics, Vol.21, No.4,  227-253.

Charniak, E. et al. 1993. Equations for Part-of-Speech Tagging. Proceedings of the Eleventh National Conference on Artificial Intelligence. AAAI Press/MIT Press, Menlo Park.

Charniak, E. 1994. Statistical Language Learning, MIT Press, Cambridge, MA.

Chen, S. and J. Goodman. 1998.  An Empirical Study of Smoothing Techniques for Language Modeling. TR-10-98, Harvard Univ.

Chinchor, N. and E. Marsh 1998. MUC-7 Information Extraction Task Definition (version 5.1), Proceedings of MUC-7.

Choueka, Y. 1988. Looking for Needles in a Haystack or Locating Interesting Collocational Expressions in Large Textual Databases. Proceedings.of the RIAO Conference on User-Oriented Content-Based Text and Image Handling,Cambridge, MA, 21-24.

Church, K.W., and P. Hanks 1990. Word Association Norms, Mutual Information and Lexicography. Computational Linguistics,Vol.16, No. 1, 22-29.

Clarke, C.L.A., G.V. Cormack and T.R. Lynam 2001. Exploiting Redundancy in Question Answering. Proceedings of SIGIR’01,New Orleans, LA.

Collins, M. and Y. Singer. 1999. Unsupervised Modelsfor Named Entity Classification.  Proceedings of    the 1999 Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.  Association for Computational  Linguistics, 1999.

Cruse, D.A. 1986. Lexical Semantics. Cambridge University Press.

Cucchiarelli, A. and P. Velardi. 2001. Unsupervised Named Entity Recognition Using Syntactic and Se-mantic Contextual Evidence. ComputationalLinguistics, 27(1), 123-131.

Cucerzan, S. and D. Yarowsky. 1999. Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. Proceedingsof the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 90--99.

Darroch, J.N. and D. Ratcliff 1972.  Generalized iterative scaling for log-linear models.  The Annals of Mathematical Statistics, 1470-1480.

Fano, R. 1961. Transmission of Information, Cambridge, Mass: MIT Press.

Gale, W., K. Church, and D. Yarowsky. 1992. One Sense Per Discourse. Proceedings of the 4th DARPA Speech and Natural Language Workshop. 233-237,.

Haegeman, L. 1991. Introduction to Government and Binding Theory.Cambridge: Blackwell, 1991.

Haris, Z.S. 1968. Mathematical Structures of Language. New York: Wiley.

Hobbs, J.R. 1977. Resolving pronoun references. Lingua, 44:311-338

Hovy, E.H., U. Hermjakob, and C.-Y. Lin. 2001. The Use of External Knowledge of Factoid QA.  Proceedingsof TREC-10, Gaithersburg, MD, U.S.A.

Jaynes, E.T. 1957. Information Theory and Statistical Mechanics.  Physical Reviews, 106.

Kehler, A. 1997. Probabilistic Coreference in Information Extraction. Proceedings of the Second Conferenceon Empirical Methods in Natural Language Processing  (EMNLP), 163-173.

Kim, J., I. Kang, and K. Choi. 2002. Unsupervised Named Entity Classification Models and their Ensembles.  Proceedings of the Main Conference, COLING 2002.

Kupiec, J. 1993. MURAX: A Robust Linguistic Approach For Question Answering Using An On-Line Encyclopaedia.  Proceedings of SIGIR-93, Pittsburgh,PA.

Kwok, K. L., L. Grunfeld, N. Dinstl and M. Chan 2001. TREC 2001 Question-Answer, Web and Cross Language Experiments using PIRCS.  Proceedings of TREC-10, Gaithersburg,MD.

Lafferty, J., F. Pereira, and A. McCallum. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.  International Conference on Machine Learning (ICML'01)

Lappin, S. and  H.J. Leass 1994. An Algorithm for Pronominal Anaphora Resolution.  ComputationalLinguistics. 20(4): 535-561.

Li, H., R. Srihari, C. Niu and W. Li2002.  Localization Normalization for Information Extraction.  COLING 2002,549–555, Taipei, Taiwan.

Li, H., R. Srihari, C. Niu and W. Li 2003.  InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extraction. HLT-NAACL03 Workshop on the Analysis of Geographic References, Edmonton, Canada

Li, W and R. Srihari 2000a. A Domain Independent Event Extraction Toolkit, Phase 2 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, NewYork

Li, W and R. Srihari 2000b. Flexible Information Extraction Learning Algorithm, Phase 1 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, NewYork

Li, W, R. Srihari, X. Li, M. Srikanth, X. Zhang and C.Niu 2002. Extracting Exact Answers to Questions Based on Structural Links. Proceedings of Multilingual Summarization and Question Answering (COLING-2002 Workshop), Taipei,Taiwan.

Li, W. and R. Srihari 2003.Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization, Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

Lin, D. 1998a.  Automatic Retrieval and Clustering of Similar Words, Proceedings of COLING-ACL '98,Montreal, 768-773.

Lin, D. 1998b. Extracting Collocations from Text Corpora.  First Workshop on computational Terminology, Montreal, Canada.

Litkowski, K.C. 1999. Question-Answering Using Semantic Relation Triples. Proceedings of TREC-8, Gaithersburg, MD.

Miller, S. et al., 1998. BBN: Description of the SIFT System as Used for MUC-7.  Proceedings of MUC-7

Ng, V. and C. Cardie.  2002. Improving Machine Learning Approaches to Coreference Resolution.  Proceedings of the 40th Annual Meeting of the ACL, Philadelphia, PA. 104-111.

Niu, C., W. Li, J. Ding, and R.K. Srihari 2003. Orthographic Case Restoration Using Supervised Learning Without Manual Annotation.  Proceedings of The 16th FLAIRS, St.Augustine, FL

Pasca, M. and S.M. Harabagiu 2001. High Performance Question/Answering. Proceedings of SIGIR 2001. 366-374

Pietra, S.D., V.D. Pietra and J. Lafferty 1997. Inducing Features of Random Fields.  IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4),  380-393.

Prager, J., D. Radev, E. Brown, A. Coden and V. Samn 1999. The use of predictive annotation for question answering in TREC8.Proceedingsof TREC-8, Gaithersburg, MD.

Ratnaparkhi, A. 1998. Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, Univ. of Pennsylvania.

Resnik, P. 1999. Semantic similarity in a taxonomy: An information based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11, 95-130.

Rilof, E. and R. Jones 1999.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, Proceedings of the Sixteenth a National Conference on Artificial Intelligence (AAAI-99)

Rosenfeld, R. 1994. Adaptive Statistical Language Modeling. PhD thesis, Carnegie MellonUniversity.

Segal, R. and O. Etzioni. 1994. Learning decision lists using homogeneous rules. Proceedings of the 12th National Conference on Artificial Intelligence, July 1994.

Segond, F. et al. 1997. An Experiment in Semantic Tagging Using Hidden Markov Model Tagging, ACL-EACL Workshop about Lexical Semantic, Madrid  

Shannon, C.E. 1948. A Mathematical Theory of Communication. Bell System TechnicalJournal, 27.

Smadja, F. 1993. Retrieving Collocations from Text: Xtract, Computational Linguistics, Vol. 19, No.1, 143-177.

Soon, Wee Meng, Ng and Lim. 2001.  A Machine Learning Approach to Coreference Resolution of Noun Phrases. Computational Linguistics. Vol 27, No. 4,521-543.

Srihari, R. 1998. A Domain Independent Event Extraction Toolkit, Phase 1 Final Technical Report, Air Force Research Laboratory, InformationDirectorate, Rome Research Site, New York

Srihari, R, Niu, C and W. Li. 2000.  A Hybrid Approach for Named Entity and Sub-Type Tagging, Proceedings of ANLP 2000, Seattle.

Srihari, R and W. Li. 2000.  A Question Answering System Supported by Information Extraction,  Proceedingsof ANLP 2000, 166-172.  Seattle, WA.

Srihari, R., W. Li, C. Niu and T. Cornell.2003. InfoXtract: A Customizable Intermediate Level Information Extraction Engine. HLT-NAACL03 Workshop on The Software Engineering and Architecture of Language Technology Systems (SEALTS), Edmonton, Canada

Strube, M. 1998. Never look back: An alternative to centering. Proceedings ofCOLING-ACL '98, Montreal,1251-1257

Tokunaga, T., M. Iwayama and H. Tanaka 1995. Automatic thesaurus construction based on grammatical relations.  Proceedings of the International JointConference on Artificial Intelligence.

Voorhees, E. 1999. The TREC-8 Question Answering TrackReport. Proceedings of TREC-8, Gaithersburg, MD.

Voorhees, E. 2000. Overview of the TREC-9 Question Answering Track.  Proceedings of TREC-9, Gaithersburg, MD.

Yarowsky, D. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. Proceedings of the 33rd AnnualMeeting of the Association for Computational Linguistics, Cambridge,Massachusetts.


GLOSSARY

CE                               Correlated Entity (an IE task)

CGE                             Concept-based General/Generic Event (an IE task)

CO                               Coreference

CRF                             Conditional Random Field

DCOM                         Distributed Component Object Model

EP                                Entity Profile (an IE task)

GE                               General Event (an IE task)

H-M                             Head-Modifier

HMM                           Hidden Markov Model

IDF                              Inverse Document Frequency

IIS                                Iterative Scaling

IE                                 Information Extraction

InfoXtract                    Cymfony Information Extraction Engine

IR                                Information Retrieval

IsA                               Equivalence Relation between two NPs  

LOC                             LOCATION

MaxEnt                        Maximum Entropy

MI                                Mutual Information

MLE                            Maximum LikelihoodEstimator

MRR                            Mean Reciprocal Rank

MUC                            Message Understanding Conference

NE                               Named Entity (an IE task)

NLP                             Natural Language Processing

NP                               Noun Phrase

ORG                            ORGANIZATION

PE                                Pre-defined Event (an IE task)

PER                             PERSON

POS                             Part Of Speech

PP                                Prepositional Phrase

QA                               Question Answering

SEC                             Securities and Exchange Commission

SBIR                            Small Business Innovation Research

SVO                             Subject-Verb-Object

TREC                           Text Retrieval Conference

V-C                              Verb-Complement

V-O                             Verb-Object

V-S                              Verb-Subject


Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization

(SBIR Phase 2) 

Wei Li, Ph.D., Principal Investigator

Rohini K. Srihari, Ph.D.,Co-Principal

Contract No. F30602-01-C-0035

September 2003


[Related]

前知识图谱钩沉,信息抽取任务由浅至深的定义 2015-10-30

前知识图谱钩沉,关于事件的抽取 2015-10-30

SVO as General Events 2015-10-25

Pre-Knowledge-Graph Profile Extraction Research via SBIR 2015-10-24

《知识图谱的先行:从 Julian Hill 说起 》 2015-10-24

朝华午拾:在美国写基金申请的酸甜苦辣 - 科学网 

【置顶:立委科学网博客NLP博文一览(定期更新版)】





http://blog.sciencenet.cn/blog-362400-932270.html

上一篇:前知识图谱钩沉,信息抽取任务由浅至深的定义
下一篇:前知识图谱钩沉: 信息抽取引擎的架构

2 谢平 赵明

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备14006957 )

GMT+8, 2018-10-20 16:58

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部