《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵,插队修地球,1991年去国离乡,不知行止。

博文

前知识图谱钩沉: 信息抽取引擎的架构

已有 5098 次阅读 2015-11-1 09:43 |个人分类:立委科普|系统分类:科研笔记|关键词:信息抽取,IE,知识图谱,架构,语言技术| 信息抽取, 知识图谱, 架构, 语言技术

【立委按】以前在哪里提过这个 million-dollar slide 的故事。说的是克林顿当政时期的 2000 前,美国来了一场互联网科技大跃进,史称  .com bubble,一时间热钱滚滚,各种互联网创业公司如雨后春笋。就在这样的形势下,老板决定趁热去找风险投资,嘱我对我们实现的语言系统原型做一个介绍。我于是画了下面这么一张三层的NLP体系架构图,最底层是parser,由浅入深,中层是建立在parsing基础上的信息抽取,最顶层是几类主要的应用,包括问答系统。连接应用与下面两层语言处理的是数据库,用来存放信息抽取的结果,这些结果可以随时为应用提供情报。这个体系架构自从我15年前提出以后,就一直没有大的变动,虽然细节和图示都已经改写了不下100遍了,本文的架构图示大约是前20版中的一版,此版只关核心引擎(后台),没有包括应用(前台)。话说架构图一大早由我老板寄送给华尔街的天使投资人,到了中午就得到他的回复,表示很感兴趣。不到两周,我们就得到了第一笔100万美金的天使投资支票。投资人说,这张图太妙了,this is a million dollar slide,它既展示了技术的门槛,又显示了该技术的巨大潜力。



2.2.2. System Background: InfoXtract

InfoXtract (Li and Srihari2003, Srihari et al. 2000) is a domain-independent and domain-portable, intermediate level IE engine. Figure 4 illustrates the overall architecture of the engine which will be explained in detail shortly.  The outputs of InfoXtract have been designed with information discovery in mind.  Specifically, there is an attempt to:

  • Merge information about  the same entity into a single profile.      While NE provides very local information, an entity profile which consolidates all mentions of an entity in a document is much more useful

  • Normalize information wherever possible; this includes time and location normalization.  Recent work has also focused on mapping key verbs into verb synonym sets reflecting the general meaning of the action word

  • Extract generic events in a bottom-up fashion, as well as map them to specific event types in a top-down manner


Figure 4.  InfoXtract Engine Architecture

 

A description of the increasingly sophisticated IE outputs from the InfoXtract engine is given below:

 

·        NE:  Named Entity objects represent key items such as proper names of person, organization, product, location, target, contact information such as address, email, phone number, URL, time and numerical expressions such as date, year and various measurements weight,money, percentage, etc.

·        CE:  Correlated Entity objects capture relationship mentions between entities such as the affiliation relationship between a person and his employer. The results will be consolidated into the information object Entity Profile (EP) based on co-reference and alias support.

·        EP:  Entity Profiles are complex rich information objects that collect entity-centric information, in particular, all the CE relationships that a given entity is involved in and all the events this entity is involved in. This is achieved through document-internal fusion and cross-document fusion of related information based on support from co-reference, including alias association. Work is in progress to enhance the fusion by correlating the extracted information with information in a user-provided existing database.

 

·        GE:  General Events are verb-centric information objects representing ‘who did what to whom when and where’ at the logical level. Concept based GE (CGE) further requires that participants of events be filled by EPs instead of NEs and that other values of the GE slots (the action, time and location) be disambiguated and normalized.

·        PE:  Predefined Events are domain specific or user-defined events of a specific event type, such as Product Launch and Company Acquisition in the business domain. They represent a simplified versionof MUC ST. InfoXtract provides a toolkit that allows users to define and write their own PEs based on automatically generated PE rule templates.

The linguistic modules serve as underlying support system for different levels of IE.  This support system involves almost all major linguistic areas:  orthography, morphology, syntax, semantics, discourse and pragmatics.  A brief description of the linguistic modulesis given below.

·        Preprocessing:  This component handles file format converting, text zoning and tokenization.  The task of text zoning is to identify and distinguish metadata such as title, author, etc from normal running text.  The task of tokenization is to convert the incoming linear string of characters from the running text into a tokenlist; this forms the basis for subsequent linguistic processing.

·        Word Analysis:  This component includes word-level orthographical analysis (capitalization, symbol combination, etc.) and morphological analysis such as stemming.  It also includes part-of-speech (POS) tagging which distinguishes, e.g., a noun from a verb based on contextual clues.    An optional HMM-based Case Restoration module is called when performing case insensitive QA (Li et al..2003a).

·        Phrase Analysis:  This component, also called shallow parsing, undertakes basic syntactic analysis and establishes simple, un-embedded linguistic structures such as basic noun phrases (NP), verb groups(VG), and basic prepositional phrases (PP). This is a key linguistic module, providing the building blocks forsubsequent dependency linkages between phrases.

·        Sentence Analysis:  This component, also called deep parsing, decodes underlying dependency trees that embody logical relationships such as V-S (verb-subject), V-O (verb-object), H-M (head-modifier).  The InfoXtract deep parser transforms various patterns, such as active patterns and passivepatterns, into the same logical form, with the argument structureat its core.  This involves a considerable amount of semantic analysis.  The decoded structures are crucial for supporting structure-based grammar development and/or structure-based machine learning for relationship and event extraction.

·        Discourse Analysis:  This component studies the structure across sentence boundaries. One key task for discourse analysis is to decode the co-reference (CO) links of pronouns (he, she, it, etc) and other anaphor (this company,that lady) with the antecedent named entities.  A special type of CO task is ‘Alias Association’ which will link International Business Machine with IBM and Bill Clinton with William Clinton.  The results support information merging and consolidation for profiles and events.  

·        Pragmatic Analysis:  This component distinguishes important,relevant information from unimportant, irrelevant information based on lexical resources, structural patterns and contextual clues.

 

Lexical Resources

The InfoXtractengine uses various lexical resources including the following:

  • General English dictionaries available in electronic form providing basis for syntactic  information.  The Oxford Advanced Learners’ Dictionary (OALD) is used extensively.

  • Specialized glossaries for people names, location names, organization names, products, etc.

  • Specialized semantic dictionaries reflecting words that denote person, organization,  etc.  For example, doctor corresponds to person, church corresponds to organization.  This is especially useful in QA.  Both WordNet as well as custom thesauri are used in InfoXtract.

  • Statistical language models for Named Entity tagging (retrainable for new domains)

InfoXtract exploits a large number of lexical resources. Three advantages exist by separating lexicon modules from grammars : (i) high speed due to indexing-based lookup; (ii) sharing of lexical resources by multiple gramamr modules; (iii) convenience in managing grammars and lexicons.   InfoXtract uses two approaches to disambiguate lexicons. The first is a traditional feature-based Grammatical/machine learning Approach where semantic features are assigned to lexical entries that are subsequently used by the grammatical modules. The second approach involves expert lexicons which are discussed in the next section.  




Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization

(SBIR Phase 2) 

Wei Li, Ph.D., Principal Investigator

Rohini K. Srihari, Ph.D.,Co-Principal

Contract No. F30602-01-C-0035

September 2003


[Related]

 《朝华午拾:创业之路》

前知识图谱钩沉: 信息体理论 2015-10-31

前知识图谱钩沉,信息抽取任务由浅至深的定义 2015-10-30

前知识图谱钩沉,关于事件的抽取 2015-10-30

SVO as General Events 2015-10-25

Pre-Knowledge-Graph Profile Extraction Research via SBIR 2015-10-24

《知识图谱的先行:从 Julian Hill 说起 》 2015-10-24

朝华午拾:在美国写基金申请的酸甜苦辣 - 科学网 

【置顶:立委科学网博客NLP博文一览(定期更新版)】






http://blog.sciencenet.cn/blog-362400-932462.html

上一篇:前知识图谱钩沉: 信息体理论
下一篇:钩沉:博士阶段的汉语HPSG研究

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备14006957 )

GMT+8, 2019-3-23 21:36

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部