《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵,插队修地球,1991年去国离乡,不知行止。

博文

钩沉:SVO as General Events

已有 4470 次阅读 2015-10-25 04:50 |个人分类:立委科普|系统分类:科研笔记| 信息抽取, General, extraction, event

Traditional Information Extraction (IE) following the MUC standards mainly targeted domain-specific Scenario Template (ST) for text mining of event scenarios.  They have several limitations including too narrow and too challenging to be practically applicable.  Therefore, in the course of our early IE research via SBIRs, in addition to proposing and implementing the entity-centric relationship extraction in Entity Profiles which are roughly equivalent to the core part of the current knowledge graph concept, as Principal Investigator, I proposed a new event type called General Events (GE) based on SVO (Subject-Verb-Object) parsing.  This idea is proven to be also a significant progress in terms connecting the event-centric profiles and entity-centric profiles into a better knowledge graph.  Some of the early ideas and design reported 15 years ago are re-presented below as a token of historical review of my career path in this fascinating area.


January 2000

On General Events

GE aims at extracting open-ended general events (level 3 IE) to support intelligent querying for information like who did what (to whom) when and where.  The feasibility study of this task has been conducted, and is reported below.  

 

GE is a more ambitious goal than CE, but it would have an even higher payoff.  As the extracted general events are open-ended, they are expected to be able to support various applications ranging from question answering, data visualization, to automatic summarization, hence maximally satisfying the information need for the human agent.  

 

GE extraction is a sophisticated IE task which requires the most NLP support. In addition to all the existing modules required for CE (tokenization, POS tagging, shallow parsing, co-referencing), there are two more NLP module sidentified to be crucial for the successful development of GE.  They are semantic full parsing and pragmatic filtering.  

 

A semantic full parser aims at mapping the surface structures into logical forms (semantic structures) via sentence level analysis. This is different from shallow parsing which only identifies isolated linguistic structures like NP, VG, etc. As the GE template in the Textract design is essentially a type of semantic representation, the requirement of a semantic full parser is fairly obvious.  Traditional full parsers are based on powerful grammar formalisms like CFG, but they are often too inefficient for industrial  application.  The Textract full parser is based on the less powerful but more efficient FST formalism.  Besides the proven advantage of efficiency, the use of FST for full parsing can be naturally built on top of the shallow parsing (also FST-based) results. This makes the grammar development much less complicated.  

 

As part of the feasibility study, a small grammar has been tested for semantic parsing based on the shallow parsing results.  The followingare two of the sentences for this experiment.

 

           John Smith from California founded Xyz Company in 1970.

           This company was later acquired by Microsoft in 1985.

 

These sentences go through the processing of all the existing modules, from tokenization, POS tagging, NE,shallow parsing, CO and CE.  The processing results are represented in the Textract internal data structure.  The relevant information is shown below in text format.

 

0|NP[John Smith]:NePersonwhere_from=1affiliation=3

1|PP[from California]:NeLocation

2|VG[founded]:ACTIVEPAST

3|NP[Xyz Company]:organization_whead=0

4|PP[in 1970]:NeTime

5|PERIOD[.]

6[This company]:organization_wmother_org=8coreference=<3,0>

7|VG[was later acquired]:PASSIVE/PAST

8|[by Microsoft]:NeOrganization

9[in 1985]:NeTime

10|PERIOD[.]

 

As seen, each unit has a unique identifier assigned by the system for supporting the CE and CO links.  NE has provided NE features like NePerson,NeOrganization, NeTime, etc. Shallow parsing provides basic linguistic structures like NP, PP and VG (plus the features like ACTIVE voice, PASSIVE voice, PAST tense).  CE has identified the relationships like affiliation, where_from, head and mother_org.  The existing non-deterministic CO has linked the anaphor This company with its potential antecedents Xyz Company (correct)and John Smith (wrong).  The above results are the assumed input to the semantic parser.

 

For this experiment, the following two rules, one for active sentence and one for passive, were formulated to support the conceived semantic parsing.[1]  

 

0|NP                [Adv|PP]*       1|VG(ACTIVE)          2|NP    3|PP(TIME)

==> 1:argument1=0argument2=2time=3

 

0|NP                [Adv|PP]*       1|VG(PASSIVE)        2|PP(by)          3|PP(TIME)

==> 1:argument1=2argument2=0time=3

 

After compiling the above rules into a transducer, the FST runner serves as a parser in applying this transducer to the sample text and outputs the semantic structures as shownbelow:

 

                       PREDICATE:             found

                       ARGUMENT1:          <John Smith>

                       ARGUMENT2:           <Xyz Company>

                       TIME:                          in 1970

 

                       PREDICATE:             acquire

                       ARGUMENT1:          <Microsoft>

                       ARGUMENT2:           ‘This company’

                       TIME:                          in 1985

 

After merging based on the co-reference links, the second template is updated to:

 

                       PREDICATE:             acquire

                       ARGUMENT1:           <Microsoft>

                       ARGUMENT2:           {Xyz Company, John Smith}

                       TIME:                          in 1985

 

This style of semantic representation shares the same structure with the defined GE template.

 

This experiment demonstrates one important point.  That is, there is no fundamental difference between CE grammars (CE3 in particular) and the grammar required for semantic parsing.  The rules are strikingly similar;  they share thesame goal in mapping surface structures into some form of semanticrepresentation (in CE template and GE template).  They both rely on the same infrastructures(tools/mechanisms, basic NLP/IE support, etc.) which Cymfony has built over theyears.  The difference lies in the contentof the rules, not in the form of the rules.  Because CE relationships are pre-defined, CErules are often key word based. For example, the key words checked in CE rules for the relationshipaffiliation include work for, join, hired by, etc. On the otherhand, the rules for semantic parsing to support GE are more abstract.  Due to the open-endedness of the GE design,the grammar only needs to check category and sub-category (information likeintransitive, transitive, di-transitive, etc) of a verb instead of the wordliteral in order to build the semantic structures.  The popular on-line lexical resources like OxfordAdvanced Learners’ Dictionary and Longman Dictionary of Contemporary Englishprovide very reliable sub-categorization information for Englishverbs.    

 

The similarity between grammars for CE extraction and grammars for GE extraction is an important factor in terms of the feasibility of the proposed GE task.  Since developing semantic parsing rules for GE (either hand-coded or machine learned or hybrid) does not require anything beyond that for developing CE rules, there is considerable degree of transferability from the CE feasibility to GE feasibility.  The same techniques proven to be effective for CE hand-coded rules and for automatic CE rule learning are expected to be equally applicable to the GE prototype development.  In particular, rule induction via structure based adaptive training promises a solution to sentence level semantic analysisfor both CE3 learning and GE learning.



[1]Note the similarity of these rules to the sample rule in the CE3 grammar shown previously.



【置顶:立委科学网博客NLP博文一览(定期更新版)】




https://blog.sciencenet.cn/blog-362400-930758.html

上一篇:钩沉:Early arguments for a hybrid model for NLP and IE
下一篇:社会资源的有效利用与社会主义制度
收藏 IP: 192.168.0.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-4-27 10:44

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部