|||
Final Reports for Small Business Innovation Research (SBIR) Projects Submitted (to appear) Li, W. and R. Srihari 2005. Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization, Phase 2 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (forthcoming) Li, W. and R. Srihari 2005. Automated Verb Sense Identification, Phase 1 Final Technical Report, Navy SBIR. (forthcoming) Li, W., R. Srihari and C. Niu 2006. Automated Verb Sense Identification, Phase 2 Final Technical Report, Navy SBIR. (forthcoming) Published (1) Srihari, R., W. Li and C. Niu 2005. An Intelligence Discovery Portal Based on Cross-document Extraction and Text Mining, Phase 1 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Abstract: KEYWORDS: Information Extraction This effort addresses two major enhancements to current information extraction (IE) technology. The first concerns the development of higher levels of IE, at the corpus level, and finally across corpora including structured data. The second objective concerns text mining from a rich IE repository assimilated from multiple corpora. IE is only a means to an end, which is the discovery of hidden trends and patterns that are implicit in large volumes of text. This effort was based on Cymfony’s document-level IE system, InfoXtract. A fusion component was developed to assimilate information extracted across multiple documents. Text mining experiments were conducted on the resulting rich knowledge repository. Finally, the design of an intelligence discovery portal (IDP) prototype led to the consolidation of the developed technology into an intuitive web-based application. (2) Srihari, R. and W. Li 2005. Fusion of Information from Diverse, Textual Media: A Case Restoration Approach, Phase 1 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Abstract KEYWORDS: Information Fusion Fusion of information in diverse text media containing case-insensitive information was explored, based on a core Information Extraction (IE) system capable of processing case-sensitive text. The core engine was adapted to handle diverse, case-insensitive information e.g. e-mail, chat, newsgroups, broadcast transcripts, HUMINT intelligence documents. The fusion system assimilates information extracted from text with that in structured knowledge bases. Traditional IE for case-insensitive text is limited to the named entity (NE) stage, e.g. retraining an NE tagger on case insensitive text. We explored case restoration, whereby statistical models and rules are used to recover case-sensitive form. Thus, the core IE system was not modified. IE systems are fully exploited if their output is consolidated with knowledge in relational databases. This calls for natural language processing and reasoning, including entity co-reference and event co-reference. Consolidation permits database change detection and alerts. Feedback to the core IE system exploits information in knowledge bases thereby fusing information. Information analysts and decision makers will benefit since this effort extends the utility of IE. A viable solution has many applications, including business intelligence systems that use large knowledge-bases of companies, products, people and projects. Updating these knowledge-bases from chat, newsgroups and multimedia broadcast transcripts would be enabled. A specific commercial application focused on brand perception and monitoring will benefit. Knowledge management systems would benefit from the ability to assimilate information in web documents and newsgroups with structured information. Military applications stem from the fact that analysts need to consolidate an abundance of information. (3) Srihari, R. and W. Li 2004. An Automated Domain Porting Toolkit for Information Extraction, Phase 1 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Abstract KEYWORDS: Domain Porting Information extraction (IE) systems provide critical assistance to both intelligence analysts as well as business analysts in the process of assimilating information from a multitude of electronic documents. This task seeks to investigate the feasibility of developing an automated, domain porting toolkit that could be used to customize generic information extraction for a specific domain of interest. Customization is required at various levels of IE in order to fully exploit domain characteristics. These levels include (i) lexicon customization, (ii) acquiring specialized glossaries of names of people, organizations, locations, etc. which assist in the process of tagging key named entities (NE), (iii) detecting important relationships between key entities, e.g. the headquarters of a given organization, and (iv) detecting significant events, e.g., transportation of chemicals. Due to the superior performance derived through customization, many have chosen to develop handcrafted IE systems which can be applied only to a single domain, such as the insurance and medical industries. The approach taken here is based on the existence of a robust, domain-independent IE engine that can continue to be enhanced, independent of any specific domain. This effort describes an attempt to develop a complete platform for automated customization of such a core engine to a specific domain or corpus. Such an approach facilitates both rapid domain porting as well as cost savings since linguists are not required. Developing such a domain porting toolkit calls for basic research in unsupervised machine learning techniques. Our structure-based training approach, which leverages output from the core IE engine, is already comparable in performance to the best unsupervised learning methods, and is expected to significantly exceed it with further research. A bootstrap approach using initial seeds is described. It is necessary to learn both lists of words (lexicons) as well as rule templates so that all levels of IE are customized. The final deliverables include: (i) new algorithms for structure-based bootstrap learning, (ii) a prototype model for domain porting of both lexicons and rule templates demonstrated on an intelligence domain and (iii) the design of a complete automated domain porting toolkit, including user-friendly graphical interfaces. (4) Li, W. & R. Srihari. 2003. Flexible Information Extraction Learning Algorithm, Phase 2 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Use the citation below to notify others of the report’s availability: Abstract: This research seeks to develop a working prototype for both shallow-level and intermediate-level information extraction (IE) by effectively employing machine learning techniques. Machine learning algorithms represent a cutting edge’ approach to tasks involving natural language processing (NLP) and information extraction. Currently, IE systems using machine learning have been restricted to low-level, shallow extraction tasks such as named entity tagging and simple event extraction. In terms of methodology, the majority of systems rely mainly on supervised learning that requires a sizable manually annotated corpus. To address these problems, a hybrid IE prototype InfoXtract’ that combines both machine learning and rule-based approaches has been developed. This prototype is capable of extracting named entities, correlated entity relationships and general events. To showcase the use of IE in applications, an IE-based Question Answering prototype has been implemented. In addition to the use of the proven techniques of supervised learning, unsupervised learning research has been explored in lexical knowledge acquisition in support of IE. A machine learning toolkit/platform that supports both supervised and unsupervised learning has also been developed. These achievements have laid a solid foundation for enhancing IE capabilities and for deploying the developed technology in IE applications. Full Text Availability: (5) Li, W. & R. Srihari. 2001. Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization, Phase 1 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Use the citation below to notify others of the report’s availability Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization Abstract: This task seeks to develop a system for intermediate-level event extraction with emphasis on time/location normalization. Currently, only keyword-based, shallow Information Extraction (IE), mainly the identification of named entities and simple events, is available for deployment. There is an acute demand for concept-based, intermediate-level extraction of events and their associated time and location information. The results of this effort can be leveraged in applications such as information visualization, fusion, and data mining. Cymfony Inc. has assessed the technical feasibility for concept-based, intermediate-level general event extraction (C-GE), by effectively employing a flexible approach integrating statistical models, handcrafted grammars and procedures encapsulating specialized logic. This intermediate level IE system C-GE aims at ‘translating’ language specific, keyword based representation of IE results into a type of ‘interlingua’ based mainly on concepts. More precisely, the key verb for a shallow event will be mapped into a concept cluster (e.g. kill/murder/shoot to death à {kill, put to death}; the time and location of the event will be normalized (e.g. last Saturday à 1999-01-30). To extract concept-based, general events from free text requires the application of ‘cutting edge’ Natural Language Processing (NLP) technology. The approach Cymfony proposes consists of a blend of machine learning techniques, cascaded application of handcrafted Finite State Transducer (FST) rules, and procedural modules. This flexible approach in combining different techniques and methods exploits the best of different paradigms depending on the specific task being handled. The work implemented by Cymfony under this Small Business Innovative Research (SBIR) Phase I grant includes the C-GE system architecture, the detailed task definition of C-GE, the implementation of a prototype time normalization module, the implementation of an alias association procedure inside NE (Named Entity tagging), enhanced machine learning tool for tasks like co-reference (CO), the development of semantic parsing grammars for shallow events, and research on lexical clustering and sense tagging. These accomplishments make the feasibility study reliable and provide a solid foundation for future system development. Distribution Statement: (6) Li, W. & R. Srihari. 2000. A Domain Independent Event Extraction Toolkit, Phase 2 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Abstract The proliferation of electronic documents has created an information overload. It has become necessary to develop automated tools for sorting through the mass of unrestricted text and for extracting relevant information. Currently, advanced Natural Language Processing (NLP) tools are not widely available for such commercial applications across domains. A commercially viable solution to this problem could have a tremendous impact on automating the information access for human agents. Cymfony has assessed the technical feasibility for domain independent information extraction. Regardless of a document’s domain, it has proven to be possible to extract key information objects accurately(with over 90% precision and recall). These objects include data items such as dates, locations, addresses, individuals or organization names, etc. More significantly, multiple relationships and general events involving the identified items can also be identified fairly reliably (over 80% for pre-defined relationships and over 70% for general events in precision and recall). Multiple relationships between entities are reflected in the task definition of the Correlated Entity(CE) template. A CE template presents profile information about an entity. For example, the CE template for a person entity represents a miniature resume of the person. A general event (GE) template is an argument structure centering around a verb notion with its arguments (for logical subject, logical object, etc.) plus the associated information of time(or frequency) and location. The implementation of such a domain independent information extractor requires the application of robust natural language processing tools. The approach used for this effort consists of a unique blend of statistical processing and finite state transducer (FST) based grammar analysis. Statistical approaches were used for their demonstrated robustness and domain portability. For text processing, Cymfony has also developed FST technology to model natural language grammar at multiple levels. As the basis for the grammar modeling, Cymfony has implemented a FST Toolkit. Cymfony has achieved the proposed two design objectives: (i) domain portability: the information extraction system Cymfony has developed can be applied to different domains with minimal changes; (ii) user-friendliness: with an intuitive user interface, non-expert users can get access to the extracted information easily. The work implemented by Cymfony under this SBIR Phase II grant on domain independent information extraction includes the conceptual design of the system architecture, named entity tagging, FST grammar modeling, and an integrated end-to-end prototype system involving all the modules. These accomplishments provide a solid foundation for further commercial development and exploitation. (7) Li, W. & R. Srihari. 2000. Flexible Information Extraction Learning Algorithm, Phase 1 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Abstract: The proliferation of electronic documents has created an information overload. It has become necessary to develop automated tools for quickly extracting key information from the mass of documents. The popularity of news clipping services, advanced WWW search engines and software agents illustrate the need for such tools. Currently, only shallow Information Extraction (IE), mainly the identification of named entities, is available for commercial applications. There is an acute demand for high-level extraction of relationships and events in situations where massive amounts of natural language texts are involved. Cymfony Inc. has assessed the technical feasibility for domain independent, high-level information extraction by effectively employing machine learning techniques. A hierarchical, modular system named Textract has been proposed for high-level as well as low-level IE tasks. In the Textract architecture, high-level IE consists of two main modules/tasks: Correlated Entity (CE) extraction and General Event (GE) extraction. CE extracts pre-defined multiple relationships between entities, such as relationships of “affiliation”, “position”, “address”, and “email” for a person entity. GE is designed to extract open-ended key events to provide information on who did what, (to whom), when and where. These relationships and events could be contained within sentence boundaries, or span a discourse of running text. The application of Textract/IE in the task of natural language Question Answering (QA) has also been explored. A unique, hybrid approach, combining the best of both paradigms, namely, machine learning and rule-based systems using finite state transducers (FST) has been employed. The latter has the advantage of being intuitive as well as efficient. However, knowledge acquisition is laborious and incomplete, especially when domain portability is involved. Machine learning techniques address this deficiency by automated learning from an annotated corpus. Statistical techniques such as Hidden Markov Models, maximum entropy and rule induction have been examined for possible use in different tasks and module development of this effort. The work implemented by Cymfony under this SBIR Phase I grant includes the IE system architecture, task definitions, machine learning toolkit development, FST grammar modeling for relationship/event extraction, implementation of the Textract/CE prototype, implementation of the Textract/QA prototype based on IE results and a detailed simulation involving all the modules up to general event extraction. These accomplishments make the feasibility study reliable and provide a solid foundation for future system development. The subject technical reports have been added to the Technical Reports database at DTIC, and are now available to others. The citations above provide the information that requesters would need to know to access these reports at http://www.dtic.mil/.
Principal Investigator (PI) or Co-Principal Investigator (Co-PI): Dr. Wei Li
Corpus-Level IE
Information Fusion
Text Mining
Knowledge Discovery
Case Restoration
Information Extraction
Multimedia Information Extraction
Named Entity Tagging
Event detection
Relationship detection
Customization
Information Extraction
Natural Language Processing
Unsupervised Machine Learning
Bootstrapping
Example-based Rule Writing
Li, W. & R. Srihari. 2003. Flexible Information Extraction Learning Algorithm, Phase 2 Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.
View Full Text (pdf)
File: /UL/b292996.pdf
Size: 793.8 KB
Accession Number:
ADB292996
Citation Status:
Active
Citation Classification:
Unclassified
Field(s) & Group(s):
050800 - PSYCHOLOGY
Corporate Author:
CYMFONY NET INC WILLIAMSVILLE NY
Unclassified Title:
Flexible Information Extraction Learning Algorithm
Title Classification:
Unclassified
Descriptive Note:
Final technical rept. May 2000-Apr 2003
Personal Author(s):
Li, Wei
Srihari, Rohini K.
Report Date:
Jul 2003
Media Count:
65 Page(s)
Cost:
$9.60
Contract Number:
F30602-00-C-0037
Report Number(s):
AFRL-IF-RS-TR-2003-157
XC-AFRL-IF-RS
Project Number:
3005
Task Number:
91
Monitor Acronym:
AFRL-IF-RS
XC
Monitor Series:
TR-2003-157
AFRL-IF-RS
Report Classification:
Unclassified
Supplementary Note:
The original document contains color images.
Distribution Statement:
Distribution authorized to U.S. Gov’t. agencies only; Specific Authority; Jul 2003. Other requests shall
be referred to Air Force Research Lab., Attn: IFEA, Rome, NY 13441-4114., Availability: This document
is not available from DTIC in microfiche.
Descriptors:
*ALGORITHMS, *LEARNING MACHINES, *INFORMATION RETRIEVAL,
METHODOLOGY, PROTOTYPES, PLATFORMS, LOW LEVEL, EXTRACTION, KNOWLEDGE BASED
SYSTEMS, FOUNDATIONS(STRUCTURES), NATURAL LANGUAGE, INFORMATION PROCESSING,
TOOL KITS, LEXICOGRAPHY, SHALLOW DEPTH
Identifiers:
SBIR(SMALL BUSINESS INNOVATION RESEARCH), SBIR REPORTS, PE65502F, WUAFRL30059109
Abstract Classification:
Unclassified
Distribution Limitation(s):
03 - U.S. GOVT. ONLY; DOD CONTROLLED
26 - NOT AVAILABLE IN MICROFICHE
Source Serial:
F
Source Code:
432812
Document Location:
DTIC
Creation Date:
19 Nov 2003
Distribution authorized to U.S. Gov’t. agencies only; Specific Authority.
~~~~~~~~~~~
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-21 16:53
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社