|||
Towards robust large-scale Chineseparsing
Wei Li
March 29, 2013
Institute of Information Science Academia Sinica
Chinese Parsing Background:
Four Layer System Architecture
I:DesignPhilosophy
Indexingsystem (backend engine for offline processing)
vsRetrieval system (frontend engine for on-the-fly run)
Parser-IEarchitecture:
deeper parsing, shallow IE
domain-independent, app specific
linguistic, domain
bridge, end results
Twoengines, four layers
I:System Architecture for Core Engine
II:Parsing-based Information Extraction
III:Text Mining
DevelopmentEnvironment for Parsing
Language engineering 与其他软体工程并无本质不同
Follow software development best practice:
1. unit test: environment fordebugging, data search etc.
2. regression test: baselines, millions of checking points
3. QA (quality assurance) test
4. several layers of regression protection: nightly build, release build
5. NLP-specific language
6. NLP platform & environment
7. Platform extension & support
8. code review: readability and maintainablility is no 1
9. help from statistics and learning
Development vstesting:
1.roughly 1:1 in terms of developer’s time
2. 1:0.5 in terms of developer and QAresources
Avoid unnecessary work and gettingoverdone
1. linguists need to be controlled: lostin trees without seeing forest
2. data-driven development
3. better goal oriented
DependencyTrees as Representation
如果爱因斯坦在时空万物中看到了造物主的美,如果门捷列夫在千姿百态的物质后面看到了元素表的简洁,语言学家则是在千变万化的语言现象中看到了逻辑结构之美。这种美的体验伴随着我们的汗水,鼓励我们为铲平语言壁垒而愚公移山,造福人类。
台北讲演幻灯第二部分:
http://blog.sciencenet.cn/blog-362400-677358.html
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-7-28 02:29
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社