|||
Chinesemorphology & syntax
字组词与词组句(or短语):
1. 界限不清晰
2. 规则类似
3. compounding: small syntax, a BIG partof Chinese structures
4. pipeline steps with adaptivedevelopment and patches can handle
modulardevelopment is key for a complex system
easyto debug and maintain
System internal coordination:
1. 很多问题可以通过系统内部协调来解决:没有绝对对错,如何更合适
更好维护 e.g. two-subject phenomena
2. 大体分层,局部patch:
2major counter-arguments:
inter-dependency
errorpropagation
切词 与 POS等因此无需一刀切
2-subject structures
我身体好
三星手机屏幕清晰,价格合理
Linguists different analyses: each hasits points/perspectives and is valid
(1)S1+S2+Pred or Topic+Subj+Pred
(2) NP1-modifier NP2 + Pred” = NP1+de+NP2 + Pred
(3) NP1 + Pred (NP2+AP):predcompounding analysis
No need to argue, whichever analysis is convenient
No absolute right or wrong, differentperspectives
Largely system internal:
parsing representation is not goal, IE is
as long as tree is consistent andsupports IE
切词 vs组词
切词是系统的有机部分:
1. 正确率不是唯一的标准: a real story
2. config和 easy to debug 是最重要的
3. 不要本末倒置:负负也可以得正, adaptive development
vs.pipeline error propagation
大词典是根本对策:
1. 边界词典:越大越好(虽然语言学词典是有限的)
2. 切词的目的之一是语义标注:HowNet
切词与组词相结合:
1. listable
2. open-ended
有待于汉语语法的理论突破?
西语分析的方法、工具:
1. 可用
2. collocations: phrasal verbs at mopho-syntacticinterface
3. 需要扩充:譬如reduplication & unification
聊聊天;说说话
汉语的所谓“意合性”:
1. 语法比较弹性
2. 省略多:
(1)对于这件事,依我的看法,我们应该听其自然。
(2)这件事我的看法应该听其自然。
Parsing 的难度:
1. 中文这座山是陡坡:
morefine-grained rules: POS, sub-POS, lexical feature, word-driven
morelexical features needed: HowNet
lazyman’s approach won’t work
2. 英文的坡则比较平缓
中文NLP迷思之三:中文处理的长足进步有待于汉语语法的理论突破
词义消歧(WSD)是NLP应用的瓶颈??
No
结构歧义 ismore serious
BeyondPrecision & Recall
Bigdata redundancy help not only recall but also precision
Instance-basedrecall at extraction level vs concept-based recall at mining level
(the latter matters to users)
我们的语言系统每天阅读分析五千万个帖子,15 亿词的处理量
Community benchmarks vsindustry benchmark
Users’ experiences
Sentiment Mining based on Chinese Parsing
Thank You & QA
台北讲演幻灯第一部分:
http://blog.sciencenet.cn/blog-362400-677352.html
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-7-28 00:26
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社