《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵,插队修地球,1991年去国离乡,不知行止。

博文

专业老友痛批立委《迷思》系列搅乱NLP秩序,立委固执己见

已有 4101 次阅读 2011-12-29 23:29 |个人分类:立委科普|系统分类:博客资讯|关键词:NLP,Chinese,中文,处理,自然语言| NLP, Chinese, 处理, 中文, 自然语言

G 是资深同行专业老友很多年了,常与立委有专业内外的交流。都是过来人,激烈交锋、碰撞出火是常有的事儿。

昨天给他邮去《迷思》系列三则,他即打电话说:“好家伙,你这是惟恐天下不乱啊。看了《迷思》,我就气不打一处来。你这是对中文NLP全盘否定啊,危言耸听,狂放颠覆性言论。偏激,严重偏激,而且误导。虽然我知道你在说什么,你想说什么,对于刚入门的新人,你的《迷思》有误导。”

听到他气不打一处来,我特别兴奋:“你尽管批判,砸砖。我为我说的话负责,每一个论点都是多年琢磨和经验以后的自然流露,绝对可以站住。对于年轻人,他们被各种’迷思‘误导很多了,我最多是矫枉过正,是对迷思的反弹,绝对不是误导。”

现剪辑摘录批判与回应,为历史留下足迹 。内行看门道,外行看热闹,欢迎围观。

2011/12/28 G
The third one is more to the point - 严格说起来,这不能算是迷思,而应该算是放之四海而皆准的“多余的话”

Frankly, the first two are 标题党 to me. Most "supporting evidence" is wrong.

Well, I think I know what you were trying to say. But to most people I believe you are misleading.

 
No, I was not misleading, this is 矫枉过正 on purpose.  

 
At least I think you should explain a bit more, and carefully pick up your examples.

Take one example. Tokenizing Peoples Republic of China is routinely done by regular expression (rule based) based on capitalization, apostrophe and proposition (symbolic evidences), but NOT using dictionary.

that is not the point.  yes, maybe I should have chosen a non-Name example ("interest rate" 利率 is a better example for both Chinese and English), but the point is that closed compounding can (and should) be looked up by lexicons rather than using rules.
 

What you are referring to I guess is named entity recognition. Even that chinese and English could be significantly different.


No I was not talking about NE, that is a special topic by itself.  I consider that to be a low-level, solved problem, and do not plan to re-invent the wheel.  I will just pick an off-shelf API to use for NE, tolerating its imperfection.
 
I wouldn't be surprised if you don't do tokenization, as you can well combine that in overall parsing. But to applications like Baidu search, tokenization is the end of text processing and is a must-have.


Chunking of words into phrases (syntax) are by nature no different from chunking of morphemes (characters) into words (morphology).  Parsing with no "word segmentation" is thus possible.  

In existing apps like search engines, no big players are using parsing and deep NLP, yet (they will: only a time issue), so lexical features from large lexicons may not be necessary.  As a result, they may prefer to adopt a light-weight tokenization without lexicons.  That is a different case from what I am addressing here.   NLP discussed in my post series assumes the need for developing a parser as its core.  
 
Your attack to tagging is also misleading. You basically say if a word has two categories, just tag it both without further processing. That is tagging already.

That is not (POS) tagging in the traditional sense: the traditional sense of tagging is deterministic and relies on context.  Lexical feature assignment from lexical lookup is not tagging in the traditional sense.  If you want to change the definition, then that is off the topic.

 
What others do is merely one step forward, saying tag-a has 90% correct while tag-b 10% chance. I did rule based parser before and I find that is really helpful (at least in terms of speed). I try the high chance first. If it making sense, I just take it. If not, I come back trying the other. Let me know if you don't do something like that.


Parsing can go a long way without context-based POS tagging.  But note that at the end I proposed 一步半 approach, i.e. I can do limited, simple context-based tagging for convenience' sake.  The later development is adaptive and in principle does not rely on tagging.
 
Note here I am not talking about 兼语词 which is essentially another unique tag with its own properties. I know this is not 100% accurate but I see it in chinese something like 动名词 in English.


In fact, I do not see that as 兼语词, but for the sake of explanation of the phenomena, I used that term (logically equivalent, but to elaborate on that requires too much space).  In my actual system, 学习 is a verb, only a verb (or logical verb).  
 
 
Then this touches grammar theory. While we may not really need a new theory, we do need to have a working theory with consistency. You may have a good one in mind. But to most people it is not the case. For example, I see you are deeply influenced by 中心词 and dependency. But not everyone even aware of that, not to mention if they agree with. Till now there is no serious competition, as really no large scale success story yet. We need to wait and see which 学派 eventually casts a bigger shadow.



Good to be criticized.  But I had a point to make there.

【相关博文】


相位问题是做结构分析的一个古典问题。理论上讲,这个问题不解决,结构分析事儿就是“未完成”。
作者: mirror
日期: 12/29/2011 10:46:20

但是做结构分析的人并不会因为“相位问题”未彻底解决而停止工作。他们有“蒙也 要蒙出来”的气势。过去不好,如今计算机发达了,也就不怕了。不但不怕,而且剥夺了研究通过实验的技术手段解相位人的“饭碗”。因此,镜某不大看好“中文处理的长足进步有待于汉语语法的理论突破”的说法。

大约计算机语言识别的事情也是如此。问题有两个侧面:响应时间和精确程度。也许还有语音语调等感情色彩的成分。只有到这个层次,才可称谓“自然语言”。也许感情符号也要象音乐中的音符那样,来表达对话的感情。毕竟有些话属于能写出来不能说出来。比如人的称呼,在西方不是个问题。直呼其名就是了。而在东方,就不大好办了。在家里,不会有儿子直呼老爸名字的现象。还有一些比较禁忌的话题,当面说、对话就很困难了。但是不妨碍写出来。比如说“色情文学”。保不齐“色情文学”的计算机思考研究,在将来会很流行。也就是说,到了那个境界,就要思考机器的“感情”问题了。

----------
就“是”论事儿,就“事儿”论是,就“事儿”论“事儿”。

镜子真神人也,第一段说得非常到位:一个"蒙"字,极尽真准传神
作者: 立委
日期: 12/29/2011 12:45:07

Quote
过去不好蒙,如今计算机发达了,也就不怕了。
属于不可泄漏之天机啊。

至于上面的第二段,镜兄乘兴发挥,恣意挥洒,“老匠”立委就跟不上了。

附:“老匠” 之来历:
kingsten_88 说:
2011年12月29号16:59
李老师看来真是老匠了,对中英语法分析的细节娓娓道来,让我想起了那一场场苦恼过的场景。李老师说出了中文无特性的真相,所有语言的语言现象都是类似的,只是或轻或重而已,这正好说明是理论不足,并非应用不足呢。

liwei999 回复:
十二月 30th, 2011 at 00:20

老匠了,老匠了。
老匠一词极为真准传神。

from 52nlp


【置顶:立委科学网博客NLP博文一览(定期更新版)】



http://blog.sciencenet.cn/blog-362400-523458.html

上一篇:中文NLP迷思之三:中文处理的长足进步有待于汉语语法的理论突破
下一篇:没有股票的时代里,人们玩邮票

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备14006957 )

GMT+8, 2019-3-20 19:12

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部