deep learning 从概念上没有问题。跟过去十多年闹得火热的 weakly supervised learning 以及 boot-strapping 的潮流一样,方向上是没有问题的,前景很诱人。但是这些潮流,结果真正引起实用技术革命的有多少呢?花样翻新可以吸引眼球和热情,但真正的好处还需要拭目以待。前一阵子有搞搜索的老友问到这个题目,我是这样回答的:
>>How do you think about current hot topic: deep learning and knowledge graph?
I am not a learning expert, and cannot judge how practical and convenient for the new deep learning trend to solve a practical problem in industry. But conceptually, it is fair to say that deep learning is in the right direction for research. For a long time, the learning community has been struggling between the supervised and unsupervised leaning dilemma, the former being tractable but facing knowledge bottleneck (i.e. the requirement of big labeled training corpus) and the latter only proven to work for (label-less) clustering, which usually cannot directly solve a practical problem. Now in addition to many different ways of semi-supervised or weakly supervised approaches, deep learning provides yet another natural way to combine unsupervised and supervised learning. It makes lots of sense to let the unsupervised learning scratch the surface of a problem area and use the results as input for some supervised learning to take on to deeper levels.
Personally, I believe to solve a real life problem in scale, it is best to combine manual rules with machine learning. That makes tasks much more tractable for engineering implementation.
在克服上述挑战的时候,统计可以大派用场。无论是把统计用于数据上,或者用于半自动编写规则,或者有机整合到规则系统中去,都有很多二者亲密合作的机会。譬如让机器学习有统计意义的可能patterns,然后提供给语言学家细化(instantiation),是确保克服人脑盲点的一个有效方法。与 deep learning 的道理一样,见林不见树的机器学习与见树不见林的专家编写难道不能各个发挥一己之长么?
【后记】上面提到了 HowNet 使用中 feature noise 的困扰,指的是其当下的中文系统。刚刚核实过,显然 HowNet 的发明者早已意识到这个问题,因此,英文的 HowNet 已经解决了这个问题,汉语的问题最终也会解决。他们对 lexical features 做了如下分类,以方便使用者根据不同使用场景对 features 进行筛选: