《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵,插队修地球,1991年去国离乡,不知行止。

博文

Church - 计算语言学课程的缺陷 (翻译节选)

已有 8600 次阅读 2013-10-3 08:16 |个人分类:立委科普|系统分类:科研笔记| 课程, 计算语言学, church

节选译自: 

K.Church2011.  A Pendulum SwungToo Far.  Linguistics issues in Language Technology, Volume 6, Issue 5.

3.5无视历史注定要重复历史错误


在多数情况下,机器学习、信息检索和语音识别方面的实证复兴派干脆无视 PCM(Pierce,Chomsky and Minsky)的论点,虽然神经网络给感知机增加隐藏层可以看作是对敏斯基和帕佩特批评的让步。尽管如此,敏斯基和帕佩特(1988)对敏斯基和帕佩特(1969年)【感知机】出版以来的20年领域进展之缓慢深表失望。 

“在编写这一版时,我们本来准备根据进展‘把这些理论更新’。但是,当我们发现自本书1969年第一版以来,没有看见什么有意义的进展,我们认为保留原文更为有利...只需加一个尾声即可。...这个领域进展如此缓慢的原因之一是,不熟悉领域历史的研究人员继续犯别人以前已经犯过的错误。有些读者听说该领域没有什么进步,可能会感到震惊。难道感知机类的神经网络(新名称叫connectionism,连通主义)没有成为热烈讨论的主题么?是的,的确存在很大的兴趣,很多的讨论。可能确实也有些现在的发现在未来也许会显出重要性。但可以肯定地说,领域的概念基础并没有明显改变。今天引起兴奋的问题似乎与前几轮的兴奋大同小异...。我们的立场依然是当年我们写这本书时的立场:我们相信这个领域的工作是极为重要和丰富的,但我们预计其增长需要一定程度的批判性分析,可这种分析在我们更浪漫的倡导者那里却一直似乎没有人愿意去做,也许因为连通主义的精神似乎变得与严谨分析南辕北辙。”Minsky and Papert 1988, 前言,第vii页)

 

计算语言学课程的缺陷

 

正如敏斯基和帕佩特上面指出的,我们之所以不断犯同样的错误与我们的教学有关。辩论的一方在当代计算语言学教科书中不再提及,已被淡忘,需要下一代人重新认识和复活它。当代的计算语言学教科书很少介绍PCM三位前辈。皮尔斯在汝拉夫斯基和马丁编著的教科书(Jurafskyand Martin 2000)以及曼宁等编著的两套教科书中(Manning and Schütze 1999;Manning et al. 2008)根本没有提及。敏斯基对感知机的批评只在三本教科书之一中简要提起(Manningand Schütze 1999,第603页)。刚入门的新学生也许意识不到所谓“相关的学习算法”(见下列粗斜体)其实包含了当今领域非常流行的方法,如线性和logistic回(linear and logistic regression)。

“一些其他的梯度下降算法(gradient descent algorithms)也有类似的收敛定理,但是多数情况下,收敛只能达到局部最优。…感知机收敛能达到全局最优是因为它们选用了线性分离机这样比较简单的分类模型。很多重要的问题是线性不可分的,其中最著名的是异或(XOR)问题。… 决策树(decision tree)算法可以处理这样的问题,而感知机则不能。研究人员在对神经网络的最初热情(Rosenblatt 1962)以后,开始意识到这些局限。其结果是,对于神经网络及其 相关的学习算法 的兴趣很快消退,此后几十年一直一蹶不振。敏斯基和帕佩特的论文(Minskyand Papert 1969)通常被认为是这类学习算法式微的起点。”

曼宁等2008 版教科书(Manning et al. 2008)在神经网络算法描述上,有简短的文献指向敏斯基和帕佩特1988年的论文(Minsky and Papert 1988),但并未提及文中的尖锐批评:

“对上面提到但本章未及细述的算法感兴趣的读者可以参阅以下文献:神经网络方面有Bishop (2006),线性和logistic回归方面有Hastie et al. (2001) 以及 Minsky and Papert (1988)”(Manning et al. 2008,第292页)”。

根据这样的文献指引,一个学生可能得出错误印象,以为敏斯基和帕佩特是这些神经网络算法(以及当今流行的线性和logistic回归这类方法)的赞许者。

毕晓普明确指出,敏斯基和帕佩特绝不是感知机和神经网络的赞许者,而且把它们认作“不正确的构想” (“incorrect conjecture”)予以排斥(Bishop2006,第193页)。毕晓普把神经网络在实际应用中的普及看做是对敏斯基和帕佩特批评的反证,认为并非如他们所说的那样“没有多少改变”,“多层网络并不比感知机更有能力识别连通性(connectedness)”。

当代教科书应该教授给学生像神经网络这类有用的近似方法的优点和缺点。辩论双方都大有可言。排除任何一方的论证都是对我们的下一代不负责任,尤其是当其中一方的批评是如此的尖锐,用到“不正确的构想”和“没有多少改变”这样的说法。

乔姆斯基比皮尔斯和敏斯基在当代教科书中被提及多一些。曼宁和舒兹的教科书(Manning and Schütze 1999)引用乔姆斯基10次,汝拉夫斯基和马丁的教科书(Jurafsky and Martin 2000)的索引中共有27处文献指向乔姆斯基。第一本书中较少引用是因为它专注于一个相对狭窄的话题,统计型自然语言处理。而第二本教科书涉及面广泛得多,包括音韵学和语音。因此,第二本书还引用了乔姆斯基的音韵学工作(Chomskyand Halle 1968)。

两本教科书都提到乔姆斯基对有限状态方法的批评,以及这些批评在当时对经验主义方法论的打击性效果。但是话题迅速转移到描述这些方法的复兴,却相对较少讨论其论点,经验主义回归的动因及其对目前实践以及未来的影响。

汝拉夫斯基和马丁的教科书第230-231页写道(Jurafsky and Martin 2000):

“在一系列极具影响力的论文中,始于乔姆斯基(1956),包括乔姆斯基(1957)以及米勒和乔姆斯基(1963) (Miller and Chomsky1963),诺姆·乔姆斯基认为,‘有限状态的马尔可夫过程’虽然可能是有用的工程近似方法,却不可能成为人类语法知识的完整认知模型。当时的这些论证促使许多语言学家和计算语言学家完全脱离了统计模型。

“N元模型的回归开始于耶利内克等(Jelinek, Mercer, Bahl)的工作。…”

两本教科书介绍N元文法都是从引用其优缺点的讨论开始(Jurafsky and Martin 2000, 第191页):

“但是必须认识到,所谓‘一个句子的概率’是一个完全无用的概念,无论怎样理解这个术语。” (Chomsky 1965, 第57页)

“任何时候,只要一个语言学家离开研究组,识别率就会上升。”(FredJelinek,当时他在IBM 语音组, 1988)

曼宁和舒兹(1999,第2页)是以这样的引用开始讨论的:

“统计的考量对于理解语言的操作与发展至关重要。”(Lyons1968, 第98页)

“一个人对合法语句的产生和识别能力不是基于统计近似的概念之类。”(Chomsky 1957, 第16页)

这样正反面观点的引用确实给学生介绍了争议的存在,但却不能真正帮助学生明白这些争议意味着什么。我们应提醒学生,乔姆斯基反对的是一些如今极其流行的有限状态的方法,包括N元文法和隐马尔可夫模型,因为他相信这些方法无法捕捉远距离的依从关系(例如,一致关系的限制条件和wh-位移现象)。

乔姆斯基的立场直到今天仍然是有争议的,本文审阅者之一的反对意见也佐证了这种争议。我不希望此时在这场辩论中站在某一方。我只是要求我们应该教给下一代辩论的双方说辞,使他们不需要重新发现任何一方。

 

计算语言学学生应该接受普通语言学和语音学的培训

 

为了给进入这行的学生为低垂水果采摘完后的情形做好准备,今天的学生教育应该向广度发展,他们应该全面学习语言学的主要分支,如句法、词法、音韵学、语音学、历史语言学以及语言共性。我们目前毕业的计算语言学学生视野太窄,专业性太强,他们对于一个很专门的领域具有深入的知识(如机器学习和统计型机器翻译),但可能没听说过很多著名的语言学现象,譬如,格林伯格共性(Greenberg’s Universals), 提升(Raising),等同(Equi), 量词辖域(quantifier scope), 空(gapping), 孤岛条件(islandconstraints)等。我们应该确保参与指代(co-reference)研究的学生都知道c-统制(c-command) 和指称相异(disjointreference)。 当学生在计算语言学会议上宣讲论文之前,他们应该了解形式语言学(FormalLinguistics)对此问题的标准处理。

语音识别工作的学生需要了解词的重音(如:Chomsky and Halle 1968)。音韵学重音对于下游语音和和声学过程具有相当的影响。

3 “politics” and “political”的谱图显示有三个/l/同位音。在重音前后出现不同的音位变体。

 

语音识别目前没有充分利用单词重音特征是一个不小的遗憾,因为重音强调是语音信号中最突出的特性之一。T图3显示了最小对立体 “politics”和“political”的波形和谱图。这两个词千差万别,目前的技术着重于语音单位层面的区别:

1.  “Politics”以 –s 结尾,而“political”以-al结尾。

2.  “politics” 不同,“political”中第一个元音是弱化的央元音(schwa)。

重音的区别更为突出。在诸多与重音有关的区别中,图3突出了重音前与重音后/l/同位音之间的区别。另外还有对/t/音的影响。“politics”中 /t/ 是送气音,但在“political”中却是闪音。

目前,在语音单位层面(segmental level),仍有大量低悬水果的工作,但这些工作终有完结之时。我们应该教给语音识别的学生有关音韵学和词重音的知识,以便他们在技术瓶颈已经超越语音单位层面以后依然游刃有余。既然存在与重音相关超过三元语音单位的远距离关系,重音方面的进展需要对目前流行的近似方法的长处与缺陷均有深入的理解。语音识别方面的基础性进展,譬如能有效使用重音,很可能要依赖于基础技术的进步。


~~~~~~~~~~~~~~~~~~~~~~~~ 


3.5 Those WhoIgnore History Are Doomed To Repeat It 

Forthe most part, the empirical revivals in Machine Learning, Information Retrieval and Speech Recognition have simply ignored PCM's arguments, though in the case of neural nets, the addition of hidden layers to perceptrons could be viewed asa concession to Minsky and Papert. Despite such concessions, Minsky and Papert(1988) expressed disappointment with the lack of progress since Minsky andPapert (1969). 

“In preparing this edition we were tempted to‘bring those theories up to date.’ But when we found that little of significance had changed since 1969, when the book was first published, we concluded that it would be more useful to keep the original text ... and add an epilogue. ... One reason why progress has been so slow in this field is that researchers unfamiliar with its history have continued to make many of the same mistakes that others have made before them. Some readers may be shocked to hear it said that little of significance has happened in the field. Have not perceptron-like networks - under the new name connectionism - become a major subject of discussion. ... Certainly, yes, in that there is a great deal of interest anddiscussion. Possibly yes, in the sense that discoveries have been made thatmay, in time, turn out to be of fundamental importance. But certainly no, in that there has been little clear-cut change in the conceptual basis of the field. The issues that give rise to excitement today seem much the same as those that were responsible for previous rounds of excitement. ... Our position remains what it was when we wrote the book: We believe this realm of work to be immensely important and rich, but we expect its growth to require a degree of critical analysis that its more romantic advocates have always been reluctant to pursue- perhaps because the spirit of connectionism seems itself to go somewhat against the grain of analytic rigor.(Minsky and Papert 1988,Prologue, p. vii)

 

Gaps in Courses on Computational Linguistics 

Part of the reason why we keep making the same mistakes, as Minsky and Papert mentioned above, has to do with teaching. One side of the debate is written out of the textbooks and forgotten, only to be revived/reinvented by the next generation. Contemporary textbooks in computational linguistics have remarkably little to say about PCM. Pierce isn't mentioned in Jurafsky andMartin (2000), Manning and Schütze (1999) or Manning et al. (2008). Minsky'scriticism of Perceptrons is briefly mentioned in just one of the three textbooks: Manning and Schütze (1999, p. 603). A student new to the field might not appreciate that the reference to “related learning algorithms” (see bold italics below) includes a number of methods that are currently very popular such as linear and logistic regression.

“There are similar convergence theorems for some other gradient descent algorithms, but in most cases convergence will only be to a local optimum. . . .Perceptrons converge to a global optimum because they select a classifier from a class of simpler models, the linear separators. There are many important problems that are not linearly separable, the most famous being the XOR problem. . . . A decision tree can learn such a problem whereas a perceptron cannot. After some initial enthusiasm about Perceptrons (Rosenblatt, 1962), researchers realized these limitations. As a consequence, interest in perceptrons and related learning algorithms [emphasis added] faded quickly and remained low for decades. The publication of Minsky and Papert (1969) is often seen as the point at which the interest in this genre of learning algorithms started to wane.”

Manning et al. (2008) have a brief reference to Minsky and Papert (1988)as a good description of perceptrons, with no mention of the sharp criticism.

“Readers interested in algorithms mentioned, but not described in this chapter, may wish to consult Bishop (2006) for neural networks, Hastie et al. (2001) for linear and logistic regression, and Minsky and Papert (1988) for the perceptron algorithm.” 

Based on this description, a student might come away with the mistaken impression that Minsky and Papert are fans of perceptrons (and currently popular relatedmethods such as linear and logistic regression).

Bishop (2006, p. 193) makes it clear that Minsky and Papert are no fans of perceptrons and neural networks, but dismisses their work as “incorrect conjecture”. Bishop points to widespread use of neural networks in practical application ascounter-evidence to Minsky and Papert's claim above that “not much has changed”and “multilayer networks will be no more able to recognize connectedness than are perceptrons.”

Contemporary textbooks ought to teach both the strengths and the weaknessesof useful approximations such as neural networks. Both sides of the debate have much to offer. We do the next generation a disservice when we dismiss one side or the other with harsh words like “incorrect conjecture” and “not much haschanged.”

Chomsky receives more coverage than Pierce and Minsky in contemporary textbooks.There are 10 references to Chomsky in the index of Manning and Schütze (1999)and 27 in the index of Jurafsky and Martin (2000). The first textbook has fewer references because it focuses on a relatively narrow topic, Statistical Natural Language Processing, whereas the second textbook takes a broader cut across awider range of topics including phonology and speech. Thus, the secondtextbook, unlike the first textbook, cites Chomsky's work in phonology: Chomskyand Halle (1968).

Both textbooks mention Chomsky's criticism of finite-state methods and the devastating effect that they had on empirical methods at the time, though they quickly move on to describe the revival of such methods, with relativelylittle discussion of the argument, motivations for the revival, andimplications for current practice and the future.

“In a series of extremely influential papers starting with Chomsky (1956) and including Chomsky (1957) and Miller and Chomsky (1963), Noam Chomskyargued that “finite-state Markov processes,” while a possibly useful engineering heuristic, were incapable of being a complete cognitive model of human grammatical knowledge. These arguments led many linguists and computational linguists away from statistical models altogether. 

“The resurgence of N-gram models came from Jelinek, Mercer, Bahl.…”

Both books also start the ngram discussion with a few quotes, pro and con.

“But it must be recognized that the notion ‘probability of a sentence’ is an entirely useless one, under any known interpretation of this term” (Chomsky1965, p. 57) 

“Anytime a linguist leaves the group the recognition rate goes up.”(Fred Jelinek, then of IBM speech group, 1988) 

Manning and Schütze (1999, p. 2) starts the discussion with these quotes:

“Statistical considerations are essential to an understanding of the operation and development of languages.” (Lyons 1968, p. 98) 

“One's ability to produce and recognize grammatical utterances is not based on notions of statistical approximations and the like.”(Chomsky 1957, p. 16)

Such quotes introduce the student to the existence of a controversy, but they don't help the student appreciate what it means for them. We should remind students that Chomsky objected to a number of finite-state methods that are extremely popular today including ngrams and Hidden Markov Models because he believed such methods cannot capture long-distance dependences (e.g., agreement constraints and wh-movement).

Chomsky's position remains controversial to this day, as evidenced by anobjection from one of the reviewers. I do not wish to take a position on this debate here. I am merely asking that we teach both sides of this debate to the next generation so they won't reinvent whichever side we fail to teach.

Educating Computational Linguistics Students in General Linguistics andPhonetics

To prepare students for what might come after the low hanging fruit has been picked over, it would be good to provide today's students with a broad education that makes room for many topics in Linguistics such as syntax, morphology, phonology, phonetics, historical linguistics and language universals. We are graduating Computational Linguistics students these days that have very deep knowledge of one particular narrow sub-area (such asmachine learning and statistical machine translation) but may not have heard of Greenberg's Universals, Raising, Equi, quantifier scope, gapping, island constraints and so on. We should make sure that students working on co-reference know about c-command and disjoint reference. When students present a paper at a Computational Linguistics conference, they should be expected to knowthe standard treatment of the topic in Formal Linguistics.

Students working on speech recognition need to know about lexical stress (e.g., Chomsky and Halle (1968)). Phonological stress has all sorts of consequences on downstream phonetic and acoustic processes.

 

Speech recognizers currently don't do much with lexical stress which seemslike a missed opportunity since stress is one of the more salient properties in the speech signal. Figure 3 shows wave forms and spectrograms for the minimal pair: “politics” and “political.” There are many differences between these two words. The technology currently focuses on differences at the segmental level:

1.“Politics” ends with -s whereas “political” ends with -al.

2. The first vowel in “political” is a reduced schwa unlike the firstvowel in “politics.”

The differences in stress are even more salient. Among the many stress-related differences, Figure 3 calls out the differences between pre-stress and post-stress allophones of /l/. There are also consequences in the /t/s; /t/ isaspirated in “politics” and flapped in “political.”

Currently, there is still plenty of low-hanging fruit to work on at the segmentallevel, but eventually the state of the art will get past those bottlenecks. Weought to teach students in speech recognition about the phonology andacoustic-phonetics of lexical stress, so they will be ready when the state ofthe art advances past the current bottlenecks at the segmental level. Since there are long-distance dependencies associated with stress that span over more than tri-phones, progress on stress will require a solid understanding of the strengths and weaknesses of currently popular approximations. Fundamental advances in speech recognition, such as effective use of stress, will likely require fundamental advances to the technology.


【置顶:立委科学网博客NLP博文一览(定期更新版)】



https://blog.sciencenet.cn/blog-362400-729702.html

上一篇:【读书笔记:异或门是神经网络的命门】
下一篇:青春无敌:透亮率性的孟丽,《宝贝》一丫
收藏 IP: 192.168.0.*| 热度|

4 章成志 吴吉良 rosejump biofans

该博文允许注册用户评论 请点击登录 评论 (5 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-12-21 22:07

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部