|||
Translator's note:
This article written in Chinese by Prof. S. Bai is a wonderful piece of writing worthy of recommendation for all natural language scholars. Prof. Bai's critical study of Chomsky's formal language theory with regards to natural language has reached a depth never seen before ever since Chomsky's revolution in 50's last century. For decades with so many papers published by so many scholars who have studied Chomsky, this novel "caterpillar" theory still stands out and strikes me as an insight that offers a much clearer and deeper explanation for how natural language should be modeled in formalism, based on my decades of natural language parsing study and practice (in our practice, I call the caterpillar FSA++, an extension of regular grammar formalism adequate for multi-level natural language deep parsing). For example, so many people have been trapped in Chomsky's recursion theory and made endless futile efforts to attempt a linear or near-linear algorithm to handle the so-called recursive nature of natural language which is practically non-existent (see Chomsky's Negative Impact).. There used to be heated debates in computational linguistics on whether natural language is context-free or context-sensitive, or mildly sensitive as some scholars call it. Such debates mechanically apply Chomsky's formal language hierarchy to natural languages, trapped in metaphysical academic controversies, far from language facts and data. In contrast, Prof. Bai's original "Caterpillar" theory presents a novel picture that provides insights in uncovering the true nature of natural languages.
S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle
Tags: Chomsky Hierarchy, computational linguistics, Natural Language Processing, linear speed
This is a technology-savvy article, not to be fooled by the title seemingly about a bug story in some VIP's castle. If you are neither an NLP professional nor an NLP fan, you can stop here and do not need to continue the journey with me on this topic.
Chomsky's Castle refers to the famous Chomsky Hierarchy in his formal language theory, built by the father of contemporary linguistics Noam Chomsky more than half a century ago. According to this theory, the language castle is built with four enclosing walls. The outmost wall is named Type-0, also called Phrase Structure Grammar, corresponding to a Turing machine. The second wall is Type-1, or Context-sensitive Grammar (CSG), corresponding to a parsing device called linear bounded automaton with time complexity to be NP-complete. The third wall is Type-2, or Context-free Grammar (CFG), corresponding to a pushdown automaton, with a time complexity that is polynomial, somewhere between square and cubic in the size of the input sentence for the best asymptotic order measured by the worst case scenario. The innermost wall is Type-3, or Regular Grammar, corresponding to a deterministic finite state automata, with a linear time complexity. The sketch of the 4-wall Chomsky Castle is illustrated below.
This castle of Chomsky has impacted generations of scholars, mainly along two lines. The first line of impact can be called "the outward fear syndrome". Because the time complexity for the second wall (Context-sensitive) is NP-complete, anywhere therein and beyond become Forbidden City before NP=P can be proved. Thus, the pressure for parsing natural languages has to be all confined to within the third wall (Context-free). Everyone knows the natural language involves some context sensitivity, but the computing device cannot hold it to be tractable once it is beyond the third wall of CFG. So it has to be left out.
The second line of impact is called "the inward perfection syndrome". Following the initial success of using Type 2 grammar (CFG) comes a severe abuse of recursion. When the number of recursive layers increases slightly, the acceptability of a sentence soon approximates to almost 0. For example, "The person that hit Peter is John" looks fine, but it starts sounding weird to hear "The person that hit Peter that met Tom is John". It becomes gibberish with sentences like "The person that hit Peter that met Tom that married Mary is John". In fact, the majority resources spent with regards to the parsing efficiency are associated with such abuse of recursion in coping with gibberish-like sentences, rarely seen in real life language. For natural language processing to be practical, pursuing the linear speed cannot be over emphasized. If we reflect on the efficiency of the human language understanding process, the conclusion is certainly about the "linear speed" in accordance with the length of speech input. In fact, the abuse of recursion is most likely triggered by the "inward perfection syndrome", for which we intend to cover every inch of the land within the third wall of CFG, even if it is an area piled up by gibberish or garbage.
In a sense, it can be said that one reason for the statistical approach to take over the rule-based approach for such a long time in the academia of natural language processing is just the combination effect of these two syndromes. To overcome the effects of these syndromes, many researchers have made all kinds of efforts, to be reviewed below one by one.
Along the line of the outward fear syndrome, some evidence against the context-freeness has been found in some constructions in Swiss-German. Chinese has similar examples in expressing respective correspondence of conjoined items and their descriptions. For example, “张三、李四、王五的年龄分别是25岁、32岁、27岁,出生地分别是武汉、成都、苏州” (Zhang San, Li Si, Wang Wu's age is respectively 25, 32, and 27, they were born respectively in Wuhan, Chengdu, Suzhou" ). Here, the three named entities constitute a list of nouns. The number of the conjoined list of entities cannot be predetermined, but although the respective descriptors about this list of nouns also vary in length, the key condition is that they need to correspond to the antecedent list of nouns one by one. This respective correspondence is something beyond the expression power of the context-free formalism. It needs to get out of the third wall.
As for overcoming "the inward perfection syndrome", the pursuit of "linear speed" in the field of NLP has never stopped. It ranges from allowing for the look-ahead mechanism in LR (k) grammar, to the cascaded finite state automata, to the probabilistic CFG parsers which are trained on a large treebank and eventually converted to an Ngram (n=>5) model. It should also include RNN/LSTM by the unique pursuit for deep parsing from the statistical school. All these efforts are striving for defining a subclass in Type-2 CFG that reaches linear speed efficiency yet still with adequate linguistic power. In fact, all parsers that have survived after fighting the statistical methods are to some degree a result of overcoming "the inward perfection syndrome", with certain success in linear speed pursuit while respecting linguistic principles. The resulting restricted subclass, compared to the area within the original third wall CFG, is a greatly "squashed" land.
If we agree that everything in parsing should be based on real life natural language as the starting point and the ultimate landing point, it should be easy to see that the outward limited breakthrough and the inward massive compression should be the two sides of a coin. We want to strive for a formalism that balances both sides. In other words, our ideal natural language parsing formalism should look like a linguistic "caterpillar" breaking through the Chomsky walls in his castle, illustrated below:
It seems to me that such a "caterpillar" may have already been found by someone. It will not take too long before we can confirm it.
Original article in Chinese from 《穿越乔家大院寻找“毛毛虫”》
【Related】
K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)
Overview of Natural Language Processing
Dr. Wei Li’s English Blog on NLP
【立委按】
白硕老师这篇文章值得所有自然语言学者研读和反思。击节叹服,拍案叫绝,是初读此文的真切感受。白老师对乔姆斯基形式语言理论用于自然语言所造成的误导,给出了迄今所见最有深度的犀利解析,而且写得深入浅出,形象生动,妙趣横生。这么多年,这么多学者,怎么就达不到这样的深度呢?一个乔姆斯基的递归陷阱不知道栽进去多少人,造成多少人在 “不是人话” 的现象上做无用功,绕了无数弯路。学界曾有多篇长篇大论,机械地套用乔氏层级体系,在自然语言是 context-free 还是 context-sensitive 的框框里争论不休,也有折衷的说法,诸如自然语言是 mildly sensitive,这些形而上的学究式争论,大多雾里看花,隔靴搔痒,不得要领,离语言事实甚远。白老师独创的 “毛毛虫” 论,形象地打破了这些条条框框。
白老师自己的总结是:‘如果认同“一切以真实的自然语言为出发点和最终落脚点”的理念,那就应该承认:向外有限突破,向内大举压缩,应该是一枚硬币的两面。’ 此乃金玉良言,掷地有声。
洪诗人有诗为证:
乔家大院分四层,护墙严丝围密缝。
长虫跨墙院躺横,自然语言才活蹦。
白硕
穿越乔家大院寻找“毛毛虫”
看标题,您八成以为这篇文章讲的是山西的乔家大院的事儿了吧?不是。这是一篇烧脑的技术贴。如果您既不是NLP专业人士也不是NLP爱好者,就不用往下看了。
咱说的这乔家大院,是当代语言学祖师爷乔姆斯基老爷子画下来的形式语言类型谱系划分格局。最外边一圈围墙,是0型文法,又叫短语结构文法,其对应的分析处理机制和图灵机等价,亦即图灵可计算的;第二圈围墙,是1型文法,又叫上下文相关文法,其对应的分析处理机制,时间复杂度是NP完全的;第三圈围墙,是2型文法,又叫上下文无关文法,其对应的分析处理机制,时间复杂度是多项式的,最坏情况下的最好渐进阶在输入句子长度的平方和立方之间;最里边一层围墙,是3型文法,又叫正则文法,其对应的分析处理机制和确定性有限状态自动机等价,时间复杂度是线性的。这一圈套一圈的,归纳整理下来,如下图所示:
乔老爷子建的这座大院,影响了几代人。影响包括这样两个方面:
第一个方面,我们可以称之为“外向恐惧情结”。因为第二圈的判定处理机制,时间复杂度是NP完全的,于是在NP=P还没有证明出来之前,第二圈之外似乎是禁区,没等碰到已经被宣判了死刑。这样,对自然语言的描述压力,全都集中到了第三圈围墙里面,也就是上下文无关文法。大家心知肚明自然语言具有上下文相关性,想要红杏出墙,但是因为出了围墙计算上就hold不住,也只好打消此念。0院点灯……1院点灯……大红灯笼高高挂,红灯停,闲人免出。
第二个方面,我们可以称之为“内向求全情结”。2型文法大行其道,取得了局部成功,也带来了一个坏风气,就是递归的滥用。当递归层数稍微加大,人类对于某些句式的可接受性就快速衰减至几近为0。比如,“我是县长派来的”没问题,“我是县长派来的派来的”就有点别扭,“我是县长派来的派来的派来的”就不太像人话了。而影响分析判定效率的绝大多数资源投入,都花在了应对这类“不像人话”的递归滥用上了。自然语言处理要想取得实用效果,处理的“线速”是硬道理。反思一下,我们人类的语言理解过程,也肯定是在“线速”范围之内。递归的滥用,起源于“向内求全情结”,也就是一心想覆盖第三圈围墙里面最犄角旮旯的区域,哪怕那是一个由“不像人话”的实例堆积起来的垃圾堆。
可以说,在自然语言处理领域,统计方法之所以在很长时间内压倒规则方法,在一定程度上,就是向外恐惧情结与向内求全情结叠加造成的。NLP领域内也有很多的仁人志士为打破这两个情结做了各种各样的努力。
先说向外恐惧情结。早就有人指出,瑞士高地德语里面有不能用上下文无关文法描述的语言现象。其实,在涉及到“分别”的表述时,汉语也同样。比如:“张三、李四、王五的年龄分别是25岁、32岁、27岁,出生地分别是武汉、成都、苏州。”这里“张三、李四、王五”构成一个名词列表,对这类列表的一般性句法表述,肯定是不定长的,但后面的两个“分别”携带的列表,虽然也是不定长的,但却需要跟前面这个列表的长度相等。这个相等的条件,上下文无关文法不能表达,必须走出第三圈围墙。
再说向内求全情结。追求“线速”的努力,在NLP领域一直没有停止过。从允许预读机制的LR(k)文法,到有限自动机堆叠,再到基于大型树库训练出来的、最终转化为Ngram模型(N=5甚至更大)的概率上下文无关文法分析器,甚至可以算上统计阵营里孤军深入自然语言深层处理的RNN/LSTM等等,都试图从2型文法中划出一个既有足够的语言学意义、又能达到线速处理效率的子类。可以说,凡是在与统计方法的搏杀中还能活下来的分析器,无一不是在某种程度上摆脱了向内求全情结、在基本尊重语言学规律基础上尽可能追求线速的努力达到相对成功的结果。这个经过限制的子类,比起第三圈围墙来,是大大地“压扁”了的。
如果认同“一切以真实的自然语言为出发点和最终落脚点”的理念,那就应该承认:向外有限突破,向内大举压缩,应该是一枚硬币的两面。我们希望,能够有一种形式化机制同时兼顾这两面。也就是说,我们理想中的自然语言句法的形式化描述机制,应该像一条穿越乔家大院的“毛毛虫”,如下图所示:
据笔者妄加猜测,这样的“毛毛虫”,可能有人已经找到,过一段时间自然会见分晓。
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-24 02:35
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社