|||
LSA is a cool machine learning technique based on lexical evidence of co-occurrence in order to decode the underlying semantic categories (clustering or classification) of the given text (Deerwester et al. 1990). Typically, the first step of LSA is to construct word-vs.-document co-occurrence matrix. Then singular value decomposition (SVD) is performed on this co-occurring matrix. The key idea of LSA is to reduce noise or insignificant association patterns by filtering the insignificant components uncovered by SVD.
Given that there is no parsing, no structures, hence no "understanding" involved in LSA, it is amazingly successful in some areas which are supposed to require Natural Language Understanding (NLU) or Artificial Intelligence (AI). For example, it is a dominant approach in the area of automatic grading of high school reading comprehension tests, at least it was dominant 8 years ago when I was collaborating with education researchers in proposing a new parsing based approach to this task to compete with the popular LSA approach. The reason for its (partial) success in uncovering some natural language semantics lies in the fact that sentences have two sides: structures (trees) and words (nodes). Putting structures aside, the words used in a natural language document (discourse) are not random collection, they have inherent lexical coherence holding them together to make sense. In addition. the lexical coherence evidence and the structural evidence are often overlapping in terms of reflecting underlying semantics to certain extent. Therefore, for some coarse-grained semantic tasks, there is a possibility of maximizing the use of the lexical side of evidence to do the job, ignoring the structure part of language. But there is a fundamental defect in LSA that limits how far it can go in decoding semantics, due to the lack of structures.
The weakness of LSA is the same as most other so-called "bag of words" (BOW) learning approaches based on keyword density or co-occurrence. Since LSA does not involve structures or understanding, it is at best an approximation to the effect of parsing-based (or understanding-based) approaches for almost all the tasks involving natural language text. In other words, the quality in theory (and in practice as well, as long as the parser is not built by inexperienced linguists) can hardly beat a parsing-based rule system.
Another weakness of LSA is that it is much more difficult to debug a learned system for a given error or error type in results. Either you tolerate it all or you re-train LSA with new or expanded data, in which case there is no guarantee that the bulk results will get that error corrected. In a rule based system of multiple levels, it is much easier to localize the error source and fix it. My own experience with using LSA for synonyms clustering is that when I examine the results, I sort of feel that it seems to make sense, but there are numerous cases which are beyond comprehension: it was difficult to determine whether that incomprehensible part of the results is due to the noise of imperfect data and/or bugs in the algorithm, hence difficult in coming up with effective corrective methods.
LSA is often cited as an alternative to grammar approach partially because it got a good, eye-catching name, I guess. It suddenly shortens the distance between sentence meaning and the building blocks words, without the trouble of having to use structures as a bridge. (But language is structured! As true as the earth is revolving.)
【相关篇什】
泥沙龙笔记:儿童语言没有文法的问题 2015-07-01
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-21 21:01
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社