《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵,插队修地球,1991年去国离乡,不知行止。


【研发笔记:没有语言结构可以解析语义么?浅论 LSA】

已有 3826 次阅读 2013-3-24 15:10 |个人分类:立委科普|系统分类:科研笔记|关键词:结构,语言,LSA,semantics,语义| 语言, 结构, 语义, Semantics, LSA

>>what are your views on Latent Semantic Analysis (LSA)?

LSA is a cool machine learning technique based on lexical evidence of co-occurrence in order to decode the underlying semantic categories (clustering or classification) of the given text  (Deerwester et al. 1990). Typically, the first step of LSA is to construct word-vs.-document co-occurrence matrix. Then singular value decomposition (SVD) is performed on this co-occurring matrix. The key idea of LSA is to reduce noise or insignificant association patterns by filtering the insignificant components uncovered by SVD.

Given that there is no parsing, no structures, hence no "understanding" involved in LSA, it is amazingly successful in some areas which are supposed to require Natural Language Understanding (NLU) or Artificial Intelligence (AI).  For example, it is a dominant approach in the area of automatic grading of high school reading comprehension tests, at least it was dominant 8 years ago when I was collaborating with education researchers in proposing a new parsing based approach to this task to compete with the popular LSA approach.  The reason for its (partial) success in uncovering some natural language semantics lies in the fact that sentences have two sides: structures (trees) and words (nodes).  Putting structures aside, the words used in a natural language document (discourse) are not random collection, they have inherent lexical coherence holding them together to make sense.  In addition. the lexical coherence evidence and the structural evidence are often overlapping in terms of reflecting underlying semantics to certain extent.  Therefore, for some coarse-grained semantic tasks, there is a possibility of maximizing the use of the lexical side of evidence to do the job, ignoring the structure part of language.  But there is a fundamental defect in LSA that limits how far it can go in decoding semantics, due to the lack of structures.

In my past research, we have used LSA in our Word Sense Disambiguation (WSD) research project, as an auxiliary method to help perform synonym expansion in order to generalize our parsing evidence from literal node to cluster node. It seems to be effective to certain extent, but cannot be claimed better than using synonym lexicon encoded by linguists if we had human resources.  It does have the benefit of automatically clustering synonyms based on the data, hence automatically adapting to the domain we are interested in.

The weakness of LSA is the same as most other so-called "bag of words" (BOW) learning approaches based on keyword density or co-occurrence. Since LSA does not involve structures or understanding, it is at best an approximation to the effect of parsing-based (or understanding-based) approaches for almost all the tasks involving natural language text. In other words, the quality in theory (and in practice as well, as long as the parser is not built by inexperienced linguists) can hardly beat a parsing-based rule system.

Another weakness of LSA is that it is much more difficult to debug a learned system for a given error or error type in results.  Either you tolerate it all or you re-train LSA with new or expanded data, in which case there is no guarantee that the bulk results will get that error corrected.  In a rule based system of multiple levels, it is much easier to localize the error source and fix it.  My own experience with using LSA for synonyms clustering is that when I examine the results, I sort of feel that it seems to make sense, but there are numerous cases which are beyond comprehension: it was difficult to determine whether that incomprehensible part of the results is due to the noise of imperfect data and/or bugs in the algorithm, hence difficult in coming up with effective corrective methods.

When we talk about rule-based semantic approach, we do not mean that the approach only relies on parsing structure in decoding semantics.  When we do semantics, whether extracting sentiments, or factual events, we always bring lexical evidence and structural evidence together in accomplishing the task. For example, in order to extract the emotional sentiment of an agent expressed towards an object or brand, our sentiment rule will involve trigger words like love/like/favor/prefer and then check its logical/grammatical subject and object of certain lexical type (e.g. human type versus non-human type) to ensure we decode the semantics of the underlying text precisely. As you see, the rule approach thus used has the advantage of having two types of evidence support than LSA that has only one type of evidence. This is a fundamental difference when we compare rules with BOW class of techniques, no matter what new approaches or techniques are hot in the community.

Admittedly, BOW learning in general and LSA in particular do have the benefit of being robust in handling noisy data and it can also be quickly built up once data are available. The automatic adaptation to a domain based on the training data is also a strength as it narrows down the semantic space to start with. The approximation in treating language as a black box rather than analyzing language as a de-composable hierarchy of structures is sometimes good enough in certain use cases of semantics.

LSA is often cited as an alternative to grammar approach partially because it got a good, eye-catching name, I guess.  It suddenly shortens the distance between sentence meaning and the building blocks words, without the trouble of having to use structures as a bridge.  (But language is structured! As true as the earth is revolving.)



 泥沙龙笔记:儿童语言没有文法的问题 2015-07-01



下一篇:拉大旗做虎皮是 marketing 的惯用伎俩,不可轻信,但可以理解


该博文允许注册用户评论 请点击登录 评论 (0 个评论)


Archiver|手机版|科学网 ( 京ICP备14006957 )

GMT+8, 2019-1-20 23:09

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社