|||
So it is time to compare and summarize the pros and cons of the two basic NLP (Natural Language Processing) approaches and show where they are complementary to each other.
Some notes:
1. In text processing, majority of basic robust machine learning is based on keywords, so-called BOW (bag-of-word) model although there is research of machine learning that goes beyond keywords. It actutally utilizes n-gram (mostly bigram or trigram) linear word sequence to simulate the language structure.
2. Grammar engineering is mostly a hand-crafted rule system based on linguistic structures (often represented internally as a grammar tree), to simulate the linguistic parsing in human mind.
3. Machine learning is good at viewing the forest (tasks such as document classification or word clustering from a corpus; and it fails in short messages) while rules are good at examining each tree (sentence-level tasks such as parsing and extraction; and it handles short messages well). This is understandable. Document or corpus contains a fairly big bag of keywords, making it easy for machine to learn statistical clues of the words for a given task. Short messages do not have enough data points for a machine learning system to use as evidence. On the other hand, grammar rules decode the linguistic relationships between words to "understand" the sentence, therefore it is good at handling short messages.
4. In general, a machine learning system based on keyword statistics is recall-oriented while a rule system is precision-oriented. They are complementary in these two core metrics of data quality. Each rule may only cover a tiny portion of the language phenomena, but once it captures it, it is usually precise. It is easy to develop a highly precise rule system but the recall typically only picks up incrementally in accordance with the number of rules developed. Because keyword based machine learning has no knowledge of sentence structures, at best its ngram evidence indirectly simulates languiage structure, it usually cannot reach high precision, but as long as the training corpus is sizable, good recall can be expected by the nature of underlying keyword statistics and the disregard for structural constraints.
5. Machine learning is known for its robustness and scalability as its algorithms are based on science (e.g. MaxEnt is based on information theory) that can be repeated and rigidly tested (of course, like any application areas, there are tricks and know-how to make things work or fail too in practice). The development is also fast once the labeled corpus is available (which is often not easy in practice) because there are off-shelf tools in open source and tons of documentation and literature in the community for proven ML algorithms.
6. Grammar engineering on the other hand tends to depend more on the expertise of the designer and developer for being robust and scalable. It requires deep skills and "secret source" which may only be accumulated based on years of successes as well as lessons learned. It is not purely a scientific undertaking but more of a blancing art in architect, design and development. To a degree, this is like chefs for Chinese cooking: with the same materials and the assumably the same recipe, one chef's dish can taste a lot better or different from that of another chef. Recipe only gives a framework while the monster of great taste is in the details of know-how. It is not easily repeatable across developers but the same master can repeatedly make the best quality dishes/systems.
7. The knowledge bottleneck is reflected in both machine learning systems and in grammar systems. A decent machine learning system requires a large hand-labeled corpus (research oriented unsupervised learning systems do not need manual annotation, but they are often not practical either). There is consensus in the community that the quality of machine learning usually depends more on the data than on the algorithms. On the other hand, the bottleneck of grammar engineering lies in skilled designer (data scientist) and well-trained domain developers (computational linguists), who are often in short supply today.
8. Machine learning is good at coarse-grained specific task (typical example is classification) while grammar engineering is good at fine-grained analysis and detailed insight extraction. Their respective strengths make them highly complementary in certain application scenarios because as information consumers, users often demand both coarse-grained overview as well as details of actionable intelligence.
9. One big big problem of a machine learning system is the difficulty to fix a reported quality bug. This is because the learned model is usually a black box and no direct human interference is allowed or even possible to address a specific problem unless the model is re-trained with new corpus and/or new features. In the latter case, there is no guarantee that the specific problem we want to solve will be addressed well by re-training as the learning process needs to blance all features in a unified model. This issue is believed to be the major reason why the Google search ranking algorithm favors hand-crafted functions over machine learning because their objective of better user experience can hardly by achieved by a black box model.
10. Grammar system is much more transparent in the language understanding process. The modern grammar systems are all designed with careful modularization so that each specific quality bug can be traced to the corresponding module of the system for fine-tuning. The effect is direct, immediate and can be incrementally accumulated for overall perforamcece enhancement.
11. From the perspective of the NLP depth, at least for the current state of the art, machine learning seems to do shallow NLP work fairly well while grammar engineering can go much deeper in linguistic parsing to achieve deep analytics and insights. (The on-going deep learning research program might get machine learning to some level deeper than before, but it is yet to see how effective it can do real deep NLP and how deep it can go, especially in the area of text processing and understanding.)
Related blogs:
why hybrid? on machine learning vs. hand-coded rules in NLP
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-21 19:39
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社