《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵,插队修地球,1991年去国离乡,不知行止。

博文

On Big Data NLP

已有 4217 次阅读 2013-7-27 20:43 |个人分类:立委科普|系统分类:科研笔记|关键词:NLP,big,data| NLP, Big, Data

Admittedly, it is not easy to develop an NLP (Natural Language Processing) system with both high precision and high recall (i.e. high F-score) due to the ambiguity and complexity of natural language phenomena.  Social media is even more challenging, full of misspellings, irregularities, and ungrammatical pieces. However, the decisive factor for a system to work in practice is neither precision nor recall. There is a more important indicator, namely, scalability.  A system with moderate precision and recallcan develop into a functioning product if it can scale up to big data. With rapid progress in both hardware and software in the computer industry, equipped with the cloud computing technology, the bottleneck of big data processing often lies more in the economic costs than in the technical challenge. The result will be revolutionary.

After the big data challenge is handled properly with the system scale-up, the precision and recall measures are relatively not as important. In other words, even a system with moderate precision and recall (for example 70% precision and 30% recall) can lead to excellent functioning products. The fundamental reason relates to two factors, namely, the big data redundancy of information and the limited human capacity to digest information. Modest recall can be compensated by the increase of the amount of data to be processed.  This is easily understandable as statistically significant information will not be mentioned only once, which must have been repeated in various expressions by many people.  Thus, it is bound to be caught in some expressions. From the consumers’ perspective, a message caught 1,000 times or 500 times, as long as it gets caught precisely, makes no semantic difference.  The doubt is on precision:  how can big data help a modest precision to be trust-worthy? A system with 70% precision gets 30 cases wrong every 100 cases it extracts, how can we trust such a system and its value? Along this line of thought, let alone 70%, even a 90% system still contains too many errors everywhere and it follows that no system is good enough to meet human expectation and standards. This argument undermines the effect of the app-level (i.e. mining-level, to be different from extraction-level) operations that leverage the size of data,  namely, information sampling and filtering, fusion (merging) and ranking. As a result, it exaggerates the negative impact of individual extraction-level errors on the final results presented at app-level to users. In fact, the typical scenario is that facing social media big data, a query from users often involves a huge number of hits, too many to present to, and to be digested by, users (i.e. the information overload problem). Therefore, a practical system has to undergo a process of information integration and ranking so the statistically most significant integrated results should be presented. This integration process greatly enhances the quality of the end results, which significantly improves the user experience. In short, big data matters. Big data changes the conditions and scenarios of technology application.  We have seen this data effect before with the Google ranking of search results where only the first page or first few pages shape the user experiences in data quality despite thousands of or even millions of hits in the long tail. Likewise, with NLP of social media big data, this data effect helps the case too.

For big data, some partial loss of the data is not a real issue for most use scenarios, as long as such loss is independent of queries (i.e. no discrimination as related to brands or any queries).  Missing data happens all the time in the real world, and it happens for many reasons, for example, server down, database failure, deliberate cut due to cost considerations (e.g. only sampling a certain percentage of the data), spam filtering over-kill, imperfect recall of the engine, or any other bugs in the system. To an extent, missing partial data is the norm rather than accidents in real life big data mining scenarios. It is unrealistic, and unnecessary, to expect a system for a perfect absolute recall of every piece of big data. In majority of information mining scenarios, big data pursues influential information (statistically significant) and public opinion trends,and partial loss of data does not in principle affect such pursuits due to the redundancy nature of big data. Despite the marketing buzz everywhere to claim the ability of NLP systems to help find 'the needle in a haystack', that is often not the true picture, nor true value, of NLP’s application to big data. In fact, redundancy itself, or degree of redundancy, is a natural part of semantics of the big data intelligence. The same logic sits behind the seasoned practice in traditional petitions where the organizer always pursues as many signatures as possible before the petition is presented although the exact number, 100,000 or 90,000 signatures, does not affect the nature of the petition, and only the magnitude adds to the semantics. With the above discussion in mind, it can be said that big data is accelerating its own technology application because it naturally tolerates an imperfect system.



【置顶:立委科学网博客NLP博文一览(定期更新版)】





http://blog.sciencenet.cn/blog-362400-711776.html

上一篇:有趣的课题,有趣的实验
下一篇:大数据NLP论

2 许培扬 白图格吉扎布

该博文允许注册用户评论 请点击登录 评论 (2 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备14006957 )

GMT+8, 2019-10-15 00:26

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部