《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵,插队修地球,1991年去国离乡,不知行止。

博文

一切声称用机器学习做社会媒体舆情挖掘的系统,都值得怀疑

已有 6590 次阅读 2015-11-21 03:51 |个人分类:立委科普|系统分类:科普集锦| machine, Social, Learning, media, sentiment

一切声称用主流机器学习方法做社会媒体舆情挖掘的系统,都值得怀疑。捉襟见肘不堪应用是基本现状。原因是如此显然,机器学习在短消息主导的社会媒体面前失效了。短消息根本就没有足够密度的数据点(所谓 keyword density)供机器学习施展。巧妇且难为无米之炊,这是一袋子词的方法论决定的,再大的训练集也难以克服这个局限。没有语言学的结构分析,这是不可逾越的挑战。

I have articulated this point in various previous posts or blogs before, but the world is so dominated by the mainstream that it does not seem to carry far.  So let me make it simple to be understood:

 The sentiment classification approach based on bag of words (BOW) model, so far the dominant approach in the mainstream for sentiment analysis, simply breaks in front of social media.  The major reason is simple: the social media posts are full of short messages which do not have the "keyword density" required by a classifier to make the proper sentiment decision.   The precision ceiling for this line of work in real life social media is found to be 60%, far behind the widely acknowledged precision minimum 80% for a usable extraction system.  Trusting a machine learning classifier to perform social media sentiment is not much better than flipping a coin.

So let us get straight.  From now on, whoever claims the use of machine learning for social media mining of public opinions and sentiments is likely to be a trap (unless it is verified to have involved parsing of linguistic structures or patterns, which so far has never been heard of in practical systems based on machine learning).  Fancy visualizations may make the mining results look real and attractive but they are just not trustable at all.  

【补记】
朋友截屏了朋友圈,说这是一竿子打翻一船人的架势。但关于这一点,实在没有办法,无论中文还是西文,短消息压倒多数是移动时代社交媒体的现实, 总须有人揭出社交媒体大数据挖掘背后的事实真相。BOW 面对短消息束手无策,是不争的事实,不会因为这是最简便 available 的主流方法,多数人用它,它就在不适合它的场所突然显灵了。不 work 就是不 work,这一路突破不了60%的精度瓶颈,离公认的可用精度门槛80%遥不可及,这是方法论决定的。

Related Posts:
Pros and Cons of Two Approaches: Machine Learning and Grammar Engineering
Coarse-grained vs. fine-grained sentiment analysis 
舆情挖掘系统独立验证的意义 2015-11-22
【立委科普:NLP 中的一袋子词是什么】 2015-11-27
【置顶:立委科学网博客NLP博文一览(定期更新版)】






https://blog.sciencenet.cn/blog-362400-937035.html

上一篇:《泥沙龙笔记:狗血的语言学》
下一篇:舆情挖掘系统独立验证的意义
收藏 IP: 192.168.0.*| 热度|

3 章成志 姬扬 毛进

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-4-20 06:56

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部