超越梦想分享 http://blog.sciencenet.cn/u/pcabaqus 结构减隔震控制 非线性抗震分析 Python简单编程

博文

Comparison of data analysis packages: R, Matlab, SciPy, Excel…

已有 4951 次阅读 2010-1-4 12:10 |个人分类:PYTHON|系统分类:科研笔记

Name Advantages Disadvantages Open source? Typical users
R Library support; visualization Steep learning curve Yes Finance; Statistics
Matlab Elegant matrix support; visualization Expensive; incomplete statistics support No Engineering
SciPy/NumPy/Matplotlib Python (general-purpose programming language) Immature Yes Engineering
Excel Easy; visual; flexible Large datasets No Business
SAS Large datasets Expensive; outdated programming language No Business; Government
SPSS, Stata Easy statistical analysis Weak programming language No Science

There’s a bunch more to be said for every cell.  Among other things:

  • Python “immature”: matplotlib, numpy, and scipy are all separate libraries that don’t always get along.  Why does matplotlib come with “pylab” which is supposed to be a unified namespace for everything?  Isn’t scipy supposed to do that?  Why is there duplication between numpy and scipy (e.g. numpy.linalg vs. scipy.linalg)?  And then there’s package compatibility version hell.  You can use SAGE or Enthought but neither is standard (yet).  In terms of functionality and approach, SciPy is closest to Matlab, but it feels much less mature.
  • Matlab’s language is certainly weak.  It sometimes doesn’t seem to be much more than a scripting language wrapping the matrix libraries.  Python is clearly better on most counts.  R’s is surprisingly good (Scheme-derived, smart use of named args, etc.) if you can get past the bizarre language constructs and weird functions in the standard library.  Everyone says SAS is very bad.
  • Matlab is the best for developing new mathematical algorithms.  Very popular in machine learning.
  • I’ve never used the Matlab Statistical Toolbox.  I’m wondering, how good is it compared to R?
  • Here’s an interesting reddit thread on SAS/Stata vs R.
  • SPSS and Stata in the same category: they seem to have a similar role so we threw them together.  Stata is a lot cheaper than SPSS, people usually seem to like it, and it seems popular for introductory courses.  I personally haven’t used either…
  • SPSS and Stata for “Science”: we’ve seen biologists and social scientists use lots of Stata and SPSS.  My impression is they get used by people who want the easiest way possible to do the sort of standard statistical analyses that are very orthodox in many academic disciplines.  (ANOVA, multiple regressions, t- and chi-squared significance tests, etc.)  Certain types of scientists, like physicists, computer scientists, and statisticians, often do weirder stuff that doesn’t fit into these traditional methods.
  • Another important thing about SAS, from my perspective at least, is that it’s used mostly by an older crowd.  I know dozens of people under 30 doing statistical stuff and only one knows SAS.  At that R meetup last week, Jim Porzak asked the audience if there were any recent grad students who had learned R in school.  Many hands went up.  Then he asked if SAS was even offered as an option.  All hands went down.  There were boatloads of SAS representatives at that conference and they sure didn’t seem to be on the leading edge.
  • But: is there ANY package besides SAS that can do analysis for datasets that don’t fit into memory?  That is, ones that mostly have to stay on disk?  And exactly how good as SAS’s capabilities here anyway?
  • If your dataset can’t fit on a single hard drive and you need a cluster, none of the above will work. There are a few multi-machine data processing frameworks that are somewhat standard (e.g. Hadoop, MPI) but It’s an open question what the standard distributed data analysis framework will be.  (Hive? Pig?  Or quite possibly something else.)
  • (This was an interesting point at the R meetup.  Porzak was talking about how going to MySQL gets around R’s in-memory limitations.  But Itamar Rosenn and Bo Cowgill (Facebook and Google respectively) were talking about multi-machine datasets that require cluster computation that R doesn’t come close to touching, at least right now.  It’s just a whole different ballgame with that large a dataset.)
  • SAS people complain about poor graphing capabilities.
  • R vs. Matlab visualization support is controversial.  One view I’ve heard is, R’s visualizations are great for exploratory analysis, but you want something else for very high-quality graphs.  Matlab’s interactive plots are super nice though.  Matplotlib follows the Matlab model, which is fine, but is uglier than either IMO.
  • Excel has a far, far larger user base than any of these other options.  That’s important to know.  I think it’s underrated by computer scientist sort of people.  But it does massively break down at >10k or certainly >100k rows.
  • Another option: Fortran and C/C++.  They are super fast and memory efficient, but tricky and error-prone to code, have to spend lots of time mucking around with I/O, and have zero visualization and data management support.  Most of the packages listed above run Fortran numeric libraries for the heavy lifting.
  • Another option: Mathematica.  I get the impression it’s more for theoretical math, not data analysis.  Can anyone prove me wrong?
  • Another option: the pre-baked data mining packages.  The open-source ones I know of are Weka and Orange.  I hear there are zillions of commercial ones too.  Jerome Friedman, a big statistical learning guy, has an interesting complaint that they should focus more on traditional things like significance tests and experimental design.  (Here; the article that inspired this rant.)
  • I think knowing where the typical users come from is very informative for what you can expect to see in the software’s capabilities and user community.  I’d love more information on this for all these options.


https://blog.sciencenet.cn/blog-339218-284122.html

上一篇:无所不能的python
下一篇:Analects of Confucius(论语英文版)【1】
收藏 IP: .*| 热度|

0

发表评论 评论 (2 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-12 16:10

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部