kikyo91的个人博客分享 http://blog.sciencenet.cn/u/kikyo91

博文

[转载][R] Fast /dependable way to "stack together" dataframe

已有 1726 次阅读 2014-12-1 00:08 |个人分类:R|系统分类:科研笔记|关键词:R,matrix,bind| bind |文章来源:转载

[R] Fast / dependable way to "stack together" data frames from alist

在找合适的方法把数据框或者矩阵连起来,看到很多人都说plyr很好用。

找到一篇文章,看了看,作者和我还是一个学校的呢。


转载自https://stat.ethz.ch/pipermail/r-help/2010-September/252046.html


 

Hi, everybody:

I asked about this in r-help last week and promised
a summary of answers. Special thanks to the folks
that helped me understand do.call and pointed me
toward plyr.

We face this problem all the time. A procedure
generates a list of data frames. How to stack them
together?

The short answer is that the plyr package's rbind.fill
method is probably the fastest method that is not
prone to trouble and does not require much user caution.

result <- rbind.fill(mylist)

A slower alternative that also works is

result <-  do.call("rbind", mylist)

That is always available in R and it works well enough, even
though it is not quite as fast. Both of these are much faster than
a loop that repeatedly applies "rbind".

Truly blazing speed can be found if we convert this into
matrices, but that is not possible if the list actually
contains data frames.

I've run this quite a few times, and the relative speed of the
different approaches has never differed much.

If you run this, I hope you will feel smarter, as I do!
:)


## stackListItems.R
## Paul Johnson <pauljohn at ku.edu>
## 2010-09-07

## Here is a test case

df1 <- data.frame(x=rnorm(100),y=rnorm(100))
df2 <- data.frame(x=rnorm(100),y=rnorm(100))
df3 <- data.frame(x=rnorm(100),y=rnorm(100))
df4 <- data.frame(x=rnorm(100),y=rnorm(100))

mylist <-  list(df1, df2, df3, df4)

## Here's the way we have done it. We understand this,
## we believe the result, it is easy to remember. It is
## also horribly slow for a long list.

resultDF <- mylist[[1]]
for (i in 2:4) resultDF <- rbind(resultDF, mylist[[i]])


## It works better to just call rbind once, as in:

resultDF2 <- rbind( mylist[[1]],mylist[[2]],mylist[[3]],mylist[[4]])


## That is faster because it calls rbind only once.

## But who wants to do all of that typing? How tiresome.
## Thanks to Erik Iverson in r-help, I understand that

resultDF3 <- do.call("rbind", mylist)

## is doing the EXACT same thing.
## Erik explained that "do.call( "rbind", mylist)"
## is *constructing* a function call from the list of arguments.
## It is shorthand for "rbind(mylist[[1]], mylist[[2]], mylist[[3]])"
## assuming mylist has 3 elements.

## Check the result:
all.equal( resultDF2, resultDF3)

## You often see people claim it is fast to allocate all
## of the required space in one shot and then fill it in.
## I got this algorithm from code in the
## "complete" function in the "mice" package.
## It allocates a big matrix of 0's and
## then it places the individual data frames into that matrix.

m <- 4
nr <- nrow(df1)
nc <- ncol(df1)
resultDF4 <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc))
for (j in  1:m) resultDF4[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]]

## This is a bit error prone for my taste. If the data frames have
## different numbers of rows, some major code surgery will be needed.

##
## Dennis Murphy pointed out the plyr package, by Hadley Wickham.
## Dennis said " ldply() in the plyr package. The following is the same
## idea as do.call(rbind, l), only faster."

library("plyr")
resultDF5  <- ldply(mylist, rbind)
all.equal(resultDF, resultDF5)



## Plyr author Hadley Wickham followed up with "I think all you want
here is rbind.fill:"

resultDF6 <- rbind.fill(mylist)
all.equal(resultDF, resultDF6)


## Gabor Grothendieck noted that if the elements in mylist were
matrices, this would all work faster.

mylist2 <- lapply(mylist, as.matrix)

matrixDoCall <-  do.call("rbind", mylist2)

all.equal(as.data.frame(matrixDoCall), resultDF)


## Gabor also showed a better way than 'system.time' to find out how
## long this takes on average using the rbenchmark package. Awesome!

#> library(rbenchmark)
#> benchmark(
#+ df = do.call("rbind", mylist),
#+ mat = do.call("rbind", L),
#+ order = "relative", replications = 250
#+ )



## To see the potentially HUGE impact of these changes, we need to
## make a bigger test case. I just used system.time to evaluate, but
## if this involved a close call, I'd use rbenchmark.

phony <- function(i){
 data.frame(w=rnorm(1000), x=rnorm(1000),y=rnorm(1000),z=rnorm(1000))
}
mylist <- lapply(1:1000, phony)




### First, try my usual way
resultDF <- mylist[[1]]
system.time(
for (i in 2:1000) resultDF <- rbind(resultDF, mylist[[i]])
          )
## wow, that's slow:
## user  system elapsed
## 168.040   4.770 173.028


### Now do.call method:
system.time( resultDF3 <- do.call("rbind", mylist) )
all.equal(resultDF, resultDF3)

## Faster! Takes one-twelfth as long
##   user  system elapsed
##  14.64    0.85   15.49


### Third, my adaptation of the complete function in the mice
### package:
m <- length(mylist)
nr <- nrow(mylist[[1]])
nc <- ncol(mylist[[1]])

system.time(
  resultDF4 <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc))
)

colnames(resultDF4) <- colnames(mylist[[1]])
system.time(
  for (j in  1:m) resultDF4[(((j-1)*nr) + 1):(j*nr), ] <- mylist[[j]]
)

all.equal(resultDF, resultDF4)
##Disappointingly slow on the big case:
#   user  system elapsed
# 80.400   3.970  84.573


### That took much longer than I expected, Gabor's
### hint about the difference between matrix and data.frame
### turns out to be important. Do it again, but don't
### make the intermediate storage thing a data.frame:
mylist2 <- lapply(mylist, as.matrix)

m <- length(mylist2)
nr <- nrow(mylist2[[1]])
nc <- ncol(mylist2[[1]])

system.time(
  resultDF4B <- matrix(0, nrow = nr*m, ncol = nc)
)

colnames(resultDF4B) <- colnames(mylist[[1]])
system.time(
  for (j in  1:m) resultDF4B[(((j-1)*nr) + 1):(j*nr), ] <- mylist2[[j]]
)

### That's FAST!
###    user  system elapsed
###   0.07    0.00    0.07

all.equal(resultDF, as.data.frame(resultDF4B))



### Now the two moethods from plyr.


system.time( resultDF5  <- ldply(mylist, rbind))

## Just about as fast, much less error prone
##  user  system elapsed
##  1.290   0.000   1.306

all.equal(resultDF, resultDF5)


system.time(resultDF6 <- rbind.fill(mylist))
##   user  system elapsed
##  0.450   0.000   0.459

all.equal(resultDF, resultDF6)



## Gabor was right. If we have matrices, do.call is
## just about as good as anything.

system.time(matrixDoCall <-  do.call("rbind", mylist2) )
##   user  system elapsed
##  0.030   0.000   0.032


all.equal(as.data.frame(matrixDoCall), resultDF)





--
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas


http://blog.sciencenet.cn/blog-517481-847547.html

上一篇:[转载]向量的点积与叉积
下一篇:[转载]C++程序设计本科教材推荐:适合初学者

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备14006957 )

GMT+8, 2019-11-20 07:17

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部