Friday, May 13, 2011

plyr's idata.frame VS. data.frame

I had seen the function idata.frame in plyr before, but not really tested it. From the plyr documentation:

"An immutable data frame works like an ordinary data frame, except that when you subset it, it
returns a reference to the original data frame, not a a copy. This makes subsetting substantially
faster and has a big impact when you are working with large datasets with many groups."

For example, although baseball is a data.frame, its immutable counterpart is a reference to it:

> idata.frame(baseball)
<environment: 0x1022c74e8>
attr(,"class")
[1] "idf"         "environment"


Here are a few comparisons of operations on normal data frames and immutable data frames. Immutable data frames don't work with the doBy package, but do work with aggregate in base functions.  Overall, the speed gains using idata.frame are quite impressive - I will use it more often for sure.


Get the github code below here.





Here's the comparisons of idata.frames and data.frames:

> # load packages
require(plyr); require(reshape2)
 
> # Make immutable data frame
baseball_i <- idata.frame(baseball)
 
> # Example 1 - idata.frame more than twice as fast
system.time( replicate(50, ddply( baseball, "year", summarise, mean(rbi))) )
   user  system elapsed 
 14.812   0.252  15.065 
> system.time( replicate(50, ddply( baseball_i, "year", summarise, mean(rbi))) )
   user  system elapsed 
  6.895   0.020   6.915 
> # Example 2 - Bummer, this does not work with idata.frame's
> colwise(max, is.numeric) ( baseball ) # works year stint g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp 1 2007 4 165 705 177 257 64 28 73 NA NA NA 232 NA NA NA NA NA NA > colwise(max, is.numeric) ( baseball_i ) # doesn't work Error: is.data.frame(df) is not TRUE
> # Example 3 - idata.frame twice as fast
system.time( replicate(100, baseball[baseball$year == "1884", ] ) )
   user  system elapsed 
  1.155   0.048   1.203 
> system.time( replicate(100, baseball_i[baseball_i$year == "1884", ] ) )
   user  system elapsed 
  0.598   0.011   0.609 
> # Example 4 - idata.frame faster
system.time( replicate(50, melt(baseball[, 1:4], id = 1) ) )
   user  system elapsed 
 16.587   1.169  17.755 
> system.time( replicate(50, melt(baseball_i[, 1:4], id = 1) ) )
   user  system elapsed 
  0.869   0.196   1.065 
> # And you can go back to a data frame by 
d <- as.data.frame(baseball_i)
str(d)
 
'data.frame': 21699 obs. of  23 variables:
 $ id   : chr  "ansonca01" "forceda01" "mathebo01" "startjo01" ...
 $ year : int  1871 1871 1871 1871 1871 1871 1871 1872 1872 1872 ...
 $ stint: int  1 1 1 1 1 1 1 1 1 1 ...
 $ team : chr  "RC1" "WS3" "FW1" "NY2" ...
 $ lg   : chr  "" "" "" "" ...
 $ g    : int  25 32 19 33 29 29 29 46 37 25 ...
 $ ab   : int  120 162 89 161 128 146 145 217 174 130 ...
 $ r    : int  29 45 15 35 35 40 36 60 26 40 ...
 $ h    : int  39 45 24 58 45 47 37 90 46 53 ...
 $ X2b  : int  11 9 3 5 3 6 5 10 3 11 ...
 $ X3b  : int  3 4 1 1 7 5 7 7 0 0 ...
 $ hr   : int  0 0 0 1 3 1 2 0 0 0 ...
 $ rbi  : int  16 29 10 34 23 21 23 50 15 16 ...
 $ sb   : int  6 8 2 4 3 2 2 6 0 2 ...
 $ cs   : int  2 0 1 2 1 2 2 6 1 2 ...
 $ bb   : int  2 4 2 3 1 4 9 16 1 1 ...
 $ so   : int  1 0 0 0 0 1 1 3 1 0 ...
 $ ibb  : int  NA NA NA NA NA NA NA NA NA NA ...
 $ hbp  : int  NA NA NA NA NA NA NA NA NA NA ...
 $ sh   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ sf   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ gidp : int  NA NA NA NA NA NA NA NA NA NA ...
 $ teamf: Factor w/ 132 levels "ALT","ANA","ARI",..: 99 127 51 79 35 35 122 86 16 122 ...
> # idata.frame doesn't work with the doBy package
require(doBy)
summaryBy(rbi ~ year, baseball_i, FUN=c(mean), na.rm=T)
Error in as.vector(x, mode) : 
  cannot coerce type 'environment' to vector of type 'any'
> # But idata.frame works with aggregate in base (but with minimal speed gains)
# and aggregate is faster than ddply of course 
system.time( replicate(100, aggregate(rbi ~ year, baseball, mean) ) )
   user  system elapsed 
  4.117   0.423   4.541 
> system.time( replicate(100, aggregate(rbi ~ year, baseball_i, mean) ) )
   user  system elapsed 
  3.908   0.383   4.291 
> system.time( replicate(100, ddply( baseball_i, "year", summarise, mean(rbi)) ) )
   user  system elapsed 
 14.015   0.048  14.082 
Created by Pretty R at inside-R.org

2 comments:

  1. I loved plyr when I started with R.
    idata.frame made it even faster.
    Then I found data.table.
    Even with doMC/parallel=T/idata.frame, data.table kicked ass.
    data.table with %dopar% kicked more ass. Not even close. 100 million rows? piece of cake.

    data.table reminds me of K(Qdb) which was a super fast pain in the ass (yes thats what I said and that is what I mean).

    Support for memory mapped files/partitions in data.table would be the panacea. So close...

    OK. I'll fess up. I found data.table at the same time as plyr and used plyr because everyone else was. But now I'm annoyed data.table isn't the standard. Soooo much faster. Why isn't this the standard?

    Respect to Hadley for ggplot/plyr. They are a huge improvement on base R, and he is an innovator.
    data.table should be part of core R and does not get the respect it deserves. Probably because most people don't need to analyze enough data that it matters.

    ReplyDelete
  2. You are right to point out data.table. I simply wanted to talk about idata.frame vs. data.frame, then I just threw in doBy and aggregate at the last minute.

    Overall, others (http://zvfak.blogspot.com/2011/03/applying-functions-on-groups-sqldf-plyr.html, http://www.psychwire.co.uk/2011/04/data-aggregation-in-r-plyr-sqldf-and-data-table/) have pointed out how fast data.table is (including others, sqldf, doBy, aggregate) relative to plyr. Seems that Hadley is working on the problem though (http://twitter.com/#!/hadleywickham/status/41150539232190464).

    Right, I agree how much faster data.table is right now, but I guess I am just partial to plyr, and my datasets really aren't that large. I have to confess to using data.table when I do have large data sets though.

    ReplyDelete