Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data table aggregations with vector functions, take 2

I'm struggling to use data.table to summarize results of vector functions, something that's easy in ddply.

Issue 1: aggregate with an (expensive) function with vector output

dt <- data.table(x=1:20,y=rep(c("a","b"),each=10))

This ddply command produces what I want:

ddply(dt,~y,function(dtbit) quantile(dtbit$x))

This data table command does not do what I want:

dt[,quantile(x),by=list(y)]

I can hack at data.table like so:

dt[,list("0%"=quantile(x,0),"25%"=quantile(x,0.25),
    "50%"=quantile(x,0.5)),by=list(y)]

But that verbose, and also would be slow if the vector function "quantile" were slow.

A similar example is:

dt$z <- rep(sqrt(1:10),2)

ddply(dt,~y,function(dtbit) coef(lm(z~x,dtbit)))

Issue 2: Using a function with both vector input and output

xzsummary <- function(dtbit) t(summary(dtbit[,"x"]-dtbit[,"z"]))

ddply(dt,~y,xzsummary )

Can I do that kind of thing easily in data.table?

Apologies if these questions are already prominently answered.

This is a similar, not identical, issue to: data.table aggregations that return vectors, such as scale()

like image 454
ewallace Avatar asked Jul 09 '14 20:07

ewallace


People also ask

What is a data table in R?

a) What is data. table ? data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.

How do I add a row to a data table in R?

To add row to R Data Frame, append the list or vector representing the row, to the end of the data frame. nrow(df) returns the number of rows in data frame.


1 Answers

> dt[ , as.list(quantile(x)),by=y]
   y 0%   25%  50%   75% 100%
1: a  1  3.25  5.5  7.75   10
2: b 11 13.25 15.5 17.75   20

I tried using rbind, but that failed to generate the by-y arrangement I was thinking you wanted. The trick with as.list (vs. list) is that it constructs a multi-element list wehn givne a vector, whereas list only puts the vector into a single element list.

as.list acts like sapply(x, list):

> dt[ , sapply(quantile(x), list), by=y]
   y 0%   25%  50%   75% 100%
1: a  1  3.25  5.5  7.75   10
2: b 11 13.25 15.5 17.75   20

Your target solution:

> ddply(dt,~y,function(dtbit) quantile(dtbit$x))
  y 0%   25%  50%   75% 100%
1 a  1  3.25  5.5  7.75   10
2 b 11 13.25 15.5 17.75   20

I was kind of proud of that solution, but mindful of fortunes::fortune("Liaw-Baron principle") ............

Lastly, by what we could call the 'Liaw-Baron principle', every question that can be asked has in fact already been asked. -- Dirk Eddelbuettel (citing Andy Liaw's and Jonathan Baron's opinion on unique questions on R-help) R-help (January 2006)

.... I did a search on: [r] data.table as.list, and find that I am by no means the first to post this strategy on SO:

Tabulate a data frame in R

Using ave() with function which returns a vector

create a formula in a data.table environment in R

I don't really know if this question would be considered a duplicate, but I am particularly grateful to @G.Grothedieck for the last one. It may be where I picked up the strategy. There were about 125 hits to that search and I've only gone through the first 20 to gather those examples, so there may be some more pearls that I haven't uncovered.

like image 78
IRTFM Avatar answered Oct 01 '22 13:10

IRTFM