Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wrapper functions for data.table

Tags:

r

data.table

I have a project that has already been written using context of data.frame. In order to improve calc times I'm trying to leverage the speed of using data.table instead. My methodology for this has been to construct wrapper functions that read in frames, convert them to tables, do the calculations and then convert back to frames. Here's one of the simple examples...

FastAgg<-function(x, FUN, aggFields, byFields = NULL, ...){
  require('data.table')
  y<-setDT(x)
  y<-y[,lapply(X=.SD,FUN=FUN,...),.SDcols = aggFields,by=byFields]
  y<-data.frame(y)
  y
}

The problem I'm having is that after running this function x has been converted to a table and then lines of code that I have written using data.frame notation fail. How do I make sure that the data.frame I feed in is unchanged by the running function?

like image 634
Adam Hoelscher Avatar asked Oct 21 '14 01:10

Adam Hoelscher


1 Answers

For your case, I'd recommend (of course) to use data.table through out and not just in a function :-).

But if it's not likely to happen, then I'd recommend the setDT + setDF setup. I'd recommend using setDT outside the function (and provide the data.table as input) - to convert your data.frame to a data.table by reference, and then after finishing the operations you'd like, you can use setDF to convert the result back to a data.frame using setDF and return that from the function. However, doing setDT(x) changes x to a data.table - as it operates by reference.

If that is not ideal, then use as.data.table(.) inside your function, as it operates on a copy. Then, you can still use setDF() to convert the resulting data.table to data.frame and return that data.frame from your function.

These functions are recently introduced (mostly due to user requests). The idea to avoid this confusion is to export shallow() function and keep track of objects that require columns to be copied, and do it all internally (and automatically). It's all in very early stages right now. When we've managed, I'll update this post.


Also have a look at ?copy, ?setDT and ?setDF. The first paragraph in these function's help page is:

In data.table parlance, all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column.. The only other data.table operator that modifies input by reference is :=. Check out the See Also section below for other set* function data.table provides.

And the example for setDT:

set.seed(45L)
X = data.frame(A=sample(3, 10, TRUE), 
         B=sample(letters[1:3], 10, TRUE), 
         C=sample(10), stringsAsFactors=FALSE)

# get the frequency of each "A,B" combination
setDT(X)[, .N, by="A,B"][]

does no assignment (although I admit it could be explained slightly better here).

In setDF:

X = data.table(x=1:5, y=6:10)
## convert 'X' to data.frame, without any copy.
setDF(X)

I think this is pretty clear. But I'll try to provide more clarity. Also, I'll try and add how best to use these functions in the documentation as well.

like image 163
Arun Avatar answered Sep 19 '22 00:09

Arun