I have a project that has already been written using context of data.frame. In order to improve calc times I'm trying to leverage the speed of using data.table instead. My methodology for this has been to construct wrapper functions that read in frames, convert them to tables, do the calculations and then convert back to frames. Here's one of the simple examples...
FastAgg<-function(x, FUN, aggFields, byFields = NULL, ...){
require('data.table')
y<-setDT(x)
y<-y[,lapply(X=.SD,FUN=FUN,...),.SDcols = aggFields,by=byFields]
y<-data.frame(y)
y
}
The problem I'm having is that after running this function x has been converted to a table and then lines of code that I have written using data.frame notation fail. How do I make sure that the data.frame I feed in is unchanged by the running function?
For your case, I'd recommend (of course) to use data.table
through out and not just in a function :-).
But if it's not likely to happen, then I'd recommend the setDT
+ setDF
setup. I'd recommend using setDT
outside the function (and provide the data.table as input) - to convert your data.frame to a data.table by reference, and then after finishing the operations you'd like, you can use setDF
to convert the result back to a data.frame using setDF
and return that from the function. However, doing setDT(x)
changes x
to a data.table - as it operates by reference.
If that is not ideal, then use as.data.table(.)
inside your function, as it operates on a copy. Then, you can still use setDF()
to convert the resulting data.table to data.frame and return that data.frame from your function.
These functions are recently introduced (mostly due to user requests). The idea to avoid this confusion is to export shallow()
function and keep track of objects that require columns to be copied, and do it all internally (and automatically). It's all in very early stages right now. When we've managed, I'll update this post.
Also have a look at ?copy
, ?setDT
and ?setDF
. The first paragraph in these function's help page is:
In
data.table
parlance, allset*
functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column.. The only otherdata.table
operator that modifies input by reference is:=
. Check out theSee Also
section below for otherset*
function data.table provides.
And the example for setDT
:
set.seed(45L)
X = data.frame(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE),
C=sample(10), stringsAsFactors=FALSE)
# get the frequency of each "A,B" combination
setDT(X)[, .N, by="A,B"][]
does no assignment (although I admit it could be explained slightly better here).
In setDF
:
X = data.table(x=1:5, y=6:10)
## convert 'X' to data.frame, without any copy.
setDF(X)
I think this is pretty clear. But I'll try to provide more clarity. Also, I'll try and add how best to use these functions in the documentation as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With