Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you "tie", or provide an alternative implementation, of a data.frame in R?

Tags:

sqlite

r

In Perl (and probably other langauges), you can "tie" a variable to replace it's built-in behavior with user-defined behavior. For example, a hash table can be tied with custom "get" and "fetch" subroutines which, for example, query BerkeleyDB so that data is persistent and not limited by RAM, but still looks and acts like regular hash to Perl.

Is something similar possible with R? In particular, I was thinking, since a data.frame looks much like a table in a relational db, that if a data.frame were tied to something like SQLite, it would enable R to handle very large data frames (I've stuffed 100GB+ into SQLite) without any code changes.

like image 664
user3243135 Avatar asked Aug 29 '14 08:08

user3243135


1 Answers

As the comments point out, a handful of package have already been built on this idea (or similar).

data.table and dplyr are exceptionally good at dealing with very large data.frame and querying them. If the data.frame is actually >100GB, I would rather recommend data.table which seem to outperform dplyr in the limit nrow->Inf. Both have excellent support on stackoverflow should you need it.

However, to actually answer your question (and to be useful to the future readers of this question): yes it is possible to surcharge a function with R to provide an alternative behavior. It is actually very easy with the S3 dispatch system. I recommend this ressource to learn more.

I'll give you the condensed version: If you have an object of class "myclass", you can write a function f.myclass to do what you want.

Then you define the generic function f:

f <- function(obj, ...) UseMethod("f", obj, ...)

When you call f(obj), the function that UseMethod will call depends on the class of obj.

If obj is of class "myclass", then f.myclass will be called on obj.

If the function you want to redefine already exists, say plot, then you can simply define plot.myclass which will be used when you call plot on a "myclass" object. The generic function already exists, no need to redefine it.

To change the class of an object (or append the new class to the existing classes, which is more common to not break the behavior you don't want to change), you can use class<-.

Here's a silly example.

> print.myclass <- function(x) {
    print("Hello!")}

> df <- data.frame(a=1:3)
> class(df)
[1] "data.frame"
> df #equivalent to print(df)
  a
1 1
2 2
3 3

> class(df) <- append(class(df), "myclass")
> class(df)
[1] "data.frame" "myclass"   

> class(df) <- "myclass"
> class(df)
[1] "myclass"
> df
[1] "Hello!"
> str(df) # checking the structure of df: the data is still there of course
List of 1
 $ a: int [1:3] 1 2 3
 - attr(*, "row.names")= int [1:3] 1 2 3
 - attr(*, "class")= chr "myclass"

There are some subtleties, like which function is called if there are several classes, in what order, etc. I refer you to a thorough explanation of the S3 system.

That's how you would redefine the behavior of functions. Re-write them as f.myclass and then create objects of class "myclass".

Alternatively, you could redefine f.targetclass. For example, again with print and data.frame:

> print.data.frame <- function(x) {
         print(paste("data.frame with columns:", paste(names(x), collapse = ", ")))} # less silly example!
> df <- data.frame(a=1:3, b=4:6)
> df
[1] "data.frame with columns: a, b"
like image 184
asachet Avatar answered Nov 14 '22 08:11

asachet