Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: When using data.table how do I get columns of y when I do x[y]?

Tags:

r

data.table

UPDATE: Old question ... it was resolved by data.table v1.5.3 in Feb 2011.

I am trying to use the data.table package, and really like the speedups I am getting, but I am stumped by this error when I do x[y, <expr>] where x and y are "data-tables" with the same key, and <expr> contains column names of both x and y:

require(data.table)
x <- data.table( foo = 1:5, a = 5:1 )
y <- data.table( foo = 1:5, boo = 10:14)
setkey(x, foo)
setkey(y, foo)
> x[y, foo*boo]
Error in eval(expr, envir, enclos) : object 'boo' not found

UPDATE... To clarify the functionality I am looking for in the above example: I need to do the equivalent of the following:

with(merge(x,y), foo*boo)

However according to the below extract from the data.table FAQ, this should have worked:

Finally, although it appears as though x[y] does not return the columns in y, you can actually use the columns from y in the j expression. This is what we mean by join inherited scope. Why not just return the union of all the columns from x and y and then run expressions on that? It boils down to eciency of code and what is quicker to program. When you write x[y,fooboo], data.table automatically inspects the j expression to see which columns it uses. It will only subset, or group, those columns only. Memory is only created for the columns the j uses. Let's say foo is in x, and boo is in y (along with 20 other columns in y). Isn't x[y,fooboo] quicker to program and quicker to run than a merge step followed by another subset step ?

I am aware of this question that addressed a similar issue, but it did not seem to have been resolved satisfactorily. Anyone know what I am missing or misunderstanding? Thanks.

UPDATE: I asked on the data-table help mailing list and the package author (Matthew Dowle) replied that indeed the FAQ quoted above is wrong, so the syntax I am using will not work currently, i.e. I cannot refer to the y columns in the j (i.e. second) argument when I do x[y,...].

like image 685
Prasad Chalasani Avatar asked Jan 21 '11 22:01

Prasad Chalasani


People also ask

How do you access columns from a Dataframe in R?

To access a specific column in a dataframe by name, you use the $ operator in the form df$name where df is the name of the dataframe, and name is the name of the column you are interested in. This operation will then return the column you want as a vector.

What does := do in data table?

table way. Unlike data. frame, the := operator adds a column to both the object living in the global environment and used in the function.

How do I find columns and rows in R?

The ncol() function in R programming R programming helps us with ncol() function by which we can get the information on the count of the columns of the object. That is, ncol() function returns the total number of columns present in the object.


1 Answers

I am not sure if I understand the problem well, and I also just started to read the docs of data.table library, but I think if you would like to get the columns of y and also do something to those by the columns of a, you might try something like:

> x[y,a*y]
     foo boo
[1,]   5  50
[2,]   8  44
[3,]   9  36
[4,]   8  26
[5,]   5  14

Here, you get back the columns of y multiplied by the a column of x. If you want to get x's foo multiplied by y's boo, try:

> y[,x*boo]
     foo  a
[1,]  10 50
[2,]  22 44
[3,]  36 36
[4,]  52 26
[5,]  70 14

After editing: thank you @Prasad Chalasani making the question clearer for me.

If simple merging is preferred, then the following should work. I made up a more complex data to see the actions deeper:

x <- data.table( foo = 1:5, a=20:24, zoo = 5:1 )
y <- data.table( foo = 1:5, b=30:34, boo = 10:14)
setkey(x, foo)
setkey(y, foo)

So only an extra column was added to each data.table. Let us see merge and doing it with data.tables:

> system.time(merge(x,y))
   user  system elapsed 
  0.027   0.000   0.023 
> system.time(x[,list(y,x)])
   user  system elapsed 
  0.003   0.000   0.006 

From which the latter looks a lot faster. The results are not identical though, but can be used in the same way (with an extra column of the latter run):

> merge(x,y)
     foo  a zoo  b boo
[1,]   1 20   5 30  10
[2,]   2 21   4 31  11
[3,]   3 22   3 32  12
[4,]   4 23   2 33  13
[5,]   5 24   1 34  14
> x[,list(x,y)]
     foo  a zoo foo.1  b boo
[1,]   1 20   5     1 30  10
[2,]   2 21   4     2 31  11
[3,]   3 22   3     3 32  12
[4,]   4 23   2     4 33  13
[5,]   5 24   1     5 34  14

So to get xy we might use: xy <- x[,list(x,y)]. To compute a one-column data.table from xy$foo * xy$boo, the following might work:

> xy[,foo*boo]
[1] 10 22 36 52 70

Well, the result is not a data.table but a vector instead.


Update (29/03/2012): thanks for @David for pointing my attention to the fact that merge.data.table were used in the above examples.

like image 65
daroczig Avatar answered Oct 14 '22 23:10

daroczig