Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Should keys behave this way in data.table?

Tags:

r

key

data.table

I have encountered a somewhat unintuitive behavior of keys in data.table package. Here goes an example:

library(data.table)
foo <- data.table(a = c(1:4), b = c(2:5), c = c(3:6), d = c(4:7))
setkey(foo, b)

Then, there is one alarming result of key():

key(foo[, .(mean(c + d)), by = .(b)]) # result is "b".
key(foo[, .(mean(c + d)), by = .(a)]) # result is "a". (!!)

Then, there is another example which produces diffirent, more reasonable results.

foo <- data.table(a = c(4:1), b = c(2:5), c = c(3:6), d = c(4:7))
setkey(foo, b)
key(foo[, .(mean(c + d)), by = .(b)]) # result is "b".
key(foo[, .(mean(c + d)), by = .(a)]) # result is NULL

I admit I'm confused. My lead is this key() somehow checks whether the resulting table needed to be sorted by the elements in by and then assumes it was keyed. Is it a feature? Is it a bug?

like image 220
Karol Avatar asked Jul 14 '17 09:07

Karol


People also ask

Is data.table a function in R?

data. table method. subset and with are base R functions that are useful for reducing repetition in code, enhancing readability, and reducing number the total characters the user has to type. This functionality is possible in R because of a quite unique feature called lazy evaluation.

What is data.table in R?

data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.

How do I add a row to a data.table in R?

To add row to R Data Frame, append the list or vector representing the row, to the end of the data frame. nrow(df) returns the number of rows in data frame.


1 Answers

Is it a feature? Is it a bug?

In the first example you get key="a" because the result from that query happened to be ordered in a way that a column was in non-decreasing order. Because of that we could call this behaviour a feature.
The problem is that creating a key silently might have not always been desired, thus this behaviour has been changed since you asked that question.
Now (as of 1.12.0) running code from first chunk removes the key and ignores the fact that results are ordered by a.

library(data.table)
foo <- data.table(a = c(1:4), b = c(2:5), c = c(3:6), d = c(4:7))
setkey(foo, b)
key(foo[, .(mean(c + d)), by = .(b)])
#[1] "b"
key(foo[, .(mean(c + d)), by = .(a)])
#NULL
like image 181
jangorecki Avatar answered Oct 06 '22 16:10

jangorecki