Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset row and column at the same time

Tags:

r

data.table

I'm a bit surprised by how data.table works:

> library(data.table)
data.table 1.8.2  For help type: help("data.table")
> dt <- data.table(a=11:20, b=21:30, c=31:40, key="a")
> dt[list(12)]
    a  b  c
1: 12 22 32
> dt[list(12), b]
    a  b
1: 12 22
> dt[list(12)][,b]
[1] 22

What I'm trying to do is obtain the value of a single column (or expression) in rows matched by a selection. I see that I've got to pass the key as a list as a raw number would indicate a row number and not a key value. So the first of the above is clear to me. But why the second and the thirs subset expression yield different results seems rather confusing to me. I'd like to get the third result, but would exect being able to write it the second way.

Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result? Is there a syntactically shorter way to obtain a single result except by double subsetting as above?

I'm using data.table 1.8.2 on R 2.15.1. If you cannot reproduce my example, you might as well consider a factor as key:

dt <- data.table(a=paste("a", 11:20, sep=""), b=21:30, c=31:40, key="a")
dt["a11", b]
like image 381
MvG Avatar asked Aug 03 '12 12:08

MvG


2 Answers

Regarding this question:

Is there any good reason why subsetting a data.table for rows and columns at the same time will always include the key value as well as the computed result?

I believe that the (good enough for me) reason is simply that Matthew Dowle hasn't yet gotten around to adding that option (likely because he has prioritized work on much more useful features such as ":= with by").

In comments following my answer here, Matthew seemed to indicate that it is on his TODO list, noting that "[this] is what drop=TRUE will do (with a speed advantage) when drop is added".

Until then, any of the following will get the job done:

dt[list(12)][,b]
# [1] 22
dt[list(12)][[2]]
# [1] 22
dt[dt[list(12), which=TRUE], b]
# [1] 22
like image 137
Josh O'Brien Avatar answered Oct 21 '22 19:10

Josh O'Brien


One possibility is to use:

dt[a == 12]

and

dt[a == 12, b]

This will work as expected, but it prevents binary search and requires sequential search instead (is there a plan to change this behavior ??), making it potentially slower.


UPDATE Sep 2014: now in v1.9.3

From NEWS :

DT[column==values] is now optimized to use DT's key when key(DT)[1]=="column", otherwise a secondary key (a.k.a. index) is automatically added so the next DT[column==values] is much faster. DT[column %in% values] is equivalent; i.e., both == and %in% accept vector values. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually using set2key() and existence checked using key2(). These optimizations and function names/arguments are experimental and may be turned off with options(datatable.auto.index=FALSE).

like image 25
nassimhddd Avatar answered Oct 21 '22 21:10

nassimhddd