I recently discovered binary search in data.table
. If the table is sorted on multiple keys it possible to search on the 2nd key only ?
DT = data.table(x=sample(letters,1e7,T),y=sample(1:25,1e7,T),rnorm(1e7))
setkey(DT,x,y)
#R> DT[J('x')]
# x y V3
# 1: x 1 0.89109
# 2: x 1 -2.01457
# ---
#384922: x 25 0.09676
#384923: x 25 0.25168
#R> DT[J('x',3)]
# x y V3
# 1: x 3 -0.88165
# 2: x 3 1.51028
# ---
#15383: x 3 -1.62218
#15384: x 3 -0.63601
EDIT: thanks to @Arun
R> system.time(DT[J(unique(x), 25)])
user system elapsed
0.220 0.068 0.288
R> system.time(DT[y==25])
user system elapsed
0.268 0.092 0.359
There are three subsetting operators, [[ , [ , and $ . Subsetting operators interact differently with different vector types (e.g., atomic vectors, lists, factors, matrices, and data frames). Subsetting can be combined with assignment.
Subsetting in R is a useful indexing feature for accessing object elements. It can be used to select and filter variables and observations. You can use brackets to select rows and columns from your dataframe.
To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.
Description. setkey sorts a data. table and marks it as sorted with an attribute sorted . The sorted columns are the key. The key can be any number of columns.
Yes, you can pass all values to the first key value and subset with the specific value for the second key.
DT[J(unique(x), 25), nomatch=0]
If you need to subset by more than one value in the second key (e.g. the equivalent of DT[y %in% 25:24]
), a more general solution is to use CJ
DT[CJ(unique(x), 25:24), nomatch=0]
Note that CJ
by default sorts the columns and sets key to all the columns, which means the result would be sorted as well. If that's not desirable, you should use sorted=FALSE
DT[CJ(unique(x), 25:24, sorted=FALSE), nomatch=0]
There's also a feature request to add secondary keys to data.table
in future. I believe the plan is to add a new function set2key
.
FR#1007 Build in secondary keys
There is also merge
, which has a method for data.table
. It builds the secondary key inside it for you, so should be faster than base merge. See ?merge.data.table
.
Based on this email thread I wrote the following functions:
create_index = function(dt, ..., verbose = getOption("datatable.verbose")) {
cols = data.table:::getdots()
res = dt[, cols, with=FALSE]
res[, i:=1:nrow(dt)]
setkeyv(res, cols, verbose = verbose)
}
JI = function(index, ...) {
index[J(...),i]$i
}
Here are the results on my system with a larger DT (1e8 rows):
> system.time(DT[J("c")])
user system elapsed
0.168 0.136 0.306
> system.time(DT[J(unique(x), 25)])
user system elapsed
2.472 1.508 3.980
> system.time(DT[y==25])
user system elapsed
4.532 2.149 6.674
> system.time(IDX_y <- create_index(DT, y))
user system elapsed
3.076 2.428 5.503
> system.time(DT[JI(IDX_y, 25)])
user system elapsed
0.512 0.320 0.831
If you are using the index multiple times it is worth it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With