Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way of finding matching rows

Tags:

r

data.table

xts

I am wondering what is the fastest way of finding all rows in xts object that are the same as one particular row

library(xts)

nRows <- 3

coreData <- data.frame(a=rnorm(nRows), b=rnorm(nRows), c=rnorm(nRows))

testXts1 <- xts(coreData, order.by=as.Date(1:nRows))
testXts2 <- xts(coreData, order.by=as.Date((nRows + 1):(2*nRows)))
testXts3 <- xts(coreData, order.by=as.Date((2*nRows + 1):(3*nRows)))

testXts <- rbind(testXts1, testXts2, testXts3)

> testXts
                    a         b         c
1970-01-02 -0.3288756  1.441799  1.321608
1970-01-03 -0.7105016  1.639239 -2.056861
1970-01-04  0.1138675 -1.782825 -1.081799
1970-01-05 -0.3288756  1.441799  1.321608
1970-01-06 -0.7105016  1.639239 -2.056861
1970-01-07  0.1138675 -1.782825 -1.081799
1970-01-08 -0.3288756  1.441799  1.321608
1970-01-09 -0.7105016  1.639239 -2.056861
1970-01-10  0.1138675 -1.782825 -1.081799

rowToSearch <- first(testXts)

> rowToSearch
                    a        b        c
1970-01-02 -0.3288756 1.441799 1.321608

indicesOfMatchingRows <- unlist(apply(testXts, 1, function(row)  lapply(1:NCOL(row), function(i) row[i] == coredata(rowToSearch[, i]))))

testXts[indicesOfMatchingRows, ]

                    a         b         c
1970-01-02 -0.3288756  1.441799  1.321608
1970-01-05 -0.3288756  1.441799  1.321608
1970-01-08 -0.3288756  1.441799  1.321608

I am sure this can be done in more elegant and fast way.

A more general question is how you say in R "I have this row matrix[5, ] how can I find (indexes of) other rows in matrix that are the same as matrix[5, ]".

How to do this in data.table?

like image 927
Samo Avatar asked Jun 20 '15 15:06

Samo


2 Answers

Since you said that speed is your main concern, you can get speedups even over a data.table solution with Rcpp:

library(Rcpp)
cppFunction(
"LogicalVector compareToRow(NumericMatrix x, NumericVector y) {
  const int nr = x.nrow();
  const int nc = x.ncol();
  LogicalVector ret(nr, true);
  for (int j=0; j < nr; ++j) {
    for (int k=0; k < nc; ++k) {
      if (x(j, k) != y[k]) {
        ret[j] = false;
        break;
      }
    }
  }
  return ret;
}")
testXts[compareToRow(testXts, rowToSearch),]
#                   a         b         c
# 1970-01-02 1.324457 0.8485654 -1.464764
# 1970-01-05 1.324457 0.8485654 -1.464764
# 1970-01-08 1.324457 0.8485654 -1.464764

Here's a comparison on a fairly large instance (with 1 million rows):

set.seed(144)
bigXts <- testXts[sample(nrow(testXts), 1000000, replace=TRUE),]
testDT <- as.data.frame(bigXts)

josilber <- function(x, y) x[compareToRow(x, y),]
roland.base <- function(x, y) x[colSums(t(x) != as.vector(y)) == 0L,]
library(data.table)
roland.dt <- function(testDT, y) {
  setDT(testDT, keep.rownames=TRUE)
  setkey(testDT, a, b, c)
  testDT[setDT(as.data.frame(y))]
}
library(microbenchmark)
microbenchmark(josilber(bigXts, rowToSearch), roland.base(bigXts, rowToSearch), roland.dt(testDT, rowToSearch), times=10)
# Unit: milliseconds
#                              expr         min         lq       mean     median         uq       max
#     josilber(bigXts, rowToSearch)    7.830986   10.24748   45.64805   14.41775   17.37049  258.4404
#  roland.base(bigXts, rowToSearch) 3530.042324 3964.72314 4288.05758 4179.64233 4534.21407 5400.5619
#    roland.dt(testDT, rowToSearch)   32.826285   34.95014  102.52362   57.30213  130.51053  267.2249

This benchmark assumes the object has been converted to a data frame (~4 seconds overhead) before calling the roland.dt and that compareToRows has been compiled (~3 seconds overhead) before calling josilber. The Rcpp solution is about 300x faster than the base R solution and about 4x faster than the data.table solution in median runtime. The approach based on digest was not competitive, taking more than 60 seconds to execute each time.

like image 64
josliber Avatar answered Nov 14 '22 20:11

josliber


Here is a faster base R solution:

ind <- colSums(t(testXts) != as.vector(rowToSearch)) == 0L
testXts[ind,]

Here is a solution using a data.table join:

library(data.table)
testDT <- as.data.frame(testXts)
setDT(testDT, keep.rownames=TRUE)
setkey(testDT, a, b, c)
testDT[setDT(as.data.frame(rowToSearch))]

However, I would be wary when comparing floating point numbers.

like image 6
Roland Avatar answered Nov 14 '22 19:11

Roland