Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Effective search in two-dimensional window

Tags:

r

Consider a matrix specifying one two-dimensional region per line, and another matrix specifying point in a plane:

xmin <- c(3, 14, 25, 61)
xmax <- c(5, 18, 27, 65)
ymin <- c(33, 12, 83, 2)
ymax <- c(35, 16, 90, 6)
regions <- cbind(xmin, xmax, ymin, ymax)

x <- c(7, 26, 4, 16)
y <- c(4, 85, 30, 13)
points <- cbind(x, y)

What is the fastest way of obtaining the indexes in regions that contains each of the points in points?

An example of what I want to achieve is:

apply(points, 1, function(x){
    which(regions[,'xmin'] < x[1] & regions[,'xmax'] > x[1] & regions[,'ymin'] < x[2] & regions[,'ymax'] > x[2])
})

But as the number of rows in both regions and points approach 1E5 this gets rather slow and I'm looking for a proper vectorized approach...

Thanks in advance...

best Thomas

EDIT:

For anyone interested I made a function in C++ using Rcpp that provides roughly a 50x performance improvement. I'm not fluent in C++ so it could possibly be done better...

cppFunction('
    IntegerVector findInRegion(NumericVector x, NumericVector y, NumericVector xmin, NumericVector xmax, NumericVector ymin, NumericVector ymax){
        int pointSize = x.size();
        int regionSize = xmin.size();
        IntegerVector ans(pointSize);
        for(int i = 0; i < pointSize; i++){
            ans[i] = NA_INTEGER;
        }

        for(int i = 0; i < pointSize; i++){
            for(int j = 0; j < regionSize; j++){
                if(x[i] > xmin[j]){
                    if(x[i] < xmax[j]){
                        if(y[i] > ymin[j]){
                            if(y[i] < ymax[j]){
                                ans[i] = j+1;
                            };
                        };
                    };
                };
            };
        };
        return ans;
    }
')

findRegion <- function(points, regions){
    if(!all(c('x', 'y') %in% colnames(points))){
        stop('points must contain columns named \'x\' and \'y\'')
    }
    if(!all(c('xmin', 'xmax', 'ymin', 'ymax') %in% colnames(regions))){
        stop('regions must contain columns named \'xmin\', \'xmax\', \'ymin\' and \'ymax\'')
    }
    findInRegion(points[, 'x'], points[,'y'], regions[, 'xmin'], regions[, 'xmax'], regions[, 'ymin'], regions[, 'ymax'])
}

One drawback of this function is that it assumes a point can only belong to one region...

like image 872
ThomasP85 Avatar asked Jun 10 '13 08:06

ThomasP85


2 Answers

Here is another solution, using an R-tree index (a type of database index designed to store bounding boxes) with SQLite. It turns out to be slightly slower than Simon's (7 seconds), probably because the data is copied to disk.

# Sample data: data.frames, rather than matrices
regions <- data.frame(id=1:length(xmin), xmin, xmax, ymin, ymax)
points  <- data.frame(x, y)

library(RSQLite)
con <- dbConnect("SQLite", dbname = "/tmp/a.sqlite") 
dbGetQuery( con, "CREATE VIRTUAL TABLE regions USING rtree (id, xmin, xmax, ymin, ymax)" )
dbWriteTable( con, "regions", regions, row.names = FALSE, append = TRUE )
dbWriteTable( con, "points",  points, row.names = TRUE )
res <- dbGetQuery( con, "
  SELECT points.row_names, regions.id
  FROM   points, regions
  WHERE  xmin <= x AND x <= xmax
  AND    ymin <= y AND y <= ymax
" )
like image 160
Vincent Zoonekynd Avatar answered Nov 04 '22 09:11

Vincent Zoonekynd


This is a really interesting problem. I did some initial testing and this seems like it might be faster, but I really don't know how well it scales. I'd be interested if you could test on your real data and report some timings:

#  Are X coords greater than xmin
lx <- outer( points[,1] , regions[,1] , ">" )

#  Are X coords less than xmax
hx <- outer( points[,1] , regions[,2] , "<" )

#  Ditto for Y coords
ly <- outer( points[,2] , regions[,3] , ">" )
hy <- outer( points[,2] , regions[,4] , "<" )

#  These matrices for X and Y points have 1 if coords is in range, 0 otherwise
inx <- lx * hx
iny <- ly * hy

#  The final result matrix has 1 if both X and Y coords are in range and 0 if not
#  Rows are points, columns are regions
res <- inx * iny

On data of 100000 points and 100000 regions this approach won't work unless you have alot of RAM. However I think it's pretty useable if you divide the number of regions up into chunks of around 1000 each. On my desktop, 100,000 points and 1,000 regions took 5 seconds:

Unit: seconds
        expr      min      lq  median       uq      max neval
 eval(simon) 4.528942 4.55258 4.59848 4.607572 4.671511     5

As a rough guide to the magnitude of difference in timings I saw between your apply method and this one, with 10,000 points and 1,000 regions (based on 5 runs):

Unit: milliseconds
        expr       min        lq    median        uq       max neval
 eval(simon)  394.7165  402.0919  403.0491  404.6943  428.7077     5
    eval(OP) 1359.5889 1364.6308 1372.4980 1383.1327 1491.4628     5

And with 100,000 points and 1,000 regions (based on one run):

Unit: seconds
        expr       min        lq    median        uq       max neval
 eval(simon)  4.352857  4.352857  4.352857  4.352857  4.352857     1
    eval(OP) 14.027390 14.027390 14.027390 14.027390 14.027390     1

This is the code I used to generate sample data and to run the benchmark:

set.seed(4862)
xmin <- sample(25,1000,repl=T)
xmax <- xmin + sample(15,100,repl=T)
ymin <- sample(25,1000,repl=T)
ymax <- ymin + sample(15,1000,repl=T)
regions <- cbind(xmin, xmax, ymin, ymax)

x <- sample(25,100000,repl=T)
y <- sample(25,100000,repl=T)
points <- cbind(x, y)


OP <- quote({ res <- apply(points, 1, function(x){
    which(regions[,'xmin'] < x[1] & regions[,'xmax'] > x[1] & regions[,'ymin'] < x[2] & regions[,'ymax'] > x[2])
}) })


simon <- quote({
lx <- outer( points[,1] , regions[,1] , ">" )
hx <- outer( points[,1] , regions[,2] , "<" )
ly <- outer( points[,2] , regions[,3] , ">" )
hy <- outer( points[,2] , regions[,4] , "<" )
inx <- lx * hx
iny <- ly * hy
res <- inx * iny })

require(microbenchmark)
microbenchmark( eval(simon) , eval(OP) , times = 1L )

I'd recommend doing this in chunks. HTH.

like image 21
Simon O'Hanlon Avatar answered Nov 04 '22 08:11

Simon O'Hanlon