Consider a matrix specifying one two-dimensional region per line, and another matrix specifying point in a plane:
xmin <- c(3, 14, 25, 61)
xmax <- c(5, 18, 27, 65)
ymin <- c(33, 12, 83, 2)
ymax <- c(35, 16, 90, 6)
regions <- cbind(xmin, xmax, ymin, ymax)
x <- c(7, 26, 4, 16)
y <- c(4, 85, 30, 13)
points <- cbind(x, y)
What is the fastest way of obtaining the indexes in regions
that contains each of the points in points
?
An example of what I want to achieve is:
apply(points, 1, function(x){
which(regions[,'xmin'] < x[1] & regions[,'xmax'] > x[1] & regions[,'ymin'] < x[2] & regions[,'ymax'] > x[2])
})
But as the number of rows in both regions
and points
approach 1E5 this gets rather slow and I'm looking for a proper vectorized approach...
Thanks in advance...
best Thomas
EDIT:
For anyone interested I made a function in C++ using Rcpp that provides roughly a 50x performance improvement. I'm not fluent in C++ so it could possibly be done better...
cppFunction('
IntegerVector findInRegion(NumericVector x, NumericVector y, NumericVector xmin, NumericVector xmax, NumericVector ymin, NumericVector ymax){
int pointSize = x.size();
int regionSize = xmin.size();
IntegerVector ans(pointSize);
for(int i = 0; i < pointSize; i++){
ans[i] = NA_INTEGER;
}
for(int i = 0; i < pointSize; i++){
for(int j = 0; j < regionSize; j++){
if(x[i] > xmin[j]){
if(x[i] < xmax[j]){
if(y[i] > ymin[j]){
if(y[i] < ymax[j]){
ans[i] = j+1;
};
};
};
};
};
};
return ans;
}
')
findRegion <- function(points, regions){
if(!all(c('x', 'y') %in% colnames(points))){
stop('points must contain columns named \'x\' and \'y\'')
}
if(!all(c('xmin', 'xmax', 'ymin', 'ymax') %in% colnames(regions))){
stop('regions must contain columns named \'xmin\', \'xmax\', \'ymin\' and \'ymax\'')
}
findInRegion(points[, 'x'], points[,'y'], regions[, 'xmin'], regions[, 'xmax'], regions[, 'ymin'], regions[, 'ymax'])
}
One drawback of this function is that it assumes a point can only belong to one region...
Here is another solution, using an R-tree index (a type of database index designed to store bounding boxes) with SQLite. It turns out to be slightly slower than Simon's (7 seconds), probably because the data is copied to disk.
# Sample data: data.frames, rather than matrices
regions <- data.frame(id=1:length(xmin), xmin, xmax, ymin, ymax)
points <- data.frame(x, y)
library(RSQLite)
con <- dbConnect("SQLite", dbname = "/tmp/a.sqlite")
dbGetQuery( con, "CREATE VIRTUAL TABLE regions USING rtree (id, xmin, xmax, ymin, ymax)" )
dbWriteTable( con, "regions", regions, row.names = FALSE, append = TRUE )
dbWriteTable( con, "points", points, row.names = TRUE )
res <- dbGetQuery( con, "
SELECT points.row_names, regions.id
FROM points, regions
WHERE xmin <= x AND x <= xmax
AND ymin <= y AND y <= ymax
" )
This is a really interesting problem. I did some initial testing and this seems like it might be faster, but I really don't know how well it scales. I'd be interested if you could test on your real data and report some timings:
# Are X coords greater than xmin
lx <- outer( points[,1] , regions[,1] , ">" )
# Are X coords less than xmax
hx <- outer( points[,1] , regions[,2] , "<" )
# Ditto for Y coords
ly <- outer( points[,2] , regions[,3] , ">" )
hy <- outer( points[,2] , regions[,4] , "<" )
# These matrices for X and Y points have 1 if coords is in range, 0 otherwise
inx <- lx * hx
iny <- ly * hy
# The final result matrix has 1 if both X and Y coords are in range and 0 if not
# Rows are points, columns are regions
res <- inx * iny
On data of 100000 points and 100000 regions this approach won't work unless you have alot of RAM. However I think it's pretty useable if you divide the number of regions up into chunks of around 1000 each. On my desktop, 100,000 points and 1,000 regions took 5 seconds:
Unit: seconds
expr min lq median uq max neval
eval(simon) 4.528942 4.55258 4.59848 4.607572 4.671511 5
As a rough guide to the magnitude of difference in timings I saw between your apply
method and this one, with 10,000 points and 1,000 regions (based on 5 runs):
Unit: milliseconds
expr min lq median uq max neval
eval(simon) 394.7165 402.0919 403.0491 404.6943 428.7077 5
eval(OP) 1359.5889 1364.6308 1372.4980 1383.1327 1491.4628 5
And with 100,000 points and 1,000 regions (based on one run):
Unit: seconds
expr min lq median uq max neval
eval(simon) 4.352857 4.352857 4.352857 4.352857 4.352857 1
eval(OP) 14.027390 14.027390 14.027390 14.027390 14.027390 1
This is the code I used to generate sample data and to run the benchmark:
set.seed(4862)
xmin <- sample(25,1000,repl=T)
xmax <- xmin + sample(15,100,repl=T)
ymin <- sample(25,1000,repl=T)
ymax <- ymin + sample(15,1000,repl=T)
regions <- cbind(xmin, xmax, ymin, ymax)
x <- sample(25,100000,repl=T)
y <- sample(25,100000,repl=T)
points <- cbind(x, y)
OP <- quote({ res <- apply(points, 1, function(x){
which(regions[,'xmin'] < x[1] & regions[,'xmax'] > x[1] & regions[,'ymin'] < x[2] & regions[,'ymax'] > x[2])
}) })
simon <- quote({
lx <- outer( points[,1] , regions[,1] , ">" )
hx <- outer( points[,1] , regions[,2] , "<" )
ly <- outer( points[,2] , regions[,3] , ">" )
hy <- outer( points[,2] , regions[,4] , "<" )
inx <- lx * hx
iny <- ly * hy
res <- inx * iny })
require(microbenchmark)
microbenchmark( eval(simon) , eval(OP) , times = 1L )
I'd recommend doing this in chunks. HTH.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With