I have a large spatial dataset (12M rows). The geometries are points on a map. For each row in the dataset, I'd like to find all the points that are within 500 meters of that point.
In r, using sf, I've been trying to do this by parallel looping through each row and running st_buffer and st_intersects, then saving the result as a list in a key-value format (the key being the origin point, the values being the neighbors).
The issue is that the dataset is too large. Even when parallelizing to upwards of 60 cores, the operation takes too long (>1 week and usually crashes).
What are the alternatives to this brute-force approach? Is it possible to build indexes using sf? Perhaps push the operation to an external database?
Reprex:
library(sf)
library(tidyverse)
library(parallel)
library(foreach)
# example data, convert to decimal:
nc <- st_read(system.file("shape/nc.shp", package="sf")) %>% st_transform(32618)
# expand the data a a bit to make the example more interesting:
nc <- rbind(nc,nc,nc)
nc <- nc %>% mutate(Id = row_number())
## can run in parallel if desired:
# num_cores <- parallel::detectCores()-2
# cl <- makeSOCKcluster(num_cores)
# registerDoSNOW(cl)
# or just run in sequence:
registerDoSEQ()
neighbors <- foreach(ii = 1:nrow(nc)
, .verbose = FALSE
, .errorhandling = "pass") %dopar% {
l = 500 # 500 meters
# isolate the row as the origin point:
row_interest <- filter(nc, row_number()==ii)
# create the buffer:
buffer <- row_interest %>% st_buffer(dist = l)
# extract the row numbers of the neighbors
comps_idx <- suppressMessages(st_intersects(buffer, nc))[[1]]
# get all the neighbors:
comps <- nc %>% filter(row_number() %in% comps_idx)
# remove the geometry:
comps <- comps %>% st_set_geometry(NULL)
# flow control in case there are no neibors:
if(nrow(comps)>0) {
comps$Origin_Key <- row_interest$Id
} else {
comps <- data_frame("lat" = NA_integer_,"lon" = NA_integer_, "bbl" = row_interest$bbl)
comps$Origin_Key <- row_interest$Id
}
return(comps)
}
closeAllConnections()
length(neighbors)==nrow(nc)
[1] TRUE
When working with sf
objects, explicitly looping over features to perform
binary operations such as intersects is usually counterproductive (see also
How can I speed up spatial operations in `dplyr::mutate()`?)
An approach similar to yours (i.e., buffering and intersecting), but without
the explicit for
loop works better.
Let's see how it performs on a reasonably big dataset of 50000 points:
library(sf)
library(spdep)
library(sf)
pts <- data.frame(x = runif(50000, 0, 100000),
y = runif(50000, 0, 100000))
pts <- sf::st_as_sf(pts, coords = c("x", "y"), remove = F)
pts_buf <- sf::st_buffer(pts, 5000)
coords <- sf::st_coordinates(pts)
microbenchmark::microbenchmark(
sf_int = {int <- sf::st_intersects(pts_buf, pts)},
spdep = {x <- spdep::dnearneigh(coords, 0, 5000)}
, times = 1)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> sf_int 21.56186 21.56186 21.56186 21.56186 21.56186 21.56186 1
#> spdep 108.89683 108.89683 108.89683 108.89683 108.89683 108.89683 1
You can see here that the st_intersects
approach is 5 times faster than
the dnearneigh
one.
Unfortunately, this is unlikely to solve your problem. Looking at execution times for datasets of different sizes we get:
subs <- c(1000, 3000, 5000, 10000, 15000, 30000, 50000)
times <- NULL
for (sub in subs[1:7]) {
pts_sub <- pts[1:sub,]
buf_sub <- pts_buf[1:sub,]
t0 <- Sys.time()
int <- sf::st_intersects(buf_sub, pts_sub)
times <- cbind(times, as.numeric(difftime(Sys.time() , t0, units = "secs")))
}
plot(subs, times)
times <- as.numeric(times)
reg <- lm(times~subs+I(subs^2))
summary(reg)
#>
#> Call:
#> lm(formula = times ~ subs + I(subs^2))
#>
#> Residuals:
#> 1 2 3 4 5 6 7
#> -0.16680 -0.02686 0.03808 0.21431 0.10824 -0.23193 0.06496
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.429e-01 1.371e-01 1.772 0.151
#> subs -2.388e-05 1.717e-05 -1.391 0.237
#> I(subs^2) 8.986e-09 3.317e-10 27.087 1.1e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1908 on 4 degrees of freedom
#> Multiple R-squared: 0.9996, Adjusted R-squared: 0.9994
#> F-statistic: 5110 on 2 and 4 DF, p-value: 1.531e-07
Here, we see an almost perfect quadratic relationship between time and number of points (as would be expected). On a 10M points subset, assuming that the behaviour does not change, you would get:
predict(reg, newdata = data.frame(subs = 10E6))
#> 1
#> 898355.4
, which corresponds to about 10 days, assuming that the trend is constant
when further increasing the number of points (but the same would happen for
dnearneigh
...)
My suggestion would be to "split" your points in chunks and then work on a per-split basis.
You could for example order your points at the beginning along the x-axis and then easily and quickly extract subsets of buffers and of points with which to compare them using data.table.
Clearly, the "points" buffer would need to be larger than that of "buffers" according
to the comparison distance. So, for example, if you make a subset of pts_buf
with
centroids in [50000 - 55000], the corresponding subset of pts
should include
points in the range [49500 - 55500].
This approach is easily parallelizable by assigning the different subsets to
different cores in a foreach
or similar construct.
I do not even know if using spatial objects/operations is beneficial here, since once we have the coordinates all is needed is computing and subsetting euclidean distances: I suspect that a carefully coded brute force data.table
-based approach could be also a feasible solution.
HTH!
UPDATE
In the end, I decided to give it a go and see how much speed we could gain from this kind of approach. Here is a possible implementation:
points_in_distance_parallel <- function(in_pts,
maxdist,
ncuts = 10) {
require(doParallel)
require(foreach)
require(data.table)
require(sf)
# convert points to data.table and create a unique identifier
pts <- data.table(in_pts)
pts <- pts[, or_id := 1:dim(in_pts)[1]]
# divide the extent in quadrants in ncuts*ncuts quadrants and assign each
# point to a quadrant, then create the index over "xcut"
range_x <- range(pts$x)
limits_x <-(range_x[1] + (0:ncuts)*(range_x[2] - range_x[1])/ncuts)
range_y <- range(pts$y)
limits_y <- range_y[1] + (0:ncuts)*(range_y[2] - range_y[1])/ncuts
pts[, `:=`(xcut = as.integer(cut(x, ncuts, labels = 1:ncuts)),
ycut = as.integer(cut(y, ncuts, labels = 1:ncuts)))] %>%
setkey(xcut, ycut)
results <- list()
cl <- parallel::makeCluster(parallel::detectCores() - 2, type =
ifelse(.Platform$OS.type != "windows", "FORK",
"PSOCK"))
doParallel::registerDoParallel(cl)
# start cycling over quadrants
out <- foreach(cutx = seq_len(ncuts)), .packages = c("sf", "data.table")) %dopar% {
count <- 0
# get the points included in a x-slice extended by `dist`, and build
# an index over y
min_x_comp <- ifelse(cutx == 1, limits_x[cutx], (limits_x[cutx] - maxdist))
max_x_comp <- ifelse(cutx == ncuts,
limits_x[cutx + 1],
(limits_x[cutx + 1] + maxdist))
subpts_x <- pts[x >= min_x_comp & x < max_x_comp] %>%
setkey(y)
for (cuty in seq_len(pts$ycut)) {
count <- count + 1
# subset over subpts_x to find the final set of points needed for the
# comparisons
min_y_comp <- ifelse(cuty == 1,
limits_y[cuty],
(limits_y[cuty] - maxdist))
max_y_comp <- ifelse(cuty == ncuts,
limits_y[cuty + 1],
(limits_y[cuty + 1] + maxdist))
subpts_comp <- subpts_x[y >= min_y_comp & y < max_y_comp]
# subset over subpts_comp to get the points included in a x/y chunk,
# which "neighbours" we want to find. Then buffer them.
subpts_buf <- subpts_comp[ycut == cuty & xcut == cutx] %>%
sf::st_as_sf() %>%
st_buffer(maxdist)
# retransform to sf since data.tables lost the geometric attrributes
subpts_comp <- sf::st_as_sf(subpts_comp)
# compute the intersection and save results in a element of "results".
# For each point, save its "or_id" and the "or_ids" of the points within "dist"
inters <- sf::st_intersects(subpts_buf, subpts_comp)
# save results
results[[count]] <- data.table(
id = subpts_buf$or_id,
int_ids = lapply(inters, FUN = function(x) subpts_comp$or_id[x]))
}
return(data.table::rbindlist(results))
}
parallel::stopCluster(cl)
data.table::rbindlist(out)
}
The function takes as input a points sf
object, a target distance and a number
of "cuts" to use to divide the extent in quadrants, and provides in output
a data frame in which, for each original point, the "ids" of the points within
maxdist
are reported in the int_ids
list column.
On on a test dataset with a varying number of uniformly distributed point,
and two values of maxdist
I got these kind of results (the "parallel" run is done using 6 cores):
So, here we get a 5-6X speed improvement already on the "serial" implementation, and another 5X thanks to parallelization over 6 cores. Although the timings shown here are merely indicative, and related to the particular test-dataset we built (on a less uniformly distributed dataset I wouldexpect a lower speed improvement) I think this is quite good.
HTH!
PS: a more thorough analysis can be found here:
https://lbusettspatialr.blogspot.it/2018/02/speeding-up-spatial-analyses-by.html
I have two alternatives, one that seems faster, and one that is not. The faster method may not be amenable for parallelization, unfortunately, and so it may not help.
library(sf)
nc <- st_transform(st_read(system.file("shape/nc.shp", package="sf")), 32618)
# create points
pts <- st_centroid(nc)
dis <- 50000
result <- list()
Your approach
system.time(
for (i in 1:nrow(pts)) {
b <- st_buffer(pts[i,], dist = dis)
result[[i]] <- st_intersects(b, nc)[[1]]
}
)
Slower alternative
system.time(
for (i in 1:nrow(pts)) {
b <- as.vector(st_distance(pts[i,], pts))
result[[i]] <- which(b <= dis)
}
)
For smaller datasets, without looping:
x <- st_distance(pts)
res <- apply(x, 1, function(i) which(i < dis))
Faster alternative (not obvious how to do in parallel), and perhaps an unfair comparison as we do not do the looping ourselves
library(spdep)
pts2 <- st_coordinates(pts)
system.time(x <- dnearneigh(pts2, 0, dis))
I would first get a list with the indices that indicate the neighbors, and extract attributes after that (that should be fast)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With