In my application there is a use case to find the closest point to some other geo point. I decided to use in-memory spatial index and found several candidates: jeospatial and Lucene spatial.
I made some benchmarks and was surprised to find out that Lucene index turned out to be very slow. Here is the code from benchmark which is done with JMH. The full source code can be found in my GitHub repository.
@State(Scope.Thread)
public class MyBenchmark {
// Lucene
private static final String COORDINATES_FIELD = "coordinates";
private static final int GEO_PRECISION_LEVEL = 5;
private static final double NEARBY_RADIUS_DEGREE = DistanceUtils.dist2Degrees(
50, DistanceUtils.EARTH_MEAN_RADIUS_KM);
private final Directory directory = new RAMDirectory();
private final IndexWriterConfig iwConfig = new IndexWriterConfig();
private IndexWriter indexWriter = null;
private IndexSearcher indexSearcher = null;
private final SpatialContext spatialCxt = SpatialContext.GEO;
private final ShapeFactory shapeFactory = spatialCxt.getShapeFactory();
private final SpatialStrategy coordinatesStrategy = new RecursivePrefixTreeStrategy(
new GeohashPrefixTree(spatialCxt, GEO_PRECISION_LEVEL),
COORDINATES_FIELD);
// Jeospatial
private VPTree<SimpleGeospatialPoint> jeospatialPoints = new VPTree<>();
public MyBenchmark() {
try {
indexWriter = new IndexWriter(directory, iwConfig);
} catch (IOException e) {
e.printStackTrace();
}
}
@Setup
public void init() throws IOException {
var r = new Random();
for (int i = 0; i < 3000; i++) {
double latitude = ThreadLocalRandom.current().nextDouble(50.4D, 51.4D);
double longitude = ThreadLocalRandom.current().nextDouble(8.2D, 11.2D);
Document doc = new Document();
doc.add(new StoredField("id", r.nextInt()));
var point = shapeFactory.pointXY(longitude, latitude);
for (var field : coordinatesStrategy.createIndexableFields(point)) {
doc.add(field);
}
doc.add(new StoredField(coordinatesStrategy.getFieldName(), latitude + ":" + longitude));
indexWriter.addDocument(doc);
jeospatialPoints.add(new MyGeospatialPoint(latitude, longitude));
}
indexWriter.forceMerge(1);
indexWriter.close();
final IndexReader indexReader = DirectoryReader.open(directory);
indexSearcher = new IndexSearcher(indexReader);
}
private SimpleGeospatialPoint createRandomPoint() {
final double latitude = ThreadLocalRandom.current().nextDouble(50.4D, 51.4D);
final double longitude = ThreadLocalRandom.current().nextDouble(8.2D, 11.2D);
return new MyGeospatialPoint(latitude, longitude);
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Fork(value = 1)
@Warmup(iterations = 0)
@Measurement(iterations = 3)
public void benchLucene() {
double latitude = ThreadLocalRandom.current().nextDouble(50.4D, 51.4D);
double longitude = ThreadLocalRandom.current().nextDouble(8.2D, 11.2D);
final var spatialArgs = new SpatialArgs(SpatialOperation.IsWithin,
shapeFactory.circle(longitude, latitude, NEARBY_RADIUS_DEGREE));
final Query q = coordinatesStrategy.makeQuery(spatialArgs);
try {
final TopDocs topDocs = indexSearcher.search(q, 1);
if (topDocs.totalHits == 0) {
return;
}
var doc = indexSearcher.doc(topDocs.scoreDocs[0].doc);
var coordinates = doc.getField(COORDINATES_FIELD).stringValue();
} catch (IOException e) {
e.printStackTrace();
}
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Fork(value = 1)
@Warmup(iterations = 0)
@Measurement(iterations = 3)
public void benchJeospatial() {
var neighbor = jeospatialPoints.getNearestNeighbor(createRandomPoint(), 50 * 1000);
var n = neighbor.getLatitude();
}
}
In Lucene I'm using RAMDirectory
, but also tried MMapDirectory
. Almost no difference.
The results of the benchmark:
# JMH version: 1.21
# VM version: JDK 10, Java HotSpot(TM) 64-Bit Server VM, 10+46
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-10.jdk/Contents/Home/bin/java
# VM options: <none>
# Warmup: <none>
# Measurement: 3 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.MyBenchmark.benchJeospatial
# Run progress: 0,00% complete, ETA 00:01:00
# Fork: 1 of 1
Iteration 1: 77528,657 ops/s
Iteration 2: 81921,096 ops/s
Iteration 3: 83470,405 ops/s
Result "org.sample.MyBenchmark.benchJeospatial":
80973,386 ±(99.9%) 56230,060 ops/s [Average]
(min, avg, max) = (77528,657, 80973,386, 83470,405), stdev = 3082,159
CI (99.9%): [24743,326, 137203,446] (assumes normal distribution)
# JMH version: 1.21
# VM version: JDK 10, Java HotSpot(TM) 64-Bit Server VM, 10+46
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-10.jdk/Contents/Home/bin/java
# VM options: <none>
# Warmup: <none>
# Measurement: 3 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.MyBenchmark.benchLucene
# Run progress: 50,00% complete, ETA 00:00:31
# Fork: 1 of 1
Iteration 1: 997,103 ops/s
Iteration 2: 1087,487 ops/s
Iteration 3: 1077,964 ops/s
Result "org.sample.MyBenchmark.benchLucene":
1054,184 ±(99.9%) 906,037 ops/s [Average]
(min, avg, max) = (997,103, 1054,184, 1087,487), stdev = 49,663
CI (99.9%): [148,147, 1960,221] (assumes normal distribution)
# Run complete. Total time: 00:01:03
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
MyBenchmark.benchJeospatial thrpt 3 80973,386 ± 56230,060 ops/s
MyBenchmark.benchLucene thrpt 3 1054,184 ± 906,037 ops/s
As you can see Jeospatial is ~75x faster. So I'm curious, if it's actually true or I just misconfigured Lucene somehow.
Noticed this was posted almost a year ago. The following is as relevant now as it was then but with far better performance.
Don't use spatial-extras
use LatLonPoint
it's far more efficient and more straight-forward of an API.
Here's all you need:
// add your points to the document
doc.add(new LatLonPoint(fieldName, lat, lon));
// create your distance query
Query q = LatLonPoint.newDistanceQuery(fieldName, centerLat, centerLon, radiusMeters);
There are several reasons you're running into performance issues w/ spatial-extras (which uses Prefix Trees in the inverted index):
GEO_PRECISION_LEVEL
).makeQuery
with a shapeFactory.circle
isn't really a distance search. The circle is processed through the same quadtree decomposition to create a collection of terms (quad cells) that approximate the circle. Terms in the inverted index are then checked against the circle's raster using JTS.relate
which is an extremely expensive operation.LatLonPoint
on the other hand creates a block KD tree structure instead of using the inverted index. It's a data structure designed for spatial and multi dimension numerics at scale. It's far more space and time efficient and performs better on very large data sets.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With