Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Poor Lucene in-memory spatial index performance

In my application there is a use case to find the closest point to some other geo point. I decided to use in-memory spatial index and found several candidates: jeospatial and Lucene spatial.

I made some benchmarks and was surprised to find out that Lucene index turned out to be very slow. Here is the code from benchmark which is done with JMH. The full source code can be found in my GitHub repository.

@State(Scope.Thread)
public class MyBenchmark {

    // Lucene
    private static final String COORDINATES_FIELD = "coordinates";
    private static final int GEO_PRECISION_LEVEL = 5;
    private static final double NEARBY_RADIUS_DEGREE = DistanceUtils.dist2Degrees(
            50, DistanceUtils.EARTH_MEAN_RADIUS_KM);

    private final Directory directory = new RAMDirectory();
    private final IndexWriterConfig iwConfig = new IndexWriterConfig();
    private IndexWriter indexWriter = null;
    private IndexSearcher indexSearcher = null;
    private final SpatialContext spatialCxt = SpatialContext.GEO;
    private final ShapeFactory shapeFactory = spatialCxt.getShapeFactory();
    private final SpatialStrategy coordinatesStrategy = new RecursivePrefixTreeStrategy(
            new GeohashPrefixTree(spatialCxt, GEO_PRECISION_LEVEL),
            COORDINATES_FIELD);

    // Jeospatial
    private VPTree<SimpleGeospatialPoint> jeospatialPoints = new VPTree<>();

    public MyBenchmark() {
        try {
            indexWriter = new IndexWriter(directory, iwConfig);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Setup
    public void init() throws IOException {
        var r = new Random();
        for (int i = 0; i < 3000; i++) {
            double latitude = ThreadLocalRandom.current().nextDouble(50.4D, 51.4D);
            double longitude = ThreadLocalRandom.current().nextDouble(8.2D, 11.2D);

            Document doc = new Document();
            doc.add(new StoredField("id", r.nextInt()));
            var point = shapeFactory.pointXY(longitude, latitude);
            for (var field : coordinatesStrategy.createIndexableFields(point)) {
                doc.add(field);
            }
            doc.add(new StoredField(coordinatesStrategy.getFieldName(), latitude + ":" + longitude));
            indexWriter.addDocument(doc);

            jeospatialPoints.add(new MyGeospatialPoint(latitude, longitude));
        }
        indexWriter.forceMerge(1);
        indexWriter.close();
        final IndexReader indexReader = DirectoryReader.open(directory);
        indexSearcher = new IndexSearcher(indexReader);
    }

    private SimpleGeospatialPoint createRandomPoint() {
        final double latitude = ThreadLocalRandom.current().nextDouble(50.4D, 51.4D);
        final double longitude = ThreadLocalRandom.current().nextDouble(8.2D, 11.2D);
        return new MyGeospatialPoint(latitude, longitude);
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.SECONDS)
    @Fork(value = 1)
    @Warmup(iterations = 0)
    @Measurement(iterations = 3)
    public void benchLucene() {
        double latitude = ThreadLocalRandom.current().nextDouble(50.4D, 51.4D);
        double longitude = ThreadLocalRandom.current().nextDouble(8.2D, 11.2D);
        final var spatialArgs = new SpatialArgs(SpatialOperation.IsWithin,
                                                shapeFactory.circle(longitude, latitude, NEARBY_RADIUS_DEGREE));
        final Query q = coordinatesStrategy.makeQuery(spatialArgs);
        try {
            final TopDocs topDocs = indexSearcher.search(q, 1);
            if (topDocs.totalHits == 0) {
                return;
            }
            var doc = indexSearcher.doc(topDocs.scoreDocs[0].doc);
            var coordinates = doc.getField(COORDINATES_FIELD).stringValue();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.SECONDS)
    @Fork(value = 1)
    @Warmup(iterations = 0)
    @Measurement(iterations = 3)
    public void benchJeospatial() {
        var neighbor = jeospatialPoints.getNearestNeighbor(createRandomPoint(), 50 * 1000);
        var n = neighbor.getLatitude();
    }
}

In Lucene I'm using RAMDirectory, but also tried MMapDirectory. Almost no difference.

The results of the benchmark:

# JMH version: 1.21
# VM version: JDK 10, Java HotSpot(TM) 64-Bit Server VM, 10+46
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-10.jdk/Contents/Home/bin/java
# VM options: <none>
# Warmup: <none>
# Measurement: 3 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.MyBenchmark.benchJeospatial

# Run progress: 0,00% complete, ETA 00:01:00
# Fork: 1 of 1
Iteration   1: 77528,657 ops/s
Iteration   2: 81921,096 ops/s
Iteration   3: 83470,405 ops/s


Result "org.sample.MyBenchmark.benchJeospatial":
  80973,386 ±(99.9%) 56230,060 ops/s [Average]
  (min, avg, max) = (77528,657, 80973,386, 83470,405), stdev = 3082,159
  CI (99.9%): [24743,326, 137203,446] (assumes normal distribution)


# JMH version: 1.21
# VM version: JDK 10, Java HotSpot(TM) 64-Bit Server VM, 10+46
# VM invoker: /Library/Java/JavaVirtualMachines/jdk-10.jdk/Contents/Home/bin/java
# VM options: <none>
# Warmup: <none>
# Measurement: 3 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.sample.MyBenchmark.benchLucene

# Run progress: 50,00% complete, ETA 00:00:31
# Fork: 1 of 1
Iteration   1: 997,103 ops/s
Iteration   2: 1087,487 ops/s
Iteration   3: 1077,964 ops/s


Result "org.sample.MyBenchmark.benchLucene":
  1054,184 ±(99.9%) 906,037 ops/s [Average]
  (min, avg, max) = (997,103, 1054,184, 1087,487), stdev = 49,663
  CI (99.9%): [148,147, 1960,221] (assumes normal distribution)


# Run complete. Total time: 00:01:03

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                     Mode  Cnt      Score       Error  Units
MyBenchmark.benchJeospatial  thrpt    3  80973,386 ± 56230,060  ops/s
MyBenchmark.benchLucene      thrpt    3   1054,184 ±   906,037  ops/s

As you can see Jeospatial is ~75x faster. So I'm curious, if it's actually true or I just misconfigured Lucene somehow.

like image 776
Alexander Guz Avatar asked Sep 12 '18 19:09

Alexander Guz


1 Answers

Noticed this was posted almost a year ago. The following is as relevant now as it was then but with far better performance.

Don't use spatial-extras use LatLonPoint it's far more efficient and more straight-forward of an API.

Here's all you need:

// add your points to the document
doc.add(new LatLonPoint(fieldName, lat, lon));

// create your distance query
Query q = LatLonPoint.newDistanceQuery(fieldName, centerLat, centerLon, radiusMeters);

There are several reasons you're running into performance issues w/ spatial-extras (which uses Prefix Trees in the inverted index):

  1. For each point you are recursing a quad tree and adding a term to the inverted index for each level of the tree. This also means your spatial resolution (and performance) is at the mercy of the depth of the quadtree (e.g., GEO_PRECISION_LEVEL).
  2. makeQuery with a shapeFactory.circle isn't really a distance search. The circle is processed through the same quadtree decomposition to create a collection of terms (quad cells) that approximate the circle. Terms in the inverted index are then checked against the circle's raster using JTS.relate which is an extremely expensive operation.

LatLonPoint on the other hand creates a block KD tree structure instead of using the inverted index. It's a data structure designed for spatial and multi dimension numerics at scale. It's far more space and time efficient and performs better on very large data sets.

Hope this helps!

like image 167
nknize Avatar answered Nov 15 '22 19:11

nknize