Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene spatial, accuracy

Tags:

java

lucene

I'm following the example in "Lucene in Action", pages 308-315, which describes Lucene Spatial. I'm using lucene 2.9.4. I've used http://geocoder.us/service/distance endpoint to calculate the distance between some locations, and then written unit tests to verify that the index can find locations within a given radius.

I'm wondering how accurate I can expect lucene to be. For example, if I give radius 10.0 and the distance between my lat/lon points is 9.99 miles, will it be able to find this location in all cases?

The thing that is prompting this question is that I've found the search to be very accurate for small radius values (e.g. 10.0 or less) and inaccurate for larger values (r=25.0 for example).

Is there anything I might be doing wrong? Is it possible that the searcher will select a tier that does not have all the lat/longs for a given radius? My understanding was that it selects the smallest tier that is guaranteed to have all the points within the radius, i.e. the tier algorigthm is just an optimization.

EDIT: Also I found this: https://issues.apache.org/jira/browse/LUCENE-2519 and the apparently fixed code here: http://code.google.com/p/spatial-search-lucene/source/browse/trunk/src/main/java/org/apache/lucene/spatial/tier/projection/SinusoidalProjector.java?r=38, but when I patched my code to use the fixed SinusoidalProjector my index returns zero ads in all cases.

And this does not give me much confidence:

http://www.lucidimagination.com/blog/2010/07/20/update-spatial-search-in-apache-lucene-and-solr/

http://www.lucidimagination.com/search/document/c32e81783642df47/spatial_rethinking_cartesian_tiers_implementation#c32e81783642df47

It seems to indicate that hacks exist throughout the code and simply patching SinusoidalProjector is not enough.

like image 220
Kevin Avatar asked Jul 12 '11 16:07

Kevin


1 Answers

I've spent some time looking at the source code, and I think I understand what's going wrong. Firstly, I made a bad assumption that the distances computed by geocoder.us would be the same as what lucene internally calcuates as distances between points. The values are close, but not exact. So I switched to compute distances between lat/lon pairs by calling lucene's

double distance = DistanceUtils.getInstance().getDistanceMi(lat1,lon1,lat2,lon2);

Next I dug into DistanceQueryBuilder class http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-spatial/2.9.4/org/apache/lucene/spatial/tier/DistanceQueryBuilder.java?av=f, which I think has a bug in it.

It calculates the bounding box for the purpose of fetching cartesian tiers like this:

CartesianPolyFilterBuilder cpf = new CartesianPolyFilterBuilder(tierFieldPrefix);
Filter cartesianFilter = cpf.getBoundingArea(lat, lng, miles);

And it's pretty clear by looking at LLRect.createBox http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-spatial/2.9.4/org/apache/lucene/spatial/geometry/shape/LLRect.java#LLRect.createBox%28org.apache.lucene.spatial.geometry.LatLng%2Cdouble%2Cdouble%29 that the third parameter to getBoudningArea is going to be treated as the full width/height of a bounding box. So passing the radius value results in a bounding box that is too small.

The fix was to provide an alternate version of DistanceQueryBuilder that does this:

Filter cartesianFilter = cpf.getBoundingArea(lat,lng,miles*2);

Which seems to work. I'm still convinced that DistanceApproximation http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-spatial/2.9.4/org/apache/lucene/spatial/geometry/shape/DistanceApproximation.java#DistanceApproximation.getMilesPerLngDeg%28double%29 is broken though, because it seems that the following operations should be reversible, and they're not:

// similar to implementation of DistanceUtils.getBoundary():
double milesPerLng = DistanceApproximation.getMilesPerLngDeg(lat);
double milesPerLat = DistanceApproximation.getMilesperLatDeg();


double lngDelta = radius / milesPerLng;
double latDelta = radius / milesPerLat;

// Now it seems like this should be roughly true:
assertEquals(radius, DistanceUtils.getInstance().getDistanceMi(lat,lng,lat,lng+lngDelta));
assertEquals(radius, DistanceUtils.getInstance().getDistanceMi(lat,lng,lat+latDelta,lng));

But it's not. For example when the above code is given lat=34, lng=-118, and radius=25, (and instead of asserting I just print the results), I get:

Lng delta: 0.36142327178505024, dist: 20.725929003138496
Lat delta: 0.4359569489852007, dist: 30.155567734407825

I'm guessing that the code works only because the cartesian tiers that are chosen after picking a bounding box will result in an area somewhat larger than the bounding box. But I don't think that's going to be guaranteed.

I'm hoping that someone with more knowledge of this can comment, because these are just observations after digging into the code for an afternoon. I did notice that what looks like the most recent code for lucene spatial is on googlecode at: http://code.google.com/p/spatial-search-lucene/, and it seems that the implementation has changed significantly, but I didn't dig too deeply into the details.

like image 102
Kevin Avatar answered Oct 17 '22 14:10

Kevin