I have approximately 400,000 documents in a GAE Search index. All documents have a location
GeoPoint
property and are spread over the entire globe. Some documents might be over 4000km away from any other document, others might be bunched within meters of each other.
I would like to find the closest document to a specific set of coordinates but find the following code gives incorrect results:
from google.appengine.api import search
# coords are in the form of a tuple e.g. (50.123, 1.123)
search.Document(
doc_id='meaningful-unique-id',
fields=[search.GeoField(name='location'
value=search.GeoPoint(coords[0], coords[1]))])
# find document function radius is in metres
def find_document(coords, radius=1000000):
sort_expr = search.SortExpression(
expression='distance(location, geopoint(%.3f, %.3f))' % coords,
direction=search.SortExpression.ASCENDING,
default_value=0)
search_query = search.Query(
query_string='distance(location, geopoint(%.3f, %.3f)) < %d' \
% (coords[0], coords[1], radius),
options=search.QueryOptions(
limit=1,
ids_only=True,
sort_options=search.SortOptions(expressions=[sort_expr])))
index = search.Index(name='document-index')
return index.search(search_query)
With this code I will get results that are consistent but incorrect. For example, a search for the nearest document to London indicated the closest one was in Scotland. I have verified that there are thousands of closer documents.
I narrowed the problem down to the radius
parameter being too large. I get correct results if the radius is down to around 12km (radius=12000
). There are generally no more than 1000 documents in a 12 km radius. (Probably associated with search.SortOptions(limit=1000)
.)
The problem is that if I am in a sparse area of the globe where there aren't any documents for thousands of miles, my search function will not return anything with radius=12000
(12km). I want it to return the closest document to me wherever I am. How can I accomplish this consistently with one call to the Search API?
Starting with GAEDownload and install the Cloud SDK. Authorize the SDK tools to access the Google Cloud Platform using your google account credentials. Use the gcloud init command.
Public APIs and services are globally visible across all Google Cloud projects. Examples of public Google Cloud APIs and services include: BigQuery API. Cloud Storage.
Google App Engine (GAE) is a platform-as-a-service product that provides web app developers and enterprises with access to Google's scalable hosting and tier 1 internet service. GAE requires that applications be written in Java or Python, store data in Google Bigtable and use the Google query language.
I believe the issue is the following. Your query will select up to 10K documents, then those are sorted according to your distance sort expression and returned. (That is, the sort is in fact not over all 400k documents.) So I suspect that some of the geographically closer points are not included in this 10k selection. That's why things work better when you narrow your search radius, as you have fewer total points in that radius.
Essentially, you want to get your query 'hits' down to 10k, in a manner that makes sense for what you are querying on. You can address this in at least a couple of ways, which you can combine:
(I don't believe this 10k threshold is currently in the Search API documentation; I've filed a ticket to get it added).
I have the exact same problem, and I don't think its possible. The problem happens as you yourself has figured out when there are more possible results than returned results. The Google algorithm just quits when it has loaded the limits and then it sorts the results.
I have seen the same clusters as you and its part of the search API.
One Hack would be to subdivide your search into sub-sectors, do multiple simultaneous calls and then merge and order the results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With