Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using full text search with geospatial index on Mongodb

Let's say I want to develop an android app that allows a user to search a hotel that is closest to where you are located. This is very common on apps nowadays, like AirBnb for example.

This is the dataset I'm using:

{
    "name" : "The Most Amazing Hotel",
    "city" : "India",
    "type": "Point"
    "coord": [
        -56.16082,
        61.15392
      ]
}

{
    "name" : "The Most Incredible Hotel",
    "city" : "India",
    "type": "Point"
    "coord": [
        -56.56285,
        61.34590
      ]
}

{
    "name" : "The Fantastic GuestHouse",
    "city" : "India",
    "type": "Point"
    "coord": [
        -56.47085,
        61.11357
      ]
}

Now, I want to create a text index on the name field so that it searches by name and then sort by a geospatial index based on the coordinates.

So if I search for the words "The Most", it will search by the name for the words "The Most" and return the closest hotels with the words "The Most in them.

Does mongodb even support this type of search?

I'm reading the guidance for mongodb here: https://docs.mongodb.org/manual/core/index-text/

A compound text index cannot include any other special index types, such as multi-key or geospatial index fields.

As far as I understand, I'm not creating a compound text index. This is a simple text index which means I'm only indexing the text for the name field and not for the city AND name fields.

like image 932
Simon Avatar asked Nov 17 '15 19:11

Simon


People also ask

Can MongoDB do full-text search?

MongoDB offers a full-text search solution, MongoDB Atlas Search, for data hosted on MongoDB Atlas.

Does MongoDB support geospatial index?

MongoDB provides the following geospatial index types to support the geospatial queries.

How do I search text in MongoDB?

Use the $text query operator to perform text searches on a collection with a text index. $text will tokenize the search string using whitespace and most punctuation as delimiters, and perform a logical OR of all such tokens in the search string.

What are geospatial indexes in MongoDB?

MongoDB's geospatial indexing allows you to efficiently execute spatial queries on a collection that contains geospatial shapes and points.


1 Answers

There is a fair case that you really do not need this at all, as it is very hard to justify a use case for such an operation, and I would argue that "Searching for a Hotel" is not something where a combination of "text" and "geoSpatial" search really apply.

In reality "most people" would be looking for something close to a location, or even more likely close to various locations they want to visit, as part of their primary criteria, and then other "winners" would likely be greater weighted to "cost", "rating", "brand", "facilities", and likely even proximity to eateries etc.

Adding "Text search" to that list is a very different thing and likely not of much real use in this particular application.

Still, this probably deserves some explanation, and there are a few concepts to understand here as to why the two concepts don't really "mesh" for this use case at least.

Fixing Schema

Firstly, I'd like to make a suggestion to "tweak" your data schema a little:

{
    "name" : "The Most Amazing Hotel",
    "city" : "India",
    "location": {
        "type": "Point",
        "coordinates": [
               72.867804,
               19.076033
        ]
    }
}

That at least provies "location" as a valid GeoJSON Object for indexing, and you generally want GeoJSON rather than legacy co-ordinate pairs, as it does open up more options for query and storage in general, plus distances are standardized to meters rather than the equated "radians" around the globe.

Why they don't work together

So your reading is basically correct in that you cannot use more than one special index at once. First look at the compound index definition:

db.hotels.createIndex({ "name": "text", "location": "2dsphere" })

{ "ok" : 0, "errmsg" : "bad index key pattern { name: \"text\", location: \"2dsphere\" }: Can't use more than one index plugin for a single index.", "code" : 67 }

So that cannot be done. Even considering seperately:

db.hotels.createIndex({ "name": "text" })
db.hotels.createIndex({ "location": "2dsphere" })

Then try doing a query:

db.hotels.find({
    "location": {
        "$nearSphere": {
            "$geometry": {
                "type": "Point",
                "coordinates": [
                   72.867804,
                   19.076033
                ]
            }
        }
    },
    "$text": { "$search": "Amazing" }
})

Error: command failed: { "waitedMS" : NumberLong(0), "ok" : 0, "errmsg" : "text and geoNear not allowed in same query", "code" : 2 } : undefined

Which actually backs up the reasons why this could not be defined in a compound index in three ways:

  1. As the initial error indicates, the way these "special" indexes are handled in MongoDB requires essentially "branching off" to the "special" handler for the selected index type, and the two handlers do not live in the same place.

  2. Even with separate indexes, since the logic is basically an "and" condition, MongoDB cannot acutally select more than one index anyway, and since both query clauses require "special" handling it would in fact be required to do so. And it cannot.

  3. Even if this were logically an $or condition, you basically end back at point 1, where even applying "index intersection" there is another property of such "special" indexes that they must be applied at the "top level" of the query operations in order to allow index selection. Wrapping these in an $or means MongoDB cannot do that and therefore it is not allowed.

But you can "Cheat"

So each basically has to be exclusive, and you cannot use them together. But of course you can always "cheat", depending on which order of search is more important to you.

Either by "location" first:

db.hotels.aggregate([
    { "$geoNear": {
        "near": {
            "type": "Point",
            "coordinates": [
               72.867804,
               19.076033
            ]
        },
        "spherical": true,
        "maxDistance": 5000,
        "distanceField": "distance",
        "query": {
           "name": /Amazing/
        }
    }}
])

Or even:

db.hotels.find({
    "location": {
        "$nearSphere": {
            "$geometry": {
                "type": "Point",
                "coordinates": [
                   72.867804,
                   19.076033
                ]
            },
            "$maxDistance": 5000
        }
    },
    "name": /Amazing/
})

Or by text search first:

db.hotels.find({
    "$text": { "$search": "Amazing" },
    "location": {
        "$geoWithin": {
            "$centerSphere": [[
               72.867804,
               19.076033
            ], 5000 ]
        }
    }
})

Now you can take a close look at the selection options in each approach with .explain() to see what is happening, but the basic case is that each selects only one of the special indexes to use respectively.

In the first case it will be the geoSpatial index on the collection that is used for the primary and will find results based on their proximity to the location given first and then filter by the Regular Expression argument given for the name field.

In the second case it will use the "text" index to do the primary selection ( therefore find things "Amazing" first ) and from those results apply a geoSpatial filter ( not using an index ) with $geoWithin, which in this case is performing what is basically the equivalent of of what a $near is doing, by searching within a circle around a point within the supplied distance to filter results there.

Not "all" Queries are Equal

The key thing to consider though is that it is very possible for each approach to return different results. By narrowing down on location first, the only data that can be inspected are those locations within the specified distance, so anything that is "Amazing" outside of the distance would never be considered by the additional filter.

In the second case, since the text term is the primary search, then all results of "Amazing" are put into consideration, and the only items that can be returned by the secondary filter are those that were allowed to be returned from the initial text filter.

This is very important in the overall consideration as the two query operations ( both "text" and "geoSpatial" ) strive to achieve very different things. In the "text" case it is looking for "top results" to the term given, and will by nature only return a limitted number of results matching the term in ranked order. This means that when applying any other filter condition, there is a strong possibility that many of the items that met that first condition do not meet the additional criteria.

In short, 'Not all things "Amazing" are necessarily anywhere near the queried point', which means with a realistic limit like 100 results, and by best match, those 100 do likely not contain all of the "near" items as well.

Also, the $text operator does not actually really "sort" the results in any way by itself. It's primary purpose is in fact not only to "match" on a phrase but to "score" the result in order to float the "best" match to the top. This is typically done "after" the query itself with the projected value being "sorted" and most likely "limited" as mentioned above. Possible in aggregation pipelines to do that an then apply the second filter(s), but as stated this likely excludes things that are otherwise "near" in the other purpose.

The reverse is also likely true ( 'There are many "Amazing" things further away from the point' ), but with realistic distance limits this becomes less likely. But the other consideration given is this is not a true text search, but just using a regular expression to match the given term.

As a final note, I'm always using "Amazing" as the example phrase here and not "Most" as suggested in the question. This is because of how "stemming" works in text indexes here ( as well as in most dedicated text search products ) in that the particular term would be ignored, much like "and", "or", "the", even "in" would be as well, as they are not really considered valuable to a Phrase, which is what text search does.

So it in fact remains that a Regular expression, would actually be better at matching such terms, if indeed that were required at all.

Concluding

Which really brings us back full circle to the original point, in that a "text" query really does not belong here anyway. The other useful filters usally work in tandem with the true "geoSpatial" search criteria the better, and true "text search" is really low on the list of what would be important.

More likely is that people want a location that lies within a *"Set Intersection" of distances from desired destinations they wish to visit, or at least near enough to some, or most. Then of course other factors ( *"price", "service" etc ) as mentioned earlier are things people want in general consideration.

It's not really a "good fit" to look for the results this way. If you think you really must, then apply one of the "cheat" approaches, or in fact use different queries and then some other logic to merge each set of results. But it really does not make sense for the server to do this alone, which is why it does not try.

So I would focus on getting your geoSpatial matches right first, then apply other critieria that should be important to results. But I don't really believe that "text search" is really valid to be one of them anyway. "Cheat" instead, but only if you really must.

like image 103
Blakes Seven Avatar answered Oct 22 '22 02:10

Blakes Seven