I have a MongoDB
collection of documents of the form
{ "id": 42, "title": "candy can", "description": "canada candy canteen", "brand": "cannister candid", "manufacturer": "candle canvas" }
I need to implement auto-complete feature based on the input search term by matching in the fields except id
. For example, if the input term is can
, then I should return all matching words in the document as
{ hints: ["candy", "can", "canada", "canteen", ...]
I looked at this question but it didn't help. I also tried searching how to do regex
search in multiple fields and extract matching tokens, or extracting matching tokens in a MongoDB text search
but couldn't find any help.
There is no easy solution for what you want, since normal queries can't modify the fields they return. There is a solution (using the below mapReduce inline instead of doing an output to a collection), but except for very small databases, it is not possible to do this in realtime.
As written, a normal query can't really modify the fields it returns. But there are other problems. If you want to do a regex search in halfway decent time, you would have to index all fields, which would need a disproportional amount of RAM for that feature. If you wouldn't index all fields, a regex search would cause a collection scan, which means that every document would have to be loaded from disk, which would take too much time for autocompletion to be convenient. Furthermore, multiple simultaneous users requesting autocompletion would create considerable load on the backend.
The problem is quite similar to one I have already answered: We need to extract every word out of multiple fields, remove the stop words and save the remaining words together with a link to the respective document(s) the word was found in a collection. Now, for getting an autocompletion list, we simply query the indexed word list.
db.yourCollection.mapReduce( // Map function function() { // We need to save this in a local var as per scoping problems var document = this; // You need to expand this according to your needs var stopwords = ["the","this","and","or"]; for(var prop in document) { // We are only interested in strings and explicitly not in _id if(prop === "_id" || typeof document[prop] !== 'string') { continue } (document[prop]).split(" ").forEach( function(word){ // You might want to adjust this to your needs var cleaned = word.replace(/[;,.]/g,"") if( // We neither want stopwords... stopwords.indexOf(cleaned) > -1 || // ...nor string which would evaluate to numbers !(isNaN(parseInt(cleaned))) || !(isNaN(parseFloat(cleaned))) ) { return } emit(cleaned,document._id) } ) } }, // Reduce function function(k,v){ // Kind of ugly, but works. // Improvements more than welcome! var values = { 'documents': []}; v.forEach( function(vs){ if(values.documents.indexOf(vs)>-1){ return } values.documents.push(vs) } ) return values }, { // We need this for two reasons... finalize: function(key,reducedValue){ // First, we ensure that each resulting document // has the documents field in order to unify access var finalValue = {documents:[]} // Second, we ensure that each document is unique in said field if(reducedValue.documents) { // We filter the existing documents array finalValue.documents = reducedValue.documents.filter( function(item,pos,self){ // The default return value var loc = -1; for(var i=0;i<self.length;i++){ // We have to do it this way since indexOf only works with primitives if(self[i].valueOf() === item.valueOf()){ // We have found the value of the current item... loc = i; //... so we are done for now break } } // If the location we found equals the position of item, they are equal // If it isn't equal, we have a duplicate return loc === pos; } ); } else { finalValue.documents.push(reducedValue) } // We have sanitized our data, now we can return it return finalValue }, // Our result are written to a collection called "words" out: "words" } )
Running this mapReduce against your example would result in db.words
look like this:
{ "_id" : "can", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } } { "_id" : "canada", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } } { "_id" : "candid", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } } { "_id" : "candle", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } } { "_id" : "candy", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } } { "_id" : "cannister", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } } { "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } } { "_id" : "canvas", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
Note that the individual words are the _id
of the documents. The _id
field is indexed automatically by MongoDB. Since indices are tried to be kept in RAM, we can do a few tricks to both speed up autocompletion and reduce the load put to the server.
For autocompletion, we only need the words, without the links to the documents. Since the words are indexed, we use a covered query – a query answered only from the index, which usually resides in RAM.
To stick with your example, we would use the following query to get the candidates for autocompletion:
db.words.find({_id:/^can/},{_id:1})
which gives us the result
{ "_id" : "can" } { "_id" : "canada" } { "_id" : "candid" } { "_id" : "candle" } { "_id" : "candy" } { "_id" : "cannister" } { "_id" : "canteen" } { "_id" : "canvas" }
Using the .explain()
method, we can verify that this query uses only the index.
{ "cursor" : "BtreeCursor _id_", "isMultiKey" : false, "n" : 8, "nscannedObjects" : 0, "nscanned" : 8, "nscannedObjectsAllPlans" : 0, "nscannedAllPlans" : 8, "scanAndOrder" : false, "indexOnly" : true, "nYields" : 0, "nChunkSkips" : 0, "millis" : 0, "indexBounds" : { "_id" : [ [ "can", "cao" ], [ /^can/, /^can/ ] ] }, "server" : "32a63f87666f:27017", "filterSet" : false }
Note the indexOnly:true
field.
Albeit we will have to do two queries to get the actual document, since we speed up the overall process, the user experience should be well enough.
words
collectionWhen the user selects a choice of the autocompletion, we have to query the complete document of words in order to find the documents where the word chosen for autocompletion originated from.
db.words.find({_id:"canteen"})
which would result in a document like this:
{ "_id" : "canteen", "value" : { "documents" : [ ObjectId("553e435f20e6afc4b8aa0efb") ] } }
With that document, we can now either show a page with search results or, like in this case, redirect to the actual document which you can get by:
db.yourCollection.find({_id:ObjectId("553e435f20e6afc4b8aa0efb")})
While this approach may seem complicated at first (well, the mapReduce is a bit), it is actual pretty easy conceptually. Basically, you are trading real time results (which you won't have anyway unless you spend a lot of RAM) for speed. Imho, that's a good deal. In order to make the rather costly mapReduce phase more efficient, implementing Incremental mapReduce could be an approach – improving my admittedly hacked mapReduce might well be another.
Last but not least, this way is a rather ugly hack altogether. You might want to dig into elasticsearch or lucene. Those products imho are much, much more suited for what you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With