Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get documents with tags in list, ordered by total number of matches

Given the following MongoDB collection of documents :

{
 title : 'shirt one'
 tags : [
  'shirt',
  'cotton',
  't-shirt',
  'black'
 ]
},
{
 title : 'shirt two'
 tags : [
  'shirt',
  'white',
  'button down collar'
 ]
},
{
 title : 'shirt three'
 tags : [
  'shirt',
  'cotton',
  'red'
 ]
},
...

How do you retrieve a list of items matching a list of tags, ordered by the total number of matched tags? For example, given this list of tags as input:

['shirt', 'cotton', 'black']

I'd want to retrieve the items ranked in desc order by total number of matching tags:

item          total matches
--------      --------------
Shirt One     3 (matched shirt + cotton + black)
Shirt Three   2 (matched shirt + cotton)
Shirt Two     1 (matched shirt)

In a relational schema, tags would be a separate table, and you could join against that table, count the matches, and order by the count.

But, in Mongo... ?

Seems this approach could work,

  • break the input tags into multiple "IN" statements
  • query for items by "OR"'ing together the tag inputs
    • i.e. where ( 'shirt' IN items.tags ) OR ( 'cotton' IN items.tags )
    • this would return, for example, three instances of "Shirt One", 2 instances of "Shirt Three", etc
  • map/reduce that output
    • map: emit(this._id, {...});
    • reduce: count total occurrences of _id
    • finalize: sort by counted total

But I'm not clear on how to implement this as a Mongo query, or if this is even the most efficient approach.

like image 887
Matt Avatar asked Dec 23 '11 14:12

Matt


3 Answers

As i answered in In MongoDB search in an array and sort by number of matches

It's possible using Aggregation Framework.

Assumptions

  • tags attribute is a set (no repeated elements)

Query

This approach forces you to unwind the results and reevaluate the match predicate with unwinded results, so its really inefficient.

db.test_col.aggregate(
    {$match: {tags: {$in: ["shirt","cotton","black"]}}}, 
    {$unwind: "$tags"}, 
    {$match: {tags: {$in: ["shirt","cotton","black"]}}},
    {$group: {
        _id:{"_id":1}, 
        matches:{$sum:1}
    }}, 
    {$sort:{matches:-1}}
);

Expected Results

{
    "result" : [
        {
            "_id" : {
                "_id" : ObjectId("5051f1786a64bd2c54918b26")
            },
            "matches" : 3
        },
        {
            "_id" : {
                "_id" : ObjectId("5051f1726a64bd2c54918b24")
            },
            "matches" : 2
        },
        {
            "_id" : {
                "_id" : ObjectId("5051f1756a64bd2c54918b25")
            },
            "matches" : 1
        }
    ],
    "ok" : 1
}
like image 94
Samuel García Avatar answered Oct 20 '22 04:10

Samuel García


Right now, it isnt possible to do unless you use MapReduce. The only problem with MapReduce is that it is slow (compared to a normal query).

The aggregation framework is slated for 2.2 (so should be available in 2.1 dev release) and should make this sort of thing much easier to do without MapReduce.

Personally, I do not think using M/R is an efficient way to do it. I would rather query for all the documents and do those calculations on the application side. It is easier and cheaper to scale your app servers than it is to scale your database servers so let the app servers do the number crunching. Of those, this approach may not work for you given your data access patterns and requirements.

An even simpler approach may be to just include a count property in each of your tag objects and whenever you $push a new tag to the array, you also $inc the count property. This is a common pattern in the MongoDB world, at least until the aggregation framework.

like image 5
Bryan Migliorisi Avatar answered Oct 20 '22 04:10

Bryan Migliorisi


I'll second @Bryan in saying that MapReduce is the only possible way at the moment (and it's far from perfect). But, in case you desperately need it, here you go :-)

    var m = function() {
        var searchTerms = ['shirt', 'cotton', 'black'];
        var me = this;
        this.tags.forEach(function(t) {
            searchTerms.forEach(function(st) {
                if(t == st) {
                    emit(me._id, {matches : 1});
                }
            })
        })
    };

    var r = function(k, vals) {
        var result = {matches : 0};
        vals.forEach(function(v) {
            result.matches += v.matches;
        })
        return result;
    };

    db.shirts.mapReduce(m, r, {out: 'found01'});

    db.found01.find();
like image 1
Sergio Tulentsev Avatar answered Oct 20 '22 06:10

Sergio Tulentsev