Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB sort by relevance

I am using trying to get documents from MongoDB on node. Let's say documents have the following structure:

{ "_id": ObjectId, "title" : String, "tags" : Array<String> }

I'd like to sort them by relevance - so when I'm looking for documents that have either "blue" or "yellow" tag I'd like to get ones with both tags first. So far I managed by google, trial and error:

var tags = [ "yellow", "blue" ];
db.collection('files').aggregate([
    { $project : { tags: 1 } },
    { $unwind : "$tags" },
    { $match : { "tags": { "$in": tags } } },
    { $group : { _id: "$_id", relevance: { $sum:1 } } },
    { $sort : { relevance : -1 } },
], function(err, success) {
    console.log(success);
});

It works just fine, I get sorted collection of ids:

[{"_id":"5371355045002fc820a09566","relevance":2},{"_id":"53712fc6c8fcd124216de6cd","relevance":2},{"_id":"5371302ebd4725dc1b908316","relevance":1}]

Now I would make another query and ask for documents with those ids - but here's my question: can it be done in one query?

like image 634
m1gu3l Avatar asked Sep 30 '22 19:09

m1gu3l


1 Answers

Yes you can as is always the case when you are actually grouping on _id then that value is essentially equivalent to the whole document. So it is just a matter of storing the whole document under the _id field.

You have a couple of approaches to this depending on your MongoDB version, and in versions prior to MongoDB 2.6 you must specify the whole document structure in an initial $project stage ( which may optionally come after a $match which is generally a good idea ) in your pipeline before you actually manipulate the document:

var tags = ["yellow","blue"];
db.collection.aggregate([
    { "$project" : { 
        "_id": {
            "_id": "$_id",
            "title": "$title",
            "tags": "$tags"
        },
        "tags": 1 
    }},
    { "$unwind": "$tags" },
    { "$match": { "tags": { "$in": tags } } },
    { "$group": { "_id": "$_id", "relevance": { "$sum":1 } } },
    { "$sort": { "relevance" : -1 } },
    { "$project": {
        _id: "$_id._id",
        "title": "$_id.title",
        "tags": "$_id.tags"
    }}
])

And of course, at the end of the pipeline you extract the information from the _id field in order to get back your original structure. That is optional, but you usually want that.

For MongoDB 2.6 and above there is a variable available to the pipeline stages that holds the structure of the document at that stage of the pipeline known as $$ROOT, and you can access this as a kind of shortcut to the above form like so:

var tags = ["yellow","blue"];
db.collection.aggregate([
    { "$project" : { 
        "_id": "$$ROOT",
        "tags": 1 
    }},
    { "$unwind": "$tags" },
    { "$match": { "tags": { "$in": tags } } },
    { "$group": { "_id": "$_id", "relevance": { "$sum":1 } } },
    { "$sort": { "relevance" : -1 } },
    { "$project": {
        "_id": "$_id._id",
        "title": "$_id.title",
        "tags": "$_id.tags"
    }}
])

Keeping in mind that in order to restore the document you still need to specify all the required fields.

I would note that as you are "filtering" documents with your match condition in this case and as was mentioned earlier, you should actually be filtering with a $match statement at the "head" of your pipeline. This is the only place where the aggregation framework can select an index in order to optimize the query, and it also reduces the number of documents that do not meet your conditions ( presuming that not everything has the tags "yellow" or "blue" ) that go through the remaining pipeline stages:

db.collection.aggregate([
    { "$match": { "tags": { "$in": tags } } },
    { "$project" : { 
        "_id": {
            "_id": "$_id",
            "title": "$title",
            "tags": "$tags"
        },
        "tags": 1 
    }},
    { "$unwind": "$tags" },
    { "$match": { "tags": { "$in": tags } } },
    { "$group": { "_id": "$_id", "relevance": { "$sum":1 } } },
    { "$sort": { "relevance" : -1 } },
    { "$project": {
        _id: "$_id._id",
        "title": "$_id.title",
        "tags": "$_id.tags"
    }}
])

At any rate that should be generally more effective than trying to do another query which of course would not maintain your sort order in the way that you have done.

like image 65
Neil Lunn Avatar answered Oct 18 '22 14:10

Neil Lunn