Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongodb aggregate sort and limit within group [duplicate]

I have a collection of items for sale with the following schema:

var itemSchema = new Schema({
    "category" : { type : Schema.Types.ObjectId, ref : 'Category' },
    "merchant" : { type : Schema.Types.ObjectId, ref : 'Merchant' },
    "rating" : Number
})

I have inherited an aggregate query which returns items matching a category, grouped by merchant, with the groups sorted by the maximum rating in the group:

Item.aggregate([
      { "$match" : { category : categoryId, merchant : { $ne : null }}},
      { "$group" : { _id : "$merchant", 
                    rating : { $max : "$rating" }, 
                    items : { $push : "$$ROOT" }}},
      { "$sort" : { rating : -1 }}
    ], { allowDiskUse : true })
    .skip(skip)
    .limit(limit)

After this the code goes on to sort the items within each group by rating, and remove all but the top 2 highest rated items of each group.

Is it possible to perform this sort and limit within the groups as a part of the aggregate function, so that the aggregate will only return the two highest rated items per group?

like image 859
Carasel Avatar asked Dec 11 '22 20:12

Carasel


2 Answers

The Basic Problem

It's not the wisest idea out there to try and do this in the aggregation framework at current on in the forseeable near future. The main problem of course comes from this line in the code you already have:

"items" : { "$push": "$$ROOT" }

And that means exactly that, in that what needs to basically happen is that all objects within the grouping key need to be pushed into an array in order to get to the "top N" results in any later code.

This clearly does not scale as eventutally the size of that array itself can very conceivably exceed the BSON limit of 16MB, and regarless of the rest of the data in the grouped document. The main catch here being that it is not possible to "limit the push" to just a certain number of items. There is a long standing JIRA issue on just such a thing.

For that reason alone, the most practical approach to this is to run individual queries for the "top N" items for each grouping key. These do not even need to be .aggregate() statments ( depending on the data ) and can really be anything that simply limits the "top N" values you want.

Best Approach

Your architecture appears to be on node.js with mongoose, but anything that supports async IO and parallel execution of queries is going to be the best option. Ideally something with it's own API library that supports combining the results of those queries into a single response.

For example there is this simplified example listing using your architecture and available libraries ( notably async ) that does this parallel and combined results exactly:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      async.waterfall(
        [
          function(callback) {
            Test.distinct("merchant",callback);
          },
          function(merchants,callback) {
            async.concat(
              merchants,
              function(merchant,callback) {
                Test.find({ "merchant": merchant })
                  .sort({ "rating": -1 })
                  .limit(2)
                  .exec(callback);
              },
              function(err,results) {
                console.log(JSON.stringify(results,undefined,2));
                callback(err);
              }
            );
          }
        ],
        callback
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

This results in just the top 2 results for each merchant in the output:

[
  {
    "_id": "560d153669fab495071553ce",
    "merchant": 1,
    "rating": 3,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553cd",
    "merchant": 1,
    "rating": 2,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553d1",
    "merchant": 2,
    "rating": 3,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553d0",
    "merchant": 2,
    "rating": 2,
    "__v": 0
  }
]

It really is the most efficient way to process this though it's going to take resources since it still is multiple queries. But nowhere near the resources eaten up in the aggregation pipeline if you attempt to store all documents in an array and process it.

The Aggregate Problem, now and near future

To that line, it is possible considering that the number of documents does not cause a breach in the BSON limit that this can be done. Methods with the current release of MongoDB are not great for this, but the upcoming release ( as of writing, 3.1.8 dev branch does this ) at least introduces a $slice operator to the aggregation pipeline. So if you are smarter about the aggregation operation and use a $sort first, then the already sorted items in the array can be picked out easily:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      Test.aggregate(
        [
          { "$sort": { "merchant": 1, "rating": -1 } },
          { "$group": {
            "_id": "$merchant",
            "items": { "$push": "$$ROOT" }
          }},
          { "$project": {
            "items": { "$slice": [ "$items", 2 ] }
          }}
        ],
        function(err,results) {
          console.log(JSON.stringify(results,undefined,2));
          callback(err);
        }
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

Which yields the same basic result as the top 2 items are "sliced" from the array once they were sorted first.

It is also actually "possible" in current releases, but with the same basic constraints in that this still involves pushing all content into an array after sorting the content first. It just takes an "iterative" approach. You can code this out to produce the aggregation pipeline for greater entries, but just showing "two" should show it's not a really great idea to try:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      Test.aggregate(
        [
          { "$sort": { "merchant": 1, "rating": -1 } },
          { "$group": {
            "_id": "$merchant",
            "items": { "$push": "$$ROOT" }
          }},
          { "$unwind": "$items" },
          { "$group": {
            "_id": "$_id",
            "first": { "$first": "$items" },
            "items": { "$push": "$items" }
          }},
          { "$unwind": "$items" },
          { "$redact": {
            "$cond": [
              { "$eq": [ "$items", "$first" ] },
              "$$PRUNE",
              "$$KEEP"
            ]
          }},
          { "$group": {
            "_id": "$_id",
            "first": { "$first": "$first" },
            "second": { "$first": "$items" }
          }},
          { "$project": {
            "items": {
              "$map": {
                "input": ["A","B"],
                "as": "el",
                "in": {
                  "$cond": [
                    { "$eq": [ "$$el", "A" ] },
                    "$first",
                    "$second"
                  ]
                }
              }
            }
          }}
        ],
        function(err,results) {
          console.log(JSON.stringify(results,undefined,2));
          callback(err);
        }
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

And again while "possible" in earlier versions ( this is using 2.6 introduced features to shorten since you already tag $$ROOT ), the basic steps are storing the array and then getting each item "off the stack" using $first and comparing that ( and potentially others ) to items within the array to remove them and then get the "next first" item off that stack until your "top N" is eventually done.


Conclusion

Until the day comes that there is such an operation that allows the items in a $push aggregation accumulator to be limited to a certain count, then this is not really a practical operation for aggregate.

You can do it, if the data you have in these results is small enough, and it might even just be more efficient than the client side processing if the database servers are of sufficient spec to provide a real advantage. But the chances are that neither is going to be the case in most real applications of reasonable usage.

The best bet is to use the "parallel query" option demonstrated first. It's always going to scale well, and there is no need to "code around" such logic that a particular grouping might not return at least the total "top N" items required and work out how to retain them ( much longer example of that omitted ) as it simply performs each query and combines the results.

Use parallel queries. It's going to be better than the coded approach you have, and it's going to outperform the aggregation approach demonstrated by a long way. Until there is a better option at least.

like image 63
Blakes Seven Avatar answered Dec 28 '22 23:12

Blakes Seven


What I did is to $push only the _ids and add $slice afterwards.

$group: {
    _id: "$merchant",
    items : { $push : "$_id" }
},
// ...

$project: {
  items: {$slice: ["$items", offset, limit]}
}

And after the query i populate the result:

return Item.populate(result, {path: 'items'})
like image 43
LandoR Avatar answered Dec 28 '22 23:12

LandoR