Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to sort, group and limit efficiently in Mongo with a pipeline?

Given Users with an index on age:

{ name: 'Bob',
  age:   21  }

{ name: 'Cathy,
  age:   21  }

{ name: 'Joe',
  age:   33  }

To get the output:

[ 
  { _id: 21,
    names: ['Bob, 'Cathy'] },
  { _id: 33,
    names: ['Joe'] }
]

Is it possible to sort, group and limit by age?

db.users.aggregate(
   [  
      {
        $sort: { 
           age: 1 
        }
      },
      {
        $group : {
           _id : $age,
           names:{ $push: '$name' }
      },
      {
        $limit: 10
      }
  ]

I did some research, but it's not clear if it is possible to sort first and then group. In my testing, the group loses the sort, but I don't see why.

If the group preserves the sort, then the sort and limit can greatly reduce the required processing. It only needs to do enough work to "fill" the limit of 10 groups.

So,

  1. Does group preserve the sort order? Or do you have to group and then sort?
  2. Is it possible to sort, group and limit doing only enough processing to return the limit? Or does it need to process the entire collection and then limit?
like image 456
B Seven Avatar asked Jan 29 '23 19:01

B Seven


1 Answers

To answer your first question: $group does not preserve the order. There are a open requests for changes which also highlight the backgrounds a little but it doesn't look like the product will be changed to preserve the input documents' order:

  • https://jira.mongodb.org/browse/SERVER-24799
  • https://jira.mongodb.org/browse/SERVER-4507
  • https://jira.mongodb.org/browse/SERVER-21022

Two things can be said in general: You generally want to group first and then do the sorting. The reason being that sorting less elements (which the grouping generally produces) is going to be faster than sorting all input documents.

Secondly, MongoDB is going to make sure to sort as efficiently and little as possible. The documentation states:

When a $sort immediately precedes a $limit in the pipeline, the $sort operation only maintains the top n results as it progresses, where n is the specified limit, and MongoDB only needs to store n items in memory. This optimization still applies when allowDiskUse is true and the n items exceed the aggregation memory limit.

So this code gets the job done in your case:

collection.aggregate({
    $group: {
        _id: '$age',
        names: { $push: '$name' }
    }
}, {
    $sort: { 
        '_id': 1 
    }
}, {
    $limit: 10
})

EDIT following your comments:

I agree to what you say. And taking your logic a little further, I would go as far as saying: If $group was smart enough to use an index then it shouldn't even require a $sort stage at the start. Unfortunately, it's not (not yet probably). As things stand today, $group will never use an index and it won't take shortcuts based on the following stages ($limit in this case). Also see this link where someone ran some basic tests.

The aggregation framework is still pretty young so I guess, there is a lot of work being done to make the aggregation pipeline smarter and faster.

There are answers here on StackOverflow (e.g. here) where people suggest to use an upfront $sort stage in order to "force" MongoDB to use an index somehow. This however, slowed down my tests (1 million records of your sample shape using different random distributions) significantly.

When it comes to performance of an aggregation pipeline, $match stages at the start are what really helps the most. If you can limit the total amount of records that need to go through the pipeline from the beginning then that's your best bet - obviously... ;)

like image 108
dnickless Avatar answered Feb 01 '23 16:02

dnickless