Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does sorting with an index work in MongoDB?

Tags:

I'm wondering how sorting with an index actually works in MongoDB. There are a couple articles in the MongoDB documentation, but they don't actually describe how the sort proceeds or the time complexity. Searches of SO and the interweb in general so far haven't turned up anything relevant.

Let's assume there are a documents in a collection, the find() clause matches b documents, there's a limit of c documents returned, a >> b >> c, and c is some suitably large number such that the returned set cannot fit in memory - let's say 1M documents, for example.

At the start of the operation, there exist b documents that needs to be sorted and a sorted tree index of size a for the feature the documents will be sorted by.

I can imagine:

A) traverse the index in order, and for each ObjectID traverse the list of b documents. Return matches until c is reached. This would be O(ab).

B) as A), but build a hashset of the ObjectIDs in the b documents first. This is O(a), but takes O(b) memory.

I've tried to consider sorts based on traversing the set of b documents, but can't seem to come up with anything faster than O(b log b), which is no better than sorting without an index.

I assume (but maybe I'm wrong) that every sort doesn't require an index scan, so how does the sort actually work?

Update:

Kevin's answer and provided link narrow down the question a lot, but I'd like to confirm/clarify a few points:

  1. As I understand it, you cannot use different indexes for the query and the sort if you want to avoid an in-memory sort. When I read this page it appeared as though you could (or at least, it didn't specify one way or the other), but that seems to be incorrect. Essentially, the documents are sorted because they're looked up in the order of the index during the query and therefore returned in the order of the index. Right?
  2. When querying on a compound index, the the sorting index must be the first index in the compound index, except for indexes where the query is an equality. If not, the sort is performed in memory. Right?
  3. How does sorting work with $in or $or queries? For example, assume the query is

    {a: {$in: [4, 6, 2, 1, 3, 10]}, b: {$gt: 1, $lt: 6}}

... and there's a compound index on a and b in that order. How would the sort work in the cases the sort is on a or b? $or is even more complicated since, as I understand it, $or queries are essentially split into multiple separate queries. Are $or queries always an in-memory sort, at least for merging the results of the separate queries?

like image 407
elhefe Avatar asked Mar 21 '16 21:03

elhefe


People also ask

How indexes affect sorting in MongoDB?

Since indexes contain ordered records, MongoDB can obtain the results of a sort from an index that includes the sort fields. MongoDB may use multiple indexes to support a sort operation if the sort uses the same indexes as the query predicate.

How does MongoDB sorting work?

To sort documents in MongoDB, you need to use sort() method. The method accepts a document containing a list of fields along with their sorting order. To specify sorting order 1 and -1 are used. 1 is used for ascending order while -1 is used for descending order.

What is sorting and index process?

Sorting and indexing are two different methods for sequentially ordering data in tables. Some Analytics commands require that the input is first sorted or indexed. Ordering data can also be a useful analytical operation in itself, bringing patterns and anomalies into focus.

How does index work in MongoDB?

The index stores the value of a specific field or set of fields, ordered by the value of the field. The ordering of the index entries supports efficient equality matches and range-based query operations. In addition, MongoDB can return sorted results by using the ordering in the index.


1 Answers

Indexes in MongoDB are stored in a B-tree structure, where each index entry points to a specific location on-disk. Using a B-tree structure also means that a MongoDB index is stored in a sorted order, always traversed in-order, and is cheap for MongoDB to fetch a series of documents in a sorted order via indexes.

Update: The B-tree structure is true for the MMAPv1 storage engine, but is implemented slightly differently by the WiredTiger storage engine (default since MongoDB 3.2). The basic idea remains the same, where it's cheap to traverse the index in a sorted order.

A SORT stage (i.e. in-memory sort) in a query is limited to 32MB of memory use. A query will fail if the SORT stage exceeds this limit. This limit can be sidestepped by utilizing the sorted nature of indexes, so that MongoDB can return a query with a sort() parameter without performing an in-memory sort.

Let us assume that the query is of the shape:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort(...) 

with collection a having an index of:

    db.a.createIndex({b:1,c:1}) 

There are two possible scenarios when a sort() stage is specified in the query:

1. MongoDB cannot use the sorted nature of the index and must perform an in-memory SORT stage.

This is the outcome if the query cannot use the "index prefix". For example:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort({c:1}) 

In the query above, the index {b:1,c:1} can be used to:

  • Match documents having b greater than 100 for the {b:{$gt:100}} portion of the query.
  • However, there is no guarantee that the returned documents are sorted in terms of c.

Therefore, MongoDB has no choice but to perform an in-memory sort. The explain() output of this query will have a SORT stage. This SORT stage would be limited to 32MB of memory use.

2. MongoDB can use the sorted nature of the index.

This is the outcome if the query uses:

  • Sort keys that matches the order of the index, and
  • Specifies the same ordering as the index (i.e. the index {b:1,c:1} can be used for sort({b:1,c:1}) or sort({b:-1,c:-1}) but not sort({b:1,c:-1}))

For example:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort({b:1}) 

In the query above, the index {b:1,c:1} can be used to:

  • Match documents having b greater than 100 for the {b:{$gt:100}} portion of the query.
  • In this case, MongoDB can guarantee that the returned documents are sorted in terms of b.

The explain() output of the query above will not have a SORT stage. Also, the explain() output of the query with and without sort() are identical. In essence, we are getting the sort() for free.

A worthwhile resource to understand this subject is Optimizing MongoDB Compound Indexes. Please note that this blog post was written way back in 2012. Although some of the terminology may be outdated, the technicality of the post is still relevant.

Update on follow-up questions

  1. MongoDB uses only one index for most queries. So for example, to avoid an in-memory SORT stage in the query

    db.a.find({a:1}).sort({b:1}) 

    the index must cover both a and b fields at the same time; e.g. a compound index such as {a:1,b:1} is required. You cannot have two separate indexes {a:1} and {b:1}, and expect the {a:1} index to be used for the equality part, and the {b:1} index to be used for the sort part. In this case, MongoDB will choose one of the two indexes.

    Therefore, it is correct that the results are sorted because they are looked up and returned in the order of the index.

  2. To avoid having an in-memory sort using a compound index, the first part of the index must cater to the equality part of the query, and the second part must cater to the sort part of the query (as shown in the explanation for (1) above).

    If you have a query like this:

    db.a.find({}).sort({a:1}) 

    the index {a:1,b:1} can be used for the sort part (since you're basically returning the whole collection). And if your query looks like this:

    db.a.find({a:1}).sort({b:1}) 

    the same index {a:1,b:1} can also be used for both parts of the query. Also:

    db.a.find({a:1,b:1}) 

    can also use the same index {a:1,b:1}

    Notice the pattern here: the find() followed by sort() parameters follow the index order {a:1,b:1}. Therefore a compound index must be ordered by equality -> sort.

Update regarding sorting of different types

If a field has different types between documents (e.g. if a is string in one document, number in others, boolean in yet another), how do the sort proceed?

The answer is MongoDB BSON type comparison order. To paraphrase the manual page, the order is:

  1. MinKey (internal type)
  2. Null
  3. Numbers (ints, longs, doubles, decimals)
  4. Symbol, String
  5. Object
  6. Array
  7. BinData
  8. ObjectId
  9. Boolean
  10. Date
  11. Timestamp
  12. Regular Expression
  13. MaxKey (internal type)

So from the example above using ascending order, documents containing numbers will appear first, then strings, then boolean.

like image 143
kevinadi Avatar answered Sep 29 '22 09:09

kevinadi