Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running MongoDB aggregations in parallel

Tags:

mongodb

Currently I'm running aggregations against a collection containing user and event information. For example:

[
  {
    $match: {
      client: ObjectId('507f1f77bcf86cd799439011'),
      location: 'UK'
    }
  },
  {
    $group: {
      _id: null,
      count: {
        $sum: 1
      }
    }
  }
]

The above is a big simplification, suffice to say there are around 20 different variables like location that can go into that $match statement. There are also sometimes additional steps between these two, which is why I'm using $group to do a count. (Rather than count)

Currently I have an index on the client field, but haven't created indexes (compound or otherwise) on the other fields. As there are so many other fields, I can't just create indexes on everything - it'll be too expensive.

The problem: This works great when a client has a small number of documents, but as the number grows, the aggregation has to scan over an ever-increasing number of docs. The index focuses the range down, but it's not enough.


Idea

Create an additional variable called p (for partition), and create a compound index: { client: 1, p: 1 }. p can be 1-n.

Instead of running the pipeline above, run a similar pipeline n times: (for all possible values of p)

[
  {
    $match: {
      client: ObjectId('507f1f77bcf86cd799439011'),
      p: 1, // or 2, 3, etc
      location: 'UK'
    }
  },
  {
    $group: {
      _id: null,
      count: {
        $sum: 1
      }
    }
  }
]

The results from all pipelines could then be merged on the application level.

Using this method, I could limit the number of scans that each query has to do, theoretically decreasing query time.

Taking this a step further, this p value could be used as the shard key, so in theory, the analytics queries could be run in parallel across multiple shards.

Has anyone done anything like this before? I've found very little on the subject.

like image 606
Chris Houghton Avatar asked Nov 10 '22 12:11

Chris Houghton


1 Answers

Early tests on this approach shows that it works really, really well. Running multiple count queries in parallel means that the "total query time" calculation is now:

total time = max(single query time) + combination time

I haven't tested this at major scale yet, but at mid-scale it's an absolute treat.

Brief stats about this test:

  • Collection has 2.5m documents
  • 200k of these documents have the client parameter I care about
  • I'm running 4 queries in parallel, each looking at a different subset (~50k) of documents

For small numbers of scans, there is almost no benefit to this approach. However for the above example, we're getting total time decreases of between 2-4x.

It looks like there's a sweet spot for this approach between 50-100k subset size.

And of course, running tons of queries in parallel makes you potentially subject to other MongoDB limitations.

like image 159
Chris Houghton Avatar answered Nov 15 '22 06:11

Chris Houghton