Currently I'm running aggregations against a collection containing user and event information. For example:
[
{
$match: {
client: ObjectId('507f1f77bcf86cd799439011'),
location: 'UK'
}
},
{
$group: {
_id: null,
count: {
$sum: 1
}
}
}
]
The above is a big simplification, suffice to say there are around 20 different variables like location
that can go into that $match
statement. There are also sometimes additional steps between these two, which is why I'm using $group
to do a count. (Rather than count
)
Currently I have an index on the client
field, but haven't created indexes (compound or otherwise) on the other fields. As there are so many other fields, I can't just create indexes on everything - it'll be too expensive.
The problem: This works great when a client has a small number of documents, but as the number grows, the aggregation has to scan over an ever-increasing number of docs. The index focuses the range down, but it's not enough.
Create an additional variable called p
(for partition), and create a compound index: { client: 1, p: 1 }
. p
can be 1
-n
.
Instead of running the pipeline above, run a similar pipeline n
times: (for all possible values of p
)
[
{
$match: {
client: ObjectId('507f1f77bcf86cd799439011'),
p: 1, // or 2, 3, etc
location: 'UK'
}
},
{
$group: {
_id: null,
count: {
$sum: 1
}
}
}
]
The results from all pipelines could then be merged on the application level.
Using this method, I could limit the number of scans that each query has to do, theoretically decreasing query time.
Taking this a step further, this p
value could be used as the shard key, so in theory, the analytics queries could be run in parallel across multiple shards.
Has anyone done anything like this before? I've found very little on the subject.
Early tests on this approach shows that it works really, really well. Running multiple count
queries in parallel means that the "total query time" calculation is now:
total time = max(single query time) + combination time
I haven't tested this at major scale yet, but at mid-scale it's an absolute treat.
Brief stats about this test:
client
parameter I care aboutFor small numbers of scans, there is almost no benefit to this approach. However for the above example, we're getting total time
decreases of between 2-4x.
It looks like there's a sweet spot for this approach between 50-100k subset size.
And of course, running tons of queries in parallel makes you potentially subject to other MongoDB limitations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With