Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB - slow '$group' performance

I have a MongoDB collection of over 1,000,000 records. Each record size is around 20K (so the total collection size is around 20GB).

I have a 'type' field in the collection (that can have around 10 different values). I would like to get the per-type counters for the collection. Also, there is an index on the 'type' field.

I've tested two different approaches (assume python syntax):

A naive method - using 'count' calls for each of the values:

for type_val in my_db.my_colc.distinct('type'):
    counters[type_val] = my_db.my_colc.find({'type' : type_val}).count()

Using aggregation framework with a '$group' syntax:

counters = my_db.my_colc.aggregate([{'$group' :  {'_id': '$type', 'agg_val': { '$sum': 1 } }}])

The performance I'm receiving for the first approach is about 2 orders of magnitude faster than for the 2nd approach. Seems to be related to the fact that count runs on the indices only, without accessing the documents, while $group has to go over the documents one-by-one. (It's about 1min vs. 45mins).

Is there any way to run an efficient grouping query on the 'type' index, that would use only the index, thus achieving the performance results from #1, but using the aggregation framework ?

I am using MongoDB 2.6.1

Update: https://jira.mongodb.org/browse/SERVER-11447 is open on this issue in MongoDB Jira.

like image 222
Baruch Oxman Avatar asked Jul 27 '14 11:07

Baruch Oxman


1 Answers

in aggregation pipeline the $group clause doesn't use indexes. It is supposed to be used after a $match, which indeed can use indexes to speed it up.

http://docs.mongodb.org/manual/core/aggregation-pipeline/#aggregation-pipeline-operators-and-performance

cheers,

like image 126
dantespot Avatar answered Sep 19 '22 13:09

dantespot