Analyzing MongoDb aggregation performance

Q: Is MongoDB aggregation fast?

On large collections of millions of documents, MongoDB's aggregation was shown to be much worse than Elasticsearch. Performance worsens with collection size when MongoDB starts using the disk due to limited system RAM. The $lookup stage used without indexes can be very slow.

Tags:

performance

mongodb

aggregation-framework

I am trying to get millisecond (< second) response time for aggregation from a collection. However, it currently takes 3-5 seconds even for a small data size (~200MB). My expected data in production for this collection is around 100GB per shard. I checked the following already
- The response time is same when I tried the query individually on each shard. - Checked the profiling output and all I could see was high timeLockedMicros and numYield. - Couldn't find anything unusual in MMS charts too. I feel like overloooking something silly here. Any help to analyze this further is highly appreciated. My cluster and collection details are provided below

Cluster - 6 nodes of 34 GB, 4 core machines (AWS m2.2xlarge) Data Size 1,285 MB (213 MB per shard) No of records = 5.5 million (~1 M per shard)

Records sample

Click to copy

{
    "_id" : {
        "ItemID" : 105182,
        "DeviceType" : 16,
        "ItemVersionID" : 117971,
        "Timestamp" : ISODate("2014-11-14T00:00:00Z"),
        "RecordType" : 1
    },
    "Dim1ID" : 102260,
    "Dim2ID" : 313,
    "Dim3ID" : 1,
    "actionType" : {
        "1" : 66,
        "47" : 66,
        "42" : 72,
        "46" : 130
    }
}

Query

Click to copy

db.AggregateCollection.aggregate({ "$group" : { "_id" : { } , "type1" : { "$sum" : "$actionType.1"} , "type2" : { "$sum" : "$actionType.2"}}})

Profile stats (from one shard)

Click to copy

"keyUpdates" : 0,
    "numYield" : 79,
    "lockStats" : {
        "timeLockedMicros" : {
            "r" : NumberLong(2981456),
            "w" : NumberLong(0)
        },
        "timeAcquiringMicros" : {
            "r" : NumberLong(3756),
            "w" : NumberLong(4)
        }
    },
    "responseLength" : 157,
    "millis" : 3268,
    "execStats" : {

    },

UPDATE Thank you Will for the prompt response. Apreciate it. I like your new data model and the indexing. However, I am afraid that is not suitable for my current data because, - 99% of records will have soem value for actionType.1 and - 99% of our queries will select actionType.1 So index on actiionType.K wouldn't help much I guess.

As you suggested in #2 & #3, We are already doing pre-aggregation using Spark cluster which upadtes MongoDb.

A little bit more about my query The query I shared earlier is just a sample one and is used just to benchmark the performance. My actual query will have $match on Timestamp and $group on one or more fileds. A typical production query will be for 30 days data. Currently my collection has only 15 days data. My objective is to get sub-second response time for 30 days data

BTW, I did some more analysis today I toook a dump of the shard and restored in a local mongo installed on my MacBook. The same query took only 2 seconds (took 4s in AWS isntance) that didn't make sense because AWS instance is at lest 4 times powerful than the MacBook (both CPU & Memory) MacBook Air - http://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i5-4250U+%40+1.30GHz AWS m2.2xlarge Instance - http://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+E5-2665+%40+2.40GHz

I suspected fragmentation becuase the data in AWS mongo instance was populated over the last 15 days through an application. So I re-imported the dump on AWS mongo as a separate collection. Query on this new collection took 2s which is comparable to MAcBook speed. So fragmentation is one reason for sure. I am planning do more research on fragmentation later. Though defragmenting improved perfromace, the fact that it took same time as my MacBook didn't make sense as the AWS isntance is 4 times powerful. Then we looked at the cpu utilization and found that mongod instance uses only one CPU (out of 4) for query execution. We are now planning to install 4 shards on each machine to get around this. Please let me know if you see a better approach.

One more thing, I know that my query has to scan the entire collection but 2 seconds to scan ~200MB data seems very high to me. Is it expected or I am missing something?

480

asked Dec 02 '14 05:12

Bijith Kumar

1 Answers

Things I'd try:

1) You've organized your data in a way that makes grouping very difficult. You may get better results if you organize your document like this:

Click to copy

{
    ...
    "actionType" : [{k:1, v:66}, {k:47, v:66}, {k:42, v:72}, {k:46, v:130}]
}

This would allow you to create an index on 'actionType.k'. Then you could do a match on that index to reduce your overall dataset to the exact actionTypes you want for this aggregation, where your query is:

Click to copy

db.action.aggregate([{$unwind: '$actionType'}, 
        {$group:{_id:'$actionType.k', t:{$sum:'$actionType.v'} } }]);
//output
{ "_id" : 46, "t" : 130 }
{ "_id" : 42, "t" : 72 }
{ "_id" : 47, "t" : 66 }
{ "_id" : 1, "t" : 66 }

Then ensureIndex on 'actionType.k'. If you're not planning to filter for ALL the different key values, the index will help quite a bit depending on the density of the keys in your documents. If you're planning to sum every key, the index won't help here.

2) Map-reduce and/or add these on a cron-job/setTimeout schedule. Again, depending on your update cycle and how accurate you need the data at any one time, set up something like:

Every hour process all 'dirty' results
Add current values to the running total
Mark as 'clean'

That works if you're doing insert-only to this db.

3) If the key values are changing regularly (updates instead of inserts), you might have better luck doing a changelog insert that happens at the same time as your updates to the main collection.

Click to copy

db.changes.insert({key:44, change:2});
db.changes.insert({key:34, change:-2});

Then routinely empty out the 'changes' collection summing the values to a different collection.

119

answered Nov 15 '22 03:11

Will Shaver

Related questions
                            
                                mongoengine query projection
                            
                                Configure a Mongo replica set to only replicate certain collections
                            
                                How much additional space would repair database require
                            
                                How do you use BsonClassMap to map a POCO domain object property as a manual or DBRef reference?
                            
                                Slow range query on a multikey index
                            
                                Mongodb recursive search
                            
                                Using find() with geospatial coordinates in Mongoose (NodeJS+MongoDB)
                            
                                Meteor does not connect to MongoDB
                            
                                Dot notation vs. $elemMatch
                            
                                Node.js and mongoDB serving concurrent requests
                            
                                Nodejs User login System
                            
                                How to query with '$in' over '_id' in reactive mongo and play
                            
                                Mongo Aggregation - Using variables created in $project
                            
                                MongoDB Multikey Compound Index - Need Help Understanding Bounds
                            
                                ClassNotFoundException: Attempted to load class "Mongo" from... (with persist) symfony2
                            
                                Encoding user input to be stored in MongoDB
                            
                                How to perform text search together with geoNear?
                            
                                Can I remove an embedded document in Mongoid without persisting?
                            
                                Data synchronization between MySQL and MongoDB
                            
                                Automatically update property of mongoose documents (node.js) object after certain time

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Analyzing MongoDb aggregation performance

Tags:

performance

mongodb

aggregation-framework

Bijith Kumar

People also ask

1 Answers

Will Shaver

Recent Activity

Donate For Us