I am trying to get millisecond (< second) response time for aggregation from a collection. However, it currently takes 3-5 seconds even for a small data size (~200MB). My expected data in production for this collection is around 100GB per shard.
I checked the following already
- The response time is same when I tried the query individually on each shard.
- Checked the profiling output and all I could see was high timeLockedMicros and numYield.
- Couldn't find anything unusual in MMS charts too.
I feel like overloooking something silly here. Any help to analyze this further is highly appreciated. My cluster and collection details are provided below
Cluster - 6 nodes of 34 GB, 4 core machines (AWS m2.2xlarge) Data Size 1,285 MB (213 MB per shard) No of records = 5.5 million (~1 M per shard)
Records sample
{
"_id" : {
"ItemID" : 105182,
"DeviceType" : 16,
"ItemVersionID" : 117971,
"Timestamp" : ISODate("2014-11-14T00:00:00Z"),
"RecordType" : 1
},
"Dim1ID" : 102260,
"Dim2ID" : 313,
"Dim3ID" : 1,
"actionType" : {
"1" : 66,
"47" : 66,
"42" : 72,
"46" : 130
}
}
Query
db.AggregateCollection.aggregate({ "$group" : { "_id" : { } , "type1" : { "$sum" : "$actionType.1"} , "type2" : { "$sum" : "$actionType.2"}}})
Profile stats (from one shard)
"keyUpdates" : 0,
"numYield" : 79,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(2981456),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(3756),
"w" : NumberLong(4)
}
},
"responseLength" : 157,
"millis" : 3268,
"execStats" : {
},
UPDATE Thank you Will for the prompt response. Apreciate it. I like your new data model and the indexing. However, I am afraid that is not suitable for my current data because, - 99% of records will have soem value for actionType.1 and - 99% of our queries will select actionType.1 So index on actiionType.K wouldn't help much I guess.
As you suggested in #2 & #3, We are already doing pre-aggregation using Spark cluster which upadtes MongoDb.
A little bit more about my query The query I shared earlier is just a sample one and is used just to benchmark the performance. My actual query will have $match on Timestamp and $group on one or more fileds. A typical production query will be for 30 days data. Currently my collection has only 15 days data. My objective is to get sub-second response time for 30 days data
BTW, I did some more analysis today I toook a dump of the shard and restored in a local mongo installed on my MacBook. The same query took only 2 seconds (took 4s in AWS isntance) that didn't make sense because AWS instance is at lest 4 times powerful than the MacBook (both CPU & Memory) MacBook Air - http://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i5-4250U+%40+1.30GHz AWS m2.2xlarge Instance - http://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+E5-2665+%40+2.40GHz
I suspected fragmentation becuase the data in AWS mongo instance was populated over the last 15 days through an application. So I re-imported the dump on AWS mongo as a separate collection. Query on this new collection took 2s which is comparable to MAcBook speed. So fragmentation is one reason for sure. I am planning do more research on fragmentation later. Though defragmenting improved perfromace, the fact that it took same time as my MacBook didn't make sense as the AWS isntance is 4 times powerful. Then we looked at the cpu utilization and found that mongod instance uses only one CPU (out of 4) for query execution. We are now planning to install 4 shards on each machine to get around this. Please let me know if you see a better approach.
One more thing, I know that my query has to scan the entire collection but 2 seconds to scan ~200MB data seems very high to me. Is it expected or I am missing something?
On large collections of millions of documents, MongoDB's aggregation was shown to be much worse than Elasticsearch. Performance worsens with collection size when MongoDB starts using the disk due to limited system RAM. The $lookup stage used without indexes can be very slow.
Things I'd try:
1) You've organized your data in a way that makes grouping very difficult. You may get better results if you organize your document like this:
{
...
"actionType" : [{k:1, v:66}, {k:47, v:66}, {k:42, v:72}, {k:46, v:130}]
}
This would allow you to create an index on 'actionType.k'. Then you could do a match on that index to reduce your overall dataset to the exact actionTypes you want for this aggregation, where your query is:
db.action.aggregate([{$unwind: '$actionType'},
{$group:{_id:'$actionType.k', t:{$sum:'$actionType.v'} } }]);
//output
{ "_id" : 46, "t" : 130 }
{ "_id" : 42, "t" : 72 }
{ "_id" : 47, "t" : 66 }
{ "_id" : 1, "t" : 66 }
Then ensureIndex on 'actionType.k'. If you're not planning to filter for ALL the different key values, the index will help quite a bit depending on the density of the keys in your documents. If you're planning to sum every key, the index won't help here.
2) Map-reduce and/or add these on a cron-job/setTimeout schedule. Again, depending on your update cycle and how accurate you need the data at any one time, set up something like:
That works if you're doing insert-only to this db.
3) If the key values are changing regularly (updates instead of inserts), you might have better luck doing a changelog insert that happens at the same time as your updates to the main collection.
db.changes.insert({key:44, change:2});
db.changes.insert({key:34, change:-2});
Then routinely empty out the 'changes' collection summing the values to a different collection.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With