We have a problem of aggregation queries running long time (couple of minutes). <h3>Collection:</h3> We have a collection of 250 million documents with about 20 fields per document, The total size of the collection is 110GB. We have indexes over "our_id" and dtKey fields. <h3>Hardware:</h3> Memory: 24GB RAM (6 * 4GB DIMMS 1333 Mhz) Disk: Lvm 11TB built from 4 disks of 3TB disks: <ul> <li> 600MB/s maximum instantaneous data transfers. </li> <li> 7200 RPM spindle. Average latency = 4.16ms </li> <li> RAID 0 </li> </ul> <h3>CPU:</h3> 2* E5-2420 0 @ 1.90GHz Total of 12 cores with 24 threads. Dell R420. Problem: We are trying to make an aggregation query of the following: <pre class="prettyprint"><code>db.our_collection.aggregate( [ { "$match": { "$and": [ {"dtKey":{"$gte":20140916}}, {"dtKey":{"$lt":20141217}}, {"our_id":"111111111"} ] } }, { "$project": { "field1":1, "date":1 } }, { "$group": { "_id": { "day":{"$dayOfYear":"$date"}, "year":{"$year":"$date"} }, "field1":{"$sum":"$field1"} } } ] ); </code></pre> This query takes a couple of minutes to run, when it is running we can see the followings: <ul> <li>Mongo current operation is yielding more than 300K times</li> <li>On iostat we see ~100% disk utilization</li> </ul> After this query is done it seems to be in cache and this can be done again in a split second, After running it for 3 – 4 users it seems that the first one is already been swapped out from the cache and the query takes a long time again. We have tested a count on the matching part and seen that we have users of 50K documents as well as users with 500K documents, We tried to get only the matching part: <pre class="prettyprint"><code>db.pub_stats.aggregate( [ { "$match": { "$and": [ {"dtKey":{"$gte":20140916}}, {"dtKey":{"$lt":20141217}}, {" our_id ":"112162107"} ] } } ] ); </code></pre> And the queries seems to take approximately 300-500M of memory, But after running the full query, it seems to take 3.5G of memory. <h3>Questions:</h3> <ol> <li>Why the pipelining of the aggregation takes so much memory?</li> <li>How can we increase our performance for it to run on a reasonable time for HTTP request?</li> </ol>

<blockquote> <ol> <li>Why the pipelining of the aggregation takes so much memory?</li> </ol> </blockquote> Just performing a <code>$match</code> won't have to read the actual data, it can be done on the indexes. Through the projection's access of <code>field1</code>, the actual document will have to be read, and it will probably be cached as well. Also, grouping can be expensive. Normally, it should report an error if your grouping stage requires more than 100M of memory - what version are you using? It requires to scan the entire result set before yielding, and MongoDB will have to at least store a pointer or an index of each element in the groups. I guess the key reason for the memory increase is the former. <blockquote> <ol start="2"> <li>How can we increase our performance for it to run on a reasonable time for HTTP request?</li> </ol> </blockquote> Your <code>dtKey</code> appears to encode time, and the grouping is also done based on time. I'd try to exploit that fact - for instance, by precomputing aggregates for each day and <code>our_id</code> combination - makes a lot of sense if there's no more criteria and the data doesn't change much anymore. Otherwise, I'd try to move the <code>{"our_id":"111111111"}</code> criterion to the first position, because equality should always precede range queries. I guess the query optimizer of the aggregation framework is smart enough, but it's worth a try. Also, you might want to try turning your two indexes into a single compound index <code>{ our_id, dtkey }</code>. Index intersections are supported now, but I'm not sure how efficient that is, really. Use the built-in profile and <code>.explain()</code> to analyze your query. Lastly, MongoDB is designed for write-heavy use and scanning data sets of hundreds of GB from disk in a matter of milliseconds isn't feasible computationally at all. If your dataset is larger than your RAM, you'll be facing massive IO delays on the scale of tens of milliseconds and upwards, tens or hundreds of thousands of times, because of all the required disk operations. Remember that with random access you'll never get even close to the theoretical sequential disk transfer rates. If you can't precompute, I guess you'll need a lot more RAM. Maybe SSDs help, but that is all just guesswork.

MongoDB Aggregation Performance

Tags:

We have a problem of aggregation queries running long time (couple of minutes).

Collection:

We have a collection of 250 million documents with about 20 fields per document, The total size of the collection is 110GB.

We have indexes over "our_id" and dtKey fields.

Hardware:

Memory:

24GB RAM (6 * 4GB DIMMS 1333 Mhz)

Disk:

Lvm 11TB built from 4 disks of 3TB disks:

600MB/s maximum instantaneous data transfers.
7200 RPM spindle. Average latency = 4.16ms
RAID 0

CPU:

2* E5-2420 0 @ 1.90GHz Total of 12 cores with 24 threads. Dell R420.

Problem: We are trying to make an aggregation query of the following:

db.our_collection.aggregate(     [         {             "$match":             {                 "$and":                     [                         {"dtKey":{"$gte":20140916}},                         {"dtKey":{"$lt":20141217}},                         {"our_id":"111111111"}                     ]             }         },         {             "$project":             {                 "field1":1,                 "date":1             }         },         {             "$group":             {                 "_id":                 {                     "day":{"$dayOfYear":"$date"},                     "year":{"$year":"$date"}                 },                 "field1":{"$sum":"$field1"}             }         }     ] );

This query takes a couple of minutes to run, when it is running we can see the followings:

Mongo current operation is yielding more than 300K times
On iostat we see ~100% disk utilization

After this query is done it seems to be in cache and this can be done again in a split second,

After running it for 3 – 4 users it seems that the first one is already been swapped out from the cache and the query takes a long time again.

We have tested a count on the matching part and seen that we have users of 50K documents as well as users with 500K documents,

We tried to get only the matching part:

db.pub_stats.aggregate(     [         {             "$match":             {                 "$and":                     [                         {"dtKey":{"$gte":20140916}},                         {"dtKey":{"$lt":20141217}},                         {" our_id ":"112162107"}                     ]             }         }     ] );

And the queries seems to take approximately 300-500M of memory,

But after running the full query, it seems to take 3.5G of memory.

Questions:

Why the pipelining of the aggregation takes so much memory?
How can we increase our performance for it to run on a reasonable time for HTTP request?

374

asked Dec 18 '14 10:12

Yarin Podoler

1 Answers

Why the pipelining of the aggregation takes so much memory?

Just performing a $match won't have to read the actual data, it can be done on the indexes. Through the projection's access of field1, the actual document will have to be read, and it will probably be cached as well.

Also, grouping can be expensive. Normally, it should report an error if your grouping stage requires more than 100M of memory - what version are you using? It requires to scan the entire result set before yielding, and MongoDB will have to at least store a pointer or an index of each element in the groups. I guess the key reason for the memory increase is the former.

How can we increase our performance for it to run on a reasonable time for HTTP request?

Your dtKey appears to encode time, and the grouping is also done based on time. I'd try to exploit that fact - for instance, by precomputing aggregates for each day and our_id combination - makes a lot of sense if there's no more criteria and the data doesn't change much anymore.

Otherwise, I'd try to move the {"our_id":"111111111"} criterion to the first position, because equality should always precede range queries. I guess the query optimizer of the aggregation framework is smart enough, but it's worth a try. Also, you might want to try turning your two indexes into a single compound index { our_id, dtkey }. Index intersections are supported now, but I'm not sure how efficient that is, really. Use the built-in profile and .explain() to analyze your query.

Lastly, MongoDB is designed for write-heavy use and scanning data sets of hundreds of GB from disk in a matter of milliseconds isn't feasible computationally at all. If your dataset is larger than your RAM, you'll be facing massive IO delays on the scale of tens of milliseconds and upwards, tens or hundreds of thousands of times, because of all the required disk operations. Remember that with random access you'll never get even close to the theoretical sequential disk transfer rates. If you can't precompute, I guess you'll need a lot more RAM. Maybe SSDs help, but that is all just guesswork.

103

answered Oct 04 '22 20:10

mnemosyn

Related questions
                            
                                How does an interpreter interpret the code?
                            
                                Django custom annotation function
                            
                                Do I need to sanitize user input before inserting in MongoDB (MongoDB+Node js combo)
                            
                                Java dependency injection: Dagger 1 vs Dagger 2, which is better?
                            
                                Two column layout with markdown
                            
                                How to slide nav bar from left instead from top?
                            
                                Automatically setting an enum member's value to its name
                            
                                NullPointerException on ViewPager with Recyclerview
                            
                                event.preventDefault() always just before the end of the functions
                            
                                Difference between javax.servlet-api.jar vs servlet-api.jar
                            
                                How to mimic word-break: break-word; for IE9, IE11 and Firefox
                            
                                How to Change Font Family for textInput Placeholder in React Native

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With