If I do a count query, I get the results in <2seconds
db.coll.find({"A":1,"createDate":{"$gt":new Date("2011-05-21"),"$lt":new Date("2013-08-21")}}).count()
This uses the following index
db.coll.ensureIndex({"A":1,"createDate":1})
Similarly, there are 4 columns A,B,C,D(values are always 0 or 1) for which I run 4 count queries and get results in <10seconds.
I looked at the aggregation framework documentation and created an aggregated query to do all 4 sums together.
db.coll.aggregate( { $match : {"createDate":{$gt:new Date("2013-05-21"),$lt:new Date("2013-08-21")} } },
{ $group :
{ _id:null,
totalA : { $sum : "$A" },
totalB : {$sum: "$B},
totalC:{$sum: "$C"},
totalD:{$sum: "$D"}}}
)
I also created an index:
db.coll..ensureIndex({"createDate":1,"A":1,"B":1,"C":1,"D":1})
According to the documentation, this index covers my aggregate function. But the return of the aggregate is in ~18seconds.
I'm confused here. Is there anything basic which I missed or is there any fundamental concept lying behind which makes aggregation slower than count. I am also concerned about the overhead due to number of queries to be fired from mongo from the code for fetching count.
On large collections of millions of documents, MongoDB's aggregation was shown to be much worse than Elasticsearch. Performance worsens with collection size when MongoDB starts using the disk due to limited system RAM. The $lookup stage used without indexes can be very slow.
find({}). count() more fast then collection.
The pipeline provides efficient data aggregation using native operations within MongoDB, and is the preferred method for data aggregation in MongoDB. The aggregation pipeline can operate on a sharded collection. The aggregation pipeline can use indexes to improve its performance during some of its stages.
MongoDB Aggregation goes further though and can also perform relational-like joins, reshape documents, create new and update existing collections, and so on. While there are other methods of obtaining aggregate data in MongoDB, the aggregation framework is the recommended approach for most work.
Firstly, though not documented for 2.4.8 you can run an explain using the db.runCommand
invocation:
db.runCommand({
aggregate: "coll",
pipeline: [
{ $match :
{"createDate":{$gt:new Date("2013-05-21"),$lt:new Date("2013-08-21")} }
},
{ $group : {
_id:null,
totalA: {$sum :"$A"},
totalB: {$sum: "$B"},
totalC: {$sum: "$C"},
totalD: {$sum: "$D"}
}}
],
explain: true
})
Which will give you some insight into what is happening.
Also, and primarily, you are comparing apples to oranges.
When you issue a count()
on a query, it is using the cursor result properties to get the number of documents that matched.
Under aggregation, you are selecting an extended match and then compacting all of those results into a sum of all the items. If your initial $match has lots of results, then all of these need to be crunched together with $sum.
Have a look at explain, and try to conceptually understand the differences. Aggregation is great for what you generally want it to do. But maybe this isn't the best use case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With