Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to get histogram of array sizes using MongoDB aggregation framework

I'm trying to get a list of the number of records that have arrays of varying size. I want to get the distribution of array sizes for all records so I can build a histogram like this:

          | *
          | *
documents | *         *
          | *  *      *
          |_*__*__*___*__*___
            2  5  6  23  47

               Array Size

So the raw documents look something like this:

{hubs : [{stuff:0, id:6}, {stuff:1"}, .... ]}
{hubs : [{stuff:0, id:6}]}`

So far using the aggregation framework and some of the help here I've come up with

db.sitedata.aggregate([{ $unwind:'$hubs'}, 
                       { $group : {_id:'$_id', count:{$sum:1}}}, 
                       { $group : {_id:'$count', count:{$sum:1}}},
                       { $sort  : {_id: 1}}])

This seems to give me the results I want, but it's not very fast. I'm wondering if there is something I can do like this that may not need two group calls. The syntax is wrong here, but what I'm trying to do is put the count value in the first _id field:

db.sitedata.aggregate([{ $unwind:'$hubs'}, 
                       { $group : {_id:{$count:$hubs}, count:1}},
                       { $sort  : { _id: 1 }}])
like image 768
Scott Avatar asked Apr 18 '13 17:04

Scott


People also ask

Is MongoDB aggregate fast?

Comparing MySQL 8.0 and MongoDB 4.0. 3, it was seen that MongoDB is typically faster on more complex queries. It's faster from disk when there are no indexes, whereas MySQL is faster from RAM. BI Connector is slower for simple queries and not as fast as hand-crafted aggregation.

Is aggregate faster than find?

Without seeing your data and your query it is difficult to answer why aggregate+sort is faster than find+sort. A well indexed(Indexing that suits your query) data will always yield faster results on your find query.

Which aggregation method is preferred for use by MongoDB?

The pipeline provides efficient data aggregation using native operations within MongoDB, and is the preferred method for data aggregation in MongoDB. The aggregation pipeline can operate on a sharded collection. The aggregation pipeline can use indexes to improve its performance during some of its stages.

Is MongoDB good for aggregation?

This is similar to the basic aggregation available in SQL with the GROUP BY clause and COUNT, SUM and AVG functions. MongoDB Aggregation goes further though and can also perform relational-like joins, reshape documents, create new and update existing collections, and so on.


1 Answers

Now that 2.6 is out, aggregation framework supports a new array operator $size which will allow you to $project the array size without having to unwind and re-group.

db.sitedata.aggregate([{ $project:{ 'count': { '$size':'$hubs'} } }, 
                       { $group : {_id:'$count', count:{$sum:1} } },
                       { $sort  : { _id: 1 } } ] )
like image 150
Asya Kamsky Avatar answered Nov 15 '22 23:11

Asya Kamsky