I'm trying to get a list of the number of records that have arrays of varying size. I want to get the distribution of array sizes for all records so I can build a histogram like this: <pre class="prettyprint lang-none prettyprint-override"><code> | * | * documents | * * | * * * |_*__*__*___*__*___ 2 5 6 23 47 Array Size </code></pre> So the raw documents look something like this: <pre class="prettyprint lang-js prettyprint-override"><code>{hubs : [{stuff:0, id:6}, {stuff:1"}, .... ]} {hubs : [{stuff:0, id:6}]}` </code></pre> So far using the aggregation framework and some of the help here I've come up with <pre class="prettyprint lang-js prettyprint-override"><code>db.sitedata.aggregate([{ $unwind:'$hubs'}, { $group : {_id:'$_id', count:{$sum:1}}}, { $group : {_id:'$count', count:{$sum:1}}}, { $sort : {_id: 1}}]) </code></pre> This seems to give me the results I want, but it's not very fast. I'm wondering if there is something I can do like this that may not need two group calls. The syntax is wrong here, but what I'm trying to do is put the count value in the first _id field: <pre class="prettyprint lang-js prettyprint-override"><code>db.sitedata.aggregate([{ $unwind:'$hubs'}, { $group : {_id:{$count:$hubs}, count:1}}, { $sort : { _id: 1 }}]) </code></pre>

Now that 2.6 is out, aggregation framework supports a new array operator <code>$size</code> which will allow you to <code>$project</code> the array size without having to unwind and re-group. <pre class="prettyprint"><code>db.sitedata.aggregate([{ $project:{ 'count': { '$size':'$hubs'} } }, { $group : {_id:'$count', count:{$sum:1} } }, { $sort : { _id: 1 } } ] ) </code></pre>

Fastest way to get histogram of array sizes using MongoDB aggregation framework

Tags:

mongodb

aggregation-framework

I'm trying to get a list of the number of records that have arrays of varying size. I want to get the distribution of array sizes for all records so I can build a histogram like this:

          | *
          | *
documents | *         *
          | *  *      *
          |_*__*__*___*__*___
            2  5  6  23  47

               Array Size

So the raw documents look something like this:

{hubs : [{stuff:0, id:6}, {stuff:1"}, .... ]}
{hubs : [{stuff:0, id:6}]}`

So far using the aggregation framework and some of the help here I've come up with

db.sitedata.aggregate([{ $unwind:'$hubs'}, 
                       { $group : {_id:'$_id', count:{$sum:1}}}, 
                       { $group : {_id:'$count', count:{$sum:1}}},
                       { $sort  : {_id: 1}}])

This seems to give me the results I want, but it's not very fast. I'm wondering if there is something I can do like this that may not need two group calls. The syntax is wrong here, but what I'm trying to do is put the count value in the first _id field:

db.sitedata.aggregate([{ $unwind:'$hubs'}, 
                       { $group : {_id:{$count:$hubs}, count:1}},
                       { $sort  : { _id: 1 }}])

768

asked Apr 18 '13 17:04

Scott

1 Answers

Now that 2.6 is out, aggregation framework supports a new array operator $size which will allow you to $project the array size without having to unwind and re-group.

db.sitedata.aggregate([{ $project:{ 'count': { '$size':'$hubs'} } }, 
                       { $group : {_id:'$count', count:{$sum:1} } },
                       { $sort  : { _id: 1 } } ] )

150

answered Nov 15 '22 23:11

Asya Kamsky

Related questions
                            
                                MongoDB 2.6 server throwing 'BSONObj size is invalid' error on queries below the limit
                            
                                nodejs application - mongodb connection fails with error "ECONNREFUSED"
                            
                                MongoDB local.oplog.rs query running for long time
                            
                                Is an upsert in mongodb atomic with the filter and the actual update
                            
                                Which way is better? Save a media file to MongoDB as array of bytes or as string?
                            
                                Redis, Mongo or Hazelcast?
                            
                                Keeping integrity between two separate data stores during backups (MySQL and MongoDB)
                            
                                How do I deal with dots in MongoDB key names?
                            
                                setting up mongodb via AWS opsworks
                            
                                Can a Compound Index serve multiple queries
                            
                                install mongoose in docker container
                            
                                How to rename path in response for populate
                            
                                how to create a persistent offline database with electron and pouchdb
                            
                                apollostack/graphql-server - how to get the fields requested in a query from resolver
                            
                                How to select, groupBy, and join in Waterline or MongoDB
                            
                                MongoDB hosting options now that Heroku mLab add-on is being removed
                            
                                Why MongoDB can create unique index but Mongoid cannot?
                            
                                Foreign key like relationship in Mongo DB
                            
                                Pymongo or Mongodb is treating two equal python dictionaries as different objects. Can I force them to be treated the same?
                            
                                How are null values in a MongoDB index sorted?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With