Is it possible to find the largest document size in MongoDB? <code>db.collection.stats()</code> shows average size, which is not really representative because in my case sizes can differ considerably.

You can use a small shell script to get this value. Note: this will perform a full table scan, which will be slow on large collections. <pre class="prettyprint lang-js prettyprint-override"><code>let max = 0, id = null; db.test.find().forEach(doc => { const size = Object.bsonsize(doc); if(size > max) { max = size; id = doc._id; } }); print(id, max); </code></pre>

Finding the largest documents in a MongoDB collection can be ~100x faster than the other answers using the aggregation framework and a tiny bit of knowledge about the documents in the collection. Also, you'll get the results in seconds, vs. minutes with the other approaches (<code>forEach</code>, or worse, getting all documents to the client). You need to know which field(s) in your document might be the largest ones - which you almost always will know. There are only two practical1 MongoDB types that can have variable sizes: <ul> <li>arrays</li> <li>strings</li> </ul> The aggregation framework can calculate the length of each. Note that you won't get the size in bytes for arrays, but the length in elements. However, what matters more typically is which the outlier documents are, not exactly how many bytes they take. Here's how it's done for arrays. As an example, let's say we have a collections of users in a social network and we suspect the array <code>friends.ids</code> might be very large (in practice you should probably keep a separate field like <code>friendsCount</code> in sync with the array, but for the sake of example, we'll assume that's not available): <pre class="prettyprint lang-js prettyprint-override"><code>db.users.aggregate([ { $match: { 'friends.ids': { $exists: true } }}, { $project: { sizeLargestField: { $size: '$friends.ids' } }}, { $sort: { sizeLargestField: -1 }}, ]) </code></pre> The key is to use the <code>$size</code> aggregation pipeline operator. It only works on arrays though, so what about text fields? We can use the <code>$strLenBytes</code> operator. Let's say we suspect the <code>bio</code> field might also be very large: <pre class="prettyprint lang-js prettyprint-override"><code>db.users.aggregate([ { $match: { bio: { $exists: true } }}, { $project: { sizeLargestField: { $strLenBytes: '$bio' } }}, { $sort: { sizeLargestField: -1 }}, ]) </code></pre> You can also combine <code>$size</code> and <code>$strLenBytes</code> using <code>$sum</code> to calculate the size of multiple fields. In the vast majority of cases, 20% of the fields will take up 80% of the size (if not 10/90 or even 1/99), and large fields must be either strings or arrays. <hr> 1 Technically, the rarely used <code>binData</code> type can also have variable size.

Well.. this is an old question.. but - I thought to share my cent about it My approach - use Mongo <code>mapReduce</code> function First - let's get the size for each document <pre class="prettyprint"><code>db.myColection.mapReduce ( function() { emit(this._id, Object.bsonsize(this)) }, // map the result to be an id / size pair for each document function(key, val) { return val }, // val = document size value (single value for each document) { query: {}, // query all documents out: { inline: 1 } // just return result (don't create a new collection for it) } ) </code></pre> This will return all documents sizes although it worth mentioning that saving it as a collection is a better approach (the result is an array of results inside the <code>result</code> field) Second - let's get the max size of document by manipulating this query <pre class="prettyprint"><code>db.metadata.mapReduce ( function() { emit(0, Object.bsonsize(this))}, // mapping a fake id (0) and use the document size as value function(key, vals) { return Math.max.apply(Math, vals) }, // use Math.max function to get max value from vals (each val = document size) { query: {}, out: { inline: 1 } } // same as first example ) </code></pre> Which will provide you a single result with value equals to the max document size In short: you may want to use the first example and save its output as a collection (change <code>out</code> option to the name of collection you want) and applying further aggregations on it (max size, min size, etc.) -OR- you may want to use a single query (the second option) for getting a single stat (min, max, avg, etc.)

Find largest document size in MongoDB

6 Answers

You can use a small shell script to get this value.

Note: this will perform a full table scan, which will be slow on large collections.

let max = 0, id = null;
db.test.find().forEach(doc => {
    const size = Object.bsonsize(doc); 
    if(size > max) {
        max = size;
        id = doc._id;
    } 
});
print(id, max);

135

answered Oct 03 '22 14:10

Abhishek Kumar

Note: this will attempt to store the whole result set in memory (from .toArray) . Careful on big data sets. Do not use in production! Abishek's answer has the advantage of working over a cursor instead of across an in memory array.

If you also want the _id, try this. Given a collection called "requests" :

// Creates a sorted list, then takes the max
db.requests.find().toArray().map(function(request) { return {size:Object.bsonsize(request), _id:request._id}; }).sort(function(a, b) { return a.size-b.size; }).pop();

// { "size" : 3333, "_id" : "someUniqueIdHere" }

answered Oct 03 '22 14:10

Mike Graf

Finding the largest documents in a MongoDB collection can be ~100x faster than the other answers using the aggregation framework and a tiny bit of knowledge about the documents in the collection. Also, you'll get the results in seconds, vs. minutes with the other approaches (forEach, or worse, getting all documents to the client).

You need to know which field(s) in your document might be the largest ones - which you almost always will know. There are only two practical¹ MongoDB types that can have variable sizes:

arrays
strings

The aggregation framework can calculate the length of each. Note that you won't get the size in bytes for arrays, but the length in elements. However, what matters more typically is which the outlier documents are, not exactly how many bytes they take.

Here's how it's done for arrays. As an example, let's say we have a collections of users in a social network and we suspect the array friends.ids might be very large (in practice you should probably keep a separate field like friendsCount in sync with the array, but for the sake of example, we'll assume that's not available):

db.users.aggregate([
    { $match: {
        'friends.ids': { $exists: true }
    }},
    { $project: { 
        sizeLargestField: { $size: '$friends.ids' } 
    }},
    { $sort: {
        sizeLargestField: -1
    }},
])

The key is to use the $size aggregation pipeline operator. It only works on arrays though, so what about text fields? We can use the $strLenBytes operator. Let's say we suspect the bio field might also be very large:

db.users.aggregate([
    { $match: {
        bio: { $exists: true }
    }},
    { $project: { 
        sizeLargestField: { $strLenBytes: '$bio' } 
    }},
    { $sort: {
        sizeLargestField: -1
    }},
])

You can also combine $size and $strLenBytes using $sum to calculate the size of multiple fields. In the vast majority of cases, 20% of the fields will take up 80% of the size (if not 10/90 or even 1/99), and large fields must be either strings or arrays.

^{¹ Technically, the rarely used binData type can also have variable size.}

answered Oct 03 '22 14:10

Dan Dascalescu

Starting Mongo 4.4, the new aggregation operator $bsonSize returns the size in bytes of a given document when encoded as BSON.

Thus, in order to find the bson size of the document whose size is the biggest:

// { "_id" : ObjectId("5e6abb2893c609b43d95a985"), "a" : 1, "b" : "hello" }
// { "_id" : ObjectId("5e6abb2893c609b43d95a986"), "c" : 1000, "a" : "world" }
// { "_id" : ObjectId("5e6abb2893c609b43d95a987"), "d" : 2 }
db.collection.aggregate([
  { $group: {
    _id: null,
    max: { $max: { $bsonSize: "$$ROOT" } }
  }}
])
// { "_id" : null, "max" : 46 }

This:

$groups all items together
$projects the $max of documents' $bsonSize
$$ROOT represents the current document for which we get the bsonsize

answered Oct 03 '22 13:10

Xavier Guihot

Well.. this is an old question.. but - I thought to share my cent about it

My approach - use Mongo mapReduce function

First - let's get the size for each document

db.myColection.mapReduce
(
   function() { emit(this._id, Object.bsonsize(this)) }, // map the result to be an id / size pair for each document
   function(key, val) { return val }, // val = document size value (single value for each document)
   { 
       query: {}, // query all documents
       out: { inline: 1 } // just return result (don't create a new collection for it)
   } 
)

This will return all documents sizes although it worth mentioning that saving it as a collection is a better approach (the result is an array of results inside the result field)

Second - let's get the max size of document by manipulating this query

db.metadata.mapReduce
(
    function() { emit(0, Object.bsonsize(this))}, // mapping a fake id (0) and use the document size as value
    function(key, vals) { return Math.max.apply(Math, vals) }, // use Math.max function to get max value from vals (each val = document size)
    { query: {}, out: { inline: 1 } } // same as first example
)

Which will provide you a single result with value equals to the max document size

In short:

you may want to use the first example and save its output as a collection (change out option to the name of collection you want) and applying further aggregations on it (max size, min size, etc.)

-OR-

you may want to use a single query (the second option) for getting a single stat (min, max, avg, etc.)

answered Oct 03 '22 13:10

ymz

If you're working with a huge collection, loading it all at once into memory will not work, since you'll need more RAM than the size of the entire collection for that to work.

Instead, you can process the entire collection in batches using the following package I created: https://www.npmjs.com/package/mongodb-largest-documents

All you have to do is provide the MongoDB connection string and collection name. The script will output the top X largest documents when it finishes traversing the entire collection in batches.

Preview

answered Oct 03 '22 13:10

Elad Nava

Related questions
                            
                                MongoDB cannot start server: The default storage engine 'wiredTiger' is not available with this build of mongod
                            
                                MongoDB Node check if objectid is valid
                            
                                MongoError: connect ECONNREFUSED 127.0.0.1:27017
                            
                                Error Message: MongoError: bad auth Authentication failed through URI string
                            
                                How to limit number of updating documents in mongodb
                            
                                How to convert a string to ObjectId in nodejs mongodb native driver?
                            
                                How to drop or delete a collection in MongoDB?
                            
                                How to stop insertion of Duplicate documents in a mongodb collection
                            
                                MongoDB collection hyphenated name
                            
                                uses for mongodb ObjectId creation time
                            
                                Warning: Accessing non-existent property 'MongoError' of module exports inside circular dependency
                            
                                How do I do a "NOT IN" query in Mongo?
                            
                                Project first item in an array to new field (MongoDB aggregation)
                            
                                Mongo group and push: pushing all fields
                            
                                How can I use a cursor.forEach() in MongoDB using Node.js?
                            
                                How to organise a many to many relationship in MongoDB
                            
                                MongoDB 'count()' is very slow. How do we refine/work around with it?
                            
                                How do I update MongoDB document fields only if they don't exist?
                            
                                Create an ISO date object in javascript
                            
                                MongoDB - Query on the last element of an array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find largest document size in MongoDB

Tags:

mongodb

sashkello

People also ask

6 Answers

Abhishek Kumar

Mike Graf

Dan Dascalescu

Xavier Guihot

ymz

Elad Nava

Recent Activity

Donate For Us