Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pymongo method of getting statistics for collection byte usage?

The MongoDB Application FAQ mentions that short field names are a technique that can be used for small documents. This led me to thinking, "what's a small document anyway?"

I'm using pymongo, is there any way I can write some python to scan a collection, and get a feel of the ratio of bytes used for field descriptors vs bytes used for actual field data?

I'm tangentially curious on what the basic byte overhead is per doc, as well.

like image 504
Travis Griggs Avatar asked Sep 16 '13 19:09

Travis Griggs


People also ask

Which method in MongoDB returns statistics about the collection?

stats() method is used to return statistics about the collection. The scale used in the output to display the sizes of items. By default, output displays size in bytes. To display kilobytes rather than bytes, specify a scale value of 1024.

How does MongoDB calculate collection size?

collection. totalSize() method is used to reports the total size of a collection, including the size of all documents and all indexes on a collection. Returns: The total size in bytes of the data in the collection plus the size of every index on the collection.

How do I get MongoDB stats?

The db. stats() method is used to return a document that reports on the state of the current database. The scale at which to deliver results. Unless specified, this command returns all data in bytes.


1 Answers

There is no builtin way to get the ratio of space used for keys in BSON documents versus space used for actual field values. However, the collstats and dbstats commands can give you useful information on collection and database size. Here's how to use them in pymongo:

from pymongo import MongoClient

client = MongoClient()
db = client.test

# print collection statistics
print db.command("collstats", "events")

# print database statistics
print db.command("dbstats")

You could always hack something up to get a pretty good estimate though. If all of your documents in a collection have the same schema, then something like this isn't half bad:

  1. Count up the total number of characters in the field names of a document, and call this number a.
  2. Add one to a for each field in order to account for the terminating character. Let the result be b.
  3. Multiply b by the number of documents in the collection, and let the result be denoted by c.
  4. Divide c by the "size" field returned by collStats (assuming collStats is scaled to return size in bytes). Let this value be d.

Now d is the proportion of the total data size of the collection which is used to store field names.

like image 62
david.storch Avatar answered Sep 30 '22 19:09

david.storch