I'm curious about this quote from Kyle Banker's MongoDB In Action:
It’s important to consider the length of the key names you choose, since key names are stored in the documents themselves. This contrasts with an RDBMS, where column names are always kept separate from the rows they refer to. So when using BSON, if you can live with dob in place of date_of_birth as a key name, you’ll save 10 bytes per document. That may not sound like much, but once you have a billion such documents, you’ll have saved nearly 10 GB of storage space just by using a shorter key name. This doesn’t mean you should go to unreasonable lengths to ensure small key names; be sensible. But if you expect massive amounts of data, economizing on key names will save space.
I am interested in the reason why this is not optimized on the database server side. Would a in-memory lookup table with all key names in the collection be too much of a performance penalty that is not worth the potential space savings?
Learn more about MongoDB Atlas. Key value databases, also known as key value stores, are database types where data is stored in a “key-value” format and optimized for reading and writing that data. The data is fetched by a unique key or a number of unique keys to retrieve the associated value with each key.
Every key-value pair is stored in a bucket, which is really just a MongoDB collection (the "bucket" terminology is used merely for resemblance with other key-value stores), so the same key can exist, with possibly different values, in multiple buckets.
MongoDB stores documents in collections. Collections are analogous to tables in relational databases.
All documents in a MongoDB collection have a primary key dubbed _id . This field is automatically assigned to a document upon insert, so there's rarely a need to provide it.
What you are referring to is often called "key compression"*. There are several reasons why it hasn't been implemented:
Like all features, there's a cost benefit analysis for implementing it, and (at least so far) other features have offered more "bang for the buck".
Full document compression is [being considered][1] for a future MongoDB version. available as of version 3.0 (see below)
* An in-memory lookup table for key names is basically a special case of LZW style compression — that's more or less what most compression algorithms do.
** Compression provides both a space advantage and a performance advantage. Smaller documents means that more documents can be read per IO, which means that in a system with fixed IO, more documents per second can be read.
MongoDB versions 3.0 and up now have full document compression capability with the WiredTiger storage engine.
Two compression algorithms are available: snappy, and zlib. The intent is for snappy to be the best choice for all-around performance, and for zlib to be the best choice for maximum storage capacity.
In my personal (non-scientific, but related to a commercial project) experimentation, snappy compression (we didn't evaluate zlib) offered significantly improved storage density at no noticeable net performance cost. In fact, there was slightly better performance in some cases, roughly in line with my previous comments/predictions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With