Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are key names stored in the document in MongodDB

Tags:

mongodb

I'm curious about this quote from Kyle Banker's MongoDB In Action:

It’s important to consider the length of the key names you choose, since key names are stored in the documents themselves. This contrasts with an RDBMS, where column names are always kept separate from the rows they refer to. So when using BSON, if you can live with dob in place of date_of_birth as a key name, you’ll save 10 bytes per document. That may not sound like much, but once you have a billion such documents, you’ll have saved nearly 10 GB of storage space just by using a shorter key name. This doesn’t mean you should go to unreasonable lengths to ensure small key names; be sensible. But if you expect massive amounts of data, economizing on key names will save space.

I am interested in the reason why this is not optimized on the database server side. Would a in-memory lookup table with all key names in the collection be too much of a performance penalty that is not worth the potential space savings?

like image 895
c089 Avatar asked Jul 11 '12 09:07

c089


People also ask

What are keys in MongoDB?

Learn more about MongoDB Atlas. Key value databases, also known as key value stores, are database types where data is stored in a “key-value” format and optimized for reading and writing that data. The data is fetched by a unique key or a number of unique keys to retrieve the associated value with each key.

How does MongoDB store key value pairs?

Every key-value pair is stored in a bucket, which is really just a MongoDB collection (the "bucket" terminology is used merely for resemblance with other key-value stores), so the same key can exist, with possibly different values, in multiple buckets.

What is the name of the storage object where all documents of MongoDB is stored?

MongoDB stores documents in collections. Collections are analogous to tables in relational databases.

Are there keys in MongoDB?

All documents in a MongoDB collection have a primary key dubbed _id . This field is automatically assigned to a document upon insert, so there's rarely a need to provide it.


1 Answers

What you are referring to is often called "key compression"*. There are several reasons why it hasn't been implemented:

  1. If you want it done, you can currently do it at the Application/ORM/ODM level quite easily.
  2. It's not necessarily a performance** advantage in all cases — think collections with lots of key names, and/or key names that vary wildly between documents.
  3. It might not provide a measurable performance** advantage at all until you have millions of documents.
  4. If the server does it, the full key names still have to be transmitted over the network.
  5. If compressed key names are transmitted over the network, then readability really suffers using the javascript console.
  6. Compressing the entire JSON document might offer offers an even better performance advantage.

Like all features, there's a cost benefit analysis for implementing it, and (at least so far) other features have offered more "bang for the buck".

Full document compression is [being considered][1] for a future MongoDB version. available as of version 3.0 (see below)

* An in-memory lookup table for key names is basically a special case of LZW style compression — that's more or less what most compression algorithms do.

** Compression provides both a space advantage and a performance advantage. Smaller documents means that more documents can be read per IO, which means that in a system with fixed IO, more documents per second can be read.

Update

MongoDB versions 3.0 and up now have full document compression capability with the WiredTiger storage engine.

Two compression algorithms are available: snappy, and zlib. The intent is for snappy to be the best choice for all-around performance, and for zlib to be the best choice for maximum storage capacity.

In my personal (non-scientific, but related to a commercial project) experimentation, snappy compression (we didn't evaluate zlib) offered significantly improved storage density at no noticeable net performance cost. In fact, there was slightly better performance in some cases, roughly in line with my previous comments/predictions.

like image 58
Sean Reilly Avatar answered Sep 21 '22 00:09

Sean Reilly