Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

email as _id in a MongoDB user collection

I have a user collection in a MongoDB. The _id is currently the standard MongoDB generated ObjectId. I also have a unique key constraint against a required 'email' field. This seems like a waste.

Is there any reason why I should not ditch the 'email' field and make that data the _id field?

like image 550
Guy Avatar asked May 01 '14 00:05

Guy


2 Answers

I have read Neil's answer and I partially agree with it (also I am really skeptical about 'significant performance gains'). One thing I have not found in your question is 'what are you going to do with this email'. Are you going to search by it or it is just saved there? And one of the most important things which was not addressed in previous answer: is it going to be changed?

It is not uncommon that people who would use your system will be going to change their email (lost / is not used anymore). If you will put your _id as their email you will not be able to change it easily (you can not modify _id in mongo). You will need to copy, remove add new element in this case (which will not be atomic).

So I would put this as one big reason not to do so. But you need to decide whether you will allow people to change email addresses.

like image 117
Salvador Dali Avatar answered Oct 04 '22 12:10

Salvador Dali


Generally speaking, no there is no real reason and in fact there are significant performance gains to be realized if you actually do use your "email" as a primary key.

  1. Where most of your lookup's are actually on that primary key. Even creating a unique key for a different field, MongoDB is optimized so that "finding" the _id field index is a no-brainer. It's always there.

  2. No additional space used for an index. So again where you are looking up your primary key there is not need to pull in anything other than the default index, as well as this naturally saving on disk space in addition to the I/O cost that would be incurred otherwise.

Perhaps the only real relevant consideration would be with sharding. And that would only be if your use case was better suited to some different form of "bucketed" distribution of "high/low" volume users for example. In that case some other form of Primary key would be required in order to facilitate that.

The default ObjectId type that generally occupies the _id field is great as it maintains a natural insertion order and also even makes it possible to do such things as general range based queries or even time based queries (within reason). So where there is a need for a natural insertion order it is generally be best choice and is highly collision safe.

But if you are generally looking for efficient lookup of Primary key values, then anything that serves as a natural primary key is ideally put in the _id field of the collection, as long as it is reasonably guaranteed to be unique.

like image 45
Neil Lunn Avatar answered Oct 04 '22 11:10

Neil Lunn