Too much data duplication in mongodb?

Tags:

I'm new to this whole NOSQL stuff and have recently been intrigued with mongoDB. I'm creating a new website from scratch and decided to go with MONGODB/NORM (for C#) as my only database. I've been reading up a lot about how to properly design your document model database and I think for the most part I have my design worked out pretty well. I'm about 6 months into my new site and I'm starting to see issues with data duplication/sync that I need to deal with over and over again. From what I read, this is expected in the document model, and for performance it makes sense. I.E. you stick embedded objects into your document so it's fast to read - no joins; but of course you can't always embed, so mongodb has this concept of a DbReference which is basically analogous to a foreign key in relational DBs.

So here's an example: I have Users and Events; both get their own document, Users attend events, Events have users attendees. I decided to embed a list of Events with limited data into the User objects. I embedded a list of Users also into the Event objects as their "attendees". The problem here is now I have to keep the Users in sync with the list of Users that is also embedded in the Event object. As I read it, this seems to be the preferred approach, and the NOSQL way to do things. Retrieval is fast, but the fall-back is when I update the main User document, I need to also go into the Event objects, possibly find all references to that user and update that as well.

So the question I have is, is this a pretty common problem people need to deal with? How much does this problem have to happen before you start saying "maybe the NOSQL strategy doesn't fit what I'm trying to do here"? When does the performance advantage of not having to do joins turn into a disadvantage because you're having a hard time keeping data in sync in embedded objects and doing multiple reads to the DB to do so?

242

asked Oct 24 '10 19:10

mike

1 Answers

Well that is the trade off with document stores. You can store in a normalized fashion like any standard RDMS, and you should strive for normalization as much as possible. It's only where its a performance hit that you should break normalization and flatten your data structures. The trade off is read efficiency vs update cost.

Mongo has really efficient indexes which can make normalizing easier like a traditional RDMS (most document stores do not give you this for free which is why Mongo is more of a hybrid instead of a pure document store). Using this, you can make a relation collection between users and events. It's analogous to a surrogate table in a tabular data store. Index the event and user fields and it should be pretty quick and will help you normalize your data better.

I like to plot the efficiency of flatting a structure vs keeping it normalized when it comes to the time it takes me to update a records data vs reading out what I need in a query. You can do it in terms of big O notation but you don't have to be that fancy. Just put some numbers down on paper based on a few use cases with different models for the data and get a good gut feeling about how much works is required.

Basically what I do is first try to predict the probability of how many updates a record will have vs. how often it's read. Then I try to predict what the cost of an update is vs. a read when it's both normalized or flattened (or maybe partially combination of the two I can conceive... lots of optimization options). I can then judge the savings of keeping it flat vs. the cost of building up the data from normalized sources. Once I plotted all the variables, if the savings of keeping it flat saves me a bunch, then I will keep it flat.

A few tips:

If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
If you require update to be quick, and access immediately then favor normalization.
If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.

There are lot of other ideas that you can employ. There a lot of great blogs on line that go into it like highscalabilty.org and make sure you understand CAP theorem.

Also consider a caching layer, like Redis or memcache. I will put one of those products in front my data layer. When I query mongo (which is storing everything normalized), I use the data to construct a flattened representation and store it in the cache. When I update the data, I will invalidate any data in the cache that references what I'm updating. (Although you have to take the time it takes to invalidate data and tracking data in the cache that is getting updated into consideration of your scaling factors). Someone once said "The two hardest things in Computer Science are naming things and cache invalidation."

Hope that helps!

167

answered Sep 30 '22 12:09

Zac Bowling

Related questions
                            
                                Searching for value of any field in MongoDB without explicitly naming it
                            
                                How to close a mongodb python connection?
                            
                                MongoDB won't start after server crash
                            
                                is there a mongoose connect error callback
                            
                                Difference between "id" and "_id" fields in MongoDB
                            
                                MongoKit vs MongoEngine vs Flask-MongoAlchemy for Flask [closed]
                            
                                MongoDB Stored Procedure Equivalent
                            
                                MongoDB Data directory /data/db not found [duplicate]
                            
                                Mongo, find through list of ids
                            
                                Unable to start/launch local mongo db
                            
                                Is there any option to limit mongodb memory usage?
                            
                                What is the maximum number of parameters passed to $in query in MongoDB?
                            
                                Does the MongoDB stats() function return bits or bytes?
                            
                                How do you connect to a replicaset from a MongoDB shell?
                            
                                Importing json from file into mongodb using mongoimport
                            
                                MongoDB: Updating documents using data from the same document [duplicate]
                            
                                Can you specify a key for $addToSet in Mongo?
                            
                                MongoDB GUI client (cross-platform or Linux) [closed]
                            
                                pymongo.errors.CursorNotFound: cursor id '...' not valid at server
                            
                                How to connect mongodb clients to local Meteor MongoDB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Too much data duplication in mongodb?

Tags:

duplicates

mongodb

norm

mike

People also ask

1 Answers

Zac Bowling

Recent Activity

Donate For Us