Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Document Databases: Redundant data, references, etc. (MongoDB specifically)

It seems like I run into lots of situations where the appropriate way to build my data is to split it into two documents. Let's say it was for a chain of stores and you were saving which stores each customer had visited. Stores and Customers need to be independent pieces of data because they interact with plenty of other things, but we do need to relate them.

So the easy answer is to store the user's Id in the store document, or the store's Id in the user's document. Often times though, you want to access 1-2 other pieces of data for display purposes because Id's aren't useful. Like maybe the customer name, or the store name.

  1. Do you typically store a duplicate of the entire document? Or just store the pieces of data you need? Maybe depends on the size of the doc vs how much of it you need.
  2. How do you handle the fact that you have duplicate data? Do you go hunt down data when it changes? Update the data at some interval when it's loaded? Only duplicate when you can afford stale data?

Would appreciate your input and/or links to any kind of 'best practices' or at least well-reasoned discussion of these topics.

like image 731
Jim Avatar asked Oct 18 '10 05:10

Jim


People also ask

What are redundant databases?

Data redundancy refers to the practice of keeping data in two or more places within a database or data storage system. Data redundancy ensures an organization can provide continued operations or services in the event something happens to its data -- for example, in the case of data corruption or data loss.

Is MongoDB a document database?

MongoDB is an open source NoSQL database management program. NoSQL is used as an alternative to traditional relational databases. NoSQL databases are quite useful for working with large sets of distributed data. MongoDB is a tool that can manage document-oriented information, store or retrieve information.

How does MongoDB prevent redundancy?

To avoid having to maintain duplicate data, you're not going to store duplicate data. At least not actively. In this scenario you'll also want to store only the references between documents. Then use a periodic map-reduce job to generate the duplicate data.

Is there data redundancy in databases?

Data redundancy can be found in a database, which is an organized collection of structured data that's stored by a computer system or the cloud. A retailer may have a database to track the products they stock. If the same product gets entered twice by mistake, data redundancy takes place.


1 Answers

There are basically two scenario's: fresh and stale.

Fresh data

Storing duplicate data is easy. Maintaining the duplicate data is the hard part. So the easiest thing to do is to avoid maintenance, by simply not storing any duplicate data to begin with. This is mainly useful if you need fresh data. Only store the references, and query the collections when you need to retrieve information.

In this scenario, you'll have some overhead due to the extra queries. The alternative is to track all locations of duplicate data, and update all instances on each update. This also involves overhead, especially in N-to-M relations like the one you mentioned. So either way, you will have some overhead, if you require fresh data. You can't have the best of both worlds.

Stale data

If you can afford to have stale data, things get a lot easier. To avoid query overhead, you can store duplicate data. To avoid having to maintain duplicate data, you're not going to store duplicate data. At least not actively.

In this scenario you'll also want to store only the references between documents. Then use a periodic map-reduce job to generate the duplicate data. You can then query the single map-reduce result, rather than separate collections. This way you avoid the query overhead, but you also don't have to hunt down data changes.

Summary

Only store references to other documents. If you can afford stale data, use periodic map-reduce jobs to generate duplicate data. Avoid maintaining duplicate data; it's complex and error-prone.

like image 64
Niels van der Rest Avatar answered Oct 02 '22 12:10

Niels van der Rest