Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Principles for Modeling CouchDB Documents

People also ask

What is a document in CouchDB?

Advertisements. Documents are CouchDB's central data structure. Contents of the database will be stored in the form of Documents instead of tables. You can create these documents using cURL utility provided by CouchDB, as well as Futon.

What is CouchDB used for?

CouchDB is an open source NoSQL database based on common standards to facilitate Web accessibility and compatibility with a variety of devices. NoSQL databases are useful for very large sets of distributed data, especially for the large amounts of non-uniform data in various formats that is characteristic of big data.

Is CouchDB key value?

Like the LevelDB key-value store, CouchDB can store any binary data that is modeled in chaincode (CouchDB attachments are used internally for non-JSON data). As a document object store, CouchDB allows you to store data in JSON format, issue rich queries against your data, and use indexes to support your queries.

Is CouchDB slow?

Quite the opposite: CouchDB is slower than many people expect. To some degree it has room to improve and optimize; but primarily CouchDB has decided that those costs are worthwhile for the broader good it brings. CouchDB fails the benchmarks, and aces the college of hard knocks.


There have been some great answers to this already, but I wanted to add some more recent CouchDB features to the mix of options for working with the original situation described by viatropos.

The key point at which to split up documents is where there might be conflicts (as mentioned earlier). You should never keep massively "tangled" documents together in a single document as you'll get a single revision path for completely unrelated updates (comment addition adding a revision to the entire site document for instance). Managing the relationships or connections between various, smaller documents can be confusing at first, but CouchDB provides several options for combining disparate pieces into single responses.

The first big one is view collation. When you emit key/value pairs into the results of a map/reduce query, the keys are sorted based on UTF-8 collation ("a" comes before "b"). You can also output complex keys from your map/reduce as JSON arrays: ["a", "b", "c"]. Doing that would allow you to include a "tree" of sorts built out of array keys. Using your example above, we can output the post_id, then the type of thing we're referencing, then its ID (if needed). If we then output the id of the referenced document into an object in the value that's returned we can use the 'include_docs' query param to include those documents in the map/reduce output:

{"rows":[
  {"key":["123412804910820", "post"], "value":null},
  {"key":["123412804910820", "author", "Lance1231"], "value":{"_id":"Lance1231"}},
  {"key":["123412804910820", "comment", "comment1"], "value":{"_id":"comment1"}},
  {"key":["123412804910820", "comment", "comment2"], "value":{"_id":"comment2"}}
]}

Requesting that same view with '?include_docs=true' will add a 'doc' key that will either use the '_id' referenced in the 'value' object or if that isn't present in the 'value' object, it will use the '_id' of the document from which the row was emitted (in this case the 'post' document). Please note, these results would include an 'id' field referencing the source document from which the emit was made. I left it out for space and readability.

We can then use the 'start_key' and 'end_key' parameters to filter the results down to a single post's data:

?start_key=["123412804910820"]&end_key=["123412804910820", {}, {}]
Or even specifically extract the list for a certain type:
?start_key=["123412804910820", "comment"]&end_key=["123412804910820", "comment", {}]
These query param combinations are possible because an empty object ("{}") is always at the bottom of the collation and null or "" are always at the top.

The second helpful addition from CouchDB in these situations is the _list function. This would allow you to run the above results through a templating system of some kind (if you want HTML, XML, CSV or whatever back), or output a unified JSON structure if you want to be able to request an entire post's content (including author and comment data) with a single request and returned as a single JSON document that matches what your client-side/UI code needs. Doing that would allow you to request the post's unified output document this way:

/db/_design/app/_list/posts/unified??start_key=["123412804910820"]&end_key=["123412804910820", {}, {}]&include_docs=true
Your _list function (in this case named "unified") would take the results of the view map/reduce (in this case named "posts") and run them through a JavaScript function that would send back the HTTP response in the content type you need (JSON, HTML, etc).

Combining these things, you can split up your documents at whatever level you find useful and "safe" for updates, conflicts, and replication, and then put them back together as needed when they're requested.

Hope that helps.


I know this is an old question, but I came across it trying to figure out the best approach to this exact same problem. Christopher Lenz wrote a nice blog post about methods of modeling "joins" in CouchDB. One of my take-aways was: "The only way to allow non-conflicting addition of related data is by putting that related data into separate documents." So, for simplicity sake you'd want to lean towards "denormalization". But you'll hit a natural barrier due to conflicting writes in certain circumstances.

In your example of Posts and Comments, if a single post and all of its comments lived in one document, then two people trying to post a comment at the same time (i.e. against the same revision of the document) would cause a conflict. This would get even worse in your "whole site in a single document" scenario.

So I think the rule of thumb would be "denormalize until it hurts", but the point where it will "hurt" is where you have a high likelihood of multiple edits being posted against the same revision of a document.


The book says, if I recall correctly, to denormalize until "it hurts", while keeping in mind the frequency with which your documents might be updated.

  1. What rules/principles do you use to divide up your documents (relationships, etc)?

As a rule of thumb, I include all data that is needed to display a page regarding the item in question. In other words, everything you would print on a real-world piece of paper that you would hand to somebody. E.g. a stock quote document would include the name of the company, the exchange, the currency, in addition to the numbers; a contract document would include the names and addresses of the counterparties, all information on dates and signatories. But stock quotes from distinct dates would form separate documents, separate contracts would form separate documents.

  1. Is it okay to put the entire site into one document?

No, that would be silly, because:

  • you would have to read and write the whole site (the document) on each update, and that is very inefficient;
  • you would not benefit from any view caching.

I think Jake's response nails one of the most important aspects of working with CouchDB that may help you make the scoping decision: conflicts.

In the case where you have comments as an array property of the post itself, and you just have a 'post' DB with a bunch of huge 'post' documents in it, as Jake and others correctly pointed out you could imagine a scenario on a really popular blog post where two users submit edits to the post document simultaneously, resulting in a collision and a version conflict for that document.

ASIDE: As this article points out, also consider that each time you are requesting/updating that doc you have to get/set the document in its entirety, so passing around a massive documents that either represent the entire site or a post with a lot of comments on it can become a problem you would want to avoid.

In the case where posts are modeled separately from comments and two people submit a comment on a story, those simply become two "comment" documents in that DB, with no issue of conflict; just two PUT operations to add two new comments to the "comment" db.

Then to write the views that give you back the comments for a post, you would pass in the postID and then emit all the comments that reference that parent post ID, sorted in some logical ordering. Maybe you even pass in something like [postID,byUsername] as the key to the 'comments' view to indicate the parent post and how you want the results sorted or something along those lines.

MongoDB handles documents a bit differently, allowing indexes to be built on sub-elements of a document, so you might see the same question on the MongoDB mailing list and someone saying "just make the comments a property of the parent post".

Because of the write locking and single-master nature of Mongo, the conflicting revision issue of two people adding comments wouldn't spring up there and the query-ability of the content, as mentioned, isn't effected too poorly because of sub-indexes.

That being said, if your sub-elements in either DB are going to be huge (say 10s of thousands of comments) I believe it is the recommendation of both camps to make those separate elements; I have certainly seen that to be the case with Mongo as there are some upper bound limits on how big a document and its subelements can be.