Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of querying relational data with document-based nosql (mongodb, couchdb, and riak, etc)

To follow up on my question on modeling relational data with nosql, I have read several articles on the subject:

Nosql doesn't mean non-relational

Nosql Ecommerce Example

They seem to suggest that nosql can handle normalized, relational data.

So let's continue with the example I had before, a CMS system that have two types of data: article and authors, where article has an reference (by ID) to author.

Below are the operations the system needs to support:

  1. Fetch a article by id along with the author
  2. Fetch all articles by particular author
  3. Find the first 10 article(s) with the author(s) sorted by creation date

I would like to understand the performance of these operation when compare to the same operation if the same data were stored on RDBMS. In particular, please specify if the operation uses MapReduce, require multple trips to the nosql store (Links), or pre-join

I would like to limit to discussion to document-based nosql solution like mongodb, couchdb, and riak.

Edit 1:

Spring-data project is avalible on Riak and Mongodb

like image 866
ltfishie Avatar asked Oct 01 '11 00:10

ltfishie


2 Answers

Just wanted to toss in a CouchDB answer for anyone who might be curious. :)

As mentioned in the first answer above, embedding the author document into the article document is unwise, so the examples below assume two document types: articles and authors.

CouchDB uses MapReduce queries typically written in JavaScript (but Python, Ruby, Erlang and others are availble). The results of a MapReduce query are stored in an index upon their first request and that stored index is used to for all future look-ups. Changes to the database are added to the index upon further requests.

CouchDB's API is completely HTTP-based, so all requests to the database are HTTP verbs (GET, POST, PUT, DELETE) at various URLs. I'll be listing both the MapReduce queries (written in JavaScript) along with the URL used to request related results from the index.

1. Fetch a article by id along with the author

The simplest method for doing this is two direct document lookups:

GET /db/{article_id}
GET /db/{author_id}

...where {author_id} is the value obtained from the article's author_id field.

2. Fetch all articles by particular author

MapReduce

function (doc) {
  if (doc.type === 'article') {
    emit(doc.author_id, doc);
  }
}
GET /db/_design/cms/_view/articles_by_author?key="{author_id}"

...where {author_id} is the actual ID of the author.

3. Find the first 10 article(s) with the author(s) sorted by creation date

MapReduce

function (doc) {
  function arrayDateFromTimeStamp(ts) {
    var d = new Date(ts);
    return [d.getFullYear(), d.getMonth(), d.getDate(), d.getHours(), d.getMinutes(), d.getSeconds()];
  }

  var newdoc = doc;
  newdoc._id = doc.author_id;
  newdoc.created_at = arrayDateFromTimeStamp(doc.created_at);

  if (doc.type === 'article') {
    emit(newdoc.created_at, newdoc); 
  }
}

It's possible to do include style "joins" in CouchDB using ?include_docs=true in a view request. If you include a "_id" key in the value side of the emit (the second argument), then adding include_docs=true to your query parameters will include the doc referenced by the specified "_id" In the case above, we're replacing the document's own "_id" (which we don't need anymore) with the referenced author's "_id" (the value of "author_id" in the article document). Requesting the top 10 articles with their related author info looks like this:

GET /db/_design/cms/_view/articles_by_date?descending=true&limit=10&include_docs=true

Requesting that URL will return a list of the most recent 10 articles in a format similar to:

{"rows":[
  { "id":"article_id",
    "key":[2011, 9, 3, 12, 5, 41],
    "value":{"_id":"author_id", "title":"..."},
    "doc":{"_id":"author_id", "name":"Author Name"}
  }
]}

Using this same index you can get a list of all the documents any any year, month, day, hour, etc granularity with or without author data.

There are also methods for using view collation to aggregate several documents together out of a single document (like a page in a CMS referencing disparate content). There's some info on how to do that in these slides I did for CouchConf in July: http://www.slideshare.net/Couchbase/couchconfsfdesigningcouchbasedocuments

If you have any other questions, please let me know.

like image 182
BigBlueHat Avatar answered Oct 13 '22 12:10

BigBlueHat


Fetch a article by id along with the author

SQL:

  • 1 query
  • 2 index lookups
  • 2 data lookups
  • data returned = article + author

MongoDB:

  • 2 queries
  • 2 index lookups
  • 2 data lookups
  • data returned = article + author

Fetch all articles by particular author

SQL:

  • 1 query
  • 1 index lookup
  • N data lookups
  • data returned = N articles

MongoDB:

  • 1 query
  • 1 index lookup
  • N data lookups
  • data returned = N articles

Find the first 10 article(s) with the author(s) sorted by creation date

SQL:

  • 1 query
  • 2 index lookups
  • 11 to 20 data lookups (articles then unique authors)
  • data returned = 10 articles + 10 authors

MongoDB:

  • 2 queries (articles.find().sort().limit(10), authors.find({$in:[article_authors]})
  • 2 index lookups
  • 11 to 20 data lookups (articles then unique authors)
  • data returned = 10 articles + 1 to 10 authors

Summary

In two cases MongoDB requires an extra query, but does most of the same total work underneath. In some cases MongoDB returns less data over the network (no repeated entries). The join queries tend to be limited by the requirement that all the data to join live on the same box. If Authors and Articles live in different places, then you end up doing two queries anyways.

MongoDB tends to get better "raw" performance because it doesn't flush to disk with every write (so it's actually a "durability" tradeoff). It also has a much smaller query parser, so there's less activity per query.

From a basic performance standpoint these things are very similar. They just make different assumptions about your data and the trade-offs you want to make.

like image 28
Gates VP Avatar answered Oct 13 '22 10:10

Gates VP