Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CouchDB Views: How much processing is acceptable in map reduce?

I've been toying around with Map Reduce with CouchDB. Some of the examples show some possibly heavy logic within the map reduce functions. In one particular case, they were performing for loops within map.

Is map reduce run on every single possible document before it emits your selected documents?

If so, I would think that means that running any kind of iterative processing within the map reduce functions would increase processing burden by an order of magnitude, at least.

Basically it boils down to the following question: how much logic can be performed within map reduce before its an unreasonably expensive query?

like image 259
Kristian Avatar asked Apr 06 '12 16:04

Kristian


2 Answers

Lots of expensive processing is acceptable in CouchDB map-reduce.

CouchDB views (map-reduce) are more like CREATE INDEX than they are SELECT FROM.

Specifically, CouchDB guarantees that a map function runs only once per document, ever. (Well, actually once per document change ever.) That is what the "iterative map-reduce" is.

Therefore, suppose you had 10,000 documents and they take 1 second each to process (which is way higher than I have ever seen). That is 10,000 seconds or 2.8 hours to completely build the view. However once the view is complete, querying any row (?key=...) or row slice (?startkey=...&endkey=...) takes the same time as querying for documents directly. Lookup time is O(log n) for the document count.

In other words, even if it takes 1 second per document to execute the map, it will take a few milliseconds to fetch the result. (Of course, the view must build first, since it is actually an index.)

like image 103
JasonSmith Avatar answered Dec 24 '22 22:12

JasonSmith


Querying the db is an unrelated activity from the map/reduce of a document. Therefore the query cost is not impacted by the complexity of the map/reduce.

In couchdb you are querying an index. This means it is a copy of your data in a format optimized for query speed. A query is not like a tablescan in sql. It does not loop through records.

So how do you make this index? It is done through the map function. The map function emits a key and a value. The key is put in the index. Some complicated map functions that you mention may loop and emit many keys and values. Couchdb is smart and only runs a document when it needs to, usually on create, updates, and deletes. This is why it is incremental map/reduce.

So as you might see, a complicated map function might impact create, update, and delete speed. But again couchdb is smart in that you can specify how stale the data might be when you query the index.

like image 26
Ryan Ramage Avatar answered Dec 24 '22 23:12

Ryan Ramage