Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"merge" view collation into useful output in CouchDB

When doing a "join" in CouchDB, you can use view collation to group the records together. For example, having two document types customers and orders. So that you can return customer, then all the orders for that customer, then the next customer, and orders.

The question is, how do you do a merging of rows, so that if you have 10 customers, and 40 orders, your output is still 10 rows instead of 50. You essentially add more information into your customer row.

I believe using a _list or a reduce will solve this. The question is how exactly to do this?

like image 779
Matt Avatar asked May 21 '11 22:05

Matt


2 Answers

I second jhs answer, but I think his "Option 2" is too dangerous. I learned it the hard way. You can use reduce function for many nice things like getting the last post of each user of a blog, but you cannot use it for anything which does not reduce the amount of data returned.

To support it with facts, I made this little script to generate 200 customers with 20 orders each.

#!/bin/bash
echo '{"docs":['
for i in $(seq 1 200); do
  id=customer/$i
  echo '{"_id":"'$id'","type":"customer","name":"Customer '$i'"},'
  for o in $(seq 1 20); do
    echo '{"type":"order","_id":"order/'$i'/'$o'", "for":"'$id'", "desc":"Desc '$i$o'"},'
  done
done
echo ']}'

It is a very likely scenario and it is enough to throw a Error: reduce_overflow_error.

IMHO the two option you have are:

Option 1: Optimized list function

With a little bit of work, you can build the JSON response by hand, so that you do not need to accumulate orders in an array.

I have edited the list function of jhs to avoid any use of arrays, so you can have customers with any number of orders.

function(head, req) {
  start({'headers':{'Content-Type':'application/json'}});

  var first_customer = true
    , first_order = true
    , row
    ;

  send('{"rows":[');

  while(row = getRow()) {
    if(row.key[1] === 2) {
      // Order for customer
      if (first_order) {
        first_order = false;
      } else {
        send(',');
      }
      send(JSON.stringify(row.value));
    }
    else if (row.key[1] === 1) {
      // New customer
      if (first_customer) {
        first_customer = false;
      } else {
        send(']},');
      }
      send('{"customer":');
      send(JSON.stringify(row.key[0]));
      send(',"orders":[');
      first_order = true;
    }
  }
  if (!first_customer)
    send(']}');

  send('\n]}');
}

Option 2: Optimize the documents for your use case

If you really need to have the orders in the same document, then ask yourself if you can store it this way and avoid any processing while querying.

In other words: try to fully exploit the possibilities offered by a document database. Design the documents to best fit your use case, and reduce the post-processing needed to use them.

like image 184
Marcello Nuccio Avatar answered Oct 15 '22 14:10

Marcello Nuccio


One of CouchDB'm main "opinions" is that it only does things that are also possible in a distributed, clustered, setting. In practice, this means some inconvenience at the beginning, with a pay-off later of big scalability without changing the code.

In other words, there is no perfect answer to the "join" question. But I think there are two pretty good options.

I'm working with this data set:

$ curl localhost:5984/so/_bulk_docs -XPOST -Hcontent-type:application/json -d @-
{"docs":[
{"type":"customer","name":"Jason"},
{"type":"customer","name":"Hunter"},
{"type":"customer","name":"Smith"},
{"type":"order", "for":"Jason", "desc":"Hat"},
{"type":"order", "for":"Jason", "desc":"Shoes"},
{"type":"order", "for":"Smith", "desc":"Pan"}
]}
^D

[{"id":"4cb766ebafda06d8a3a7382f74000b46","rev":"1-8769ac2fffb869e795c347e7b8c653bf"},
{"id":"4cb766ebafda06d8a3a7382f74000b7d","rev":"1-094eff3e3a5967d974fcd7b3cfd7e454"},
{"id":"4cb766ebafda06d8a3a7382f740019cb","rev":"1-5cda0b61da4c045ff503b57f614454d5"},
{"id":"4cb766ebafda06d8a3a7382f7400239d","rev":"1-50642a9809f15283a9d938c8fe28ef27"},
{"id":"4cb766ebafda06d8a3a7382f74002778","rev":"1-d03d883fb14a424e3db022350b38c510"},
{"id":"4cb766ebafda06d8a3a7382f74002c5c","rev":"1-e9612f5d267a8442d3fc2ae09e8c800d"}]

And my map function is

function(doc) {
  if(doc.type == 'customer')
    emit([doc.name, 1], "");
  if(doc.type == 'order')
    emit([doc.for, 2], doc.desc);
}

Querying the full view shows:

{"total_rows":6,"offset":0,"rows":[
{"id":"4cb766ebafda06d8a3a7382f74000b7d","key":["Hunter",1],"value":""},
{"id":"4cb766ebafda06d8a3a7382f74000b46","key":["Jason",1],"value":""},
{"id":"4cb766ebafda06d8a3a7382f7400239d","key":["Jason",2],"value":"Hat"},
{"id":"4cb766ebafda06d8a3a7382f74002778","key":["Jason",2],"value":"Shoes"},
{"id":"4cb766ebafda06d8a3a7382f740019cb","key":["Smith",1],"value":""},
{"id":"4cb766ebafda06d8a3a7382f74002c5c","key":["Smith",2],"value":"Pan"}
]}

Option 1: List function to gather results

The benefit is, if you ask for 10 rows, you will definitely get 10 (unless there isn't enough data of course).

But the cost is you have to do server-side processing for every query. The data you want was just sitting there on disk, ready to stream out to you, but now you've put it through this bottleneck.

However, I personally feel that unless you have a demonstrated performance issue, the _list is nice.

function(head, req) {
  start({'headers':{'Content-Type':'application/json'}});

  send('{"rows":');

  var customer = null, orders = [], count = 0;

  var prefix = '\n[ ';
  function show_orders() {
    if(customer && orders.length > 0) {
      count += 1;

      send(prefix);
      prefix = '\n, ';

      send(JSON.stringify({'customer':customer, 'orders':orders}));
    }
  }

  function done() {
    send('\n]}');
  }

  var row;
  while(row = getRow()) {
    if(row.key[1] == 2) {
      // Order for customer
      orders.push(row.value);
    }

    if(row.key[1] == 1) {
      // New customer
      show_orders();

      if(req.query.lim && count >= parseInt(req.query.lim)) {
        // Reached the limit
        done();
        return;
      } else {
        // Prepare for this customer.
        customer = row.key[0];
        orders = [];
      }
    }
  }

  // Show the last order set seen and finish.
  show_orders();
  done();
}

This function simply loops through the map rows and only outputs the full customer+orders row once all the information is collected. Obviously you can change the JSON format that you output. Also, there is a ?lim=X parameter, because using the parameter limit would interfere with the map query.

The danger is that this function builds an unlimited response in memory. What if a customer made 10,000 orders? Or 100,000? Eventually building the orders array will fail. That is why CouchDB keeps them in a "tall" list. If you won't ever get 10,000 orders per customer, then this isn't a problem.

$ curl 'http://localhost:5984/so/_design/ex/_list/ex/so?reduce=false&lim=2'
{"rows":
[ {"customer":"Jason","orders":["Hat","Shoes"]}
, {"customer":"Smith","orders":["Pan"]}
]}

Option 2: A sneaky reduce

You can do something similar with a reduce function. Right here, I'll warn you this is technically not scalable because you accumulate a response on disk, however I personally prefer it instead of _list because the code is simpler and I know I am directly reading data off disk, without post-processing.

function(keys, vals, re) {
  // If all keys are the same, then these are all
  // orders for the same customer, so accumulate
  // them. Otherwise, return something meaningless.
  var a;

  var first_customer = keys[0][0][0];
  for(a = 0; a < keys.length; a++)
    if(keys[a][0][0] !== first_customer)
      return null;

  var result = [];
  for(a = 0; a < vals.length; a++)
    if(vals[a]) {
      // Accumulate an order.
      result.push(vals[a]);
    }
  return result;
}

Always query this view with ?group_level=1 which will segment the results by customer (because the customer name was the first item in the map key).

This is against the law because you are not supposed to accumulate data during a reduce phase. That's why they call it reduce.

However, CouchDB is relaxed, and as long as you don't build giant lists, it should work and it's much more elegant.

$ curl 'localhost:5984/so/_design/ex/_view/so?group_level=1&limit=3'
{"rows":[
{"key":["Hunter"],"value":[]},
{"key":["Jason"],"value":["Shoes","Hat"]},
{"key":["Smith"],"value":["Pan"]}
]}

Good luck!

like image 35
JasonSmith Avatar answered Oct 15 '22 14:10

JasonSmith