Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

very slow filters with couchDB even with erlang

I have a database (couchDB) with about 90k documents in it. The documents are very simple like this:

{
   "_id": "1894496e-1c9e-4b40-9ba6-65ffeaca2ccf",
   "_rev": "1-2d978d19-3651-4af9-a8d5-b70759655e6a",
   "productName": "Cola"
}

now I want one day to sync this database with a mobile device. Obviously 90k docs shouldn't go to the phone all at once. This is why I wrote filter functions. These are supposed to filter by "productName". At first in Javascript later in Erlang to gain performance. These Filter functions look like this in JavaScript:

{
   "_id": "_design/local_filters",
   "_rev": "11-57abe842a82c9835d63597be2b05117d",
   "filters": {
       "by_fanta": "function(doc, req){ if(doc.productName == 'Fanta'){ return doc;}}",
       "by_wasser": "function(doc, req){if(doc.productName == 'Wasser'){ return doc;}}",
       "by_sprite": "function(doc, req){if(doc.productName == 'Sprite'){ return doc;}}"
   }
}

and like this in Erlang:

{
   "_id": "_design/erlang_filter",
   "_rev": "74-f537ec4b6508cee1995baacfddffa6d4",
   "language": "erlang",
   "filters": {
       "by_fanta": "fun({Doc}, {Req}) ->  case proplists:get_value(<<\"productName\">>, Doc) of <<\"Fanta\">> -> true; _ -> false end end.",
       "by_wasser": "fun({Doc}, {Req}) ->  case proplists:get_value(<<\"productName\">>, Doc) of <<\"Wasser\">> -> true; _ -> false end end.",
       "by_sprite": "fun({Doc}, {Req}) ->  case proplists:get_value(<<\"productName\">>, Doc) of <<\"Sprite\">> -> true; _ -> false end end."       
   }
}

To keep it simple there is no query yet but a "hardcoded" string. The filter all work. The problem is they are way to slow. I wrote a testprogram first in Java later in Perl to test the time it takes to filter the documents. Here one of my Perl scripts:

$dt = DBIx::Class::TimeStamp->get_timestamp();

$content = get("http://127.0.0.1:5984/mobile_product_test/_changes?filter=local_filters/by_sprite");

$dy = DBIx::Class::TimeStamp->get_timestamp() - $dt;
$dm = $dy->minutes();
$dz = $dy->seconds();

@contArr = split("\n", $content);

$arraysz = @contArr;
$arraysz = $arraysz - 3;

$\="\n";
print($dm.':'.$dz.' with '.$arraysz.' Elements (JavaScript)');

And now the sad part. These are the times I get:

2:35 with 2 Elements (Erlang)
2:40 with 10000 Elements (Erlang)
2:38 with 30000 Elements (Erlang)
2:31 with 2 Elements (JavaScript)
2:40 with 10000 Elements (JavaScript)
2:51 with 30000 Elements (JavaScript)

btw these are Minutes:Seconds. The number is the number of elements returned by the filter and the database had 90k Elements in it. The big surprise was that the Erlang filter was not faster at all.

To request all elements only takes 9 seconds. And creating views about 15. But it is not possible for my use on a phone to transfer all documents (volume and security reasons).

Is there a way to filter on a view to get a performance increase? Or is something wrong with my erlang filter functions (I'm not surprised by the times for the JavaScript filters).

EDIT: As pointed out by pgras the reason why this is slow is posted in the answer to this Question. To have the erlang filters run faster I need to go a "layer" below and program the erlang directly into the database and not as a _design document. But I dont'r really know where to start and how to do this. Any tips would be helpful.

like image 526
Arne Fischer Avatar asked Mar 13 '13 14:03

Arne Fischer


3 Answers

I may be wrong but filter functions should return boolean values so try to change one to:

function(doc, req){ return doc.productName === 'Fanta';}

It may solve your performance problem...

Edit:

Here is an explanation about why it is slow (at least with JavaScript)...

One solution would be to use a view to select the ids of the documents to sync and then to start the sync by specifying the doc_ids to sync.

For example the view would be:

function(doc){
  emit(doc.productName, doc._id)
}

You could call the view with _design/docs/_view/by_producName?key="Fanta"

And then start the replication with the found doc ids...

like image 63
pgras Avatar answered Oct 18 '22 20:10

pgras


In general couchDB filters are slow. Others have already explained why they are slow. What I found was that the only reasonable way to use filters are to use the "since". Otherwise in a reasonably large database (mine has 47k docs, and they are complex docs) filters don't work. We learnt this the hard way by migrating from dev to prod [few hundred docs to ~47k docs]. We also changed design to a query a view and because we required a continuous feed like behaviour, we used Spring's @Scheduled

like image 38
Arindam Das Avatar answered Oct 18 '22 19:10

Arindam Das


This has been a while since I asked this question. But I thought I would come back to it and share what we ended up doing to solve this.

So the short answer is filter speed can't really be improved.

The reason is behind the way filters work. If you check your database changes. They are here:

http://<ip>:<port>/<databaseName>/_changes

This document contains all changes belonging to your database. If you do anything in your database new lines are just added. When one now wants to use a filter the filter is parsed from json to the specified language and used for every line in this file. To be clear as far as I am aware the parsing is done for each line as well. This is not very efficient and can't be changed.

So I personally think for most use cases filter are to slow and can't be used. This means we have to find a way around this. I do not imply that I have a general solution. I can just say that for us here it was possible to use views instead of filter. Views generate trees internally and are as fast as light compared to filter. A Simple filter is also stored in design document and could look like this:

{
"_id": "_design/all",
"language": "javascript",
"views": {
    "fantaView": {
        "map": "function(doc) { \n   if (doc.productName == 'Fanta')  \n    emit(doc.locale, doc)\n} "
    }
}
}

Where fantaView is the name for the view. I guess the function is self explanatory. So this is what we did I hope it helps someone if he runs into a similar issue.

like image 34
Arne Fischer Avatar answered Oct 18 '22 21:10

Arne Fischer