Can CouchDB handle thousands of separate databases?

Tags:

couchdb

Can CouchDB handle thousands of separate databases on the same machine?

Imagine you have a collection of BankTransactions. There are many thousands of records. (EDIT: not actually storing transactions--just think of a very large number of very small, frequently updating records. It's basically a join table from SQL-land.)

Each day you want a summary view of transactions that occurred only at your local bank branch. If all the records are in a single database, regenerating the view will process all of the transactions from all of the branches. This is a much bigger chunk of work, and unnecessary for the user who cares only about his particular subset of documents.

This makes it seem like each bank branch should be partitioned into its own database, in order for the views to be generated in smaller chunks, and independently of each other. But I've never heard of anyone doing this, and it seems like an anti-pattern (e.g. duplicating the same design document across thousands of different databases).

Is there a different way I should be modeling this problem? (Should the partitioning happen between separate machines, not separate databases on the same machine?) If not, can CouchDB handle the thousands of databases it will take to keep the partitions small?

(Thanks!)

843

asked Mar 27 '12 10:03

elliot42

3 Answers

I know this question is old, but wanted to note that now with more recent versions of CouchDB (3.0+), partitioned databases are supported, which addresses this situation. So you can have a single database for transactions, and partition them by bank branch. You can then query all transactions as you would before, or query just for those from a specific branch, and only the shards where that branch's data is stored will be accessed.

172

answered Jan 01 '23 21:01

ruhnet

[Warning, I'm assuming you're running this in some sort of production environment. Just go with the short answer if this is for a school or pet project.]

The short answer is "yes".

The longer answer is that there are some things you need to watch out for...

You're going to be playing whack-a-mole with a lot of system settings like max file descriptors.
You'll also be playing whack-a-mole with erlang vm settings.
CouchDB has a "max open databases" option. Increase this or you're going to have pending requests piling up.
It's going to be a PITA to aggregate multiple databases to generate reports. You can do it by polling each database's _changes feed, modifying the data, and then throwing it back into a central/aggregating database. The tooling to make this easier is just not there yet in CouchDB's API. Almost, but not quite.

However, the biggest problem that you're going to run into if you try to do this is that CouchDB does not horizontally scale [well] by itself. If you add more CouchDB servers they're all going to have duplicates of the data. Sure, your max open dbs count will scale linearly with each node added, but other things like view build time won't (ex., they'll all need to do their own view builds).

Whereas I've seen thousands of open databases on a BigCouch cluster. Anecdotally that's because of dynamo clustering: more nodes doing different things in parallel, versus walled off CouchDB servers replicating to one another.

Cheers.

answered Jan 01 '23 20:01

Sam Bisbee

Multiple databases are possible, but for most cases I think the aggregate database will actually give better performance to your branches. Keep in mind that you're only optimizing when a document is updated into the view; each document will only be parsed once per view.

For end-of-day polling in an aggregate database, the first branch will cause 100% of the new docs to be processed, and pay 100% of the delay. All other branches will pay 0%. So most branches benefit. For end-of-day polling in separate databases, all branches pay a portion of the penalty proportional to their volume, so most come out slightly behind.

For frequent view updates throughout the day, active branches prefer the aggregate and low-volume branches prefer separate. If one branch in 10 adds 99% of the documents, most of the update work will be done on other branch's polls, so 9 out of 10 prefer separate dbs.

If this latency matters, and assuming couch has some clock cycles going unused, you could write a 3-line loop/view/sleep shell script that updates some documents before any user is waiting.

answered Jan 01 '23 20:01

Jim

Related questions
                            
                                How to sync CouchDB and PostgreSQL
                            
                                CouchDB - Filtered Replication - Can the speed be improved?
                            
                                Return unique values by key in CouchDB
                            
                                Running cUrl cmd from Win7 doesn't work, but on Linux it does
                            
                                Translate CouchDB javascript views to erlang
                            
                                CouchDB: "Database-per-user" or "One-Database-For-All" design?
                            
                                Promoting Couch-DB to management [closed]
                            
                                Emit Tuples From Erlang Views In CouchDB
                            
                                CouchDB not replicating design documents
                            
                                How would you model customer > order > ordertem > product in NoSql database?
                            
                                Is it safe to compact a CouchDB database that has continuous replication?
                            
                                How to get last created document in couchdb?
                            
                                CouchDB On-the-fly attachments through command-line
                            
                                SQL vs NoSQL for an inventory management system
                            
                                Using a CouchDB view, can I count groups and filter by key range at the same time?
                            
                                _deleted_conflicts in CouchDB?
                            
                                Is there spring-data for CouchDB?
                            
                                CouchDb automatic timestamps
                            
                                Embedded couchDB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With