Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CouchDB database structure - best practice

Being new to CouchDB, just wanted to discuss the best practice for structuring a database and documents. My background is from MySQL, so still trying to get a handle on document-driven databases.

To outline the system, we have several clients who each access a separate website with separate data. Each clients data will be split into its own database. Each database will have data constantly inserted (every 5 minutes, for at least a year) for logging events. A new document is created every 5 minutes with a timestamp and value. We also need to store some information about the client, which is a single document that doesn't ever get updated (if so, very rarely).

Below is an example of how one client database looks...

{
    "_id": "client_info",
    "name": "Client Name",
    "role": "admin",
    ....
},
{
    "_id": "1199145600",
    "alert_1_value": 0.150
    "alert_2_value": 1.030
    "alert_3_value": 12.500
    ...
    ...
},
{
    "_id": "1199145900",
    "alert_1_value": 0.150
    "alert_2_value": 1.030
    "alert_3_value": 12.500
    ...
    ...
},
{
    "_id": "1199146200",
    "alert_1_value": 0.150
    "alert_2_value": 1.030
    "alert_3_value": 12.500
    ...
    ...
},
etc...literally millions more of these every 5 minutes...

My question is, is this sort of structure correct? I understand CouchDB is a flat-file database, but there will be literally millions of the timestamp/value documents in the database. I may just be being picky, but it just seems a little disorganized to me.

Thanks!

like image 821
crawf Avatar asked Oct 25 '22 07:10

crawf


1 Answers

Use the timestamp as your id if it's guaranteed to be unique. This dramatically improves the ability of couch to maintain its b-tree for things like building and maintaining views as well as docs, and also it will save you len([_id]) space too.

Each doc you add (for such small data) adds some overhead in b-tree space. In your view (the logical equivalent of your SQL query) you can always parse the doc fields and emit them separately, or multiple times, if needed.

This type of unchanging data is great fit for CouchDB. As the data is added to couch, you can trigger a view update periodically, and the view will build the query data in advance. This means that, unlike SQL, where you'd calculate that aggregate date on the fly each time, couch will simply read that data, cached in the view b-tree's intermediate nodes. Much faster.

So the typical CouchDB approach is: - model your transactions to minimise the # of docs (i.e. denormalise) - use different views if needed to filter or sort results differently.

I guess you'll want to produce aggregate stats across that time period. Likely this will be much more efficient (CPU wise) in erlang; so take a look at https://github.com/apache/couchdb/blob/trunk/src/couchdb/couch_query_servers.erl#L172-205 to see how they're done.

like image 66
dch Avatar answered Oct 27 '22 10:10

dch