Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CouchDB attachment manipulation before document update

Tags:

couchdb

I have the requirement to transform images attached to every document (actually need images to be shrinked to 400px width). What is the best way to achieve that? Was thinking on having nodejs code listening on _changes and performing necessary manipulations on document save. However, this have bunch of drawbacks: a) document change does not always means that new attachment was added b) all the time we have to process already shrinked images (at least check image width)

like image 690
Archer Avatar asked Apr 07 '12 09:04

Archer


1 Answers

I think you basically have some data in a database and most of your problem is simply application logic and implementation. I could imagine a very similar requirements list for an application using Drizzle. Anyway, how can your application "cut with the grain" and use CouchDB's strengths?

A Node.js _changes listener sounds like a very good starting point. Node.js has plenty of hype and silly debates. But for receiving a "to-do list" from CouchDB and executing that list concurrently, Node.js is ideal.

Memoizing

I immediately think that image metadata in the document will help you. Fetching an image and checking if it is 400px could get expensive. If you could indicate "shrunk":true or "width":400 or something like that in the document, you would immediately know to skip the document. (This is an optimization, you could possibly skip it during the early phase of your project.)

But how do you keep the metadata in sync with the images? Maybe somebody will attach a large image later, and the metadata still says "shrunk":true. One answer is the validation function. validate_doc_update() has the privilege of examining both the old and the new (candidate) document version. If it is not satisfied, it can throw() an exception to prevent the change. So it could enforce your policy in a few ways:

  • Any time new images are attached, the "shrunk" key must also be deleted
  • Or, your external Node.js tool has a dedicated username to access CouchDB. Documents must never set "shrunk":true unless the user is your tool

Another idea worth investigating is, instead of setting "shrunk":true, you set it to the MD5 checksum of the image. (That is already in the document, in the ._attachments object.) So if your Node.js tool sees this document, it knows that it has work to do.

{ "_id": "a_doc"
, "shrunk": "md5-D2yx50i1wwF37YAtZYhy4Q=="
, "_attachments":
  { "an_image.png":
    { "content_type":"image/png"
    , "revpos": 1
    , "digest": "md5-55LMUZwLfzmiKDySOGNiBg=="
    }
  }
}

In other words:

if(doc.shrunk == doc._attachments["an_image.png"].digest)
  console.log("This doc is fine")
else
  console.log("Uh oh, I need to check %s and maybe shrink the image", doc._id)

Execution

I am biased because I wrote the following tools. However I have had success, and others have reported success using the Node.js package Follow to watch the _changes events: https://github.com/iriscouch/follow

And then use Txn for ACID transactions in the CouchDB documents: https://github.com/iriscouch/txn

The pattern is,

  • Run follow() on the _changes URL, perhaps with "include_docs":true in the options.
  • For each change, decide if it needs work. If it does, execute a function to make the necessary changes, and let txn() take care of fetching and updating, and possible retries if there is a temporary error.

For example, Txn helps you atomically resize the image and also update the metadata, pretty easily.

Finally, if your program crashes, you might fetch a lot of documents that you already processed. That might be okay (if you have your metadata working); however you might want to record a checkpoint occasionally. Remember which changes you saw.

var db = "http://localhost:5984/my_db"
var checkpoint = get_the_checkpoint_somehow() // Synchronous, for simplicity

follow({"db":db, "since":checkpoint}, function(er, change) {
  if(change.seq % 100 == 0)
    store_the_checkpoint_somehow(change.seq) // Another synchronous call
})

Work queue

Again, I am embarrassed to point to all my own tools. But image processing is a classic example of a work queue situation. Every document that needs work is placed in the queue. An unlimited, elastic, army of workers receives a job, fixes the document, and marks the job done (deleted).

I use this a lot myself, and that is why I made CQS, the CouchDB Queue System: https://github.com/iriscouch/cqs

It is for Node.js, and it is identical to Amazon SQS, except it uses your own CouchDB server. If you are already using CouchDB, then CQS might simplify your project.

like image 198
JasonSmith Avatar answered Sep 24 '22 22:09

JasonSmith