Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DocumentDB Change Feed - How to see all changes to a document

This new Change Feed feature provided by DocumentDB is pretty cool. However, the documentation states:

Each change to a document appears only once in the change feed. Only the most recent change for a given document is included in the change log. Intermediate changes may not be available.

Basically, if a document goes from revision A->B->C, when the change feed is polled, we're only going to get "C." - I have a situation where I want to see "A" and "B" as well.

I know of a few existing patterns to solve this, but I was really hoping to leverage this new Change Feed feature. I hoped it would return A, B, and C.

Is the intent of this feature to have "workers" polling the service very frequently? Obviously, the more frequently workers poll, the less likely they are to skip a revision to a document. However, I wouldn't want to adversely affect performance of the collection as a result.

like image 278
Jmoney38 Avatar asked Dec 16 '16 01:12

Jmoney38


1 Answers

DocumentDB team member here. I'll start off saying please propose/vote for support for all versions/generations of the document here: http://feedback.azure.com/forums/263030-documentdb

The intent of Change Feed supporting the latest version was for two reasons:

  1. Many problems like data synchronization, and stream processing rely on the latest version, and do not need the intermediate versions
  2. This approach has the advantage of not requiring additional storage to store all versions or having a time period for change feed availability.

You had mentioned you're already aware of workarounds, but I'll just state this for the benefit of others: this problem can be solved by inverting what's stored in DocumentDB. That is, you can store all versions in DocumentDB via creating new documents, then consolidate them via change feed by upserting the latest version.

To answer the question in comments, you must absolutely use Change Feed over querying by timestamp for the following reasons:

  1. Change Feed is much more efficient. Querying "order by timestamp" across a distributed dataset performs a global sort, whereas Change Feed sorts locally within partitions timestamp partially. Additionally, there's no query parsing overhead
  2. Clock time is less meaningful in distributed systems due to clock skew, and differentiating between multiple updates within a second/millisecond can be important. Instead, you need the "logical time" representing the exact commit order within the database. With change feed, updates within a partition key are in exact order of commit, and you get all documents updated within a transaction stamped with the same logical timestamp.
  3. Change Feed can be consumed in a distributed manner across multiple workers unlike query. This is great when working with a downstream scalable compute framework like Apache Storm or Azure Functions.
like image 179
Aravind Krishna R. Avatar answered Nov 15 '22 08:11

Aravind Krishna R.