Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to push diffs of data (possibly JSON) to a server?

I am going to be periodically pushing a set of text-based data from a web-page to a server, probably as JSON.

For every push, none, some or all of the data may have changed. To reduce the amount of data I have to send over the wire I would want to only send a diff of the changes in each push.

Do you know of any pre-made solutions / tools / libraries that:

  • Dynamically build a diff of JSON as changes are made to it (to avoid storing oldJson and newJson and doing a full diff every push) written in JavaScript (i.e. for the client-side)
  • Patch an existing chunk of JSON with a JSON diff on the server side, written on any platform that isn't Java or .NET^ (needs to run on linux, Java's not an option for the env I'm in, neither is Mono).

Moreover, is this even the best way of going about this particular problem? Is there a better way to push chunks of text data around?

Edit: Some clarifications:

  • The probable data stucture would be basically a fairly flat (in the sense that it's hightly connected so any links will be ID-based references not actual nested data) collection of nodes. Nodes contain a collection of trees, the leaves of these trees contain actual 'primative' data, such as numbers, strings and Ids. Most data change will be in the leaves.
    • Most of the leaf data will be very small (primatives or less than a paragraph of text) but some will be very long (pages of "rich" text).
  • For the moment we can consider this strictly one-to-one, i.e. there is only one client connected (in read / write) to any particular data structure.
  • It would be nice to keep the server as minimal as possible in terms of complexity-- the idea is to move away from having a server as much as possible. While HTML5 is still mostly unsupported I still need one to store data with though...

^ That that you would expect on random shared hosting. I'm talking your good friends PHP, Python, PERL, Ruby, those fullas. Or, something that could be easily installed on random shared hosting.

like image 468
SCdF Avatar asked Feb 25 '09 00:02

SCdF


3 Answers

This has been something I've been struggling with as well. I'll be keenly interested if anyone else offers a better answer than mine, but for the time being...

first off, there is http://www.xn--schler-dya.net/blog/2008/01/15/diffing_json_objects/

I personally have not been able to get this library to work, but your milage may vary.

The other alternative is to not try to solve the problem using the DIFF algorithm. It's quite innefficient, and depending on the problem, you may get better performance metrics just sending the whole blob of data, even if you do end up repeating yourself. This is true mainly of very small chunks of data. Obviously there's going to be a turning point as the data you need to transmit gets larger, but it's not going to be obvious where the turning point is, without some kind of measurement. The trick here, is that the bigger your data gets, the longer your diff calculation is going to take too. The turning point is only determined by the intersection of the two lines formed by each method's rate of growth, both of which are going to be linear or worse, depending on how your diff is implemented. In a worst case scenario, you may see an island in the middle where diff gets better performance, but then crosses back again for even larger data sets, and just plain sending it over the network is better again.

Next stop before trying diff, is by wrapping your data access in "get", "set" and "delete" methods that track the changes being made. The data you send over the wire would essentially be a sequential log of these method's usage, which you flush from the client side on each successful transmission. On the serverside you apply this log to your serverside data with serverside analogues to your data access methods. This is a somewhat lighter solution than a diff that doesn't require quite as much processing power.

Finally, if you're going to do diff, the most efficient way I can think of is if you can break your dataset down into discrete "chunks", each with a unique ID. Then when you run the diff, the courseness of the diff is exactly at the "chunk" level. that is, the only comparisons you'd make is ID to ID. If you change a chunk, give it a new id. The courser you can afford to make the diff algorithm, the less time it will take to run.

Alternatively, rather than assigning a new ID on change, you could simply run the diff to check whether a specific object has "changed", stop short as soon as you detect a change, and simply mark that chunk to be re-sent in its entirity, to update the chunk on the server side with the same ID. This could be made even more efficient if you have some kind of quick hashing algorithm for your chunks that you can use to quickly establish equality.

If the sequence of your chunks doesn't matter, or if you can store the sequence as a property of the chunks themselves, rather than modeled by the physical sequence of the chunks, then you can even key your chunks by ID. Then discovering the differences is simply a matter of listing the keys of object A, and looking them up on object B, and then Vice Versa. This is much simpler to implement than a "real" diff algorithm, it has O(a+b) performance which( I think ) is better than the worst case scenario for a real diff algorithm, which you're likely to get if you're trying to implement it yourself, or get a bad implementation.

like image 54
Breton Avatar answered Nov 18 '22 04:11

Breton


EtherPad solved something like this by turning each source change into a mathematical operation that could be commutative, associative, etc. I would consider this a non-trivial problem, though, and EtherPad was able to form a business around the solution to this problem.

edit: interestingly enough, this is exactly the problem that a good DVCS like Git tackles, just not on the browser.

like image 5
m104 Avatar answered Nov 18 '22 04:11

m104


I publish today a small jQuery plugin doing a diff between two JS objects. http://plugins.jquery.com/project/jquery-diff

I use it into an application to send to the server backend only the modified data.

like image 2
Marc Rutkowski Avatar answered Nov 18 '22 03:11

Marc Rutkowski