I am going to be periodically pushing a set of text-based data from a web-page to a server, probably as JSON.
For every push, none, some or all of the data may have changed. To reduce the amount of data I have to send over the wire I would want to only send a diff of the changes in each push.
Do you know of any pre-made solutions / tools / libraries that:
Moreover, is this even the best way of going about this particular problem? Is there a better way to push chunks of text data around?
Edit: Some clarifications:
^ That that you would expect on random shared hosting. I'm talking your good friends PHP, Python, PERL, Ruby, those fullas. Or, something that could be easily installed on random shared hosting.
This has been something I've been struggling with as well. I'll be keenly interested if anyone else offers a better answer than mine, but for the time being...
first off, there is http://www.xn--schler-dya.net/blog/2008/01/15/diffing_json_objects/
I personally have not been able to get this library to work, but your milage may vary.
The other alternative is to not try to solve the problem using the DIFF algorithm. It's quite innefficient, and depending on the problem, you may get better performance metrics just sending the whole blob of data, even if you do end up repeating yourself. This is true mainly of very small chunks of data. Obviously there's going to be a turning point as the data you need to transmit gets larger, but it's not going to be obvious where the turning point is, without some kind of measurement. The trick here, is that the bigger your data gets, the longer your diff calculation is going to take too. The turning point is only determined by the intersection of the two lines formed by each method's rate of growth, both of which are going to be linear or worse, depending on how your diff is implemented. In a worst case scenario, you may see an island in the middle where diff gets better performance, but then crosses back again for even larger data sets, and just plain sending it over the network is better again.
Next stop before trying diff, is by wrapping your data access in "get", "set" and "delete" methods that track the changes being made. The data you send over the wire would essentially be a sequential log of these method's usage, which you flush from the client side on each successful transmission. On the serverside you apply this log to your serverside data with serverside analogues to your data access methods. This is a somewhat lighter solution than a diff that doesn't require quite as much processing power.
Finally, if you're going to do diff, the most efficient way I can think of is if you can break your dataset down into discrete "chunks", each with a unique ID. Then when you run the diff, the courseness of the diff is exactly at the "chunk" level. that is, the only comparisons you'd make is ID to ID. If you change a chunk, give it a new id. The courser you can afford to make the diff algorithm, the less time it will take to run.
Alternatively, rather than assigning a new ID on change, you could simply run the diff to check whether a specific object has "changed", stop short as soon as you detect a change, and simply mark that chunk to be re-sent in its entirity, to update the chunk on the server side with the same ID. This could be made even more efficient if you have some kind of quick hashing algorithm for your chunks that you can use to quickly establish equality.
If the sequence of your chunks doesn't matter, or if you can store the sequence as a property of the chunks themselves, rather than modeled by the physical sequence of the chunks, then you can even key your chunks by ID. Then discovering the differences is simply a matter of listing the keys of object A, and looking them up on object B, and then Vice Versa. This is much simpler to implement than a "real" diff algorithm, it has O(a+b) performance which( I think ) is better than the worst case scenario for a real diff algorithm, which you're likely to get if you're trying to implement it yourself, or get a bad implementation.
EtherPad solved something like this by turning each source change into a mathematical operation that could be commutative, associative, etc. I would consider this a non-trivial problem, though, and EtherPad was able to form a business around the solution to this problem.
edit: interestingly enough, this is exactly the problem that a good DVCS like Git tackles, just not on the browser.
I publish today a small jQuery plugin doing a diff between two JS objects. http://plugins.jquery.com/project/jquery-diff
I use it into an application to send to the server backend only the modified data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With