Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find what has been changed and upload only changes

I'm just looking for ideas/suggestions here; I'm not asking for a full on solution (although if you have one, I'd be happy to look at it)

I'm trying to find a way to only upload changes to text. It's most likely going to be used as a cloud-based application running on jQuery and HTML, with a PHP server running the back-end.

For example, if I have text like

asdfghjklasdfghjkl

And I change it to

asdfghjklXasdfghjkl

I don't want to have to upload the whole thing (the text can get pretty big)

For example, something like 8,X sent to the server could signify: add an X to the 8th position

Or D8,3 could signify: go to position 8 and delete the previous 3 terms

However, if a single request is corrupted en route to the server, the whole document could be corrupted since the positions would be changed. A simple hash could detect corruption, but then how would one go about recovering from the corruption? The client will have all of the data, but the data is possibly very large, and it is unlikely to be possible to upload.

So thanks for reading through this. Here is a short summary of what needs suggestions

  • Change/Modification Detection
  • Method to communicate the changes
  • Recovery from corruption
  • Anything else that needs improvement
like image 883
Kranu Avatar asked Nov 05 '10 06:11

Kranu


3 Answers

There is already an accepted form for transmitting this kind of "differences" information. It's called Unified Diff.

The google-diff-match-patch provides implementations in Java, JavaScript, C++, C#, Lua and Python.

You should be able to just keep the "original text" and the "modified text" in variables on the client, then generate the diff in javascript (via diff-match-patch), send it to the server, along with a hash, and re-construct it (either using diff-match-patch or the unix "patch" program) on the server.

You might also want to consider including a "version" (or a modified date) when you send the original text to the client in the first place. Then include the same version (or date) in the "diff request" that the client sends up to the server. Verify the version on the server prior to applying the diff, so as to be sure that the server's copy of the text has not diverged from the client's copy while the modification was being made. (of course, in order for this to work, you'll need to update the version number on the server every time the master copy is updated).

like image 164
Lee Avatar answered Nov 17 '22 08:11

Lee


You have a really interesting approach. But if the text files are really so large that it would need too much time to upload them every time, why do you have the send the whole thing to the client? Does the client really have to receive the whole 5mb text file? Wouldn't it be possible to send him only what he needs?

Anyway, to your question: The first thing that comes to my mind when hearing "large text files" and modification detection is diff. For the algorithm, read here. This could be an approach to commit the changes, and it specifies a format for it. You'd just have to rebuild diff (or a part of it) in javascript. This will be not easy, but possible, as I guess. If the algorithm doesn't help you, possibly at least the definition of the diff file format does.

To the corruption issue: You don't have to fear that your date gets corrupted on the way, because the TCP protocol, on which HTTP is based, looks that everything arrives without being corrupted. What you should fear is the connection reset. Might be you can do something like a handshake? When the client sends an update to the server, the server applies the modifications and keeps one old version of the file. To ensure that the client has received the ratification from the server that the modification went fine (that's where the conneciton reset happens), the client sends back another ajax request to the server. If this one doesn't come to the server within sone definied time, the file gets reset on the server side.

Another thing: I don't know if javascript likes it to handle such gigantic files/data...

like image 1
joni Avatar answered Nov 17 '22 07:11

joni


This sounds like a problem that versioning systems (CVS, SVN, Git, Bazaar) already solve very well.

They're all reasonably easy to set up on a server, and you can communicate with them through PHP.

After the setup, you'd get for free: versioning, log, rollback, handling of concurrent changes, proper diff syntax, tagging, branches...

You wouldn't get the 'send just the updates' functionality that you asked for. I'm not sure how important that is to you. Pure texts are really very cheap to send as far as bandwidth is concerned.

Personally, I would probably make a compromise similar to what Wikis do. Break down the whole text into smaller semantically coherent chunks (chapters, or even paragraphs), determine on the client side just which chunks have been edited (without going down to the character level), and send those.

The server could then answer with a diff, generated by your versioning system, which is something they do very efficiently. If you want to allow concurrent changes, you might run into cases where editors have to do manual merges, anyway.

Another general hint might be to look at what Google did with Wave. I have to remain general here, because I haven't really studied it in detail myself, but I seem to remember that there have been a few articles about how they've solved the real-time concurrent editing problem, which seems to be exactly what you'd like to do.

In summary, I believe the problem you're planning to tackle is far from trivial, there are tools that address many of the associated problems already, and I personally would compromise and reformulate the approach in favor of much less workload.

like image 1
Thomas Avatar answered Nov 17 '22 07:11

Thomas