Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use a plaintext diff algorithm for tracking XML changes?

I'm working in Flex/AS3 on (for simplicity) an XML editor. I need to provide undo/redo functionality.

Of course, one solution is to store the entire source text with each edit. However, to conserve memory, I'd like to store the diffs instead (these diffs will also be used to transmit updates to the server for auto-saving).


My question is - can I use a plaintext diff algorithm for tracking these XML changes?

My research on the internet indicates that I cannot do so. However, I'm obviously missing something. Plaintext diff provides functionality that is purportedly:

diff(text, text') -> diffs
patch(text, diffs) -> text'

XML is simply text, so why can't I just use diff() and patch() to transform the text reliably?

For example: Let's say that I'm a poet. When I write poetry, I use lots of funky punctuation... You know, like <, /, and >. (You might see where I'm going with this...) If I'm writing my poetry in an application that uses diffs to provide undo/redo functionality, does my poetry become garbled when I undo/redo my edits? It's just text! Why does it make a difference to the algorithm?

I obviously don't get something here...Thanks for explaining! :)

UPDATE:

Some discussion I've encountered regarding diffing XML with a plaintext algorithm:

  • http://code.google.com/p/google-diff-match-patch/wiki/Plaintext
  • Is there a JS diff library against htmlstring just like google-diff-match-patch on plain text?

Also, I understand that a Command pattern is likely a better way to implement Undo/Redo. I've simplified my use case for the sake of simplicity, and I do still think that XML diffing is the best approach.

like image 517
rinogo Avatar asked Mar 12 '10 02:03

rinogo


2 Answers

I'm the author of the plain-text diff/match/patch library from Google.

The key question is whether your patches are exact. In an ideal world:

  diff(old_text, new_text) -> edits
  patch(edits, old_text) -> new_text

Notice that the base text (old_text) is the same in both operations. In this ideal case, then a simple plain-text diff and patch will work perfectly, regardless of the type of the content. If this case applies to you, then you're done.

The issue lies with fuzzy patching. Here's the corresponding example:

  diff(old_text, new_text) -> edits
  patch(edits, old_forked_text) -> new_forked_text

Notice that the base text is not the same in both operations. They should be similar, but the patch operation now has to use "judgement" about what it should do. Some patches may fit perfectly as specified in the edit, others may need to be tweaked for position, others may need to be tweaked for altered context, others may not fit at all and should be dropped. If your patching algorithm is not aware of the structure of XML when making its decisions, you may very well end up with malfromed XML. Here's a sample:

  old_text = Jabberwock<SPAN>Hello<SPAN>World</SPAN></SPAN>
  new_text = Jabberwock<DIV>Hello<SPAN>World</SPAN></DIV>
  diff(old_text, new_text) -> edits
  edits = ["SPAN" -> "DIV" @ character 11,
           "SPAN" -> "DIV" @ character 41]
  old_forked_text = <SPAN>Hello<SPAN>World</SPAN></SPAN>
  patch(edits, old_forked_text) -> new_forked_text
  new_forked_text = <SPAN>Hello<DIV>World</SPAN></DIV>

Let's look at this one carefully. The original diff returned two edits, change the outermost SPAN to a DIV. Simple change. Unfortunately the text this edit is being applied to has changed from the original. The word "Jabberwock" has been removed. Now the first SPAN->DIV change matches up with the second SPAN tag, not the first one. Since the patch algorithm is not aware of the rules of XML, it results in illegally nested tags.

There are some hacks which allow you to guarantee valid XML when using a plain-text patch, but they result in some loss of flexibility (the original question already has a link to the wiki page I wrote about this). The ultimate solution for patching XML is of course to use an XML-aware diff and patch algorithm. These are significantly more complicated and expensive, but they exist. Google the names Tancred Lindholm and Sebastian Rönnau for the great work that they have done in the XML field (particularly with regards to DocEng).

Let me know if there is anything else I can add.

-- Neil Fraser

like image 97
Neil Fraser Avatar answered Sep 30 '22 03:09

Neil Fraser


I use Beyond Compare all the time to compare XML documents. It understands XML, to a certain degree.

You may need to pre-process the two documents in order for textual comparison to do the best job possible. For instance, in some XML documents, the order of some elements may not matter. It will certainly matter to your diff tool! You may need to pre-process the XML using an XML Transform that sorts these elements into a common order in both files, before comparing the two sorted files.

You're also going to want to use the same indentation for both documents. I find it useful to start each element on a new line, and to use the same amount of indentation, with spaces, for each level. If your document gets very deep, you would want to use only one or two spaces per level, so that the compare fits on the screen. You may even want to use one attribute per line (and to sort the attributes into a common order).

like image 37
John Saunders Avatar answered Sep 30 '22 04:09

John Saunders