Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using a diff on a document containing HTML

I'm in the brainstorming process for a system that will be used to write and edit documents (think Google Docs, but without realtime collaboration). I will be using a WYSIWYG editor, so the documents will contain HTML.

Editors will make changes to the documents, and the original user should see those revisions and make the changes manually.

How can I use a diff system when HTML tags are involved? I'd like the revisions to display any changes to formatting, so for instance if something was bolded, then it should pick up on that change. It should also make content changes visible like a normal diff.

If I were to use a regular diff system, then it would see the HTML as regular text, which is obviously not the case. Thanks for any pointers!

like image 300
James Skidmore Avatar asked Dec 22 '10 21:12

James Skidmore


1 Answers

Best OSS library I've worked with for HTML diffing is DaisyDiff (PHP Version).

That noted, diffing HTML, or XML for that matter, is a troublesome beast. If you can simplify the data to be compared by doing some pre-processing before passing to a diffing algorithm you may get better results. For instance, if you really only use things like <p>, <b>, <i> tags tokenize them and use a differencing algorithm such as diff-match-patch or GNU Diff. Then replace those tokens afterward. See this diff-match-patch wiki page for some thoughts on this.

like image 142
orangepips Avatar answered Oct 29 '22 20:10

orangepips