I want to write a simple diff application in Python using Google's Diff Match Patch APIs.
I'm quite new to Python, so I want an example of how to use the Diff Match Patch API for semantically comparing two paragraphs of text. I'm not too sure of how to go about using the diff_match_patch.py
file and what to import to from it. Help will be much appreciated!
Additionally, I've tried using difflib, but I found it ineffective for comparing largely varied sentences. I'm using ubuntu 12.04 x64.
Google's diff-match-patch API is the same for all languages that it is implemented in (Java, JavaScript, Dart, C++, C#, Objective C, Lua and Python 2.x or python 3.x). Therefore one can typically use sample snippets in languages other than one's target language to figure out which particular API calls are needed for various diff/match/patch tasks .
In the case of a simple "semantic" comparison this is what you need
import diff_match_patch
textA = "the cat in the red hat"
textB = "the feline in the blue hat"
#create a diff_match_patch object
dmp = diff_match_patch.diff_match_patch()
# Depending on the kind of text you work with, in term of overall length
# and complexity, you may want to extend (or here suppress) the
# time_out feature
dmp.Diff_Timeout = 0 # or some other value, default is 1.0 seconds
# All 'diff' jobs start with invoking diff_main()
diffs = dmp.diff_main(textA, textB)
# diff_cleanupSemantic() is used to make the diffs array more "human" readable
dmp.diff_cleanupSemantic(diffs)
# and if you want the results as some ready to display HMTL snippet
htmlSnippet = dmp.diff_prettyHtml(diffs)
A word on "semantic" processing by diff-match-patch
Beware that such processing is useful to present the differences to a human viewer because it tends to produce a shorter list of differences by avoiding non-relevant resynchronization of the texts (when for example two distinct words happen to have common letters in their mid). The results produced however are far from perfect, as this processing is just simple heuristics based on the length of differences and surface patterns etc. rather than actual NLP processing based on lexicons and other semantic-level devices.
For example, the textA
and textB
values used above produce the following "before-and-after-diff_cleanupSemantic" values for the diffs
array
[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'r'), (1, 'blu'), (0, 'e'), (-1, 'd'), (0, ' hat')]
[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'red'), (1, 'blue'), (0, ' hat')]
Nice! the letter 'e' that is common to red and blue causes the diff_main() to see this area of the text as four edits, but the cleanupSemantic() fixes as just two edits, nicely singling out the different sems 'blue' and 'red'.
However, if we have, for example
textA = "stackoverflow is cool"
textb = "so is very cool"
The before/after arrays produced are:
[(0, 's'), (-1, 'tack'), (0, 'o'), (-1, 'verflow'), (0, ' is'), (1, ' very'), (0, ' cool')]
[(0, 's'), (-1, 'tackoverflow is'), (1, 'o is very'), (0, ' cool')]
Which shows that the allegedly semantically improved after can be rather unduly "tortured" compared to the before. Note, for example, how the leading 's' is kept as a match and how the added 'very' word is mixed with parts of the 'is cool' expression. Ideally, we'd probably expect something like
[(-1, 'stackoverflow'), (1, 'so'), (0, ' is '), (-1, 'very'), (0, ' cool')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With