I'm using google-diff-match-patch to compare plain text in natural languages.
How can I make google-diff-match-patch to ignore certain characters? (Some tiny differences which I don't care.)
For example, given text1:
give me a cup of bean-milk. Thanks.
and text2:
please give mom a cup of bean milk! Thank you.
(Note that there are two space characters before 'Thank you'.)
google-diff-match-patch outputs something like this:
[please] give m(e)[om] a cup of bean(-)[ ]milk(.)[!] Thank(s)[ you].
It seems that google-diff-match-patch only ignores different numbers of white spaces.
How can I tell google-diff-match-patch to also ignore characters like [-.!]
?
The expect result would be
[please] give m(e)[om] a cup of bean-milk. Thank(s)[ you].
Thanks.
google-diff-match-patch can output a list of tuples
The first element specifies if it is an insertion (1), a deletion (-1) or an equality (0). The second element specifies the affected text.
For example:
diff_main("Good dog", "Bad dog") => [(-1, "Goo"), (1, "Ba"), (0, "d dog")]
Thus we just need to filter this list.
Example code in Python:
Ignored_marks = re.compile('[ ,\.;:!\'"?-]+$')
def unmark_minor_diffs(diffs): #diffs are list of tuples produced by google-diff-match-patch
cooked_diffs = []
for (op, data) in diffs:
if not Ignored_marks.match(data):
cooked_diffs.append((op, data))
else:
if op in (0, -1):
cooked_diffs.append((0, data))
return cooked_diffs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With