Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ignore certain characters while doing diff in google-diff-match-patch?

I'm using google-diff-match-patch to compare plain text in natural languages.

How can I make google-diff-match-patch to ignore certain characters? (Some tiny differences which I don't care.)

For example, given text1:

give me a cup of bean-milk. Thanks.

and text2:

please give mom a cup of bean milk!  Thank you.

(Note that there are two space characters before 'Thank you'.)

google-diff-match-patch outputs something like this:

[please] give m(e)[om] a cup of bean(-)[ ]milk(.)[!] Thank(s)[ you].

It seems that google-diff-match-patch only ignores different numbers of white spaces.

How can I tell google-diff-match-patch to also ignore characters like [-.!]?

The expect result would be

[please] give m(e)[om] a cup of bean-milk. Thank(s)[ you].

Thanks.

like image 309
weakish Avatar asked Nov 03 '22 15:11

weakish


1 Answers

google-diff-match-patch can output a list of tuples

The first element specifies if it is an insertion (1), a deletion (-1) or an equality (0). The second element specifies the affected text.

For example:

diff_main("Good dog", "Bad dog") => [(-1, "Goo"), (1, "Ba"), (0, "d dog")]

Thus we just need to filter this list.

Example code in Python:

Ignored_marks = re.compile('[ ,\.;:!\'"?-]+$')

def unmark_minor_diffs(diffs): #diffs are list of tuples produced by google-diff-match-patch

  cooked_diffs = []
  for (op, data) in diffs: 
  if not Ignored_marks.match(data):
    cooked_diffs.append((op, data))
  else:
    if op in (0, -1):
      cooked_diffs.append((0, data))
  return cooked_diffs
like image 135
2 revs Avatar answered Nov 08 '22 05:11

2 revs