Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is an Algorithm to Diff the Two Strings in the Same Way that SO Does on the Version Page?

Tags:

c#

algorithm

ruby

I'm trying to diff two strings by phrase, similar to the way that StackOverflow diffs the two strings on the version edits page. What would be an algorithm to do this? Are there gems, or other standard libraries that accomplish this?

EDIT: I've seen other diffing algorithms (Differ with Ruby) and they seem to result in the following:

>> o = 'now is the time when all good men.'
>> p = 'now some time the men time when all good men.'
>> Differ.diff_by_word(o,p).format_as(:html)
=> "now <del class=\"differ\">some</del><ins class=\"differ\">is</ins> 
   <del class=\"differ\">time </del>the <del class=\"differ\">men </del>time
   when all good men."

Note how the words are diffed on a per word basis? I'd like some way of diffing more by phrase, so the above code output:

=> "now <del class=\"differ\">some time the men</del><ins class=\"differ\">is
   the</ins> time when all good men."

Am I hoping for too much?

like image 607
aronchick Avatar asked Sep 03 '09 04:09

aronchick


People also ask

Which algorithm is used to match two strings?

Levenshtein distance is the most frequently used algorithm. It was founded by the Russian scientist, Vladimir Levenshtein to calculate the similarities between two strings. This is also known as the Edit distance-based algorithm as it computes the number of edits required to transform one string to another.

What algorithm does diff use?

Myers Algorithm – human readable diffs This is used by tools such as Git Diff and GNU Diff.

How does file diff work?

The diff command is invoked from the command line, passing it the names of two files: diff original new . The output of the command represents the changes required to transform the original file into the new file. If original and new are directories, then diff will be run on each file that exists in both directories.


1 Answers

The algorithm you are looking for is Longest Common Subsequence it does most of the work for you.

The outline is something along these lines.

  1. Split by word (input, output)
  2. Calculate LCS on input / output array.
  3. Walk through the array and join up areas intelligently.

So for example say you have:

"hello world this is a test"

compared with:

"mister hello world"

The result from the LCS is

  • "mister" +
  • "hello" =
  • "world" =
  • "this" -
  • "is" -
  • "a" -
  • "test" -

Now you sprinkle the special sauce when building up. You join the string together while staying mindful of the previous action. The naive algorithm is just join sections that are the same action.

  • "mister" +
  • "hello world" =
  • "this is a test" -

Finally you transform it to html:

<ins>mister</ins> hello world <del>this is a test</del>  

Of course the devil is in the detail:

  • You need to consider how you handle tags
  • Do you compare markdown or html
  • Are there any edge cases where the UI stops making sense.
  • Do you need special handling for punctuations.
like image 91
Sam Saffron Avatar answered Sep 19 '22 05:09

Sam Saffron