Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word by word diff comparison of two strings in .NET

I need to do Word by word comparison of two strings. Something like diff, but for words, not for lines.

Like it is done in wikipedia http://en.wikipedia.org/w/index.php?title=Horapollo&action=historysubmit&diff=21895647&oldid=21893459

In result I want return the two arrays of indexes of words, which are different in two string.

Are there any libraries/frameworks/standalone_methods for .NET which can do this?

P.S. I want to compare several kilobytes of text

like image 380
Alex Blokha Avatar asked Dec 22 '22 07:12

Alex Blokha


2 Answers

Actually, you probably want to implement a variation of the Local Alignment/Global Alignment algorithms we use in DNA sequence alignments. This is because you probably cannot do a word-by-word comparison of the two strings. I.e:

The quick brown fox jumps over the lazy dog
The quick fox jumps over the lazy dog

In other words, if you cannot identify insertions and deletions of whole words, your comparison algorithm can become very sc(r)ewed. Take a look at the Smith-Waterman algorithm and the Needleman-Wunsch algorithm and find a way to adapt them to your needs. Since such a search space can become very large if the strings are long, you could also check out BLAST. BLAST is a very common heuristic algorithm, and is pretty much the standard in genetic searches.

like image 133
Pedery Avatar answered Jan 09 '23 05:01

Pedery


It seems I've found needed solution:

DiffPlex is a combination of a .NET Diffing Library with both a Silverlight and HTML diff viewer. http://diffplex.codeplex.com/

But It has one bug. In those lines "Hello-Kitty" "Hello - Kitty", the word "Hello" will be marked as difference. Although the difference is space symbol.

like image 23
Alex Blokha Avatar answered Jan 09 '23 07:01

Alex Blokha