I need an algorithm that can compare two text files and highlight their difference and ( even better!) can compute their difference in a meaningful way (like two similar files should have a similarity score higher than two dissimilar files, with the word "similar" defined in the normal terms). It sounds easy to implement, but it's not.
The implementation can be in c# or python.
Thanks.
When comparing texts, consider both what they have in common and what is different about them. If they have the same purpose: Do they use similar techniques? For example, two newspaper articles could use exaggeration to present completely different viewpoints of the same topic.
The core of diff algorithms seeks to compare two sequences and to discover how the first can be transformed into the second by a sequence of operations using the primitives delete-subsequence, and insert-subseqence. If a delete and an insert coincide on the same range then it can be labeled as a change-subsequence.
What is Text Comparison? Text Comparison is the process of inspecting two files to ensure that no unintended changes have occurred. Typically, one of the files is the original, master document while the other is a revision.
I can recommend to take a look at Neil Fraser's code and articles:
google-diff-match-patch
Currently available in Java, JavaScript, C++ and Python. Regardless of language, each library features the same API and the same functionality. All versions also have comprehensive test harnesses.
Neil Fraser: Diff Strategies - for theory and implementation notes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With