We have a requirement in the project that we have to compare two texts (update1, update2) and come up with an algorithm to define how many words and how many sentences have changed.
Are there any algorithms that I can use?
I am not even looking for code. If I know the algorithm, I can code it in Java.
The core of diff algorithms seeks to compare two sequences and to discover how the first can be transformed into the second by a sequence of operations using the primitives delete-subsequence, and insert-subseqence. If a delete and an insert coincide on the same range then it can be labeled as a change-subsequence.
A good way to compare two paragraphs only by comparing the text and word similarity is using an algorithm called Levenshtein Distance. It compare distance between two texts, and you can use the threshold that better suits your need. For example, all text above 90% similarity should be considered the same.
Typically this is accomplished by finding the Longest Common Subsequence (commonly called the LCS problem). This is how tools like diff
work. Of course, diff
is a line-oriented tool, and it sounds like your needs are somewhat different. However, I'm assuming that you've already constructed some way to compare words and sentences.
An O(NP) Sequence Comparison Algorithm is used by subversion's diff engine.
For your information, there are implementations with various programming languages by myself in following page of github.
https://github.com/cubicdaiya/onp
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With