Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text comparison algorithm

We have a requirement in the project that we have to compare two texts (update1, update2) and come up with an algorithm to define how many words and how many sentences have changed.

Are there any algorithms that I can use?

I am not even looking for code. If I know the algorithm, I can code it in Java.

like image 920
java_mouse Avatar asked Jan 30 '12 14:01

java_mouse


People also ask

How does diff algorithm work?

The core of diff algorithms seeks to compare two sequences and to discover how the first can be transformed into the second by a sequence of operations using the primitives delete-subsequence, and insert-subseqence. If a delete and an insert coincide on the same range then it can be labeled as a change-subsequence.

How do you know if two paragraphs are similar?

A good way to compare two paragraphs only by comparing the text and word similarity is using an algorithm called Levenshtein Distance. It compare distance between two texts, and you can use the threshold that better suits your need. For example, all text above 90% similarity should be considered the same.


2 Answers

Typically this is accomplished by finding the Longest Common Subsequence (commonly called the LCS problem). This is how tools like diff work. Of course, diff is a line-oriented tool, and it sounds like your needs are somewhat different. However, I'm assuming that you've already constructed some way to compare words and sentences.

like image 138
FatalError Avatar answered Sep 20 '22 13:09

FatalError


An O(NP) Sequence Comparison Algorithm is used by subversion's diff engine.

For your information, there are implementations with various programming languages by myself in following page of github.

https://github.com/cubicdaiya/onp

like image 27
cubicdaiya Avatar answered Sep 18 '22 13:09

cubicdaiya