Percentage Similarity Analysis (Java)

Question

I have following situation:

String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically"; String b = "Web Crawler computer program browses the World Wide Web";

Is there any idea or standard algorithm to calculate the percentage of similarity?

For instance, above case, the similarity estimated by manual looking should be 90%++.

My idea is to tokenize both Strings and compare the number of tokens matched. Something like (7 tokens /1 0 tokens) * 100. But, of course, it is not effective at all for this method. Compare number of characters matched also seem to be not effective....

Can anyone give some guidelines???

Above is part of my project, Plagiarism Analyzer.

Hence, the words matched will be exactly same without any synonyms.

The only matters in this case is that how to calculate a quite accurate percentage of similarity.

Thanks a lot for any helps.

Tomislav Nakic-Alfirevic · Accepted Answer

As Konrad pointed out, your question depends heavily on what you mean by "similar". In general, I would say the following guidelines should be of use:

normalize the input by reducing a word to it's base form and lowercase it
use a word frequency list (obtainable easily on the web) and make the word's "similarity relevance" inversly proportional to it's position on the frequency list
calculate the total sentence similarity as an aggregated similarity of the words appearing in both sentences divided by the total similarity relevance of the sentences

You can refine the technique to include differences between word forms, sentence word order, synonim lists etc. Although you'll never get perfect results, you have a lot of tweaking possibilities and I believe that in general you might get quite valuable measures of similarity.

Percentage Similarity Analysis (Java)

Tags:

java

similarity

Mr CooL

1 Answers

Tomislav Nakic-Alfirevic

Recent Activity

Donate For Us

Percentage Similarity Analysis (Java)

Tags:

java

similarity

Mr CooL

1 Answers

Tomislav Nakic-Alfirevic

Related questions

Recent Activity

Donate For Us