Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Percentage Similarity Analysis (Java)

I have following situation:

String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically"; String b = "Web Crawler computer program browses the World Wide Web";

Is there any idea or standard algorithm to calculate the percentage of similarity?

For instance, above case, the similarity estimated by manual looking should be 90%++.

My idea is to tokenize both Strings and compare the number of tokens matched. Something like (7 tokens /1 0 tokens) * 100. But, of course, it is not effective at all for this method. Compare number of characters matched also seem to be not effective....

Can anyone give some guidelines???

Above is part of my project, Plagiarism Analyzer.

Hence, the words matched will be exactly same without any synonyms.

The only matters in this case is that how to calculate a quite accurate percentage of similarity.

Thanks a lot for any helps.

like image 668
Mr CooL Avatar asked Dec 12 '22 23:12

Mr CooL


1 Answers

As Konrad pointed out, your question depends heavily on what you mean by "similar". In general, I would say the following guidelines should be of use:

  • normalize the input by reducing a word to it's base form and lowercase it
  • use a word frequency list (obtainable easily on the web) and make the word's "similarity relevance" inversly proportional to it's position on the frequency list
  • calculate the total sentence similarity as an aggregated similarity of the words appearing in both sentences divided by the total similarity relevance of the sentences

You can refine the technique to include differences between word forms, sentence word order, synonim lists etc. Although you'll never get perfect results, you have a lot of tweaking possibilities and I believe that in general you might get quite valuable measures of similarity.

like image 176
Tomislav Nakic-Alfirevic Avatar answered Dec 23 '22 17:12

Tomislav Nakic-Alfirevic