Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare two paragraphs of text?

I need to remove duplicated paragraphs in a text with many paragraphs.

I use functions from the class java.security.MessageDigest to calculate each paragraph's MD5 hash value, and then add these hash value into a Set.

If add()'ed successfully, it means the latest paragraph is a duplicate one.

Is there any risk of this way?

Except String.equals(), is there any other way to do it?

like image 744
mojiayi Avatar asked Mar 13 '13 10:03

mojiayi


1 Answers

Before hashing you could normalize the paragraphs e.g. Removing punctuation, conversion to lower case and removing additional whitespace. After normalizing, paragraphs that only differ there would get the same hash.

like image 100
Matt Avatar answered Sep 24 '22 21:09

Matt