I need to remove duplicated paragraphs in a text with many paragraphs.
I use functions from the class java.security.MessageDigest
to calculate each paragraph's MD5 hash value, and then add these hash value into a Set
.
If add()
'ed successfully, it means the latest paragraph is a duplicate one.
Is there any risk of this way?
Except String.equals()
, is there any other way to do it?
Before hashing you could normalize the paragraphs e.g. Removing punctuation, conversion to lower case and removing additional whitespace. After normalizing, paragraphs that only differ there would get the same hash.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With