To my understanding, the scientific consensus in NLP is that the most effective method for near-duplicate detection in large-scale scientific document collections (more than 1 billion documents) is the one found here:
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
which can be briefly described by:
a) shingling of documents b) minhashing to obtain theminhash signatures of the shingles c) locality-sensitive hashing to avoid doing pairwise similarity calculations for all signatures but instead focus only to pairs within buckets.
I am ready to implement this algorithm in Map-Reduce or Spark, but because I am new to the field (I have been reading upon large-scale near-duplicate detection for about two weeks) and the above was published quite a few years ago, I am wondering whether there are known limitations of the above algorithm and whether there are different approaches that are more efficient (offering a more appealing performance/complexity trade-off ).
Thanks in advance!
Regarding the second step b) there are recent developments which significantly speed up the calculation of signatures:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With