Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

State-of-the-art method for large-scale near-duplicate detection of documents?

To my understanding, the scientific consensus in NLP is that the most effective method for near-duplicate detection in large-scale scientific document collections (more than 1 billion documents) is the one found here:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

which can be briefly described by:

a) shingling of documents b) minhashing to obtain theminhash signatures of the shingles c) locality-sensitive hashing to avoid doing pairwise similarity calculations for all signatures but instead focus only to pairs within buckets.

I am ready to implement this algorithm in Map-Reduce or Spark, but because I am new to the field (I have been reading upon large-scale near-duplicate detection for about two weeks) and the above was published quite a few years ago, I am wondering whether there are known limitations of the above algorithm and whether there are different approaches that are more efficient (offering a more appealing performance/complexity trade-off ).

Thanks in advance!

like image 223
Alex Avatar asked Jun 04 '17 14:06

Alex


1 Answers

Regarding the second step b) there are recent developments which significantly speed up the calculation of signatures:

  • Optimal Densification for Fast and Accurate Minwise Hashing, 2017, https://arxiv.org/abs/1703.04664
  • Fast Similarity Sketching, 2017, https://arxiv.org/abs/1704.04370
  • SuperMinHash - A New Minwise Hashing Algorithm for Jaccard Similarity Estimation, 2017, https://arxiv.org/abs/1706.05698
  • ProbMinHash - A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity, 2019, https://arxiv.org/pdf/1911.00675.pdf
like image 149
otmar Avatar answered Jan 04 '23 09:01

otmar