State-of-the-art method for large-scale near-duplicate detection of documents?

Question

To my understanding, the scientific consensus in NLP is that the most effective method for near-duplicate detection in large-scale scientific document collections (more than 1 billion documents) is the one found here:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

which can be briefly described by:

a) shingling of documents b) minhashing to obtain theminhash signatures of the shingles c) locality-sensitive hashing to avoid doing pairwise similarity calculations for all signatures but instead focus only to pairs within buckets.

I am ready to implement this algorithm in Map-Reduce or Spark, but because I am new to the field (I have been reading upon large-scale near-duplicate detection for about two weeks) and the above was published quite a few years ago, I am wondering whether there are known limitations of the above algorithm and whether there are different approaches that are more efficient (offering a more appealing performance/complexity trade-off ).

Thanks in advance!

otmar · Accepted Answer

Regarding the second step b) there are recent developments which significantly speed up the calculation of signatures:

Optimal Densification for Fast and Accurate Minwise Hashing, 2017, https://arxiv.org/abs/1703.04664
Fast Similarity Sketching, 2017, https://arxiv.org/abs/1704.04370
SuperMinHash - A New Minwise Hashing Algorithm for Jaccard Similarity Estimation, 2017, https://arxiv.org/abs/1706.05698
ProbMinHash - A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity, 2019, https://arxiv.org/pdf/1911.00675.pdf

State-of-the-art method for large-scale near-duplicate detection of documents?

Tags:

machine-learning

nlp

Alex

1 Answers

otmar

Recent Activity

Donate For Us

State-of-the-art method for large-scale near-duplicate detection of documents?

Tags:

machine-learning

nlp

Alex

1 Answers

otmar

Related questions

Recent Activity

Donate For Us