Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Near Duplicate Detection in Data Streams

I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to filter near duplicate data.

I did a bit of research on duplicate detection in data streams and read about Stable Bloom Filters. Stable bloom filters are data structures for duplicate detection in data streams with an upper bound on the false positive rate.

But, I want to identify near duplicates and I also looked at Hashing Algorithms like LSH and MinHash that are used in Nearest Neighbour problems and Near Duplicate Detection.

I am kind of stuck and looking for pointers as to how to proceed and papers/implementations that I could look at?

like image 476
thickblood Avatar asked Apr 27 '12 10:04

thickblood


2 Answers

  1. First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the MD5 hash (or something faster) of the resulting string. Do a database lookup of the MD5 hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.

  2. To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the SpotSigs paper and blog post by Greg Linden. Suppose the routine Sigs() does that for a given string, that is, given the normalized string x, Sigs(x) returns a small (1-5) set of 64 bit integers. You could use something like the SpotSigs algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is here).

  3. Given the Sigs() the problem of efficiently finding the near duplicates is commonly called the set similarity joins problem. The SpotSigs paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the simhash method.

like image 197
Jeff Kubina Avatar answered Nov 12 '22 15:11

Jeff Kubina


http://micvog.com/2013/09/08/storm-first-story-detection/ has some nice implementation notes

like image 1
Ashwin Jayaprakash Avatar answered Nov 12 '22 14:11

Ashwin Jayaprakash