Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing "almost duplicate" strings in subquadratic time

I'm trying to do machine learning on a real-life dataset (hotel reviews). Unfortunately, it's plagued by spam, which comes in the form of almost identical reviews, complicating matters for me greatly.

I would like to remove "almost duplicates" from the dataset based on the edit distance or something similar, and since the dataset size is >100K, the algorithm has to be subquadratic in the size of the dataset. Right now I can only think of flagging individual sentences or phrases that are repeated too often and then removing all reviews that have them, but it's easy to see how such strategy could backfire. Is there a common algorithm that does better?

like image 459
Alexei Averchenko Avatar asked Jan 10 '14 15:01

Alexei Averchenko


1 Answers

Obviously solving this problem in whole might involve writing a decent research paper. Here is my suggestion.

In bioinformatics we face this problem all the time. The most used algorithm is BLAST (http://en.wikipedia.org/wiki/BLAST). Please go through the algorithm and you might get an idea of what is involved.

like image 65
ElKamina Avatar answered Sep 29 '22 14:09

ElKamina