Removing "almost duplicate" strings in subquadratic time

Question

I'm trying to do machine learning on a real-life dataset (hotel reviews). Unfortunately, it's plagued by spam, which comes in the form of almost identical reviews, complicating matters for me greatly.

I would like to remove "almost duplicates" from the dataset based on the edit distance or something similar, and since the dataset size is >100K, the algorithm has to be subquadratic in the size of the dataset. Right now I can only think of flagging individual sentences or phrases that are repeated too often and then removing all reviews that have them, but it's easy to see how such strategy could backfire. Is there a common algorithm that does better?

ElKamina · Accepted Answer

Obviously solving this problem in whole might involve writing a decent research paper. Here is my suggestion.

In bioinformatics we face this problem all the time. The most used algorithm is BLAST (http://en.wikipedia.org/wiki/BLAST). Please go through the algorithm and you might get an idea of what is involved.

Removing "almost duplicate" strings in subquadratic time

Tags:

algorithm

data-mining

Alexei Averchenko

1 Answers

ElKamina

Recent Activity

Donate For Us

Removing "almost duplicate" strings in subquadratic time

Tags:

algorithm

data-mining

Alexei Averchenko

1 Answers

ElKamina

Related questions

Recent Activity

Donate For Us