I'm trying to do machine learning on a real-life dataset (hotel reviews). Unfortunately, it's plagued by spam, which comes in the form of almost identical reviews, complicating matters for me greatly.
I would like to remove "almost duplicates" from the dataset based on the edit distance or something similar, and since the dataset size is >100K, the algorithm has to be subquadratic in the size of the dataset. Right now I can only think of flagging individual sentences or phrases that are repeated too often and then removing all reviews that have them, but it's easy to see how such strategy could backfire. Is there a common algorithm that does better?
Obviously solving this problem in whole might involve writing a decent research paper. Here is my suggestion.
In bioinformatics we face this problem all the time. The most used algorithm is BLAST (http://en.wikipedia.org/wiki/BLAST). Please go through the algorithm and you might get an idea of what is involved.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With