Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grouping strings by similarity

I have an array of strings, not many (maybe a few hundreds) but often long (a few hundred chars).

Those string are, generally, nonsense and different one from the other.. but in a group of those string, maybe 5 out of 300, there's a great similarity. In fact they are the same string, what differs is formatting, punctuation and a few words..

How can I work out that group of string?

By the way, I'm writing in ruby, but if nothing else an algorithm in pseudocode would be fine.

thanks

like image 513
luca Avatar asked Jan 28 '10 23:01

luca


1 Answers

Assuming that you are not worried about misspellings or other errors in each word, you could do the following:

Build an inverted index, which is basically a hash keyed by word, pointing to a list of pointers to the strings which contain that word (how you handle duplicate occurrences is up to you). To determine strings that are similar to a given query string, lookup each query word in the index, and for each source string in the resulting lists, count how many times the source string appears in each list. The strings with the highest counts are your best candidates for similarity, because they contain the most words in common.

Then you can compute the edit distance between the two strings, or whatever other metric you want. This way you avoid the O(n^2) complexity of comparing each string with every other string.

like image 122
RyanHennig Avatar answered Nov 15 '22 04:11

RyanHennig