grouping strings by similarity

Question

I have an array of strings, not many (maybe a few hundreds) but often long (a few hundred chars).

Those string are, generally, nonsense and different one from the other.. but in a group of those string, maybe 5 out of 300, there's a great similarity. In fact they are the same string, what differs is formatting, punctuation and a few words..

How can I work out that group of string?

By the way, I'm writing in ruby, but if nothing else an algorithm in pseudocode would be fine.

thanks

RyanHennig · Accepted Answer

Assuming that you are not worried about misspellings or other errors in each word, you could do the following:

Build an inverted index, which is basically a hash keyed by word, pointing to a list of pointers to the strings which contain that word (how you handle duplicate occurrences is up to you). To determine strings that are similar to a given query string, lookup each query word in the index, and for each source string in the resulting lists, count how many times the source string appears in each list. The strings with the highest counts are your best candidates for similarity, because they contain the most words in common.

Then you can compute the edit distance between the two strings, or whatever other metric you want. This way you avoid the O(n^2) complexity of comparing each string with every other string.

grouping strings by similarity

Tags:

string

algorithm

ruby

similarity

grouping

luca

1 Answers

RyanHennig

Recent Activity

Donate For Us

grouping strings by similarity

Tags:

string

algorithm

ruby

similarity

grouping

luca

1 Answers

RyanHennig

Related questions

Recent Activity

Donate For Us