Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ALGORITHM - String similarity score/hash

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number/scores (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) scores/hashes.

Let's consider these strings and scores as an example:

Hello world 1000

Hello world! 1010

Hello earth 1125

Foo bar 3250

FooBarbar 3750

Foo Bar! 3300

Foo world! 2350

You can see that Hello world! and Hello world are similar and their scores are close to each other.

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

My end aim is : there would be streaming log messages(only pure messages) and i wanna find the pattern of those messages(some sort of regular expression type).But that gets started only when i can bucket similar strings. I again focus that I should get some number/scores (hash) for each string AND THAT CAN LATER tell me that two strings are or are not similar

like image 808
Ajay Avatar asked Jul 12 '11 14:07

Ajay


3 Answers

Have a look at locality-sensitive hashing.

The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).

There's a very good explanation available here together with some sample code.

like image 129
Jeff Foster Avatar answered Sep 28 '22 07:09

Jeff Foster


TL;DR: Python BK-tree

Interesting question. I have limited experience within this field, but since the Levenshtein distance fulfills the triangle inequality, I figured that there must be a way of computing some sort of absolute distance to an origin in order to find strings in the vicinity of each other without performing direct comparisons against all entries in the entire database.

While googling on some terms related to this, I found one particularly interesting thesis: Aspects of Metric Spaces in Computation by Matthew Adam Skala.

At page 26 he discusses similarity measures based on kd-trees and others, but concludes:

However, general metric spaces do not provide the geometry required by those techniques. For a general metric space with no other assumptions, it is necessary distance-based to use a distance-based approach that indexes points solely on the basis of their distance from each other. Burkhard and Keller [35] offered one of the first such index structures, now known as a BK-tree for their initials, in 1973. In a BK-tree, the metric is assumed to have a few discrete return values, each internal node contains a vantage point, and the subtrees correspond to the different values of the metric.

A blog entry about how BK-trees work can be found here.

In the thesis, Skala goes on describing other solutions to this problem, including VP-trees and GH-trees. Chapter 6 analyses distances based on the Levenshtein edit distance. He also presents some other interesting distance metrics for strings.

I also found Foundations of Multidimensional and Metric Data Structures, which seems relevant to your question.

like image 22
Alexander Torstling Avatar answered Sep 28 '22 07:09

Alexander Torstling


There are several such "scores", but they all depend on how you define similarity.

  • I think the python library already has a soundex implementation.
  • you can also compute the Levenshtein distance of two strings
  • NYSIIS?
like image 44
Daren Thomas Avatar answered Sep 28 '22 07:09

Daren Thomas