Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

algorithm to find closest string using same characters

Given a list L of n character strings, and an input character string S, what is an efficient way to find the character string in L that contains the most characters that exist in S? We want to find the string in L that is most-closely made up of the letters contained in S.

The obvious answer is to loop through all n strings and check to see how many characters in the current string exist in S. However, this algorithm will be run frequently, and the list L of n string will be stored in a database... loop manually through all n strings would require something like big-Oh of n*m^2, where n is the number of strings in L, and m is the max length of any string in L, as well as the max length of S... in this case m is actually a constant of 150.

Is there a better way than just a simple loop? Is there a data structure I can load the n strings into that would give me fast search ability? Is there an algorithm that uses the pre-calculated meta-data about each of the n strings that would perform better than a loop?

I know there are a lot of geeks out there that are into the algorithms. So please help!

Thanks!

like image 641
Rafe Avatar asked Mar 01 '23 18:03

Rafe


1 Answers

If you are after substrings, a Trie or Patrica trie might be a good starting point.

If you don't care about the order, just about the number of each symbol or letter, I would calculate the histogram of all strings and then compare them with the histogram of the input.

               ABCDEFGHIJKLMNOPQRSTUVWXYZ
Hello World => ...11..1...3..2..1....1...

This will lower the costs to O(26 * m + n) plus the preprocessing once if you consider only case-insensitive latin letters.

If m is constant, you could interpret the histogram as a 26 dimensional vector on a 26 dimensional unit sphere by normalizing it. Then you could just calculate the Dot Product of two vectors yielding the cosine of the angle between the two vectors, and this value should be proportional to the similarity of the strings.

Assuming m = 3, a alphabet A = { 'U', 'V', 'W' } of size three only, and the following list of strings.

L = { "UUU", "UVW", "WUU" }

The histograms are the following.

H = { (3, 0, 0), (1, 1, 1), (2, 0, 1) }

A histogram h = (x, y, z) is normalized to h' = (x/r, y/r, z/r) with r the Euclidian norm of the histogram h - that is r = sqrt(x² + y² + z²).

H' = { (1.000, 0.000, 0.000), (0.577, 0.577, 0.577), (0.894, 0.000, 0.447) }

The input S = "VVW" has the histogram hs = (0, 2, 1) and the normalized histogram hs' = (0.000, 0.894, 0.447).

Now we can calculate the similarity of two histograms h1 = (a, b, c) and h2 = (x, y, z) as the Euclidian distance of both histograms.

d(h1, h2) = sqrt((a - x)² + (b - y)² + (c - z)²)

For the example we obtain.

d((3, 0, 0), (0, 2, 1)) = 3.742
d((1, 1, 1), (0, 2, 1)) = 1.414
d((2, 0, 1), (0, 2, 1)) = 2.828

Hence "UVW" is closest to "VVW" (smaller numbers indicate higher similarity).

Using the normalized histograms h1' = (a', b', c') and h2' = (x', y', z') we can calculate the distance as the dot product of both histograms.

d'(h1', h2') = a'x' + b'y' + c'z'

For the example we obtain.

d'((1.000, 0.000, 0.000), (0.000, 0.894, 0.447)) = 0.000
d'((0.577, 0.577, 0.577), (0.000, 0.894, 0.447)) = 0.774
d'((0.894, 0.000, 0.447), (0.000, 0.894, 0.447)) = 0.200

Again "UVW" is determined to be closest to "VVW" (larger numbers indicate higher similarity).

Both version yield different numbers, but the results are always the same. One could also use other norms - Manhattan distance (L1 norm) for example - but this will only change the numbers because norms in finite dimensional vector spaces are all equivalent.

like image 160
Daniel Brückner Avatar answered Mar 03 '23 08:03

Daniel Brückner