Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Absolute" string metric

I have a huge (but finite) set of natural language strings.

I need a way to convert each string to a numeric value. For any given string the value must be the same every time.

The more "different" two given strings are, the more different two corresponding values should be. The more "similar" they are, the less different values should be.

I do not know yet what exact definition of difference between strings I need. No natural language parsing anyway. It should probably be something Levenstein-like (but Levenstein is relative and I need absolute metric). Lets start with something simple.

Update on dimensions

I'll be happy to settle for a multidimensional (3d is best) vector instead of single numeric value.

Update on expected result correctness

As it was correctly noted here and here, the distance from one string to another is a vector with MAX(firstStringLength, secondStringLength) dimensions. In general it is not possible to reduce number of dimensions without some loss of information.

However I do not need an absolute solution. I would settle for any "good enough" conversion from N-dimensional strings space to my 3D space.

Note also that I have a finite number of strings of finite length. (Number of strings is rather large though, about 80 million (10 GB), so I'd better pick some single-pass state-less algorithm.)

From scanning references, I'm under impression that Hilbert space-filling curve may help me here. Looks like Analysis of the Clustering Properties of the Hilbert Space-Filling Curve article discusses something close to my problem...

Update on Hilbert curve approach

  1. We map each string to a point in a N-dimensional space, where N is the maximum length of a string in set. BTW, can i-th character code from a string be used as the i-th coordinate value here?
  2. We plot a Hilbert curve through that N-dimensional space.
  3. For each string we take point on the curve, closest to the coordinates of the string. Hilbert value of that point (the length from the beginning of curve) is the single-dimensional value I seek.
  4. If we need 3D value, we plot Hilbert curve in 3D and pick points, matching Hilbert values, calculated above.

Does this looks right? What would be the computational expenses here?

like image 421
Alexander Gladysh Avatar asked Jan 30 '09 22:01

Alexander Gladysh


2 Answers

I don't think this is possible to do. Start with a simple string, and assign it zero (it doesn't really matter what the number is)

  • "Hello World" = 0

The following strings are at distance 2 from it:

  • "XXllo World" = a
  • "HeXXo World" = b
  • "Hello XXrld" = c
  • "Hello WorXX" = d

Yet, each of these strings is 4 from each other. There is no way to sort the numbers to make it work, for the following instance:

a = 1, b = -1, c = 2, d = -2

Consider that c to 0 is 2, yet c to a is 1, yet 0 is closer than a.

And this is just a simple case.

like image 158
FryGuy Avatar answered Oct 03 '22 09:10

FryGuy


I think you are going to have to specify your problem more clearly, what exactly are you trying to achieve with this metric?

I say this, because Levenstein works since it maps pairs of strings to a metric, which can preserve the dimensionality of the string space. What happens if you try and map strings to numbers is that there is a large loss of dimensional information. For example, say I have the string "cat", I'd want "bat", "hat", "rat", "can", "cot" etc. to all be reasonably close to this. With a large number of words, the result is that you end up with dissimilar words being close relatively often e.g. "bat" and "cot" may be close, because they both happen to be similar distances from "cat" on the positive side. This is similar to the problem of what happens when you try and map the plane to a line, it is difficult to meet the restriction that points far away in the plane stay far away on the line. So, the upshot of this is that the 'The more "different" two given strings are, the more different two corresponding values should be' requirement is difficult.

So, my first suggestion is, do you really need something that does this, will a simple hash-code suffice to give you unique values, or perhaps you can use Levenstein after all and ignore the values for individual strings? If none of those suffice, perhaps you can use a multidimensional function value, that is map strings into pairs, triples or another small tuple of numbers. The extra dimensionality granted that way will give you far better results.

An example might be encoding the string as a triple: length, sum of values of letters in string, alternating sum of values of letters e.g. f("cat") = (3, 3 + 1 + 20, 3 - 1 + 20) = (3, 24, 22). This would have some of the properties you desire, but is probably not optimal. Try looking for orthogonal features of the string to do this encoding, or even better, if you have a large test set of strings there are existing libraries for mapping this sort of data into low dimensions while preserving metrics (e.g. the Levenstein metric) and you can train your function on that. I remember the S language had support for this sort of thing.

like image 30
Daniel Nadasi Avatar answered Oct 03 '22 08:10

Daniel Nadasi