Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compute a double precision float score from the first 8 bytes of a string in Python?

Trying to get a double-precision floating point score from a UTF-8 encoded string object in Python. The idea is to grab the first 8 bytes of the string and create a float, so that the strings, ordered by their score, would be ordered lexicographically according to their first 8 bytes (or possibly their first 63 bits, after forcing them all to be positive to avoid sign errors).

For example:

get_score(u'aaaaaaa') < get_score(u'aaaaaaab') < get_score(u'zzzzzzzz')

I have tried to compute the score in an integer using bit-shift-left and XOR, but I am not sure of how to translate that into a float value. I am also not sure if there is a better way to do this.

How should the score for a string be computed so the condition I specified before is met?

Edit: The string object is UTF-8 encoded (as per @Bakuriu's commment).

like image 318
Juan Carlos Coto Avatar asked Oct 23 '13 18:10

Juan Carlos Coto


2 Answers

float won't give you 64 bits of precision. Use integers instead.

def get_score(s):
  return struct.unpack('>Q', (u'\0\0\0\0\0\0\0\0' + s[:8])[-8:])[0]

In Python 3:

def get_score(s):
  return struct.unpack('>Q', ('\0\0\0\0\0\0\0\0' + s[:8])[-8:].encode('ascii', 'error'))[0]

EDIT:

For floats, with 6 characters:

def get_score(s):
  return struct.unpack('>d', (u'\0\1' + (u'\0\0\0\0\0\0\0\0' + s[:6])[-6:]).encode('ascii', 'error'))[0]
like image 193
Ignacio Vazquez-Abrams Avatar answered Oct 23 '22 05:10

Ignacio Vazquez-Abrams


You will need to setup the entire alphabet and do the conversion by hand, since conversions to base > 36 are not built in, in order to do that you only need to define the complete alphabet to use. If it was an ascii string for instance you would create a conversion to a long in base 256 from the input string using all the ascii table as an alphabet.

You have an example of the full functions to do it here: string to base 62 number

Also you don't need to worry about negative-positive numbers when doing this, since the encoding of the string with the first character in the alphabet will yield the minimum possible number in the representation, which is the negative value with the highest absolute value, in your case -2**63 which is the correct value and allows you to use < > against it.

Hope it helps!

like image 23
Sergio Ayestarán Avatar answered Oct 23 '22 04:10

Sergio Ayestarán