Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get the string that is the midpoint between two other strings

Is there a library or code snippet available that can take two strings and return the exact or approximate mid-point string between the two strings?

Preferably the code would be in Python.

Background:

This seems like a simple problem on the surface, but I'm kind of struggling with it:

  • Clearly, the midpoint string between "A" and "C" would be "B".
  • With base64 encoding, the midpoint string between "A" and "B" would probably be "Ag"
  • With UTF-8 encoding, I'm not sure what the valid midpoint would be because the middle character seems to be a control character: U+0088 c2 88 <control>

Practical Application:

The reason I am asking is because I was hoping write map-reduce type algorithm to read all of the entries out of our database and process them. The primary keys in the database are UTF-8 encoded strings with random distributions of characters. The database we are using is Cassandra.

Was hoping to get the lowest key and the highest key out of the database, then break that up into two ranges by finding the midpoint, then breaking those two ranges up into two smaller sections by finding each of their midpoints until I had a few thousand sections, then I could read each section asynchronously.

Example if the strings were base-16 encoded: (Some of the midpoints are approximate):

Starting highest and lowest keys:  '000'                'FFF'
                                   /   \              /       \
                              '000'     '8'         '8'       'FFF'
                              /   \     /  \       /  \       /   \
Result:                  '000'    '4' '4'  '8'   '8'  'B8'  'B8'  'FFF'
(After 3 levels of recursion)
like image 477
Chris Dutrow Avatar asked May 25 '13 17:05

Chris Dutrow


1 Answers

Unfortunately not all sequences of bytes are valid UTF-8, so it's not trivial to just take the midpoint of the UTF-8 values, like the following.

def midpoint(s, e):
    '''Midpoint of start and end strings'''
    (sb, eb) = (int.from_bytes(bytes(x, 'utf-8'), byteorder='big') for x in (s, e))
    midpoint = int((eb - sb) / 2 + sb)

    midpoint_bytes = midpoint.to_bytes((midpoint.bit_length() // 8) + 1, byteorder='big')
    return midpoint_bytes.decode('utf-8')

Basically this code converts each string into an integer represented by the sequence of bytes in memory, finds the midpoint of those two integers, and attempts to interpret the "midpoint" bytes as UTF-8 again.

Depending on exactly what behavior you would like, the next step could be to replace the invalid bytes in midpoint_bytes with some kind of replacement character to form a valid UTF-8 string. For your problem it might not matter much exactly which character you use for the replacement so long as you're consistent.

However, since you're trying to partition the data and don't seem to care too much about the string representation of the midpoint, another option is to just leave the midpoint representation as an integer and convert the keys to integers while doing the partition. Depending on the scale of your problem this option may or may not be feasible.

like image 118
kgraney Avatar answered Sep 27 '22 19:09

kgraney