Is there a library or code snippet available that can take two strings and return the exact or approximate mid-point string between the two strings?
Preferably the code would be in Python.
This seems like a simple problem on the surface, but I'm kind of struggling with it:
U+0088 c2 88 <control>
The reason I am asking is because I was hoping write map-reduce type algorithm to read all of the entries out of our database and process them. The primary keys in the database are UTF-8 encoded strings with random distributions of characters. The database we are using is Cassandra.
Was hoping to get the lowest key and the highest key out of the database, then break that up into two ranges by finding the midpoint, then breaking those two ranges up into two smaller sections by finding each of their midpoints until I had a few thousand sections, then I could read each section asynchronously.
Example if the strings were base-16 encoded: (Some of the midpoints are approximate):
Starting highest and lowest keys: '000' 'FFF' / \ / \ '000' '8' '8' 'FFF' / \ / \ / \ / \ Result: '000' '4' '4' '8' '8' 'B8' 'B8' 'FFF' (After 3 levels of recursion)
Unfortunately not all sequences of bytes are valid UTF-8, so it's not trivial to just take the midpoint of the UTF-8 values, like the following.
def midpoint(s, e):
'''Midpoint of start and end strings'''
(sb, eb) = (int.from_bytes(bytes(x, 'utf-8'), byteorder='big') for x in (s, e))
midpoint = int((eb - sb) / 2 + sb)
midpoint_bytes = midpoint.to_bytes((midpoint.bit_length() // 8) + 1, byteorder='big')
return midpoint_bytes.decode('utf-8')
Basically this code converts each string into an integer represented by the sequence of bytes in memory, finds the midpoint of those two integers, and attempts to interpret the "midpoint" bytes as UTF-8 again.
Depending on exactly what behavior you would like, the next step could be to replace the invalid bytes in midpoint_bytes
with some kind of replacement character to form a valid UTF-8 string. For your problem it might not matter much exactly which character you use for the replacement so long as you're consistent.
However, since you're trying to partition the data and don't seem to care too much about the string representation of the midpoint, another option is to just leave the midpoint representation as an integer and convert the keys to integers while doing the partition. Depending on the scale of your problem this option may or may not be feasible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With