I want to split unicode string to max 255 byte characters and return the result as unicode:
# s = arbitrary-length-unicode-string
s.encode('utf-8')[:255].decode('utf-8')
Problem with this snippet, is that if 255-th byte character is part of 2-byte unicode character, I'll get error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 254: unexpected end of data
Even if I handle the error I'll get unwanted garbage at the string end.
How to solve this more elegantly?
One very nice property of UTF-8 is that trailing bytes can easily be differentiated from starting bytes. Just work backwards until you've deleted a starting byte.
trunc_s = s.encode('utf-8')[:256]
if len(trunc_s) > 255:
final = -1
while ord(trunc_s[final]) & 0xc0 == 0x80:
final -= 1
trunc_s = trunc_s[:final]
trunc_s = trunc_s.decode('utf-8')
Edit: Check out the answers in the question identified as a duplicate, too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With