How can I use the textwrap
module to split before a line reaches a certain amount of bytes (without splitting a multi-bytes character)?
I would like something like this:
>>> textwrap.wrap('☺ ☺☺ ☺☺ ☺ ☺ ☺☺ ☺☺', bytewidth=10)
☺ ☺☺
☺☺ ☺
☺ ☺☺
☺☺
The result depends on the encoding used, because the number of bytes per
character is a function of the encoding, and in many encodings, of the
character as well. I'll assume we're using UTF-8, in which '☺'
is
encoded as e298ba
and is three bytes long; the given example is
consistent with that assumption.
Everything in textwrap
works on characters; it doesn't know anything
about encodings. One way around this is to convert the input string to
another format, with each character becoming a string of characters
whose length is proportional to the byte length. I will use three
characters: two for the byte in hex, plus one to control line breaking.
Thus:
'a' -> '61x' non-breaking
' ' -> '20 ' breaking
'☺' -> 'e2x98xbax' non-breaking
For simplicity I'll assume we only break on spaces, not tabs or any other character.
import textwrap
def wrapbytes(s, bytewidth, encoding='utf-8', show_work=False):
byts = s.encode(encoding)
encoded = ''.join('{:02x}{}'.format(b, ' ' if b in b' ' else 'x')
for b in byts)
if show_work:
print('encoded = {}\n'.format(encoded))
ewidth = bytewidth * 3 + 2
elist = textwrap.wrap(encoded, width=ewidth)
if show_work:
print('elist = {}\n'.format(elist))
# Remove trailing encoded spaces.
elist = [s[:-2] if s[-2:] == '20' else s for s in elist]
if show_work:
print('elist = {}\n'.format(elist))
# Decode. Method 1: inefficient and lengthy, but readable.
bl1 = []
for s in elist:
bstr = "b'"
for i in range(0, len(s), 3):
hexchars = s[i:i+2]
b = r'\x' + hexchars
bstr += b
bstr += "'"
bl1.append(eval(bstr))
# Method 2: equivalent, efficient, terse, hard to read.
bl2 = [eval("b'{}'".format(''.join(r'\x{}'.format(s[i:i+2])
for i in range(0, len(s), 3))))
for s in elist]
assert(bl1 == bl2)
if show_work:
print('bl1 = {}\n'.format(bl1))
dlist = [b.decode(encoding) for b in bl1]
if show_work:
print('dlist = {}\n'.format(dlist))
return(dlist)
result = wrapbytes('☺ ☺☺ ☺☺ ☺ ☺ ☺☺ ☺☺', bytewidth=10, show_work=True)
print('\n'.join(result))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With