Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using textwrap.wrap with bytes count

How can I use the textwrap module to split before a line reaches a certain amount of bytes (without splitting a multi-bytes character)?

I would like something like this:

>>> textwrap.wrap('☺ ☺☺ ☺☺ ☺ ☺ ☺☺ ☺☺', bytewidth=10)
☺ ☺☺
☺☺ ☺
☺ ☺☺
☺☺
like image 751
Valentin Lorentz Avatar asked Oct 31 '22 06:10

Valentin Lorentz


1 Answers

The result depends on the encoding used, because the number of bytes per character is a function of the encoding, and in many encodings, of the character as well. I'll assume we're using UTF-8, in which '☺' is encoded as e298ba and is three bytes long; the given example is consistent with that assumption.

Everything in textwrap works on characters; it doesn't know anything about encodings. One way around this is to convert the input string to another format, with each character becoming a string of characters whose length is proportional to the byte length. I will use three characters: two for the byte in hex, plus one to control line breaking. Thus:

'a' -> '61x'         non-breaking
' ' -> '20 '         breaking
'☺' -> 'e2x98xbax'   non-breaking

For simplicity I'll assume we only break on spaces, not tabs or any other character.

import textwrap

def wrapbytes(s, bytewidth, encoding='utf-8', show_work=False):
    byts = s.encode(encoding)
    encoded = ''.join('{:02x}{}'.format(b, ' ' if b in b' ' else 'x')
                      for b in byts)
    if show_work:
        print('encoded = {}\n'.format(encoded))
    ewidth = bytewidth * 3 + 2
    elist = textwrap.wrap(encoded, width=ewidth)
    if show_work:
        print('elist = {}\n'.format(elist))
    # Remove trailing encoded spaces.
    elist = [s[:-2] if s[-2:] == '20' else s for s in elist]
    if show_work:
        print('elist = {}\n'.format(elist))
    # Decode. Method 1: inefficient and lengthy, but readable.
    bl1 = []
    for s in elist:
        bstr = "b'"
        for i in range(0, len(s), 3):
            hexchars = s[i:i+2]
            b = r'\x' + hexchars
            bstr += b
        bstr += "'"
        bl1.append(eval(bstr))
    # Method 2: equivalent, efficient, terse, hard to read.
    bl2 = [eval("b'{}'".format(''.join(r'\x{}'.format(s[i:i+2])
                                       for i in range(0, len(s), 3))))
             for s in elist]
    assert(bl1 == bl2)
    if show_work:
        print('bl1 = {}\n'.format(bl1))
    dlist = [b.decode(encoding) for b in bl1]
    if show_work:
        print('dlist = {}\n'.format(dlist))
    return(dlist)

result = wrapbytes('☺ ☺☺ ☺☺ ☺ ☺ ☺☺ ☺☺', bytewidth=10, show_work=True)
print('\n'.join(result))
like image 97
Tom Zych Avatar answered Nov 10 '22 15:11

Tom Zych