Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using textwrap.dedent() with bytes in Python 3

When I use a triple-quoted multiline string in Python, I tend to use textwrap.dedent to keep the code readable, with good indentation:

some_string = textwrap.dedent("""
    First line
    Second line
    ...
    """).strip()

However, in Python 3.x, textwrap.dedent doesn't seem to work with byte strings. I encountered this while writing a unit test for a method that returned a long multiline byte string, for example:

# The function to be tested

def some_function():
    return b'Lorem ipsum dolor sit amet\n  consectetuer adipiscing elit'

# Unit test

import unittest
import textwrap

class SomeTest(unittest.TestCase):
    def test_some_function(self):
        self.assertEqual(some_function(), textwrap.dedent(b"""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """).strip())

if __name__ == '__main__':
    unittest.main()

In Python 2.7.10 the above code works fine, but in Python 3.4.3 it fails:

E
======================================================================
ERROR: test_some_function (__main__.SomeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 16, in test_some_function
    """).strip())
  File "/usr/lib64/python3.4/textwrap.py", line 416, in dedent
    text = _whitespace_only_re.sub('', text)
TypeError: can't use a string pattern on a bytes-like object

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

So: Is there an alternative to textwrap.dedent that works with byte strings?

  • I could write such a function myself, but if there is an existing function, I'd prefer to use it.
  • I could convert to unicode, use textwrap.dedent, and convert back to bytes. But this is only viable if the byte string conforms to some Unicode encoding.
like image 741
nomadictype Avatar asked Oct 02 '16 22:10

nomadictype


People also ask

How do you use the Textwrap function in Python?

wrap(text, width=70, **kwargs): This function wraps the input paragraph such that each line in the paragraph is at most width characters long. The wrap method returns a list of output lines. The returned list is empty if the wrapped output has no content. Default width is taken as 70.

What is Textwrap Dedent?

textwrap. dedent(text) Remove any common leading whitespace from every line in text. This can be used to make triple-quoted strings line up with the left edge of the display, while still presenting them in the source code in indented form.

How do you use Textwrap?

Wrap text around a picture or drawing objectSelect the picture or object. Select Format and then under Arrange, select Wrap Text. Choose the wrapping option that you want to apply.


2 Answers

Answer 2: textwrap is primarily about the Textwrap class and functions. dedent is listed under

# -- Loosely related functionality --------------------

As near as I can tell, the only things that makes it text (unicode str) specific are the re literals. I prefixed all 6 with b and voila! (I did not edit anything else, but the function docstring should be adjusted.)

import re

_whitespace_only_re = re.compile(b'^[ \t]+$', re.MULTILINE)
_leading_whitespace_re = re.compile(b'(^[ \t]*)(?:[^ \t\n])', re.MULTILINE)

def dedent_bytes(text):
    """Remove any common leading whitespace from every line in `text`.

    This can be used to make triple-quoted strings line up with the left
    edge of the display, while still presenting them in the source code
    in indented form.

    Note that tabs and spaces are both treated as whitespace, but they
    are not equal: the lines "  hello" and "\\thello" are
    considered to have no common leading whitespace.  (This behaviour is
    new in Python 2.5; older versions of this module incorrectly
    expanded tabs before searching for common leading whitespace.)
    """
    # Look for the longest leading string of spaces and tabs common to
    # all lines.
    margin = None
    text = _whitespace_only_re.sub(b'', text)
    indents = _leading_whitespace_re.findall(text)
    for indent in indents:
        if margin is None:
            margin = indent

        # Current line more deeply indented than previous winner:
        # no change (previous winner is still on top).
        elif indent.startswith(margin):
            pass

        # Current line consistent with and no deeper than previous winner:
        # it's the new winner.
        elif margin.startswith(indent):
            margin = indent

        # Find the largest common whitespace between current line
        # and previous winner.
        else:
            for i, (x, y) in enumerate(zip(margin, indent)):
                if x != y:
                    margin = margin[:i]
                    break
            else:
                margin = margin[:len(indent)]

    # sanity check (testing/debugging only)
    if 0 and margin:
        for line in text.split(b"\n"):
            assert not line or line.startswith(margin), \
                   "line = %r, margin = %r" % (line, margin)

    if margin:
        text = re.sub(rb'(?m)^' + margin, b'', text)
    return text

print(dedent_bytes(b"""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """)
      )

# prints
b'\nLorem ipsum dolor sit amet\n  consectetuer adipiscing elit\n'
like image 135
Terry Jan Reedy Avatar answered Oct 14 '22 07:10

Terry Jan Reedy


It seems like dedent does not support bytestrings, sadly. However, if you want cross-compatible code, I recommend you take advantage of the six library:

import sys, unittest
from textwrap import dedent

import six


def some_function():
    return b'Lorem ipsum dolor sit amet\n  consectetuer adipiscing elit'


class SomeTest(unittest.TestCase):
    def test_some_function(self):
        actual = some_function()

        expected = six.b(dedent("""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """)).strip()

        self.assertEqual(actual, expected)

if __name__ == '__main__':
    unittest.main()

This is similar to your bullet point suggestion in the question

I could convert to unicode, use textwrap.dedent, and convert back to bytes. But this is only viable if the byte string conforms to some Unicode encoding.

But you're misunderstanding something about encodings here - if you can write the string literal in your test like that in the first place, and have the file successfully parsed by python (i.e. the correct coding declaration is on the module), then there is no "convert to unicode" step here. The file gets parsed in the encoding specified (or sys.defaultencoding, if you didn't specify) and then when the string is a python variable it is already decoded.

like image 3
wim Avatar answered Oct 14 '22 09:10

wim