When I use a triple-quoted multiline string in Python, I tend to use textwrap.dedent to keep the code readable, with good indentation: <pre class="prettyprint"><code>some_string = textwrap.dedent(""" First line Second line ... """).strip() </code></pre> However, in Python 3.x, textwrap.dedent doesn't seem to work with byte strings. I encountered this while writing a unit test for a method that returned a long multiline byte string, for example: <pre class="prettyprint"><code># The function to be tested def some_function(): return b'Lorem ipsum dolor sit amet\n consectetuer adipiscing elit' # Unit test import unittest import textwrap class SomeTest(unittest.TestCase): def test_some_function(self): self.assertEqual(some_function(), textwrap.dedent(b""" Lorem ipsum dolor sit amet consectetuer adipiscing elit """).strip()) if __name__ == '__main__': unittest.main() </code></pre> In Python 2.7.10 the above code works fine, but in Python 3.4.3 it fails: <pre class="prettyprint lang-none prettyprint-override"><code>E ====================================================================== ERROR: test_some_function (__main__.SomeTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "test.py", line 16, in test_some_function """).strip()) File "/usr/lib64/python3.4/textwrap.py", line 416, in dedent text = _whitespace_only_re.sub('', text) TypeError: can't use a string pattern on a bytes-like object ---------------------------------------------------------------------- Ran 1 test in 0.001s FAILED (errors=1) </code></pre> So: Is there an alternative to textwrap.dedent that works with byte strings? <ul> <li>I could write such a function myself, but if there is an existing function, I'd prefer to use it.</li> <li>I could convert to unicode, use textwrap.dedent, and convert back to bytes. But this is only viable if the byte string conforms to some Unicode encoding.</li> </ul>

It seems like <code>dedent</code> does not support bytestrings, sadly. However, if you want cross-compatible code, I recommend you take advantage of the <code>six</code> library: <pre class="prettyprint"><code>import sys, unittest from textwrap import dedent import six def some_function(): return b'Lorem ipsum dolor sit amet\n consectetuer adipiscing elit' class SomeTest(unittest.TestCase): def test_some_function(self): actual = some_function() expected = six.b(dedent(""" Lorem ipsum dolor sit amet consectetuer adipiscing elit """)).strip() self.assertEqual(actual, expected) if __name__ == '__main__': unittest.main() </code></pre> This is similar to your bullet point suggestion in the question <blockquote> I could convert to unicode, use textwrap.dedent, and convert back to bytes. But this is only viable if the byte string conforms to some Unicode encoding. </blockquote> But you're misunderstanding something about encodings here - if you can write the string literal in your test like that in the first place, and have the file successfully parsed by python (i.e. the correct coding declaration is on the module), then there is no "convert to unicode" step here. The file gets parsed in the encoding specified (or <code>sys.defaultencoding</code>, if you didn't specify) and then when the string is a python variable it is already decoded.

Using textwrap.dedent() with bytes in Python 3

Tags:

python

python-3.x

indentation

python-unicode

literals

When I use a triple-quoted multiline string in Python, I tend to use textwrap.dedent to keep the code readable, with good indentation:

some_string = textwrap.dedent("""
    First line
    Second line
    ...
    """).strip()

However, in Python 3.x, textwrap.dedent doesn't seem to work with byte strings. I encountered this while writing a unit test for a method that returned a long multiline byte string, for example:

# The function to be tested

def some_function():
    return b'Lorem ipsum dolor sit amet\n  consectetuer adipiscing elit'

# Unit test

import unittest
import textwrap

class SomeTest(unittest.TestCase):
    def test_some_function(self):
        self.assertEqual(some_function(), textwrap.dedent(b"""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """).strip())

if __name__ == '__main__':
    unittest.main()

In Python 2.7.10 the above code works fine, but in Python 3.4.3 it fails:

E
======================================================================
ERROR: test_some_function (__main__.SomeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test.py", line 16, in test_some_function
    """).strip())
  File "/usr/lib64/python3.4/textwrap.py", line 416, in dedent
    text = _whitespace_only_re.sub('', text)
TypeError: can't use a string pattern on a bytes-like object

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

So: Is there an alternative to textwrap.dedent that works with byte strings?

I could write such a function myself, but if there is an existing function, I'd prefer to use it.
I could convert to unicode, use textwrap.dedent, and convert back to bytes. But this is only viable if the byte string conforms to some Unicode encoding.

741

asked Oct 02 '16 22:10

nomadictype

2 Answers

Answer 2: textwrap is primarily about the Textwrap class and functions. dedent is listed under

# -- Loosely related functionality --------------------

As near as I can tell, the only things that makes it text (unicode str) specific are the re literals. I prefixed all 6 with b and voila! (I did not edit anything else, but the function docstring should be adjusted.)

import re

_whitespace_only_re = re.compile(b'^[ \t]+$', re.MULTILINE)
_leading_whitespace_re = re.compile(b'(^[ \t]*)(?:[^ \t\n])', re.MULTILINE)

def dedent_bytes(text):
    """Remove any common leading whitespace from every line in `text`.

    This can be used to make triple-quoted strings line up with the left
    edge of the display, while still presenting them in the source code
    in indented form.

    Note that tabs and spaces are both treated as whitespace, but they
    are not equal: the lines "  hello" and "\\thello" are
    considered to have no common leading whitespace.  (This behaviour is
    new in Python 2.5; older versions of this module incorrectly
    expanded tabs before searching for common leading whitespace.)
    """
    # Look for the longest leading string of spaces and tabs common to
    # all lines.
    margin = None
    text = _whitespace_only_re.sub(b'', text)
    indents = _leading_whitespace_re.findall(text)
    for indent in indents:
        if margin is None:
            margin = indent

        # Current line more deeply indented than previous winner:
        # no change (previous winner is still on top).
        elif indent.startswith(margin):
            pass

        # Current line consistent with and no deeper than previous winner:
        # it's the new winner.
        elif margin.startswith(indent):
            margin = indent

        # Find the largest common whitespace between current line
        # and previous winner.
        else:
            for i, (x, y) in enumerate(zip(margin, indent)):
                if x != y:
                    margin = margin[:i]
                    break
            else:
                margin = margin[:len(indent)]

    # sanity check (testing/debugging only)
    if 0 and margin:
        for line in text.split(b"\n"):
            assert not line or line.startswith(margin), \
                   "line = %r, margin = %r" % (line, margin)

    if margin:
        text = re.sub(rb'(?m)^' + margin, b'', text)
    return text

print(dedent_bytes(b"""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """)
      )

# prints
b'\nLorem ipsum dolor sit amet\n  consectetuer adipiscing elit\n'

135

answered Oct 14 '22 07:10

Terry Jan Reedy

It seems like dedent does not support bytestrings, sadly. However, if you want cross-compatible code, I recommend you take advantage of the six library:

import sys, unittest
from textwrap import dedent

import six


def some_function():
    return b'Lorem ipsum dolor sit amet\n  consectetuer adipiscing elit'


class SomeTest(unittest.TestCase):
    def test_some_function(self):
        actual = some_function()

        expected = six.b(dedent("""
            Lorem ipsum dolor sit amet
              consectetuer adipiscing elit
            """)).strip()

        self.assertEqual(actual, expected)

if __name__ == '__main__':
    unittest.main()

This is similar to your bullet point suggestion in the question

I could convert to unicode, use textwrap.dedent, and convert back to bytes. But this is only viable if the byte string conforms to some Unicode encoding.

But you're misunderstanding something about encodings here - if you can write the string literal in your test like that in the first place, and have the file successfully parsed by python (i.e. the correct coding declaration is on the module), then there is no "convert to unicode" step here. The file gets parsed in the encoding specified (or sys.defaultencoding, if you didn't specify) and then when the string is a python variable it is already decoded.

answered Oct 14 '22 09:10

wim

Related questions
                            
                                How do I trim a .fits image and keep world coordinates for plotting in astropy Python?
                            
                                Scikit Learn - Extract word tokens from a string delimiter using CountVectorizer
                            
                                Python multiprocessing/threading takes longer than single processing on a virtual machine
                            
                                tf.contrib.layers.embedding_column from tensor flow
                            
                                python pandas get index boundaries from a series of Booleans
                            
                                Clarification about Python imports
                            
                                Patching a parent class
                            
                                Download files from an FTP server containing given string using Python
                            
                                How can I set the colors per value when coloring plots by a DataFrame column?
                            
                                What Should the Structure of virtualenv Environment Look Like
                            
                                NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected
                            
                                django attribute error : object has no attribute 'get_bound_field'
                            
                                Django - Why are variables declared in Model Classes Static
                            
                                Best way to do file transfer via SCP using python and a .pem file [duplicate]
                            
                                Cumulative Set in PANDAS
                            
                                scipy.sparse matrix: subtract row mean to nonzero elements
                            
                                How to split up a string on multiple delimiters but only capture some?
                            
                                Python - Stacking two histograms with a scatter plot
                            
                                Calculating the Manhattan distance in the eight puzzle
                            
                                Getting substring based on another column in a pandas dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With