In python2: <pre class="prettyprint"><code>$ python2 -c 'print "\x08\x04\x87\x18"' | hexdump -C 00000000 08 04 87 18 0a |.....| 00000005 </code></pre> In python3: <pre class="prettyprint"><code>$ python3 -c 'print("\x08\x04\x87\x18")' | hexdump -C 00000000 08 04 c2 87 18 0a |......| 00000006 </code></pre> Why does it have the byte <code>"\xc2"</code> here? Edit: I think when the string have a non-ascii character, python3 will append the byte <code>"\xc2"</code> to the string. (as @Ashraful Islam said) So how can I avoid this in python3?

Python 2's default string type is byte strings. Byte strings are written <code>"abc"</code> while Unicode strings are written <code>u"abc"</code>. Python 3's default string type is Unicode strings. Byte strings are written as <code>b"abc"</code> while Unicode strings are written <code>"abc"</code> (<code>u"abc"</code> still works, too). since there are millions of Unicode characters, printing them as bytes requires an encoding (UTF-8 in your case) which requires multiple bytes per code point. First use a byte string in Python 3 to get the same Python 2 type. Then, because Python 3's <code>print</code> expects Unicode strings, use <code>sys.stdout.buffer.write</code> to write to the raw stdout interface, which expects byte strings. <pre class="prettyprint"><code>python3 -c 'import sys; sys.stdout.buffer.write(b"\x08\x04\x87\x18")' </code></pre> Note that if writing to a file, there are similar issues. For no encoding translation, open files in binary mode <code>'wb'</code> and write byte strings.

Why is the output of print in python2 and python3 different with the same string?

Tags:

python

unicode

utf-8

In python2:

$ python2 -c 'print "\x08\x04\x87\x18"' | hexdump -C
00000000  08 04 87 18 0a                                    |.....|
00000005

In python3:

$ python3 -c 'print("\x08\x04\x87\x18")' | hexdump -C
00000000  08 04 c2 87 18 0a                                 |......|
00000006

Why does it have the byte "\xc2" here?

Edit:

I think when the string have a non-ascii character, python3 will append the byte "\xc2" to the string. (as @Ashraful Islam said)

So how can I avoid this in python3?

925

asked Mar 19 '17 07:03

lzutao

2 Answers

Consider the following snippet of code:

import sys
for i in range(128, 256):
    sys.stdout.write(chr(i))

Run this with Python 2 and look at the result with hexdump -C:

00000000  80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f  |................|

Et cetera. No surprises; 128 bytes from 0x80 to 0xff.

Do the same with Python 3:

00000000  c2 80 c2 81 c2 82 c2 83  c2 84 c2 85 c2 86 c2 87  |................|
...
00000070  c2 b8 c2 b9 c2 ba c2 bb  c2 bc c2 bd c2 be c2 bf  |................|
00000080  c3 80 c3 81 c3 82 c3 83  c3 84 c3 85 c3 86 c3 87  |................|
...
000000f0  c3 b8 c3 b9 c3 ba c3 bb  c3 bc c3 bd c3 be c3 bf  |................|

To summarize:

Everything from 0x80 to 0xbf has 0xc2 prepended.
Everything from 0xc0 to 0xff has bit 6 set to zero and has 0xc3 prepended.

So, what’s going on here?

In Python 2, strings are ASCII and no conversion is done. Tell it to write something outside the 0-127 ASCII range, it says “okey-doke!” and just writes those bytes. Simple.

In Python 3, strings are Unicode. When non-ASCII characters are written, they must be encoded in some way. The default encoding is UTF-8.

So, how are these values encoded in UTF-8?

Code points from 0x80 to 0x7ff are encoded as follows:

110vvvvv 10vvvvvv

Where the 11 v characters are the bits of the code point.

Thus:

0x80                 hex
1000 0000            8-bit binary
000 1000 0000        11-bit binary
00010 000000         divide into vvvvv vvvvvv
11000010 10000000    resulting UTF-8 octets in binary
0xc2 0x80            resulting UTF-8 octets in hex

0xc0                 hex
1100 0000            8-bit binary
000 1100 0000        11-bit binary
00011 000000         divide into vvvvv vvvvvv
11000011 10000000    resulting UTF-8 octets in binary
0xc3 0x80            resulting UTF-8 octets in hex

So that’s why you’re getting a c2 before 87.

How to avoid all this in Python 3? Use the bytes type.

117

answered Oct 16 '22 04:10

Tom Zych

Python 2's default string type is byte strings. Byte strings are written "abc" while Unicode strings are written u"abc".

Python 3's default string type is Unicode strings. Byte strings are written as b"abc" while Unicode strings are written "abc" (u"abc" still works, too). since there are millions of Unicode characters, printing them as bytes requires an encoding (UTF-8 in your case) which requires multiple bytes per code point.

First use a byte string in Python 3 to get the same Python 2 type. Then, because Python 3's print expects Unicode strings, use sys.stdout.buffer.write to write to the raw stdout interface, which expects byte strings.

python3 -c 'import sys; sys.stdout.buffer.write(b"\x08\x04\x87\x18")'

Note that if writing to a file, there are similar issues. For no encoding translation, open files in binary mode 'wb' and write byte strings.

answered Oct 16 '22 06:10

Mark Tolonen

Related questions
                            
                                Make a list of ints hashable in python
                            
                                what is the best way to save tuples in python
                            
                                Oauth2 lib cannot import name 'run'
                            
                                HSV2BGR conversion fails in Python OpenCV script
                            
                                selenium webdriver takes too long to load a page
                            
                                Displaying only one tooltip when using the HoverTool() tool
                            
                                Map a NumPy array of strings to integers
                            
                                Django CKEditor Image Uploads not appearing
                            
                                Dictionary in a numpy array?
                            
                                Slicing a MultiIndex DataFrame by multiple values from a specified level
                            
                                SQLAlchemy. Creating tables that share enum
                            
                                Write formula to Excel with Python
                            
                                How to load a pre-trained Word2vec MODEL File and reuse it?
                            
                                How to create a Django superuser if it doesn't exist non-interactively?
                            
                                Different colours for arrows in quiver plot
                            
                                Compare two Python methods in PyCharm
                            
                                How to run Scrapy project in Jupyter?
                            
                                How to fix "AssertionError: Value must be bytes" error in Python2.7 with Apache Kafka
                            
                                Escaping double quotes while rendering in Jinja2
                            
                                How to read gz compressed file by pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With