Guido van Rossum's presentation about Python 3000 mentions several things to make a transition from Python 2 to Python 3 easier eventually. He is specifically talking about text handling since the move to Unicode as the only representation of strings in Python 3 is one of the major changes. As far as text handling goes, one slide (#14) says: <ul> <li>In 2.6: <ul> <li>Use bytes and b'…' for all data (Knowing these are just aliases for str and '…')</li> <li>Use unicode and u'...' for all text</li> </ul> </li> <li>In 2.5: <ul> <li>'...' for data, u'...' for text</li> </ul> </li> </ul> I am using Python 2.6.4. What exactly does this mean for me? In Python's world, what is the difference between data and text?

In a nutshell, the way text and data is handled in Py3k may arguably be the most "breaking" change in the language. By knowing and avoiding,when possible, the situations where some Python 2.6 logic will work differently than in 3.x, we can facilitate the migration when it happens. Yet we should expect that some parts of the 2.6 logic may require special attention and modifications for example to deal with distinct encodings etc. The idea behind BDFL's suggestion on slide 14 is probably to start "using" the same types which Py3k supports (and only these), namely unicode strings for strings (<code>str</code> type) and 8-bits byte sequences for "data" (<code>bytes</code> type). The term "using" in the previous sentence is used rather loosely since the semantics and associated storage/encoding for these types differs between the 2.6 and 3.x versions. In Python 2.6, the bytes type and the associated literal syntax (b'xyz') simply map to the str type. Therefore <pre class="prettyprint"><code># in Py2.6 >>'mykey' == b'mykey' True b'mykey'.__class__ <class 'str'> # in Py3k >>>'mykey' == b'mykey' False b'mykey'.__class__ <class 'bytes'> </code></pre> To answer your question [in the remarks below], in 2.6 whether you use b'xyz' or 'xyz', Python understands it as the same and one thing : an str. What is important is that you understand these as [potentially/in-the-future] two distinct types with a distinct purpose: <ul> <li>str for text-like info, and </li> <li>bytes for sequences of octets storing whatever data at hand.</li> </ul> For example, again speaking close to your example/question, in Py3k you'll be able to have a dictionary with two elements which have a similar keys, one with b'mykey' and the other with 'mykey', however under 2.6 this is not possible, since these two keys are really the same; what matters is that you know this kind of things and avoid (or explicitly mark in a special fashion in the code) the situations where the 2.6 code will not work in 3.x. In Py3k, str is an abstract unicode string, a sequence of unicode code points (characters) and Python deals with converting this to/from its encoded form whatever the encoding might be (as a programmer you do have a say about the encoding but at the time you deal with string operations and such you do not need to worry about these details). In contrast, bytes is a sequence of 8-bits "things" which semantics and encoding are totally left to the programmer. So, even though Python 2.6 doesn't see a difference, by explicitly using bytes() / b'...' or str() / u'...', you... <ul> <li>... prepare yourself and your program to the upcoming types and semantics of Py3k </li> <li>... make it easier for the automatic conversion (2to3 tool or other) of the source code, whereby the b in b'...' will remain and the u of u'...' will be removed (since the only string type will be unicode).</li> </ul> For more info: Python 2.6 What's new (see PEP 3112 Bytes Literals) Python 3.0 What's New (see <code>Text Vs. Data Instead Of Unicode Vs. 8-bit</code> near the top)

Python: data vs. text?

Tags:

python

python-3.x

unicode

Guido van Rossum's presentation about Python 3000 mentions several things to make a transition from Python 2 to Python 3 easier eventually. He is specifically talking about text handling since the move to Unicode as the only representation of strings in Python 3 is one of the major changes.

As far as text handling goes, one slide (#14) says:

In 2.6:
- Use bytes and b'…' for all data (Knowing these are just aliases for str and '…')
- Use unicode and u'...' for all text
In 2.5:
- '...' for data, u'...' for text

I am using Python 2.6.4. What exactly does this mean for me?

In Python's world, what is the difference between data and text?

620

asked Nov 15 '09 01:11

cschol

1 Answers

In a nutshell, the way text and data is handled in Py3k may arguably be the most "breaking" change in the language. By knowing and avoiding,when possible, the situations where some Python 2.6 logic will work differently than in 3.x, we can facilitate the migration when it happens. Yet we should expect that some parts of the 2.6 logic may require special attention and modifications for example to deal with distinct encodings etc.

The idea behind BDFL's suggestion on slide 14 is probably to start "using" the same types which Py3k supports (and only these), namely unicode strings for strings (str type) and 8-bits byte sequences for "data" (bytes type).

The term "using" in the previous sentence is used rather loosely since the semantics and associated storage/encoding for these types differs between the 2.6 and 3.x versions. In Python 2.6, the bytes type and the associated literal syntax (b'xyz') simply map to the str type. Therefore

# in Py2.6
>>'mykey' == b'mykey'
True
b'mykey'.__class__
<class 'str'>

# in Py3k
>>>'mykey' == b'mykey'
False
b'mykey'.__class__
<class 'bytes'>

To answer your question [in the remarks below], in 2.6 whether you use b'xyz' or 'xyz', Python understands it as the same and one thing : an str. What is important is that you understand these as [potentially/in-the-future] two distinct types with a distinct purpose:

str for text-like info, and
bytes for sequences of octets storing whatever data at hand.

For example, again speaking close to your example/question, in Py3k you'll be able to have a dictionary with two elements which have a similar keys, one with b'mykey' and the other with 'mykey', however under 2.6 this is not possible, since these two keys are really the same; what matters is that you know this kind of things and avoid (or explicitly mark in a special fashion in the code) the situations where the 2.6 code will not work in 3.x.

In Py3k, str is an abstract unicode string, a sequence of unicode code points (characters) and Python deals with converting this to/from its encoded form whatever the encoding might be (as a programmer you do have a say about the encoding but at the time you deal with string operations and such you do not need to worry about these details). In contrast, bytes is a sequence of 8-bits "things" which semantics and encoding are totally left to the programmer.

So, even though Python 2.6 doesn't see a difference, by explicitly using bytes() / b'...' or str() / u'...', you...

... prepare yourself and your program to the upcoming types and semantics of Py3k
... make it easier for the automatic conversion (2to3 tool or other) of the source code, whereby the b in b'...' will remain and the u of u'...' will be removed (since the only string type will be unicode).

For more info:
Python 2.6 What's new (see PEP 3112 Bytes Literals)
Python 3.0 What's New (see Text Vs. Data Instead Of Unicode Vs. 8-bit near the top)

answered Oct 22 '22 03:10

mjv

Related questions
                            
                                How to upload multiple files with flask-wtf?
                            
                                Theoretical vs actual time-complexity for algorithm calculating 2^n
                            
                                How to access the network weights while using PyTorch 'nn.Sequential'?
                            
                                how to set logging level from command line
                            
                                How to create a dictionary using a single list?
                            
                                What's the most space-efficient way to compress serialized Python data?
                            
                                Tensorflow 2: how to switch execution from GPU to CPU and back?
                            
                                RuntimeError: __class__ not set defining 'AbstractBaseUser' as <class 'django.contrib.auth.base_user.Abstract BaseUser'>. Was __classcell__ propagated
                            
                                Maintained alternatives to PyPDF2
                            
                                Setup django with WSGI and apache
                            
                                Nginx + fastcgi truncation problem
                            
                                Python regex findall numbers and dots
                            
                                How do I get the full XML or HTML content of an element using ElementTree?
                            
                                How can I pass a filename as a parameter into my module?
                            
                                str.format() -> how to left-justify
                            
                                Crunching json with python
                            
                                How to explicitly specify a path to Firefox for Selenium?
                            
                                Python - importing package classes into console global namespace
                            
                                Python SSH / SFTP Module?
                            
                                dynamically adding functions to a Python module

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With