Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: data vs. text?

Guido van Rossum's presentation about Python 3000 mentions several things to make a transition from Python 2 to Python 3 easier eventually. He is specifically talking about text handling since the move to Unicode as the only representation of strings in Python 3 is one of the major changes.

As far as text handling goes, one slide (#14) says:

  • In 2.6:
    • Use bytes and b'…' for all data (Knowing these are just aliases for str and '…')
    • Use unicode and u'...' for all text
  • In 2.5:
    • '...' for data, u'...' for text

I am using Python 2.6.4. What exactly does this mean for me?

In Python's world, what is the difference between data and text?

like image 620
cschol Avatar asked Nov 15 '09 01:11

cschol


People also ask

What is a text type in Python?

Text data type is known as Strings in Python, or Objects in Pandas. Strings can contain numbers and / or characters. For example, a string might be a word, a sentence, or several sentences. A Pandas object might also be a plot name like 'plot1'. A string can also contain or consist of numbers.

What is r text in Python?

Python raw string is created by prefixing a string literal with 'r' or 'R'. Python raw string treats backslash (\) as a literal character. This is useful when we want to have a string that contains backslash and don't want it to be treated as an escape character.

What is the data type of 1?

1 is an integer, 1.0 is a floating-point number. Complex numbers are written in the form, x + yj , where x is the real part and y is the imaginary part.


1 Answers

In a nutshell, the way text and data is handled in Py3k may arguably be the most "breaking" change in the language. By knowing and avoiding,when possible, the situations where some Python 2.6 logic will work differently than in 3.x, we can facilitate the migration when it happens. Yet we should expect that some parts of the 2.6 logic may require special attention and modifications for example to deal with distinct encodings etc.

The idea behind BDFL's suggestion on slide 14 is probably to start "using" the same types which Py3k supports (and only these), namely unicode strings for strings (str type) and 8-bits byte sequences for "data" (bytes type).

The term "using" in the previous sentence is used rather loosely since the semantics and associated storage/encoding for these types differs between the 2.6 and 3.x versions. In Python 2.6, the bytes type and the associated literal syntax (b'xyz') simply map to the str type. Therefore

# in Py2.6
>>'mykey' == b'mykey'
True
b'mykey'.__class__
<class 'str'>

# in Py3k
>>>'mykey' == b'mykey'
False
b'mykey'.__class__
<class 'bytes'>  

To answer your question [in the remarks below], in 2.6 whether you use b'xyz' or 'xyz', Python understands it as the same and one thing : an str. What is important is that you understand these as [potentially/in-the-future] two distinct types with a distinct purpose:

  • str for text-like info, and
  • bytes for sequences of octets storing whatever data at hand.

For example, again speaking close to your example/question, in Py3k you'll be able to have a dictionary with two elements which have a similar keys, one with b'mykey' and the other with 'mykey', however under 2.6 this is not possible, since these two keys are really the same; what matters is that you know this kind of things and avoid (or explicitly mark in a special fashion in the code) the situations where the 2.6 code will not work in 3.x.

In Py3k, str is an abstract unicode string, a sequence of unicode code points (characters) and Python deals with converting this to/from its encoded form whatever the encoding might be (as a programmer you do have a say about the encoding but at the time you deal with string operations and such you do not need to worry about these details). In contrast, bytes is a sequence of 8-bits "things" which semantics and encoding are totally left to the programmer.

So, even though Python 2.6 doesn't see a difference, by explicitly using bytes() / b'...' or str() / u'...', you...

  • ... prepare yourself and your program to the upcoming types and semantics of Py3k
  • ... make it easier for the automatic conversion (2to3 tool or other) of the source code, whereby the b in b'...' will remain and the u of u'...' will be removed (since the only string type will be unicode).

For more info:
Python 2.6 What's new (see PEP 3112 Bytes Literals)
Python 3.0 What's New (see Text Vs. Data Instead Of Unicode Vs. 8-bit near the top)

like image 62
mjv Avatar answered Oct 22 '22 03:10

mjv