i know that django uses unicode strings all over the framework instead of normal python strings. what encoding are normal python strings use ? and why don't they use unicode?

Hey! I'd like to add some stuff to other answers, unfortunately I don't have enough rep yet to do that properly :-( FWIW, Mike Graham's post is pretty good and that's probably what you should be reading first. Here's a few comments: <ol> <li>The need to prefix unicode literals with "u" in 2.x is pretty easily removed in recent (2.6+) 2.x Pythons. <code>from __future__ import unicode_literals</code> </li> <li>Simialrly, ASCII is only the default source encoding. Python understands a variety of coding hints including the emacs-style <code># -*- coding: utf-8 -*-</code>. For more information see PEP 0263. Changing the source encoding affects how Unicode literals (regardless of their prefix or lack of prefix, as affected by point 1) are interpreted. In Py3k, the default file encoding is UTF-8.</li> <li>Python of course does use an encoding internally for Unicode strings (<code>str</code> in py3k, <code>unicode</code> in 2.x) because at some point in time stuff's going to have to be written to memory. Ideally, this would never be evident to the end-user. Unfortunately nothing's perfect and you can occasionally run into problems with this: specifically if you use funky squiggles outside of the Unicode Base Multilingual Plane. Since Python 2.2, we've had what's called wide builds and narrow builds; these names refer to the type used internally to store Unicode code points. Wide builds use UCS-4, which uses 4 bytes to store a Unicode code point. (This means UCS-4's code unit size is 4 bytes, or 32 bits.) Narrow builds use UCS-2. UCS-2 only has 16 bits, and therefore can not encode all Unicode code points accurately (it's like UTF-16, except without the surrogate pairs). To check, test the value of <code>sys.maxunicode</code>. If it's <code>1114111</code>, you've got a wide build (which can correctly represent all of Unicode). If it's less, well, don't fret too much. The BMP (code points <code>0x0000</code> to <code>0xFFFF</code>) covers most people's needs. For more information, see PEP 0261.</li> </ol>

What encoding do normal python strings use?

2 Answers

In Python 2: Normal strings (Python 2.x str) don't have an encoding: they are raw data.

In Python 3: These are called "bytes" which is an accurate description, as they are simply sequences of bytes, which can be text encoded in any encoding (several are common!) or non-textual data altogether.

For representing text, you want unicode strings, not byte strings. By "unicode strings", I mean unicode instances in Python 2 and str instances in Python 3. Unicode strings are sequences of unicode codepoints represented abstractly without an encoding; this is well-suited for representing text.

Bytestrings are important because to represent data for transmission over a network or writing to a file or whatever, you cannot have an abstract representation of unicode, you need a concrete representation of bytes. Though they are often used to store and represent text, this is at least a little naughty.

This whole situation is complicated by the fact that while you should turn unicode into bytes by calling encode and turn bytes into unicode using decode, Python will try to do this automagically for you using a global encoding you can set that is by default ASCII, which is the safest choice. Never depend on this for your code and never ever change this to a more flexible encoding--explicitly decode when you get a bytestring and encode if you need to send a string somewhere external.

190

answered Oct 05 '22 06:10

Mike Graham

Hey! I'd like to add some stuff to other answers, unfortunately I don't have enough rep yet to do that properly :-(

FWIW, Mike Graham's post is pretty good and that's probably what you should be reading first.

Here's a few comments:

The need to prefix unicode literals with "u" in 2.x is pretty easily removed in recent (2.6+) 2.x Pythons. from __future__ import unicode_literals
Simialrly, ASCII is only the default source encoding. Python understands a variety of coding hints including the emacs-style # -*- coding: utf-8 -*-. For more information see PEP 0263. Changing the source encoding affects how Unicode literals (regardless of their prefix or lack of prefix, as affected by point 1) are interpreted. In Py3k, the default file encoding is UTF-8.
Python of course does use an encoding internally for Unicode strings (str in py3k, unicode in 2.x) because at some point in time stuff's going to have to be written to memory. Ideally, this would never be evident to the end-user. Unfortunately nothing's perfect and you can occasionally run into problems with this: specifically if you use funky squiggles outside of the Unicode Base Multilingual Plane. Since Python 2.2, we've had what's called wide builds and narrow builds; these names refer to the type used internally to store Unicode code points. Wide builds use UCS-4, which uses 4 bytes to store a Unicode code point. (This means UCS-4's code unit size is 4 bytes, or 32 bits.) Narrow builds use UCS-2. UCS-2 only has 16 bits, and therefore can not encode all Unicode code points accurately (it's like UTF-16, except without the surrogate pairs). To check, test the value of sys.maxunicode. If it's 1114111, you've got a wide build (which can correctly represent all of Unicode). If it's less, well, don't fret too much. The BMP (code points 0x0000 to 0xFFFF) covers most people's needs. For more information, see PEP 0261.

answered Oct 05 '22 07:10

lvh

Related questions
                            
                                python: creating excel workbook and dumping csv files as worksheets
                            
                                Comparing dates to check for old files
                            
                                Listing available devices in python-opencv
                            
                                Python: try except KeyError vs if has_key() [duplicate]
                            
                                Python: How to generate a 12-digit random number?
                            
                                Pass bash argument to python script
                            
                                How do I restart a program based on user input?
                            
                                How to extract all UPPER from a string? Python
                            
                                How to append a vector to a matrix in python
                            
                                How to subtract rows of one pandas data frame from another?
                            
                                Pascal's Triangle for Python
                            
                                How to disable pylint no-self-use warning?
                            
                                Can we Zoom the browser window in python selenium webdriver?
                            
                                Plotly: How to display charts in Spyder?
                            
                                Is it possible to dump an enum in json without passing an encoder to json.dumps()?
                            
                                How to make a subquery in sqlalchemy
                            
                                'module' object has no attribute 'py' when running from cmd
                            
                                Flask SQLAlchemy filter by value OR another
                            
                                How to install TA-lib in google colab?
                            
                                How to create a file one directory up?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What encoding do normal python strings use?

Tags:

python

encoding

Bunny Rabbit

People also ask

2 Answers

Mike Graham

lvh

Recent Activity

Donate For Us