Unicode vs UTF-8 confusion in Python / Django?

Tags:

I stumbled over this passage in the Django tutorial:

Django models have a default str() method that calls unicode() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.

Now, I'm confused because afaik Unicode is not any particular representation, so what is a "Unicode string" in Python? Does that mean UCS-2? Googling turned up this "Python Unicode Tutorial" which boldly states

Unicode is a two-byte encoding which covers all of the world's common writing systems.

which is plain wrong, or is it? I have been confused many times by character set and encoding issues, but here I'm quite sure that the documentation I'm reading is confused. Does anybody know what's going on in Python when it gives me a "Unicode string"?

312

asked Aug 22 '08 12:08

Hanno Fietz

1 Answers

what is a "Unicode string" in Python? Does that mean UCS-2?

Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.

You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.

plain wrong, or is it?

Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).

There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.

151

answered Oct 04 '22 12:10

bobince

Related questions
                            
                                How do I merge lists in python? [duplicate]
                            
                                Programming in Python vs. programming in Java
                            
                                Compare XML snippets?
                            
                                Longest increasing subsequence
                            
                                SQLAlchemy ordering by count on a many to many relationship
                            
                                Vim and PEP 8 -- Style Guide for Python Code
                            
                                Getting values with the right type in Redis
                            
                                scipy minimize with constraints
                            
                                I know of f-strings, but what are r-strings? Are there others?
                            
                                Swap two rows in a numpy array in python [duplicate]
                            
                                How to get hard disk serial number using Python
                            
                                Override module method where from...import is used
                            
                                Get column name where value is something in pandas dataframe
                            
                                Tkinter messagebox without window?
                            
                                Python best practice in terms of logging
                            
                                Using an OrderedDict in **kwargs
                            
                                OpenCV resize fails on large image with "error: (-215) ssize.area() > 0 in function cv::resize"
                            
                                How to cache Django Rest Framework API calls?
                            
                                Group by two columns and count the occurrences of each combination in Pandas
                            
                                merging 2 dataframes vertically [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode vs UTF-8 confusion in Python / Django?

Tags:

python

unicode

django

Hanno Fietz

People also ask

1 Answers

bobince

Recent Activity

Donate For Us