Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python and Unicode: How everything should be Unicode

Tags:

Forgive if this a long a question:

I have been programming in Python for around six months. Self taught, starting with the Python tutorial and then SO and then just using Google for stuff.

Here is the sad part: No one told me all strings should be Unicode. No, I am not lying or making this up, but where does the tutorial mention it? And most examples also I see just make use of byte strings, instead of Unicode strings. I was just browsing and came across this question on SO, which says how every string in Python should be a Unicode string. This pretty much made me cry!

I read that every string in Python 3.0 is Unicode by default, so my questions are for 2.x:

  1. Should I do a:

    print u'Some text' or just print 'Text' ?

  2. Everything should be Unicode, does this mean, like say I have a tuple:

    t = ('First', 'Second'), it should be t = (u'First', u'Second')?

    I read that I can do a from __future__ import unicode_literals and then every string will be a Unicode string, but should I do this inside a container also?

  3. When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?

  4. If I get the string from say raw_input(), should I convert that to Unicode also?

What is the common approach to handling all of the above issues in 2.x? The from __future__ import unicode_literals statement?

Sorry for being a such a noob, but this changes what I have been doing for a long time and so clearly I am confused.

like image 928
user225312 Avatar asked Dec 27 '10 18:12

user225312


People also ask

How does Python handle Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.

How do I fix encoding in Python?

The best way to attack the problem, as with many things in Python, is to be explicit. That means that every string that your code handles needs to be clearly treated as either Unicode or a byte sequence. The most systematic way to accomplish this is to make your code into a Unicode-only clean room.

Are all strings in Python Unicode?

In Python 3, all strings are sequences of Unicode characters . You have two options to create Unicode string in Python. Either use decode() , or create a new Unicode string with UTF-8 encoding by unicode(). The unicode() method is unicode(string[, encoding, errors]) , its arguments should be 8-bit strings.

Is Python Unicode or ASCII?

Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters.


2 Answers

The "always use Unicode" suggestion is primarily to make the transition to Python 3 easier. If you have a lot of non-Unicode string access in your code, it'll take more work to port it.

Also, you shouldn't have to decide on a case-by-case basis whether a string should be stored as Unicode or not. You shouldn't have to change the types of your strings and their very syntax just because you changed their contents, either.

It's also easy to use the wrong string type, leading to code that mostly works, or code which works in Linux but not in Windows, or in one locale but not another. For example, for c in "漢字" in a UTF-8 locale will iterate over each UTF-8 byte (all six of them), not over each character; whether that breaks things depends on what you do with them.

In principle, nothing should break if you use Unicode strings, but things may break if you use regular strings when you shouldn't.

In practice, however, it's a pain to use Unicode strings everywhere in Python 2. codecs.open doesn't pick the correct locale automatically; this fails:

codecs.open("blar.txt", "w").write(u"漢字") 

The real answer is:

import locale, codecs lang, encoding = locale.getdefaultlocale() codecs.open("blar.txt", "w", encoding).write(u"漢字") 

... which is cumbersome, forcing people to make helper functions just to open files. codecs.open should be using the encoding from locale automatically when one isn't specified; Python's failure to make such a simple operation convenient is one of the reasons people generally don't use Unicode everywhere.

Finally, note that Unicode strings are even more critical in Windows in some cases. For example, if you're in a Western locale and you have a file named "漢字", you must use a Unicode string to access it, eg. os.stat(u"漢字"). It's impossible to access it with a non-Unicode string; it just won't see the file.

So, in principle I'd say the Unicode string recommendation is reasonable, but with the caveat that I don't generally even follow it myself.

like image 177
Glenn Maynard Avatar answered Oct 02 '22 18:10

Glenn Maynard


No, not every string "should be Unicode". Within your Python code, you know if the string literals needs to be Unicode or not, so it doesn't make any sense to make every string literal into a Unicode literal.

But there are cases where you should use Unicode. For example, if you have arbitrary input that is text, use Unicode for it. You will sooner or later find a non-american using it, and he want to wrîte têxt ås hé is üsed tö. And you'll get problems in that case unless your input and output happen to use the same encoding, which you can't be sure of.

So in short, no, strings shouldn't be Unicode. Text should be. But YMMV.

Specifically:

  1. No need to use Unicode here. You know if that string is ASCII or not.

  2. Depends if you need to merge those strings with Unicode or not.

  3. Both ways work. But do not encode decode "when required". Decode ASAP, encode as late as possible. Using codecs work well (or io, from Python 2.7).

  4. Yeah.

like image 36
Lennart Regebro Avatar answered Oct 02 '22 17:10

Lennart Regebro