Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding for Multilingual .py Files

I am writing a .py file that contains strings from multiple charactersets, including English, Spanish, and Russian. For example, I have something like:

string_en = "The quick brown fox jumped over the lazy dog."  
string_es = "El veloz murciélago hindú comía feliz cardillo y kiwi."
string_ru = "В чащах юга жил бы цитрус? Да, но фальшивый экземпляр!"

I am having trouble figuring out how to encode my file to avoid generating syntax errors like the one below when my file is run:

SyntaxError: Non-ASCII character '\xc3' in file example.py on line 128, but no encoding
declared; see http://www.python.org/peps/pep-0263.html for details

I've tried adding # -*- coding: utf-8 -*- to the beginning of my file, but without any luck. I've also tried marking my strings as unicode (i.e. string_en = u'The quick brown fox jumped over the lazy dog."), again unsuccessfully.

Is it possible to include characters from different Python codecs in one file, or am I attempting to do something that is not allowed?

like image 885
Katrina Avatar asked Feb 14 '11 17:02

Katrina


People also ask

What encoding to use for all languages?

You cannot encode different parts of a document in different encodings. A Unicode-based encoding such as UTF-8 can support many languages and can accommodate pages and forms in any mixture of those languages.

What encoding does Python use?

By default, Python uses utf-8 encoding.

How do I fix encoding in Python?

The best way to attack the problem, as with many things in Python, is to be explicit. That means that every string that your code handles needs to be clearly treated as either Unicode or a byte sequence. The most systematic way to accomplish this is to make your code into a Unicode-only clean room.

What is encoding in Python file handling?

The encoding information is then used by the Python parser to interpret the file using the given encoding. Most notably this enhances the interpretation of Unicode literals in the source code and makes it possible to write Unicode literals using e.g. UTF-8 directly in an Unicode aware editor.


1 Answers

There are two aspects to proper encoding of strings in your use case:

  1. For Python to understand that you are using UTF-8 encoding, you must include in the first or second line of your code, a line that looks like # coding=utf-8. See PEP 0263 for details.

  2. Your editor also must use UTF-8. This requires to configure it, and depends on the editor you are using. Configuration of Emacs and Vim are addressed in the same PEP, Eclipse can default to the filesystem encoding, which itself can be derived from your locale settings, etc.

like image 197
Eric Redon Avatar answered Oct 14 '22 05:10

Eric Redon