Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chinese and Japanese character support in python

How to read correctly japanese and chinese characters. I'm using python 2.5. Output is displayed as "E:\Test\?????????"

path = r"E:\Test\は最高のプログラマ"
t = path.encode()
print t
u = path.decode()
print u
t = path.encode("utf-8")
print t
t = path.decode("utf-8")
print t
like image 679
user2030113 Avatar asked Feb 04 '13 08:02

user2030113


People also ask

Does Python support Japanese?

The default encoding for Python 2 files is ASCII, so by declaring an encoding you make it possible to use Japanese directly.

Is Japanese supported in UTF-8?

The Unicode Standard supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true no matter which encoding form of Unicode is used: UTF-8, UTF-16, or UTF-32.

Which coding scheme is used for Chinese and Japanese characters?

HZ is a chinese coding scheme in which two ascii chars are used to represent one chinese character.

How do I fix encoding in Python?

The best way to attack the problem, as with many things in Python, is to be explicit. That means that every string that your code handles needs to be clearly treated as either Unicode or a byte sequence. The most systematic way to accomplish this is to make your code into a Unicode-only clean room.


1 Answers

Please do read the Python Unicode HOWTO; it explains how to process and include non-ASCII text in your Python code.

If you want to include Japanese text literals in your code, you have several options:

  • Use unicode literals (create unicode objects instead of byte strings), but any non-ascii codepoint is represented by a unicode escape character. They take the form of \uabcd, so a backslash, a u and 4 hexadecimal digits:

    ru = u'\u30EB'
    

    would be one character, the katakana 'ru' codepoint ('ル').

  • Use unicode literals, but include the characters in some form of encoding. Your text editor will save files in a given encoding (say, UTF-16); you need to declare that encoding at the top of the source file:

    # encoding: utf-16
    
    ru = u'ル'
    

    where 'ル' is included without using an escape. The default encoding for Python 2 files is ASCII, so by declaring an encoding you make it possible to use Japanese directly.

  • Use byte string literals, ready encoded. Encode the codepoints by some other means and include them in your byte string literals. If all you are going to do with them is use them in encoded form anyway, this should be fine:

    ru = '\xeb\x30'  # ru encoded to UTF16 little-endian
    

    I encoded 'ル' to UTF-16 little-endian because that's the default Windows NTFS filename encoding.

Next problem will be your terminal, the Windows console is notorious for not supporting many character sets out of the box. You probably want to configure it to handle UTF-8 instead. See this question for some details, but you need to run the following command in the console:

chcp 65001

to switch to UTF-8, and you may need to switch to a console font that can handle your codepoints (Lucida perhaps?).

like image 135
Martijn Pieters Avatar answered Oct 15 '22 18:10

Martijn Pieters