Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Python Unicode Encoding

I am using argparse to read in arguments for my python code. One of those inputs is a title of a file [title] which can contain Unicode characters. I have been using 22少女時代22 as a test string.

I need to write the value of the input title to a file, but when I try to convert the string to UTF-8 it always throws an error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 2: ordinal not in range(128)

I have been looking around and see I need my string to be in the form u"foo" to call .encode() on it.

When I run type() on my input from argparse I see:

<type 'str'>

I am looking to get a response of:

<type 'unicode'>

How can I get it in the right form?


Modify argparse to take in a str but store it as a unicode string u"foo":

parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoded.')

This approach is not working at all. Thoughts?

Edit 1:

Some sample code where title is 22少女時代22:

inputs = vars(parser.parse_args())
title = inputs["title"]
print type(title)
print type(u'foo')
title = title.encode('utf8') # This line throws the error
print title
like image 917
Morrowind789 Avatar asked Oct 06 '12 22:10


People also ask

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

How do I use Unicode in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

Is UTF-8 the same as Unicode?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

Does Python use ASCII or Unicode?

1. Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII.

1 Answers

It looks like your input data is in SJIS encoding (a legacy encoding for Japanese), which produces the byte 0x8f at position 2 in the bytestring:

>>> '22少女時代22'.encode('sjis')

(At Python 3 prompt)

Now, I'm guessing that to "convert the string to UTF-8", you used something like


The problem is that title is actually a bytestring containing the SJIS-encoded string. Due to a design flaw in Python 2, bytestrings can be directly encoded, and it assumes the bytestring is ASCII-encoded. So what you have is conceptually equivalent to


and of course the decode call fails.

You should instead explicitly decode from SJIS to a Unicode string, before encoding to UTF-8:


As Mark Tolonen pointed out, you're probably typing the characters into your console, and it's your console encoding is a non-Unicode encoding.

So it turns out your sys.stdin.encoding is cp932, which is Microsoft's variant of SJIS. For this, use


You really should set your console encoding to the standard UTF-8, but I'm not sure if that's possible on Windows. If you do, you can skip the decoding/encoding step and just write your input bytestring to the file.

like image 67
Mechanical snail Avatar answered Sep 25 '22 01:09

Mechanical snail