I am using argparse
to read in arguments for my python code. One of those inputs is a title of a file [title
] which can contain Unicode characters. I have been using 22少女時代22
as a test string.
I need to write the value of the input title
to a file, but when I try to convert the string to UTF-8
it always throws an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 2: ordinal not in range(128)
I have been looking around and see I need my string to be in the form u"foo"
to call .encode()
on it.
When I run type()
on my input from argparse
I see:
<type 'str'>
I am looking to get a response of:
<type 'unicode'>
How can I get it in the right form?
Idea:
Modify argparse
to take in a str
but store it as a unicode string u"foo"
:
parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoded.')
This approach is not working at all. Thoughts?
Edit 1:
Some sample code where title
is 22少女時代22
:
inputs = vars(parser.parse_args())
title = inputs["title"]
print type(title)
print type(u'foo')
title = title.encode('utf8') # This line throws the error
print title
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.
The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).
1. Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII.
It looks like your input data is in SJIS encoding (a legacy encoding for Japanese), which produces the byte 0x8f at position 2 in the bytestring:
>>> '22少女時代22'.encode('sjis')
b'22\x8f\xad\x8f\x97\x8e\x9e\x91\xe322'
(At Python 3 prompt)
Now, I'm guessing that to "convert the string to UTF-8", you used something like
title.encode('utf8')
The problem is that title
is actually a bytestring containing the SJIS-encoded string. Due to a design flaw in Python 2, bytestrings can be directly encode
d, and it assumes the bytestring is ASCII-encoded. So what you have is conceptually equivalent to
title.decode('ascii').encode('utf8')
and of course the decode
call fails.
You should instead explicitly decode from SJIS to a Unicode string, before encoding to UTF-8:
title.decode('sjis').encode('utf8')
As Mark Tolonen pointed out, you're probably typing the characters into your console, and it's your console encoding is a non-Unicode encoding.
So it turns out your sys.stdin.encoding
is cp932
, which is Microsoft's variant of SJIS. For this, use
title.decode('cp932').encode('utf8')
You really should set your console encoding to the standard UTF-8, but I'm not sure if that's possible on Windows. If you do, you can skip the decoding/encoding step and just write your input bytestring to the file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With