I am using <code>argparse</code> to read in arguments for my python code. One of those inputs is a title of a file [<code>title</code>] which can contain Unicode characters. I have been using <code>22少女時代22</code> as a test string. I need to write the value of the input <code>title</code> to a file, but when I try to convert the string to <code>UTF-8</code> it always throws an error: <blockquote> UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 2: ordinal not in range(128) </blockquote> I have been looking around and see I need my string to be in the form <code>u"foo"</code> to call <code>.encode()</code> on it. When I run <code>type()</code> on my input from <code>argparse</code> I see: <pre class="prettyprint"><code><type 'str'> </code></pre> I am looking to get a response of: <pre class="prettyprint"><code><type 'unicode'> </code></pre> How can I get it in the right form? Idea: Modify <code>argparse</code> to take in a <code>str</code> but store it as a unicode string <code>u"foo"</code>: <pre class="prettyprint"><code>parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoded.') </code></pre> This approach is not working at all. Thoughts? Edit 1: Some sample code where <code>title</code> is <code>22少女時代22</code>: <pre class="prettyprint"><code>inputs = vars(parser.parse_args()) title = inputs["title"] print type(title) print type(u'foo') title = title.encode('utf8') # This line throws the error print title </code></pre>

It looks like your input data is in SJIS encoding (a legacy encoding for Japanese), which produces the byte 0x8f at position 2 in the bytestring: <pre class="prettyprint"><code>>>> '22少女時代22'.encode('sjis') b'22\x8f\xad\x8f\x97\x8e\x9e\x91\xe322' </code></pre> (At Python 3 prompt) Now, <del>I'm guessing that</del> to "convert the string to UTF-8", you used something like <pre class="prettyprint"><code>title.encode('utf8') </code></pre> The problem is that <code>title</code> is actually a bytestring containing the SJIS-encoded string. Due to a design flaw in Python 2, bytestrings can be directly <code>encode</code>d, and it assumes the bytestring is ASCII-encoded. So what you have is conceptually equivalent to <pre class="prettyprint"><code>title.decode('ascii').encode('utf8') </code></pre> and of course the <code>decode</code> call fails. You should instead explicitly decode from SJIS to a Unicode string, before encoding to UTF-8: <pre class="prettyprint"><code>title.decode('sjis').encode('utf8') </code></pre> <hr> As Mark Tolonen pointed out, you're probably typing the characters into your console, and it's your console encoding is a non-Unicode encoding. So it turns out your <code>sys.stdin.encoding</code> is <code>cp932</code>, which is Microsoft's variant of SJIS. For this, use <pre class="prettyprint"><code>title.decode('cp932').encode('utf8') </code></pre> You really should set your console encoding to the standard UTF-8, but I'm not sure if that's possible on Windows. If you do, you can skip the decoding/encoding step and just write your input bytestring to the file.

Python Unicode Encoding

Tags:

python

encode

unicode

argparse

I am using argparse to read in arguments for my python code. One of those inputs is a title of a file [title] which can contain Unicode characters. I have been using 22少女時代22 as a test string.

I need to write the value of the input title to a file, but when I try to convert the string to UTF-8 it always throws an error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position 2: ordinal not in range(128)

I have been looking around and see I need my string to be in the form u"foo" to call .encode() on it.

When I run type() on my input from argparse I see:

<type 'str'>

I am looking to get a response of:

<type 'unicode'>

How can I get it in the right form?

Idea:

Modify argparse to take in a str but store it as a unicode string u"foo":

parser.add_argument(u'title', metavar='T', type=unicode, help='this will be unicode encoded.')

This approach is not working at all. Thoughts?

Edit 1:

Some sample code where title is 22少女時代22:

inputs = vars(parser.parse_args())
title = inputs["title"]
print type(title)
print type(u'foo')
title = title.encode('utf8') # This line throws the error
print title

917

asked Oct 06 '12 22:10

Morrowind789

1 Answers

It looks like your input data is in SJIS encoding (a legacy encoding for Japanese), which produces the byte 0x8f at position 2 in the bytestring:

>>> '22少女時代22'.encode('sjis')
b'22\x8f\xad\x8f\x97\x8e\x9e\x91\xe322'

(At Python 3 prompt)

Now, ~~I'm guessing that~~ to "convert the string to UTF-8", you used something like

title.encode('utf8')

The problem is that title is actually a bytestring containing the SJIS-encoded string. Due to a design flaw in Python 2, bytestrings can be directly encoded, and it assumes the bytestring is ASCII-encoded. So what you have is conceptually equivalent to

title.decode('ascii').encode('utf8')

and of course the decode call fails.

You should instead explicitly decode from SJIS to a Unicode string, before encoding to UTF-8:

title.decode('sjis').encode('utf8')

As Mark Tolonen pointed out, you're probably typing the characters into your console, and it's your console encoding is a non-Unicode encoding.

So it turns out your sys.stdin.encoding is cp932, which is Microsoft's variant of SJIS. For this, use

title.decode('cp932').encode('utf8')

You really should set your console encoding to the standard UTF-8, but I'm not sure if that's possible on Windows. If you do, you can skip the decoding/encoding step and just write your input bytestring to the file.

answered Sep 25 '22 01:09

Mechanical snail

Related questions
                            
                                How to get the number of elements found using Selenium WebDriver with Python?
                            
                                How to implement python method with signature like ([start ,] stop [, step]), i.e. default keyword argument on the left
                            
                                Make Django URLs work with or without /
                            
                                Best way to count char occurences in a string
                            
                                difference normal quote and backquote in python
                            
                                How can an implementation of a language in the same language be faster than the language?
                            
                                Remove one value from a NumPy array
                            
                                Python - neat way of creating multiple objects?
                            
                                How to generate SOPInstance UID for DICOM file?
                            
                                lxml and <wbr> tags
                            
                                AJAX Posting to Python cgi [duplicate]
                            
                                What happens to a immutable object in python when its value is changed?
                            
                                Is there any way to fix PEP-8 issues with pydev?
                            
                                Repeated POST request is causing error "socket.error: (99, 'Cannot assign requested address')"
                            
                                Issue with sys.exit() in pygame
                            
                                Check if key exists in dictionary. If not, append it
                            
                                Collapse run-on whitespace
                            
                                remove last element in a dictionary of lists in python
                            
                                How to Reverse Hebrew String in Python?
                            
                                Test if an index of a list exists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With