Is there an easy way to make unicode work in python?

Tags:

I'm trying to deal with unicode in python 2.7.2. I know there is the .encode('utf-8') thing but 1/2 the time when I add it, I get errors, and 1/2 the time when I don't add it I get errors.

Is there any way to tell python - what I thought was an up-to-date & modern language to just use unicode for strings and not make me have to fart around with .encode('utf-8') stuff?

I know... python 3.0 is supposed to do this, but I can't use 3.0 and 2.7 isn't all that old anyways...

For example:

url = "http://en.wikipedia.org//w/api.php?action=query&list=search&format=json&srlimit=" + str(items) + "&srsearch=" + urllib2.quote(title.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

Update If I remove all my .encode statements from all my code and add # -*- coding: utf-8 -*- to the top of my file, right under the #!/usr/bin/python then I get the following, same as if I didn't add the # -*- coding: utf-8 -*- at all.

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1250: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  return ''.join(map(quoter, s))
Traceback (most recent call last):
  File "classes.py", line 583, in <module>
    wiki.getPage(title)
  File "classes.py", line 146, in getPage
    url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&titles=" + urllib2.quote(title)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1250, in quote
    return ''.join(map(quoter, s))
KeyError: u'\xf1'

I'm not manually typing in any string, I parsing HTML and json from websites. So the scripts/bytestreams/whatever they are, are all created by python.

Update 2 I can move the error along, but it just keeps coming up in new places. I was hoping python would be a useful scripting tool, but looks like after 3 days of no luck I'll just try a different language. Its a shame, python is preinstalled on osx. I've marked correct the answer that fixed the one instance of the error I posted.

914

asked Sep 23 '12 22:09

Justin808

2 Answers

This is a very old question but just wanted to add one partial suggestion. While I sympathise with the OP's pain - having gone through it a lot myself - here's one (partial) answer to make things "easier". Put this at the top of any Python 2.7 script:

from __future__ import unicode_literals

This will at least ensure that your own literal strings default to unicode rather than str.

135

answered Oct 20 '22 12:10

ShankarG

There is no way to make unicode "just work" apart from using unicode strings everywhere and immediately decoding any encoded string you receive. The problem is that you MUST ALWAYS keep straight whether you're dealing with encoded or unencoded data, or use tools that keep track of it for you, or you're going to have a bad time.

Python 2 does some things that are problematic for this: it makes str the "default" rather than unicode for things like string literals, it silently coerces str to unicode when you add the two, and it lets you call .encode() on an already-encoded string to double-encode it. As a result, there are a lot of python coders and python libraries out there that have no idea what encodings they're designed to work with, but are nonetheless designed to deal with some particular encoding since the str type is designed to let the programmer manage the encoding themselves. And you have to think about the encoding each time you use these libraries since they don't support the unicode type themselves.

In your particular case, the first error tells you you're dealing with encoded UTF-8 data and trying to double-encode it, while the 2nd tells you you're dealing with UNencoded data. It looks like you may have both. You should really find and fix the source of the problem (I suspect it has to do with the silent coercion I mentioned above), but here's a hack that should fix it in the short term:

encoded_title = title
if isinstance(encoded_title, unicode):
    encoded_title = title.encode('utf-8')

If this is in fact a case of silent coercion biting you, you should be able to easily track down the problem using the excellent unicode-nazi tool:

python -Werror -municodenazi myprog.py

This will give you a traceback right at the point unicode leaks into your non-unicode strings, instead of trying troubleshooting this exception way down the road from the actual problem. See my answer on this related question for details.

answered Oct 20 '22 14:10

Mu Mind

Related questions
                            
                                List of evented / asynchronous languages
                            
                                How do I override a parent class's functions in python?
                            
                                Converting a bash script to python (small script)
                            
                                Pythonic way to mix two lists
                            
                                open() function python default directory
                            
                                Building a GeoJSON with Python
                            
                                Using Django models in external python script
                            
                                Python - How to change autopct text color to be white in a pie chart?
                            
                                How to pythonically have partially-mutually exclusive optional arguments?
                            
                                Add response headers to flask web app
                            
                                How to exit a loop in Python? [closed]
                            
                                selecting second child in beautiful soup
                            
                                Pyspark: Replacing value in a column by searching a dictionary
                            
                                Return a variable by name from a function in Python [duplicate]
                            
                                Python Dataframes: Describing a single column
                            
                                delete all items DynamoDB using Python
                            
                                Need to add an element at the start of an iterator in python
                            
                                How to check dimensions of all images in a directory using python?
                            
                                Generating dictionary keys on the fly [duplicate]
                            
                                what's the difference between python objects and json objects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there an easy way to make unicode work in python?

Tags:

python

unicode

utf-8

python-2.7

Justin808

People also ask

2 Answers

ShankarG

Mu Mind

Recent Activity

Donate For Us