Python: Sanitize a string for unicode? [duplicate]

Tags:

Possible Duplicate:
Python UnicodeDecodeError - Am I misunderstanding encode?

I have a string that I'm trying to make safe for the unicode() function:

>>> s = " foo “bar bar ” weasel"
>>> s.encode('utf-8', 'ignore')

Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    s.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
>>> unicode(s)

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    unicode(s)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)

I'm mostly flailing around here. What do I need to do to remove the unsafe characters from the string?

Somewhat related to this question, although I was unable to solve my problem from it.

This also fails:

>>> s
' foo \x93bar bar \x94 weasel'
>>> s.decode('utf-8')

Traceback (most recent call last):
  File "<pyshell#13>", line 1, in <module>
    s.decode('utf-8')
  File "C:\Python25\254\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 5: unexpected code byte

595

asked Jul 11 '10 19:07

Nick Heiner

1 Answers

Good question. Encoding issues are tricky. Let's start with "I have a string." Strings in Python 2 aren't really "strings," they're byte arrays. So your string, where did it come from and what encoding is it in? Your example shows curly quotes in the literal, and I'm not even sure how you did that. I try to paste it into a Python interpreter, or type it on OS X with Option-[, and it doesn't come through.

Looking at your second example though, you have a character of hex 93. That can't be UTF-8, because in UTF-8, any byte higher than 127 is part of a multibyte sequence. So I'm guessing it's supposed to be Latin-1. The problem is, x93 isn't a character in the Latin-1 character set. There's this "invalid" range in Latin-1 from x7f to x9f that's considered illegal. However, Microsoft saw that unused range and decided to put "curly quotes" in there. In doing so they created this similar encoding called "windows-1252", which is like Latin-1 with stuff in that invalid range.

So, let's assume it is windows-1252. What now? String.decode converts bytes into Unicode, so that's the one you want. Your second example was on the right track, but it failed because the string wasn't UTF-8. Try:

>>> uni = 'foo \x93bar bar\x94 weasel'.decode("windows-1252")
u'foo \u201cbar bar\u201d weasel'
>>> print uni
foo “bar bar” weasel
>>> type(uni)
<type 'unicode'>

That's correct, because opening curly quote is Unicode U+201C. Now that you have Unicode, you can serialize it to bytes in any encoding you choose (if you need to pass it across the wire) or just keep it as Unicode if it's staying within Python. If you want to convert to UTF-8, use the oppose function, string.encode.

>>> uni.encode("utf-8")
'foo \xe2\x80\x9cbar bar \xe2\x80\x9d weasel'

Curly quotes take 3 bytes to encode in UTF-8. You could use UTF-16 and they'd only be two bytes. You can't encode as ASCII or Latin-1 though, because those don't have curly quotes.

answered Sep 23 '22 11:09

jpsimons

Related questions
                            
                                ipython install new modules
                            
                                Using Python's multiprocessing.Process class
                            
                                Why django uses a comma as decimal separator
                            
                                Flask route giving 404 with floating point numbers in the URL
                            
                                pynfs: error: gssapi/gssapi.h: No such file or directory
                            
                                Removing everything except letters and spaces from string in Python3.3
                            
                                Python: subprocess call with shell=False not working
                            
                                CAP_PROP_FRAME_COUNT constant is missing in opencv `cv2` python module
                            
                                What is the difference between creating db tables using alembic and defining models in SQLAlchemy?
                            
                                How to get current date and time from GPS unsegment time in python
                            
                                Python: Bokeh hover date time
                            
                                Finding top 10 in a dataframe in Pandas
                            
                                How to make a tkinter canvas rectangle with rounded corners?
                            
                                Uploading PIL image object to Amazon s3 python
                            
                                Disable Python requests SSL validation for an imported module
                            
                                Cannot open anaconda suddenly
                            
                                How to upgrade sqlite 3.8.2 to >= 3.8.3
                            
                                How do you apply 'or' to all values of a list in Python?
                            
                                Upgrade Python to 2.6 on Mac
                            
                                How to copy files only if the source is newer than the destination in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: Sanitize a string for unicode? [duplicate]

Tags:

python

character-encoding

unicode

Nick Heiner

People also ask

1 Answers

jpsimons

Recent Activity

Donate For Us