Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing non-ascii characters from any given stringtype in Python

>>> teststring = 'aõ'
>>> type(teststring)
<type 'str'>
>>> teststring
'a\xf5'
>>> print teststring
aõ
>>> teststring.decode("ascii", "ignore")
u'a'
>>> teststring.decode("ascii", "ignore").encode("ascii")
'a'

which is what i really wanted it to store internally as i remove non-ascii characters. Why did the decode("ascii give out a unicode string ?

>>> teststringUni = u'aõ'
>>> type(teststringUni)
<type 'unicode'>
>>> print teststringUni
aõ
>>> teststringUni.decode("ascii" , "ignore")

Traceback (most recent call last):
  File "<pyshell#79>", line 1, in <module>
    teststringUni.decode("ascii" , "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.decode("utf-8" , "ignore")

Traceback (most recent call last):
  File "<pyshell#81>", line 1, in <module>
    teststringUni.decode("utf-8" , "ignore")
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf5' in position 1: ordinal not in range(128)
>>> teststringUni.encode("ascii" , "ignore")
'a'

Which is again what i wanted. I don't understand this behavior. Can someone explain to me what is happening here?

edit: i thought this would me understand things so i could solve my real program problem that i state here: Converting Unicode objects with non-ASCII symbols in them into strings objects (in Python)

like image 923
fullmooninu Avatar asked Feb 27 '23 07:02

fullmooninu


1 Answers

It's simple: .encode converts Unicode objects into strings, and .decode converts strings into Unicode.

like image 142
Ned Batchelder Avatar answered Mar 01 '23 10:03

Ned Batchelder