Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to strip unicode in a list

Tags:

python

unicode

I want to strip unicode string from the list for example airports [u'KATL',u'KCID']

expected output

[KATL,KCID]

Followed the below link

Strip all the elements of a string list

Tried one of the solution

my_list = ['this\n', 'is\n', 'a\n', 'list\n', 'of\n', 'words\n']

map(str.strip, my_list) ['this', 'is', 'a', 'list', 'of', 'words']

got the following error

TypeError: descriptor 'strip' requires a 'str' object but received a 'unicode'

like image 316
Hariom Singh Avatar asked Jul 27 '17 14:07

Hariom Singh


2 Answers

First, I strongly suggest you switch to Python 3, which treats Unicode strings as first-class citizens (all strings are Unicode strings, but they are called str).

But if you have to make it work in Python 2, you can strip unicode strings with unicode.strip (if your strings are true Unicode strings):

>>> lst = [u'KATL\n', u'KCID\n']
>>> map(unicode.strip, lst)
[u'KATL', u'KCID']

If your unicode strings are limited to ASCII subset, you can convert them to str with:

>>> lst = [u'KATL', u'KCID']
>>> map(str, lst)
['KATL', 'KCID']

Note that this conversion will fail for non-ASCII strings. To encode Unicode codepoints as a str (string of bytes), you have to choose your encoding algorithm (usually UTF-8) and use .encode() method on your strings:

>>> lst = [u'KATL', u'KCID']
>>> map(lambda x: x.encode('utf-8'), lst)
['KATL', 'KCID']
like image 85
randomir Avatar answered Oct 15 '22 12:10

randomir


The only reliable to convert a unicode string to a byte string is to encode it into an acceptable encoding (ascii, Latin1 and UTF8 are most common one). By definition, UTF8 is able to encode any unicode character, but you will find non ascii chars in the string, and the size in byte will no longer be the number of (unicode) characters. Latin1 is able to represent most of west european languages characters in with a 1 byte per character relation, and ascii is the set of characters that are always correctly represented.

If you want to be able to process strings containing characters not representable in the choosen charset, you can use the parameter errors='ignore' to just remove them or errors='replace' to replace them with a replacement character, often ?.

So if I have correctly understood your requirement, you could translate the list of unicode string into a list of byte strings with:

[ x.encode('ascii', errors='replace') for x in my_list ]
like image 28
Serge Ballesta Avatar answered Oct 15 '22 14:10

Serge Ballesta