How to use split with utf8 coding?

Question

I use builtin split function and i have a problem:

>>> data = "test, ąśżźć, test2"
>>> splitted_data = data.split(",")
>>> print splitted_data
['test', ' \xc4\x85\xc5\x9b\xc5\xbc\xc5\xba\xc4\x87', ' test2']

Why this is happen? What should I do to prevent this?

Python 2.7.1

Chris Morgan · Accepted Answer

That's purely the output you get from str.__repr__ (calling repr() on a string). The \xc4 etc. is just the actual way it's stored. When you print it it's still the same:

>>> data = "test, ąśżźć, test2"
>>> data
'test, \xc4\x85\xc5\x9b\xc5\xbc\xc5\xba\xc4\x87, test2'
>>> print data
test, ąśżźć, test2

list.__str__ and list.__repr__ use the representation of the string, but if you access the item inside it, it's still correct:

>>> splitted_data = data.split(",")
>>> splitted_data
['test', ' \xc4\x85\xc5\x9b\xc5\xbc\xc5\xba\xc4\x87', ' test2']
>>> print splitted_data[1]
 ąśżźć

Cat Plus Plus · Answer

While your snippet works (escapes are just how repr works), you shouldn't treat bytestrings as text. Decode first, operate later.

data       = u"test, ąśżźć, test2" # or "test, ąśżźć, test2".decode('utf-8')
split_data = data.split(u",")

How to use split with utf8 coding?

Tags:

python

Nips

2 Answers

Chris Morgan

Cat Plus Plus

Recent Activity

Donate For Us

How to use split with utf8 coding?

Tags:

python

Nips

2 Answers

Chris Morgan

Cat Plus Plus

Related questions

Recent Activity

Donate For Us