UnicodeDecodeError on join

Question

I have a list with some strings (most of which I fetched from a sqlite3 database):

stats_list = ['Statistik \xc3\xb6ver s\xc3\xa5nger
', 'Antal	S\xc3\xa5ng', '1	Carola - Betlehems Stj\xc3\xa4rna', '

Statistik \xc3\xb6ver datak\xc3\xa4llor
', 'K\xc3\xa4lla	Antal', 'MANUAL	1', '

Statistik \xc3\xb6ver \xc3\xb6nskare
', 'Antal	Id', u'1	Niclas']

When I try to join it with:

return '
'.join(stats_list)

I get this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)

Is it possible to get any clue why this happens just by looking at the list? If I loop over the list and print it to screen, I get this:

Statistik över sånger

Antal   Sång 
1   Carola - Betlehems Stjärna


Statistik över datakällor

Källa   Antal 
MANUAL  1


Statistik över önskare

Antal   Id
1   Niclas

which is exactly what I was expecting, and no error is shown. (The special characters are swedish).

EDIT:

I'll tried this:

   return '
'.join(i.decode('utf8') for i in stats_list)

But it returned:

Traceback (most recent call last):
  File "./CyberJukebox.py", line 489, in on_stats_to_clipboard
    stats = self.jbox.get_stats()
  File "/home/nine/dev/python/CyberJukebox/jukebox.py", line 235, in get_stats
    return self._stats.get_string()
  File "/home/nine/dev/python/CyberJukebox/jukebox.py", line 59, in get_string
    return '
'.join(i.decode('utf8') for i in stats_list)
  File "/home/nine/dev/python/CyberJukebox/jukebox.py", line 59, in <genexpr>
    return '
'.join(i.decode('utf8') for i in stats_list)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 10: ordinal not in range(128)

EDIT 2:

The suggested solution works for me in the interpreter. But when I execute the code it won't work. I can't wrap my head around this. Maybe it's something obvious I'm missing so I'm pasting the whole method here:

 def get_string(self):
     stats_list = [u'Statistik över sånger
', u'Antal	Sång']
     stats = sorted([(v, k) for k, v in self.song_stats.iteritems()], reverse=True)
     for row in stats:
         line = '%s	%s' % row
         stats_list.append(line)

     stats_list.append(u'

Statistik över datakällor
')
     stats_list.append(u'Källa	Antal')
     stats = sorted([(k, v) for k, v in self.exts_stats.iteritems()])
     for row in stats:
         line = '%s	%s' % row
         stats_list.append(line)

     stats_list.append(u'

Statistik över önskare
')
     stats_list.append(u'Antal	Id')
     stats = sorted([(v, k) for k, v in self.wisher_stats.iteritems() if k != ''], reverse=True)
     for row in stats:
         line = '%s	%s' % row
         stats_list.append(line)

     return '
'.join(i.decode('utf8') for i in stats_list)

song_stats, exts_stats and wisher_stats are dictionaries in the class.

ekhumoro · Accepted Answer

Your problem is probably that you are mixing unicode strings with byte strings.

The code in "Edit 2" has several unicode strings being added to stats_list:

stats_list = [u'Statistik över sånger
', u'Antal	Sång']

If you try to decode these unicode strings, you will get a UnicodeEncodeError. This because Python will first try to use the default encoding (usually "ascii") to encode the strings before trying to decode them. It only ever makes sense to decode byte strings.

So to start with, change the final line in the function to:

return '
'.join(stats_list)

Now you need to check whether any of the other strings that get added to stats_list are byte strings, and ensure they get decoded to unicode strings properly first.

So put print type(line) after the three lines like this:

line = '%s	%s' % row

and then wherever it prints <type 'str'>, change the following line to:

stats_list.append(line.decode('utf-8'))

Of course, if it prints <type 'unicode'>, there's no need to change the following line.

A even better solution here would be to check how the dictionaries song_stats, exts_stats and wisher_stats are created, and make sure they always contain unicode strings (or byte strings that only contain ascii characters).

kennytm · Answer

The strings are encoded in UTF-8. You need to .decode them to a unicode:

>>> 'Statistik \xc3\xb6ver s\xc3\xa5nger
'.decode('utf-8')
u'Statistik \xf6ver s\xe5nger
'
>>> print _
Statistik över sånger

Use comprehension to perform this to all elements:

return '
'.join(x.decode('utf-8') for x in stats_list)

Jack Edmonds · Answer

Python is complaining that it can't convert the string 'Statistik \xc3\xb6ver s\xc3\xa5nger ' to an ASCII string. Try prefixing all your UNICODE strings with u.

stats_list = [u'Statistik \xc3\xb6ver s\xc3\xa5nger
', u'Antal	S\xc3\xa5ng', u'1	Carola - Betlehems Stj\xc3\xa4rna', u'

Statistik \xc3\xb6ver datak\xc3\xa4llor
', u'K\xc3\xa4lla	Antal', u'MANUAL	1', u'

Statistik \xc3\xb6ver \xc3\xb6nskare
', u'Antal	Id', u'1	Niclas']

UnicodeDecodeError on join

Tags:

python

character-encoding

unicode

Niclas Nilsson

3 Answers

ekhumoro

kennytm

Jack Edmonds

Recent Activity

Donate For Us

UnicodeDecodeError on join

Tags:

python

character-encoding

unicode

Niclas Nilsson

3 Answers

ekhumoro

kennytm

Jack Edmonds

Related questions

Recent Activity

Donate For Us