Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeDecodeError on join

I have a list with some strings (most of which I fetched from a sqlite3 database):

stats_list = ['Statistik \xc3\xb6ver s\xc3\xa5nger\n', 'Antal\tS\xc3\xa5ng', '1\tCarola - Betlehems Stj\xc3\xa4rna', '\n\nStatistik \xc3\xb6ver datak\xc3\xa4llor\n', 'K\xc3\xa4lla\tAntal', 'MANUAL\t1', '\n\nStatistik \xc3\xb6ver \xc3\xb6nskare\n', 'Antal\tId', u'1\tNiclas']

When I try to join it with:

return '\n'.join(stats_list)

I get this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)

Is it possible to get any clue why this happens just by looking at the list? If I loop over the list and print it to screen, I get this:

Statistik över sånger

Antal   Sång 
1   Carola - Betlehems Stjärna


Statistik över datakällor

Källa   Antal 
MANUAL  1


Statistik över önskare

Antal   Id
1   Niclas

which is exactly what I was expecting, and no error is shown. (The special characters are swedish).

EDIT:

I'll tried this:

   return '\n'.join(i.decode('utf8') for i in stats_list)

But it returned:

Traceback (most recent call last):
  File "./CyberJukebox.py", line 489, in on_stats_to_clipboard
    stats = self.jbox.get_stats()
  File "/home/nine/dev/python/CyberJukebox/jukebox.py", line 235, in get_stats
    return self._stats.get_string()
  File "/home/nine/dev/python/CyberJukebox/jukebox.py", line 59, in get_string
    return '\n'.join(i.decode('utf8') for i in stats_list)
  File "/home/nine/dev/python/CyberJukebox/jukebox.py", line 59, in <genexpr>
    return '\n'.join(i.decode('utf8') for i in stats_list)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 10: ordinal not in range(128)

EDIT 2:

The suggested solution works for me in the interpreter. But when I execute the code it won't work. I can't wrap my head around this. Maybe it's something obvious I'm missing so I'm pasting the whole method here:

 def get_string(self):
     stats_list = [u'Statistik över sånger\n', u'Antal\tSång']
     stats = sorted([(v, k) for k, v in self.song_stats.iteritems()], reverse=True)
     for row in stats:
         line = '%s\t%s' % row
         stats_list.append(line)

     stats_list.append(u'\n\nStatistik över datakällor\n')
     stats_list.append(u'Källa\tAntal')
     stats = sorted([(k, v) for k, v in self.exts_stats.iteritems()])
     for row in stats:
         line = '%s\t%s' % row
         stats_list.append(line)

     stats_list.append(u'\n\nStatistik över önskare\n')
     stats_list.append(u'Antal\tId')
     stats = sorted([(v, k) for k, v in self.wisher_stats.iteritems() if k != ''], reverse=True)
     for row in stats:
         line = '%s\t%s' % row
         stats_list.append(line)

     return '\n'.join(i.decode('utf8') for i in stats_list)

song_stats, exts_stats and wisher_stats are dictionaries in the class.

like image 432
Niclas Nilsson Avatar asked Dec 12 '11 21:12

Niclas Nilsson


3 Answers

Your problem is probably that you are mixing unicode strings with byte strings.

The code in "Edit 2" has several unicode strings being added to stats_list:

stats_list = [u'Statistik över sånger\n', u'Antal\tSång']

If you try to decode these unicode strings, you will get a UnicodeEncodeError. This because Python will first try to use the default encoding (usually "ascii") to encode the strings before trying to decode them. It only ever makes sense to decode byte strings.

So to start with, change the final line in the function to:

return '\n'.join(stats_list)

Now you need to check whether any of the other strings that get added to stats_list are byte strings, and ensure they get decoded to unicode strings properly first.

So put print type(line) after the three lines like this:

line = '%s\t%s' % row

and then wherever it prints <type 'str'>, change the following line to:

stats_list.append(line.decode('utf-8'))

Of course, if it prints <type 'unicode'>, there's no need to change the following line.

A even better solution here would be to check how the dictionaries song_stats, exts_stats and wisher_stats are created, and make sure they always contain unicode strings (or byte strings that only contain ascii characters).

like image 53
ekhumoro Avatar answered Nov 12 '22 06:11

ekhumoro


The strings are encoded in UTF-8. You need to .decode them to a unicode:

>>> 'Statistik \xc3\xb6ver s\xc3\xa5nger\n'.decode('utf-8')
u'Statistik \xf6ver s\xe5nger\n'
>>> print _
Statistik över sånger

Use comprehension to perform this to all elements:

return '\n'.join(x.decode('utf-8') for x in stats_list)
like image 27
kennytm Avatar answered Nov 12 '22 07:11

kennytm


Python is complaining that it can't convert the string 'Statistik \xc3\xb6ver s\xc3\xa5nger\n' to an ASCII string. Try prefixing all your UNICODE strings with u.

stats_list = [u'Statistik \xc3\xb6ver s\xc3\xa5nger\n', u'Antal\tS\xc3\xa5ng', u'1\tCarola - Betlehems Stj\xc3\xa4rna', u'\n\nStatistik \xc3\xb6ver datak\xc3\xa4llor\n', u'K\xc3\xa4lla\tAntal', u'MANUAL\t1', u'\n\nStatistik \xc3\xb6ver \xc3\xb6nskare\n', u'Antal\tId', u'1\tNiclas']
like image 32
Jack Edmonds Avatar answered Nov 12 '22 07:11

Jack Edmonds