Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I write special characters to a CSV in Python?

Tags:

python

When trying to write data to a CSV in Python, I receive the following error.

File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/csv.py", line 150, in writerows
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd3' in position 0: ordinal not in range(128)

Here is an example of a dictionary I'm trying to write to the CSV:

{'Field1': 'Blah \xc3\x93 D\xc3\xa1blah', 'Field2': u'\xd3', 'Field3': u'Blah', 'Field4': u'D\xe1blah'}

I know that you can't write unicode to a CSV with Python, but I'm having trouble figuring out what to convert to and how to convert it.

Edit: This is what I've tried. dictList is a list of dictionaries taken from another CSV.

WANTED_HEADERS = ['First Name',
                  'Last Name',
                  'Date',
                  'ID']

def utf8ify(d):
  return dict((str(k).encode('utf-8'), str(v).encode('utf-8')) for k, v in d.iteritems())

def ListToCSVWithHeaders(data_list, output_file_name, headers):
output_file = open(output_file_name, 'w')
header_row = {}
to_append = []
for entry in data_list:
  to_append.append(utf8ify(entry))
  for key in entry.keys():
    if key not in headers:
      headers.append(key)
      print 'KEY APPENDED: ' + key
for header in headers:
  header_row[header] = header
data = [header_row]
data.extend(to_append)
data_writer = csv.DictWriter(output_file, headers)
data_writer.writerows(data)
print str(len(data)) + ' rows written'

ListToCSVWithHeaders(dictList, 'output.csv', WANTED_HEADERS)

This is the error I receive when running.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)
like image 435
JStew Avatar asked Sep 16 '25 04:09

JStew


1 Answers

You can't write Unicode to a CSV… but you can write bytes that happen to be UTF-8 (or Latin-1, or almost any other encoding*) encoding Unicode. The docs explicitly say this, and suggest how to deal with it:

Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples. These restrictions will be removed in the future.

The Examples section shows how to deal with this, providing wrappers that let you read and write unicode objects, encoding/decoding UTF-8 automatically for you. If you're using a different charset (e.g., because you're planning to pass this to an Excel VBscript that requires a cp1252-encoded CSV), just replace 'utf-8' as appropriate.


The example code does some fancy footwork to make sure that the csv module itself only has to deal with UTF-8, while the file can be in a different codec. That's a great way to deal with codecs that may confuse the csv module. But it looks like you're just looking for Latin-1 (or a Latin-1-extending charset like cp1252), or maybe even UTF-8 itself. in which case, you can just use a quick&dirty solution, like this:

w.writerows(mydata)

… you can just do something hacky like this:

def utf8ify(d):
    return dict((k.encode('utf-8'), v.encode('utf-8')) for k, v in d.iteritems())

w.writerows(utf8ify(d))

Depending on the values you're trying to write, you may need to change the above. For example, if you have Latin-1 strings in the original dict, you will want something like:

k.decode('latin-1').encode('utf-8'), …

If you don't know the kind of thing you're trying to write… well, you can't do the quick&dirty solution.


In your edited version, you're using the quick&dirty solution this way:

def utf8ify(d):
  return dict((str(k).encode('utf-8'), str(v).encode('utf-8')) for k, v in d.iteritems())

… and the values you're passing appear to be a mix of unicode strings like u'\xd3' and what I think are UTF-8 encoded str byte strings like 'Blah \xc3\x93 D\xc3\xa1blah'. There may also be some numbers or something in there, or maybe you're just being careful.

Anyway, that isn't going to work; the UTF-8 encoded strings will pass through str unchanged, decode as sys.getdefaultencoding(), and re-encode as UTF-8, while the Unicode strings will encode with the default encoding, decode with the default encoding, and re-encode with UTF-8.

If this is your actual data, the code will be something like this:

def utf8ify_s(s):
    if isinstance(s, unicode):
        return s.encode('utf-8')
    else:
        return str(s)

That will encode unicode strings, assume str strings are already in UTF-8 and pass them through str (which will leave them unchanged), and turn numbers etc. into strings by calling str (which is fine with any built-in types, and as long as custom types' str that you write are pure ASCII or UTF-8 it's fine for them too). Then, instead of str(…).encode('utf-8') for each k and v, call this function:

def utf8ify(d):
    return dict(utf8ify_s(k): utf8ify_s(v) for k, v in d.iteritems())

Meanwhile, I would strongly encourage you to read through the Unicode HOWTO, and anything else you need, to understand what's actually going on here, instead of just trying to hack on your code until it seems to work.


* The actual rules are something like this: No embedded NUL bytes (so UTF-16 is out), no persistent state that can crosses multiple lines (so some East Asian encodings are out), and no "surrogate"-style partial-character bytes that match like your quote characters's bytes. If you're not sure… use the fancy converters and go through UTF-8.

like image 156
abarnert Avatar answered Sep 19 '25 08:09

abarnert