I am writing a Python utility that needs to parse a large, regularly-updated CSV file I don't control. The utility must run on a server with only Python 2.4 available. The CSV file does not quote field values at all, but the Python 2.4 version of the csv library does not seem to give me any way to turn off quoting, it just allows me to set the quote character (dialect.quotechar = '"'
or whatever). If I try setting the quote character to None
or the empty string, I get an error.
I can sort of work around this by setting dialect.quotechar
to some "rare" character, but this is brittle, as there is no ASCII character I can absolutely guarantee will not show up in field values (except the delimiter, but if I set dialect.quotechar = dialect.delimiter
, things go predictably haywire).
In Python 2.5 and later, if I set dialect.quoting
to csv.QUOTE_NONE
, the CSV reader respects that and does not interpret any character as a quote character. Is there any way to duplicate this behavior in Python 2.4?
UPDATE: Thanks Triptych and Mark Roddy for helping to narrow the problem down. Here's a simplest-case demonstration:
>>> import csv
>>> import StringIO
>>> data = """
... 1,2,3,4,"5
... 1,2,3,4,5
... """
>>> reader = csv.reader(StringIO.StringIO(data))
>>> for i in reader: print i
...
[]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
_csv.Error: newline inside string
The problem only occurs when there's a single double-quote character in the final column of a row. Unfortunately, this situation exists in my dataset. I've accepted Tanj's solution: manually assign a nonprinting character ("\x07"
or BEL
) as the quotechar. This is hacky, but it works, and I haven't yet seen another solution that does. Here's a demo of the solution in action:
>>> import csv
>>> import StringIO
>>> class MyDialect(csv.Dialect):
... quotechar = '\x07'
... delimiter = ','
... lineterminator = '\n'
... doublequote = False
... skipinitialspace = False
... quoting = csv.QUOTE_NONE
... escapechar = '\\'
...
>>> dialect = MyDialect()
>>> data = """
... 1,2,3,4,"5
... 1,2,3,4,5
... """
>>> reader = csv.reader(StringIO.StringIO(data), dialect=dialect)
>>> for i in reader: print i
...
[]
['1', '2', '3', '4', '"5']
['1', '2', '3', '4', '5']
In Python 2.5+ setting quoting to csv.QUOTE_NONE would be sufficient, and the value of quotechar
would then be irrelevant. (I'm actually getting my initial dialect via a csv.Sniffer
and then overriding the quotechar value, not by subclassing csv.Dialect
, but I don't want that to be a distraction from the real issue; the above two sessions demonstrate that Sniffer
isn't the problem.)
I don't know if python would like/allow it but could you use a non-printable ascii code such as BEL or BS (backspace) These I would think to be extremely rare.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With