Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I disable quoting in the Python 2.4 CSV reader?

Tags:

python

csv

I am writing a Python utility that needs to parse a large, regularly-updated CSV file I don't control. The utility must run on a server with only Python 2.4 available. The CSV file does not quote field values at all, but the Python 2.4 version of the csv library does not seem to give me any way to turn off quoting, it just allows me to set the quote character (dialect.quotechar = '"' or whatever). If I try setting the quote character to None or the empty string, I get an error.

I can sort of work around this by setting dialect.quotechar to some "rare" character, but this is brittle, as there is no ASCII character I can absolutely guarantee will not show up in field values (except the delimiter, but if I set dialect.quotechar = dialect.delimiter, things go predictably haywire).

In Python 2.5 and later, if I set dialect.quoting to csv.QUOTE_NONE, the CSV reader respects that and does not interpret any character as a quote character. Is there any way to duplicate this behavior in Python 2.4?

UPDATE: Thanks Triptych and Mark Roddy for helping to narrow the problem down. Here's a simplest-case demonstration:

>>> import csv
>>> import StringIO
>>> data = """
... 1,2,3,4,"5
... 1,2,3,4,5
... """
>>> reader = csv.reader(StringIO.StringIO(data))
>>> for i in reader: print i
... 
[]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
_csv.Error: newline inside string

The problem only occurs when there's a single double-quote character in the final column of a row. Unfortunately, this situation exists in my dataset. I've accepted Tanj's solution: manually assign a nonprinting character ("\x07" or BEL) as the quotechar. This is hacky, but it works, and I haven't yet seen another solution that does. Here's a demo of the solution in action:

>>> import csv
>>> import StringIO
>>> class MyDialect(csv.Dialect):
...     quotechar = '\x07'
...     delimiter = ','
...     lineterminator = '\n'
...     doublequote = False
...     skipinitialspace = False
...     quoting = csv.QUOTE_NONE
...     escapechar = '\\'
... 
>>> dialect = MyDialect()
>>> data = """
... 1,2,3,4,"5
... 1,2,3,4,5
... """
>>> reader = csv.reader(StringIO.StringIO(data), dialect=dialect)
>>> for i in reader: print i
... 
[]
['1', '2', '3', '4', '"5']
['1', '2', '3', '4', '5']

In Python 2.5+ setting quoting to csv.QUOTE_NONE would be sufficient, and the value of quotechar would then be irrelevant. (I'm actually getting my initial dialect via a csv.Sniffer and then overriding the quotechar value, not by subclassing csv.Dialect, but I don't want that to be a distraction from the real issue; the above two sessions demonstrate that Sniffer isn't the problem.)

like image 347
Carl Meyer Avatar asked Jan 30 '09 00:01

Carl Meyer


Video Answer


1 Answers

I don't know if python would like/allow it but could you use a non-printable ascii code such as BEL or BS (backspace) These I would think to be extremely rare.

like image 167
Tanj Avatar answered Nov 10 '22 09:11

Tanj