Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle double quotes inside field values with csv module?

Tags:

python

csv

I'm trying to parse CSV files from an external system which I have no control of.

  • comma is used as a separator
  • when cell contains comma then it's wrapped in quotes and all other quotes are escaped with another quote character.
  • (my problem) when cell was not wrapped in quotes then all quote characters are escaped with another quote nonetheless.

Example CSV:

qw""erty,"a""b""c""d,ef""""g"

Should be parsed as:

[['qw"erty', 'a"b"c"d,ef""g']]

However, I think that Python's csv module does not expect quote characters to be escaped when cell was not wrapped in quote chars in the first place. csv.reader(my_file) (with default doublequote=True) returns:

['qw""erty', 'a"b"c"d,ef""g']

Is there any way to parse this with python csv module ?

like image 386
matf Avatar asked Feb 25 '15 17:02

matf


1 Answers

Following on @JackManey comment where he suggested to replace all instances of '""' inside of double quotes with '\\"'.

Recognizing if we are currently inside of double quoted cells turned out to be unnecessary and we can replace all instances of '""' with '\\"'. Python documentation says:

On reading, the escapechar removes any special meaning from the following character

However this would still break in the case where original cell already contains escape characters, example: 'qw\\\\""erty' producing [['qw\\"erty']]. So we have to escape the escape characters before parsing too.

Final solution:

with open(file_path, 'rb') as f:
  content = f.read().replace('\\', '\\\\').replace('""', '\\"')
  reader = csv.reader(StringIO(content), doublequote=False, escapechar='\\')
  return [row for row in reader]
like image 125
matf Avatar answered Sep 24 '22 09:09

matf