Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match unescaped quotes in quoted csv

I've looked at several of the Stack Overflow posts with similar titles, and none of the accepted answers have done the trick for me.

I have a CSV file where each "cell" of data is delimited by a comma and is quoted (including numbers). Each line ends with a new line character.

Some text "cells" have quotation marks in them, and I want to use regex to find these, so that I can escape them properly.

Example line:

"0","0.23432","234.232342","data here dsfsd hfsdf","3/1/2016",,"etc","E 60"","AD"8"\n

I want to match just the " in E 60" and in AD"8, but not any of the other ".

What is a (preferably Python-friendly) regular expression that I can use to do this?

like image 953
sundance Avatar asked Oct 18 '22 13:10

sundance


2 Answers

EDIT: Updated with regex from @sundance to avoid beginning of line and newline.

You could try substituting only quotes that aren't next to a comma, start of line, or newline:

import re

newline = re.sub(r'(?<!^)(?<!,)"(?!,|$)', '', line)
like image 84
dogoncouch Avatar answered Oct 21 '22 03:10

dogoncouch


Rather than using regex, here's an approach that uses Python's string functions to find and escape only quotes between the left and rightmost quotes of a string.

It uses the .find() and .rfind() methods of strings to find the surrounding " characters. It then does a replacement on any additional " characters that appear inside the outer quotes. Doing it this way makes no assumptions about where the surrounding quotes are between the , separators, so it will leave any surrounding whitespace unaltered (for example, it leaves the '\n' at the end of each line as-is).

def escape_internal_quotes(item):
    left = item.find('"') + 1
    right = item.rfind('"')
    if left < right:
        # only do the substitution if two surrounding quotes are found
        item = item[:left] + item[left:right].replace('"', '\\"') + item[right:]
    return item

line = '"0","0.23432","234.232342","data here dsfsd hfsdf","3/1/2016",,"etc","E 60"","AD"8"\n'
escaped = [escape_internal_quotes(item) for item in line.split(',')]
print(repr(','.join(escaped)))

Resulting in:

'"0","0.23432","234.232342","data here dsfsd hfsdf","3/1/2016",,"etc","E 60\\"","AD\\"8"\n'
like image 22
Craig Avatar answered Oct 21 '22 03:10

Craig