Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex for reading CSV-like rows

Tags:

python

regex

csv

I want to parse incoming CSV-like rows of data. Values are separated with commas (and there could be leading and trailing whitespaces around commas), and can be quoted either with ' or with ". For example - this is a valid row:

    data1, data2  ,"data3'''",  'data4""',,,data5,

but this one is malformed:

    data1, data2, da"ta3", 'data4',

-- quotation marks can only be prepended or trailed by spaces.

Such malformed rows should be recognized - best would be to somehow mark malformed value within row, but if regex doesn't match the whole row then it's also acceptable.

I'm trying to write regex able to parse this, using either match() of findall(), but every single regex I'm coming with has some problems with edge cases.

So, maybe someone with experience in parsing something similar could help me on this? (Or maybe this is too complex for regex and I should just write a function)

EDIT1:

csv module is not much of use here:

    >>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2',''')))
    [['2', ' "dat', 'a1"', " 'dat", "a2'", '']]

    >>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2',''')))
    [['2', 'dat,a1', "'dat", "a2'", '']]

-- unless this can be tuned?

EDIT2: A few language edits - I hope it's more valid English now

EDIT3: Thank you for all answers, I'm now pretty sure that regular expression is not that good idea here as (1) covering all edge cases can be tricky (2) writer output is not regular. Writing that, I've decided to check mentioned pyparsing and either use it, or write custom FSM-like parser.

like image 603
Tomasz Zieliński Avatar asked Feb 06 '10 11:02

Tomasz Zieliński


3 Answers

While the csv module is the right answer here, a regex that could do this is quite doable:

import re

r = re.compile(r'''
    \s*                # Any whitespace.
    (                  # Start capturing here.
      [^,"']+?         # Either a series of non-comma non-quote characters.
      |                # OR
      "(?:             # A double-quote followed by a string of characters...
          [^"\\]|\\.   # That are either non-quotes or escaped...
       )*              # ...repeated any number of times.
      "                # Followed by a closing double-quote.
      |                # OR
      '(?:[^'\\]|\\.)*'# Same as above, for single quotes.
    )                  # Done capturing.
    \s*                # Allow arbitrary space before the comma.
    (?:,|$)            # Followed by a comma or the end of a string.
    ''', re.VERBOSE)

line = r"""data1, data2  ,"data3'''",  'data4""',,,data5,"""

print r.findall(line)

# That prints: ['data1', 'data2', '"data3\'\'\'"', '\'data4""\'', 'data5']

EDIT: To validate lines, you can reuse the regex above with small additions:

import re

r_validation = re.compile(r'''
    ^(?:    # Capture from the start.
      # Below is the same regex as above, but condensed.
      # One tiny modification is that it allows empty values
      # The first plus is replaced by an asterisk.
      \s*([^,"']*?|"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*')\s*(?:,|$)
    )*$    # And don't stop until the end.
    ''', re.VERBOSE)

line1 = r"""data1, data2  ,"data3'''",  'data4""',,,data5,"""
line2 = r"""data1, data2, da"ta3", 'data4',"""

if r_validation.match(line1):
    print 'Line 1 is valid.'
else:
    print 'Line 1 is INvalid.'

if r_validation.match(line2):
    print 'Line 2 is valid.'
else:
    print 'Line 2 is INvalid.'

# Prints:
#    Line 1 is valid.
#    Line 2 is INvalid.
like image 63
Max Shawabkeh Avatar answered Oct 10 '22 18:10

Max Shawabkeh


Although it would likely be possible with some combination of pre-processing, use of csv module, post-processing, and use of regular expressions, your stated requirements do not fit well with the design of the csv module, nor possibly with regular expressions (depending on the complexity of nested quotation marks that you might have to handle).

In complex parsing cases, pyparsing is always a good package to fall back on. If this isn't a one-off situation, it will likely produce the most straightforward and maintainable result, at the cost of possibly a little extra effort up front. Consider that investment to be paid back quickly, however, as you save yourself the extra effort of debugging the regex solutions to handle corner cases...

You can likely find examples of pyparsing-based CSV parsing easily, with this question maybe enough to get you started.

like image 24
Peter Hansen Avatar answered Oct 10 '22 16:10

Peter Hansen


Python has a standard library module to read csv files:

import csv

reader = csv.reader(open('file.csv'))

for line in reader:
    print line

For your example input this prints

['data1', ' data2 ', "data3'''", ' \'data4""\'', '', '', 'data5', '']

EDIT:

you need to add skipinitalspace=True to allow spaces before double quotation marks for the extra examples you provided. Not sure about the single quotes yet.

>>> list(csv.reader(StringIO('''2, "dat,a1", 'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]

>>> list(csv.reader(StringIO('''2,"dat,a1",'dat,a2','''), skipinitialspace=True))
[['2', 'dat,a1', "'dat", "a2'", '']]
like image 21
pwdyson Avatar answered Oct 10 '22 17:10

pwdyson