Python

Question

Am trying to read data from csv file, split each row into respective columns.

But my regex is failing when a particular column has commas with in itself.

eg: a,b,c,"d,e, g,",f

I want result like:

a    b    c    "d,e, g,"    f

which is 5 columns.

Here is the regex am using to split the string by comma

,(?=(?:"[^"]?(?:[^"])*))|,(?=[^"]+(?:,)|,+|$)

but it fails for few strings while it works for others.

All am looking for is, when I read data from csv using pyspark into dataframe/rdd, I want to load/preserve all the columns without any mistakes

Thank You

Jan · Accepted Answer

Much easier with the help of the newer regex module:

import regex as re

string = 'a,b,c,"d,e, g,",f'
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|,')

parts = rx.split(string)
print(parts)
# ['a', 'b', 'c', '"d,e, g,"', 'f']

It supports the (*SKIP)(*FAIL) mechanism, which ignores everything betweem double quotes in this example.

If you have escaped double quotes, you could use:

import regex as re

string = '''a,b,c,"d,e, g,",f, this, one, with "escaped \"double",quotes:""'''
rx = re.compile(r'".*?(?<!\)"(*SKIP)(*FAIL)|,')
parts = rx.split(string)
print(parts)
# ['a', 'b', 'c', '"d,e, g,"', 'f', ' this', ' one', ' with "escaped "double",quotes:""']

See a demo for the latter on regex101.com.

For nearly 50 points, I feel to provide the csv methods as well:

import csv
string = '''a,b,c,"d,e, g,",f, this, one, with "escaped \"double",quotes:""'''

# just make up an iterable, normally a file would go here
for row in csv.reader([string]):
    print(row)
    # ['a', 'b', 'c', 'd,e, g,', 'f', ' this', ' one', ' with "escaped "double"', 'quotes:""']

Erwin Rooijakkers · Answer

Try \,(?=([^"\]*(\.|"([^"\]*\.)*[^"\]*"))*[^"]*$).

Used this answer which explains how to match everything that is not in quotes ignoring escaped quotes and http://regexr.com/ to test.

Note that - as other answers to your question state - there are better ways to parse CSV than use a regex.

Python - Split a row into columns - csv data

Tags:

regex

csv

rdd

pyspark

Alekhya Vemavarapu

2 Answers

Jan

Erwin Rooijakkers

Recent Activity

Donate For Us