Am trying to read data from csv file, split each row into respective columns.
But my regex is failing when a particular column has commas with in itself.
eg: a,b,c,"d,e, g,",f
I want result like:
a b c "d,e, g," f
which is 5 columns.
Here is the regex am using to split the string by comma
,(?=(?:"[^"]?(?:[^"])*))|,(?=[^"]+(?:,)|,+|$)
but it fails for few strings while it works for others.
All am looking for is, when I read data from csv using pyspark into dataframe/rdd, I want to load/preserve all the columns without any mistakes
Thank You
Much easier with the help of the newer regex module:
import regex as re
string = 'a,b,c,"d,e, g,",f'
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|,')
parts = rx.split(string)
print(parts)
# ['a', 'b', 'c', '"d,e, g,"', 'f']
It supports the (*SKIP)(*FAIL) mechanism, which ignores everything betweem double quotes in this example.
import regex as re
string = '''a,b,c,"d,e, g,",f, this, one, with "escaped \"double",quotes:""'''
rx = re.compile(r'".*?(?<!\\)"(*SKIP)(*FAIL)|,')
parts = rx.split(string)
print(parts)
# ['a', 'b', 'c', '"d,e, g,"', 'f', ' this', ' one', ' with "escaped "double",quotes:""']
See a demo for the latter on regex101.com.
csv methods as well:
import csv
string = '''a,b,c,"d,e, g,",f, this, one, with "escaped \"double",quotes:""'''
# just make up an iterable, normally a file would go here
for row in csv.reader([string]):
print(row)
# ['a', 'b', 'c', 'd,e, g,', 'f', ' this', ' one', ' with "escaped "double"', 'quotes:""']
Try \,(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$).
Used this answer which explains how to match everything that is not in quotes ignoring escaped quotes and http://regexr.com/ to test.
Note that - as other answers to your question state - there are better ways to parse CSV than use a regex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With