I'm struggling to split text rows, based on variable delimiter, and preserve empty fields and quoted data.
Examples:
1,"2",three,'four, 4',,"6\tsix"
or as tab-delimited vesion
1\t"2"\tthree\t'four, 4'\t\t"6\tsix"
Should both result in:
['1', '"2"', 'three', 'four, 4', '', "6\tsix"]
So far, i've tried:
Using split, but clearly the quoted delimiters are not handled as desired.
solutions using the csv library, but it tends to have options that quotes everything or nothing, without preserving the original quotes.
Regex, particularly following the pattern from the following answer, but it drops the empty fields: How to split but ignore separators in quoted strings, in python?
Using the pyparsing library. The best i've managed is as follows, but this also drops the empty fields (using the comma delimiter example):
s = '1,"2",three,\'four, 4\',,"6\tsix"'
wordchars = (printables + ' \t\r\n').replace(',', '', 1)
delimitedList(OneOrMore(quotedString | Word(wordchars)), ',').parseWithTabs().parseString(s)
Thanks for any ideas!
Python has a built-in method you can apply to string, called .split (), which allows you to split a string by a certain delimiter. The method looks like this: seperator: argument accepts what character to split on.
Python split a string at linebreak using splitlines() : Line split or string split is one of the most common problems we faced in our development journey. For example, the server is sending us a list of comma separated values and we need to split all the values and put them in a list. The easiest way to solve this problem is to split the string.
As you can see that the first print statement separated each word from the string and the second print statement separated the words but added the line boundaries with each one. split method is more useful if you want to split the strings using a specific separator character. Both splitlines and split methods are different.
Split in Python: An Overview of Split () Function. The string manipulation function in Python used to break down a bigger string into several smaller strings is called the split () function in Python. The split () function returns the strings as a list.
This works for me:
import pyparsing as pyp
pyp.delimitedList(pyp.quotedString | pyp.SkipTo(',' | pyp.LineEnd()), ',') \
.parseWithTabs().parseString(s)
Gives
['1', '"2"', 'three', "'four, 4'", '', '"6\tsix"']
Avoid creating Words with whitespace characters, or all printable characters. Pyparsing does not do any lookahead, and these expressions are likely to include much more than you had planned.
use this pattern to match the commas outside double quotes,(?=(?:(?:[^"]*\"){2})*[^"]*$)
Demo
Edit:
to split commas outside double quotes or quotes use this pattern,(?=(?:(?:[^'\"]*(?:\"|')){2})*[^'\"]*$)
Demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With