Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python delimited line split problems

I'm struggling to split text rows, based on variable delimiter, and preserve empty fields and quoted data.

Examples:

1,"2",three,'four, 4',,"6\tsix"

or as tab-delimited vesion

1\t"2"\tthree\t'four, 4'\t\t"6\tsix"

Should both result in:

['1', '"2"', 'three', 'four, 4', '', "6\tsix"]

So far, i've tried:

  1. Using split, but clearly the quoted delimiters are not handled as desired.

  2. solutions using the csv library, but it tends to have options that quotes everything or nothing, without preserving the original quotes.

  3. Regex, particularly following the pattern from the following answer, but it drops the empty fields: How to split but ignore separators in quoted strings, in python?

  4. Using the pyparsing library. The best i've managed is as follows, but this also drops the empty fields (using the comma delimiter example):

    s = '1,"2",three,\'four, 4\',,"6\tsix"'
    wordchars = (printables + ' \t\r\n').replace(',', '', 1)
    delimitedList(OneOrMore(quotedString | Word(wordchars)), ',').parseWithTabs().parseString(s)
    

Thanks for any ideas!

like image 571
user2123203 Avatar asked Jun 17 '14 15:06

user2123203


People also ask

How to split a string by a delimiter in Python?

Python has a built-in method you can apply to string, called .split (), which allows you to split a string by a certain delimiter. The method looks like this: seperator: argument accepts what character to split on.

How to split a string at linebreak in Python?

Python split a string at linebreak using splitlines() : Line split or string split is one of the most common problems we faced in our development journey. For example, the server is sending us a list of comma separated values and we need to split all the values and put them in a list. The easiest way to solve this problem is to split the string.

What is the difference between Split and splitlines in Python?

As you can see that the first print statement separated each word from the string and the second print statement separated the words but added the line boundaries with each one. split method is more useful if you want to split the strings using a specific separator character. Both splitlines and split methods are different.

How do you split a list in Python?

Split in Python: An Overview of Split () Function. The string manipulation function in Python used to break down a bigger string into several smaller strings is called the split () function in Python. The split () function returns the strings as a list.


2 Answers

This works for me:

import pyparsing as pyp

pyp.delimitedList(pyp.quotedString | pyp.SkipTo(',' | pyp.LineEnd()), ',') \
    .parseWithTabs().parseString(s)

Gives

['1', '"2"', 'three', "'four, 4'", '', '"6\tsix"']

Avoid creating Words with whitespace characters, or all printable characters. Pyparsing does not do any lookahead, and these expressions are likely to include much more than you had planned.

like image 112
PaulMcG Avatar answered Oct 09 '22 13:10

PaulMcG


use this pattern to match the commas outside double quotes
,(?=(?:(?:[^"]*\"){2})*[^"]*$)
Demo

Edit: to split commas outside double quotes or quotes use this pattern
,(?=(?:(?:[^'\"]*(?:\"|')){2})*[^'\"]*$)
Demo

like image 3
alpha bravo Avatar answered Oct 09 '22 13:10

alpha bravo