Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split but ignore separators in quoted strings, in python?

Tags:

python

regex

I need to split a string like this, on semicolons. But I don't want to split on semicolons that are inside of a string (' or "). I'm not parsing a file; just a simple string with no line breaks.

part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5

Result should be:

  • part 1
  • "this is ; part 2;"
  • 'this is ; part 3'
  • part 4
  • this "is ; part" 5

I suppose this can be done with a regex but if not; I'm open to another approach.

like image 552
Sylvain Avatar asked May 07 '10 02:05

Sylvain


People also ask

How do you split a string but keep the delimiter in Python?

Just split it, then for each element in the array/list (apart from the last one) add a trailing ">" to it. @paulm no, because splitting two > s like in "<html>>body". split('>') creates an empty element in the middle ["<html", "", "body"] .


1 Answers

Most of the answers seem massively over complicated. You don't need back references. You don't need to depend on whether or not re.findall gives overlapping matches. Given that the input cannot be parsed with the csv module so a regular expression is pretty well the only way to go, all you need is to call re.split with a pattern that matches a field.

Note that it is much easier here to match a field than it is to match a separator:

import re data = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5""" PATTERN = re.compile(r'''((?:[^;"']|"[^"]*"|'[^']*')+)''') print PATTERN.split(data)[1::2] 

and the output is:

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5'] 

As Jean-Luc Nacif Coelho correctly points out this won't handle empty groups correctly. Depending on the situation that may or may not matter. If it does matter it may be possible to handle it by, for example, replacing ';;' with ';<marker>;' where <marker> would have to be some string (without semicolons) that you know does not appear in the data before the split. Also you need to restore the data after:

>>> marker = ";!$%^&;" >>> [r.replace(marker[1:-1],'') for r in PATTERN.split("aaa;;aaa;'b;;b'".replace(';;', marker))[1::2]] ['aaa', '', 'aaa', "'b;;b'"] 

However this is a kludge. Any better suggestions?

like image 83
Duncan Avatar answered Sep 23 '22 06:09

Duncan