Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string by delimiter only if not wrapped in certain pattern

Tags:

python

regex

I am trying to split a string into a list by a delimiter (let's say ,) but the delimiter character should be considered the delimiter only if it is not wrapped in a certain pattern, in my particular case <>. IOW, when a comma is nested in <>, it is ignored as a delimiter and becomes just a regular character not to be delimited by.

So if I have the following string:

"first token, <second token part 1, second token part 2>, third token"

it should split into

list[0] = "first token"
list[1] = "second token part 1, second token part 2"
list[2] = "third token"

Needless to say, I cannot just do a simple split by , because that will split the second token into two tokens, second token part 1 and second token part 2, as they have a comma in between them.

How should I define the pattern to do it using Python RegEx?

like image 502
amphibient Avatar asked Nov 21 '13 18:11

amphibient


2 Answers

Update: Since you mentioned that the brackets may be nested, I regret to inform you that a regex solution is not possible in Python. The following can work only if the angle brackets are always balanced and never nested nor escaped:

>>> import re
>>> s = "first token, <second token part 1, second token part 2>, third token"
>>> regex = re.compile(",(?![^<>]*>)")
>>> regex.split(s)
['first token', ' <second token part 1, second token part 2>', ' third token']
>>> [item.strip(" <>") for item in _]
['first token', 'second token part 1, second token part 2', 'third token']

The regex ,(?![^<>]*>) splits on commas only if the next angle bracket that follows isn't a closing angle bracket.

Nested brackets preclude this or any other regex solution from working in Python. You either need a language that supports recursive regexes (like Perl or .NET languages), or use a parser.

like image 196
Tim Pietzcker Avatar answered Oct 18 '22 21:10

Tim Pietzcker


One kludgy way that works for your example is to translate the <>'s into "'s and then treat it as a CSV file:

import csv
import string

s = "first token, <second token part 1, second token part 2>, third token"    
a = s.translate(string.maketrans('<>', '""'))
# first token, "second token part 1, second token part 2", third token
print next(csv.reader([a], skipinitialspace=True))
['first token', 'second token part 1, second token part 2', 'third token']
like image 40
Jon Clements Avatar answered Oct 18 '22 22:10

Jon Clements