I am trying to split a string into a list by a delimiter (let's say ,
) but the delimiter character should be considered the delimiter only if it is not wrapped in a certain pattern, in my particular case <>
. IOW, when a comma is nested in <>
, it is ignored as a delimiter and becomes just a regular character not to be delimited by.
So if I have the following string:
"first token, <second token part 1, second token part 2>, third token"
it should split into
list[0] = "first token"
list[1] = "second token part 1, second token part 2"
list[2] = "third token"
Needless to say, I cannot just do a simple split by ,
because that will split the second token into two tokens, second token part 1
and second token part 2
, as they have a comma in between them.
How should I define the pattern to do it using Python RegEx
?
Update: Since you mentioned that the brackets may be nested, I regret to inform you that a regex solution is not possible in Python. The following can work only if the angle brackets are always balanced and never nested nor escaped:
>>> import re
>>> s = "first token, <second token part 1, second token part 2>, third token"
>>> regex = re.compile(",(?![^<>]*>)")
>>> regex.split(s)
['first token', ' <second token part 1, second token part 2>', ' third token']
>>> [item.strip(" <>") for item in _]
['first token', 'second token part 1, second token part 2', 'third token']
The regex ,(?![^<>]*>)
splits on commas only if the next angle bracket that follows isn't a closing angle bracket.
Nested brackets preclude this or any other regex solution from working in Python. You either need a language that supports recursive regexes (like Perl or .NET languages), or use a parser.
One kludgy way that works for your example is to translate the <>'s into "'s and then treat it as a CSV file:
import csv
import string
s = "first token, <second token part 1, second token part 2>, third token"
a = s.translate(string.maketrans('<>', '""'))
# first token, "second token part 1, second token part 2", third token
print next(csv.reader([a], skipinitialspace=True))
['first token', 'second token part 1, second token part 2', 'third token']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With