This question has been asked and answered many times before. Some examples: [1], [2]. But there doesn't seem to be something somewhat more general. What I'm looking for is for a way to split strings at commas that are not within quotes or pairs of delimiters. For instance:
s1 = 'obj<1, 2, 3>, x(4, 5), "msg, with comma"'
should be split into a list of three elements
['obj<1, 2, 3>', 'x(4, 5)', '"msg, with comma"']
The problem now is that this can get more complicated since we can look into pairs of <>
and ()
.
s2 = 'obj<1, sub<6, 7>, 3>, x(4, y(8, 9), 5), "msg, with comma"'
which should be split into:
['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']
The naive solution without using regex is to parse the string by looking for the characters ,<(
. If either <
or (
are found then we start counting the parity. We can only split at a comma if the parity is zero. For instance say we want to split s2
, we can start with parity = 0
and when we reach s2[3]
we encounter <
which will increase parity by 1. The parity will only decrease when it encounters >
or )
and it will increase when it encounters <
or (
. While the parity is not 0 we can simply ignore the commas and not do any splitting.
The question here is, is there a way to this quickly with regex? I was really looking into this solution but this doesn't seem like it covers the examples I have given.
A more general function would be something like this:
def split_at(text, delimiter, exceptions):
"""Split text at the specified delimiter if the delimiter is not
within the exceptions"""
Some uses would be like this:
split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',', [('<', '>'), ('(', ')'), ('"', '"')]
Would regex be able to handle this or is it necessary to create a specialized parser?
While it's not possible to use a Regular Expression, the following simple code will achieve the desired result:
def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
result = []
buff = ""
level = 0
is_quoted = False
for char in text:
if char in delimiter and level == 0 and not is_quoted:
result.append(buff)
buff = ""
else:
buff += char
if char in opens:
level += 1
if char in closes:
level -= 1
if char in quotes:
is_quoted = not is_quoted
if not buff == "":
result.append(buff)
return result
Running this in the interpreter:
>>> split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',')
#=>['obj<1, 2, 3>', ' x(4, 5)', ' "msg with comma"']
using iterators and generators:
def tokenize(txt, delim=',', pairs={'"':'"', '<':'>', '(':')'}):
fst, snd = set(pairs.keys()), set(pairs.values())
it = txt.__iter__()
def loop():
from collections import defaultdict
cnt = defaultdict(int)
while True:
ch = it.__next__()
if ch == delim and not any (cnt[x] for x in snd):
return
elif ch in fst:
cnt[pairs[ch]] += 1
elif ch in snd:
cnt[ch] -= 1
yield ch
while it.__length_hint__():
yield ''.join(loop())
and,
>>> txt = 'obj<1, sub<6, 7>, 3>,x(4, y(8, 9), 5),"msg, with comma"'
>>> [x for x in tokenize(txt)]
['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With