Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting comma delimited strings in python

This question has been asked and answered many times before. Some examples: [1], [2]. But there doesn't seem to be something somewhat more general. What I'm looking for is for a way to split strings at commas that are not within quotes or pairs of delimiters. For instance:

s1 = 'obj<1, 2, 3>, x(4, 5), "msg, with comma"'

should be split into a list of three elements

['obj<1, 2, 3>', 'x(4, 5)', '"msg, with comma"']

The problem now is that this can get more complicated since we can look into pairs of <> and ().

s2 = 'obj<1, sub<6, 7>, 3>, x(4, y(8, 9), 5), "msg, with comma"'

which should be split into:

['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']

The naive solution without using regex is to parse the string by looking for the characters ,<(. If either < or ( are found then we start counting the parity. We can only split at a comma if the parity is zero. For instance say we want to split s2, we can start with parity = 0 and when we reach s2[3] we encounter < which will increase parity by 1. The parity will only decrease when it encounters > or ) and it will increase when it encounters < or (. While the parity is not 0 we can simply ignore the commas and not do any splitting.

The question here is, is there a way to this quickly with regex? I was really looking into this solution but this doesn't seem like it covers the examples I have given.

A more general function would be something like this:

def split_at(text, delimiter, exceptions):
    """Split text at the specified delimiter if the delimiter is not
    within the exceptions"""

Some uses would be like this:

split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',', [('<', '>'), ('(', ')'), ('"', '"')]

Would regex be able to handle this or is it necessary to create a specialized parser?

like image 335
jmlopez Avatar asked Dec 15 '13 20:12

jmlopez


2 Answers

While it's not possible to use a Regular Expression, the following simple code will achieve the desired result:

def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
    result = []
    buff = ""
    level = 0
    is_quoted = False

    for char in text:
        if char in delimiter and level == 0 and not is_quoted:
            result.append(buff)
            buff = ""
        else:
            buff += char

            if char in opens:
                level += 1
            if char in closes:
                level -= 1
            if char in quotes:
                is_quoted = not is_quoted

    if not buff == "":
        result.append(buff)

    return result

Running this in the interpreter:

>>> split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',')                                                                                                                                 
#=>['obj<1, 2, 3>', ' x(4, 5)', ' "msg with comma"']
like image 86
Aaron Cronin Avatar answered Oct 10 '22 07:10

Aaron Cronin


using iterators and generators:

def tokenize(txt, delim=',', pairs={'"':'"', '<':'>', '(':')'}):
    fst, snd = set(pairs.keys()), set(pairs.values())
    it = txt.__iter__()

    def loop():
        from collections import defaultdict
        cnt = defaultdict(int)

        while True:
            ch = it.__next__()
            if ch == delim and not any (cnt[x] for x in snd):
                return
            elif ch in fst:
                cnt[pairs[ch]] += 1
            elif ch in snd:
                cnt[ch] -= 1
            yield ch

    while it.__length_hint__():
        yield ''.join(loop())

and,

>>> txt = 'obj<1, sub<6, 7>, 3>,x(4, y(8, 9), 5),"msg, with comma"'
>>> [x for x in tokenize(txt)]
['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']
like image 42
behzad.nouri Avatar answered Oct 10 '22 07:10

behzad.nouri