Regex pattern recursively - in python

Question

I've been sitting on this for 2 days now, and still didn't find an effective way to do so. Lets say we have the string:

<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>

The output I want:

<562947621914222412421>

Well, recursively there might be data inside <> brackets, it could consist of numbers and letters - but the first level consists only of numbers. I bolded out the data I want to extract.

I want to do this in Python. The naive way is to implement a bracket stack of course (that way I'll know if I'm inside an inner bracket or I'm in the 1st level) - but it's very inefficient going char by char. I believe there is a good regex pattern I could use, but I haven't come up with something that works.

Can some with enough experience with regex give a little help?

Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.

kaya3 · Accepted Answer

Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.

Of course, any regex also has to run through the string character by character, too. Don't rule out the "naive" solution so easily: it turns out the simple way is more efficient than all three of the posted answers so far.

Here's a solution like your "naive" one: but it doesn't require a stack, because there is only one kind of open-bracket. Even with multiple kinds of brackets, you only need a stack if you also want to detect when the brackets are mismatched.

def chars_at_level(s):
    out = ['<']
    nesting_level = 0

    for c in s:
        if c == '<':
            nesting_level += 1
        elif c == '>':
            nesting_level -= 1
        elif nesting_level == 1:
            out.append(c)

    out.append('>')
    return ''.join(out)

Example:

>>> s = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> chars_at_level(s)
'<562947621914222412421>'

Now for the performance comparison. It beats the other three solutions, though Seb's solution is close.

>>> timeit(lambda: chars_at_level(s))
7.594452977000401
>>> timeit(lambda: parse(s)) # Seb's solution using re.sub
7.817124693000096
>>> timeit(lambda: regex_sub(s)) # bobble bubble's recursive regex
9.322779934999744
>>> timeit(lambda: nested_list(s)) # Ajax1234's nested list solution
17.795835303999866

However, Seb's solution is much less efficient in the worst case, on strings like <<<<<<1>>>>>>, because it does O(n) replacements on strings of length O(n), giving a running time of O(n²). The other two posted solutions still seem to be about O(n) on this kind of string, though I had to increase the system recursion limit for Ajax1234's solution to work. The "naive" solution is still fastest.

>>> t = (1000 * '<') + '1' + (1000 * '>')
>>> timeit(lambda: chars_at_level(t), number=1000)
0.1329130509998322
>>> timeit(lambda: parse(t), number=1000) # Seb's solution using re.sub
31.281542531000014
>>> timeit(lambda: regex_sub(t), number=1000) # bobble bubble's recursive regex
0.705901896999876
>>> timeit(lambda: nested_list(t), number=1000) # Ajax1234's nested list solution
1.1296931150000091

By the way, even if you do want to augment the "naive" solution with a stack, it still only takes O(n) time. It's also fairly trivial to change this algorithm to get the characters at any other nesting level, too.

Seb · Answer

This is one way; it recursively removes inner complete tags <[^<>]*> until only the outer level elements remain:

def parse(string):
    while True:
        output = re.sub(r'(?<!^)<([^<>]*)>(?!$)', '', string)
        if output == string:
            break
        string = output
    return output

>>> string = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> parse(string)
'<562947621914222412421>'

>>> %timeit parse(string)
6.57 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

There should also be a way to do the recursion in regex itself, but I couldn't quite make it work. The built-in module re does not support this, but regex does. Here are some ideas on that.

Regex pattern recursively - in python

Tags:

python

regex

algorithm

pattern-matching

regexer178

2 Answers

kaya3

Seb

Recent Activity

Donate For Us

Regex pattern recursively - in python

Tags:

python

regex

algorithm

pattern-matching

regexer178

2 Answers

kaya3

Seb

Related questions

Recent Activity

Donate For Us