I've been sitting on this for 2 days now, and still didn't find an effective way to do so. Lets say we have the string:
<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>
The output I want:
<562947621914222412421>
Well, recursively there might be data inside <> brackets, it could consist of numbers and letters - but the first level consists only of numbers. I bolded out the data I want to extract.
I want to do this in Python. The naive way is to implement a bracket stack of course (that way I'll know if I'm inside an inner bracket or I'm in the 1st level) - but it's very inefficient going char by char. I believe there is a good regex pattern I could use, but I haven't come up with something that works.
Can some with enough experience with regex give a little help?
Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.
Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.
Of course, any regex also has to run through the string character by character, too. Don't rule out the "naive" solution so easily: it turns out the simple way is more efficient than all three of the posted answers so far.
Here's a solution like your "naive" one: but it doesn't require a stack, because there is only one kind of open-bracket. Even with multiple kinds of brackets, you only need a stack if you also want to detect when the brackets are mismatched.
def chars_at_level(s):
out = ['<']
nesting_level = 0
for c in s:
if c == '<':
nesting_level += 1
elif c == '>':
nesting_level -= 1
elif nesting_level == 1:
out.append(c)
out.append('>')
return ''.join(out)
Example:
>>> s = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> chars_at_level(s)
'<562947621914222412421>'
Now for the performance comparison. It beats the other three solutions, though Seb's solution is close.
>>> timeit(lambda: chars_at_level(s))
7.594452977000401
>>> timeit(lambda: parse(s)) # Seb's solution using re.sub
7.817124693000096
>>> timeit(lambda: regex_sub(s)) # bobble bubble's recursive regex
9.322779934999744
>>> timeit(lambda: nested_list(s)) # Ajax1234's nested list solution
17.795835303999866
However, Seb's solution is much less efficient in the worst case, on strings like <<<<<<1>>>>>>
, because it does O(n) replacements on strings of length O(n), giving a running time of O(n²). The other two posted solutions still seem to be about O(n) on this kind of string, though I had to increase the system recursion limit for Ajax1234's solution to work. The "naive" solution is still fastest.
>>> t = (1000 * '<') + '1' + (1000 * '>')
>>> timeit(lambda: chars_at_level(t), number=1000)
0.1329130509998322
>>> timeit(lambda: parse(t), number=1000) # Seb's solution using re.sub
31.281542531000014
>>> timeit(lambda: regex_sub(t), number=1000) # bobble bubble's recursive regex
0.705901896999876
>>> timeit(lambda: nested_list(t), number=1000) # Ajax1234's nested list solution
1.1296931150000091
By the way, even if you do want to augment the "naive" solution with a stack, it still only takes O(n) time. It's also fairly trivial to change this algorithm to get the characters at any other nesting level, too.
This is one way; it recursively removes inner complete tags <[^<>]*>
until only the outer level elements remain:
def parse(string):
while True:
output = re.sub(r'(?<!^)<([^<>]*)>(?!$)', '', string)
if output == string:
break
string = output
return output
>>> string = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> parse(string)
'<562947621914222412421>'
>>> %timeit parse(string)
6.57 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
There should also be a way to do the recursion in regex itself, but I couldn't quite make it work. The built-in module re
does not support this, but regex
does. Here are some ideas on that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With