To emphasize, I do not want to "parse using a regex" - I want to "parse a regex into a symbolic tree." (Searching has only brought up the former...) My use case: to speed up a regex search over a database, I'd like to parse a regex like <code>(foo|bar)baz+(bat)*</code> and pull out all substrings that MUST appear in a match. (In this case, it's just <code>baz</code> because foo/bar are alternations and bat can appear 0 times.) To do this, I need some understanding of regex operators/semantics. <code>re.DEBUG</code> comes closest: <pre class="prettyprint"><code>In [7]: re.compile('(foo|bar)baz+(bat)', re.DEBUG) subpattern 1 branch literal 102 literal 111 literal 111 or literal 98 literal 97 literal 114 literal 98 literal 97 max_repeat 1 4294967295 literal 122 subpattern 2 literal 98 literal 97 literal 116 </code></pre> However, it's just printing out, and the c-implementation doesn't preserve the structure afterwards as far as I can tell. Any ideas on how I can parse this out without writing my owner parser?

You can maybe just use this: <pre class="prettyprint"><code>import sre_parse sre_parse.parse(r'(\d+)foo(.*)') </code></pre>

Python library to parse regex into AST?

Tags:

parsing

To emphasize, I do not want to "parse using a regex" - I want to "parse a regex into a symbolic tree." (Searching has only brought up the former...)

My use case: to speed up a regex search over a database, I'd like to parse a regex like (foo|bar)baz+(bat)* and pull out all substrings that MUST appear in a match. (In this case, it's just baz because foo/bar are alternations and bat can appear 0 times.)

To do this, I need some understanding of regex operators/semantics. re.DEBUG comes closest:

In [7]: re.compile('(foo|bar)baz+(bat)', re.DEBUG)
subpattern 1
  branch
    literal 102
    literal 111
    literal 111
  or
    literal 98
    literal 97
    literal 114
literal 98
literal 97
max_repeat 1 4294967295
  literal 122
subpattern 2
  literal 98
  literal 97
  literal 116

However, it's just printing out, and the c-implementation doesn't preserve the structure afterwards as far as I can tell. Any ideas on how I can parse this out without writing my owner parser?

216

asked Dec 30 '15 05:12

munchybunch

2 Answers

You can maybe just use this:

import sre_parse
sre_parse.parse(r'(\d+)foo(.*)')

answered Oct 22 '22 09:10

boxed

You can only specify a (classic) regex using a context free grammar:

 regex = { alternatives };
 alternatives =  primitive { '|' alternatives } ;
 primitive = '(' regex ')' | '[' character_set ']' | ...

This means you can't parse a regex using a regex (Perl is an exception, but then its "regexes" are extended way beyond "classic").

So, to parse a regex, you'll need to build your own parser and constructs some kind of tree (re.Debug comes pretty close) or that magic library you are hoping for.

I suspect this is the easy part. This isn't terribly hard to do yourself; see Is there an alternative for flex/bison that is usable on 8-bit embedded systems? for a straightforward scheme for building such parsers.

To understand the semantics of the regex (e.g., to figure out "necessary substrings"), you might be able to get away with building an analyzer the walks over the parse tree, and for each subtree (bottom up), computes the common-string. Failing that you may have to implement the classic NDFA construction and then walk over it, or implement the NDFA to DFA construction and walk over the DFA. Real regexes tend to contain lots of messy complications such as built-in character sets, capture groups, etc.

The "common string" might not be just a contiguous sequence of characters although you could define it narrowly as such. It might include several constant substrings separated by fixed or variable length gaps of characters, e.g., your necessary substring might always itself be expressible as a "simple regex" of the form:

   (<character>+ ?+) <character>+

answered Oct 22 '22 10:10

Ira Baxter

Related questions
                            
                                Python decorator for automatic binding __init__ arguments
                            
                                How do I start and stop a Linux program using the subprocess module in Python?
                            
                                Overriding __getattr__ to support dynamic nested attributes
                            
                                Getting an embedded Python runtime to use the current active virtualenv
                            
                                Classifiers confidence in opencv face detector
                            
                                Git-backed ORM for Python?
                            
                                Apply automatic pep8 fixes from QuickFix window
                            
                                Sharing object (class instance) using multiprocessing Managers
                            
                                tracing memory leaks in Python (multiprocessing)
                            
                                Passing the library path as a command line argument to setup.py
                            
                                Django unable to load test fixtures, IntegrityError
                            
                                Import errors with Pycharm
                            
                                Community detection in Networkx
                            
                                Scipy -- 3d griddata -- Why is it necessary to cast griddata xi argument to tuple?
                            
                                Pairwise Set Intersection in Python
                            
                                Pandas read_csv on 6.5 GB file consumes more than 170GB RAM
                            
                                How do I ensure Python "zeros" memory when it is garbage collected?
                            
                                Why am I getting different results when using a list comprehension with coroutines with asyncio?
                            
                                tar.extractall() does not recognize unexpected EOF
                            
                                How to add a Callback to Bokeh DataTable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python library to parse regex into AST?

Tags:

python

regex