Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python library to parse regex into AST?

To emphasize, I do not want to "parse using a regex" - I want to "parse a regex into a symbolic tree." (Searching has only brought up the former...)

My use case: to speed up a regex search over a database, I'd like to parse a regex like (foo|bar)baz+(bat)* and pull out all substrings that MUST appear in a match. (In this case, it's just baz because foo/bar are alternations and bat can appear 0 times.)

To do this, I need some understanding of regex operators/semantics. re.DEBUG comes closest:

In [7]: re.compile('(foo|bar)baz+(bat)', re.DEBUG)
subpattern 1
  branch
    literal 102
    literal 111
    literal 111
  or
    literal 98
    literal 97
    literal 114
literal 98
literal 97
max_repeat 1 4294967295
  literal 122
subpattern 2
  literal 98
  literal 97
  literal 116

However, it's just printing out, and the c-implementation doesn't preserve the structure afterwards as far as I can tell. Any ideas on how I can parse this out without writing my owner parser?

like image 216
munchybunch Avatar asked Dec 30 '15 05:12

munchybunch


People also ask

Which Python library is used for regex search?

Python has a built-in package called re , which can be used to work with Regular Expressions.

Can you parse regex with regex?

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML.

What is regex parser?

The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.

What is regex and parsing in Python?

URL or Uniform Resource Locator consists of many information parts, such as the domain name, path, port number etc. Any URL can be processed and parsed using Regular Expression. So for using Regular Expression we have to use re library in Python.


2 Answers

You can maybe just use this:

import sre_parse
sre_parse.parse(r'(\d+)foo(.*)')
like image 58
boxed Avatar answered Oct 22 '22 09:10

boxed


You can only specify a (classic) regex using a context free grammar:

 regex = { alternatives };
 alternatives =  primitive { '|' alternatives } ;
 primitive = '(' regex ')' | '[' character_set ']' | ...

This means you can't parse a regex using a regex (Perl is an exception, but then its "regexes" are extended way beyond "classic").

So, to parse a regex, you'll need to build your own parser and constructs some kind of tree (re.Debug comes pretty close) or that magic library you are hoping for.

I suspect this is the easy part. This isn't terribly hard to do yourself; see Is there an alternative for flex/bison that is usable on 8-bit embedded systems? for a straightforward scheme for building such parsers.

To understand the semantics of the regex (e.g., to figure out "necessary substrings"), you might be able to get away with building an analyzer the walks over the parse tree, and for each subtree (bottom up), computes the common-string. Failing that you may have to implement the classic NDFA construction and then walk over it, or implement the NDFA to DFA construction and walk over the DFA. Real regexes tend to contain lots of messy complications such as built-in character sets, capture groups, etc.

The "common string" might not be just a contiguous sequence of characters although you could define it narrowly as such. It might include several constant substrings separated by fixed or variable length gaps of characters, e.g., your necessary substring might always itself be expressible as a "simple regex" of the form:

   (<character>+ ?+) <character>+
like image 29
Ira Baxter Avatar answered Oct 22 '22 10:10

Ira Baxter