To emphasize, I do not want to "parse using a regex" - I want to "parse a regex into a symbolic tree." (Searching has only brought up the former...)
My use case: to speed up a regex search over a database, I'd like to parse a regex like (foo|bar)baz+(bat)*
and pull out all substrings that MUST appear in a match. (In this case, it's just baz
because foo/bar are alternations and bat can appear 0 times.)
To do this, I need some understanding of regex operators/semantics. re.DEBUG
comes closest:
In [7]: re.compile('(foo|bar)baz+(bat)', re.DEBUG)
subpattern 1
branch
literal 102
literal 111
literal 111
or
literal 98
literal 97
literal 114
literal 98
literal 97
max_repeat 1 4294967295
literal 122
subpattern 2
literal 98
literal 97
literal 116
However, it's just printing out, and the c-implementation doesn't preserve the structure afterwards as far as I can tell. Any ideas on how I can parse this out without writing my owner parser?
Python has a built-in package called re , which can be used to work with Regular Expressions.
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML.
The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.
URL or Uniform Resource Locator consists of many information parts, such as the domain name, path, port number etc. Any URL can be processed and parsed using Regular Expression. So for using Regular Expression we have to use re library in Python.
You can maybe just use this:
import sre_parse
sre_parse.parse(r'(\d+)foo(.*)')
You can only specify a (classic) regex using a context free grammar:
regex = { alternatives };
alternatives = primitive { '|' alternatives } ;
primitive = '(' regex ')' | '[' character_set ']' | ...
This means you can't parse a regex using a regex (Perl is an exception, but then its "regexes" are extended way beyond "classic").
So, to parse a regex, you'll need to build your own parser and constructs some kind of tree (re.Debug comes pretty close) or that magic library you are hoping for.
I suspect this is the easy part. This isn't terribly hard to do yourself; see Is there an alternative for flex/bison that is usable on 8-bit embedded systems? for a straightforward scheme for building such parsers.
To understand the semantics of the regex (e.g., to figure out "necessary substrings"), you might be able to get away with building an analyzer the walks over the parse tree, and for each subtree (bottom up), computes the common-string. Failing that you may have to implement the classic NDFA construction and then walk over it, or implement the NDFA to DFA construction and walk over the DFA. Real regexes tend to contain lots of messy complications such as built-in character sets, capture groups, etc.
The "common string" might not be just a contiguous sequence of characters although you could define it narrowly as such. It might include several constant substrings separated by fixed or variable length gaps of characters, e.g., your necessary substring might always itself be expressible as a "simple regex" of the form:
(<character>+ ?+) <character>+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With