How can I get the abstract syntax tree (AST) of a regular expression (in C++)?
For example,
(XYZ)|(123)
should yield a tree of:
|
/ \
. .
/ \ / \
. Z . 3
/ \ / \
X Y 1 2
Is there a boost::spirit grammar to parse regular expression patterns? The boost::regex library should have it, but I didn't find it. Are there any other open-source tools available that would give me the abstract representation of a regex?
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
$ means "Match the end of the string" (the position after the last character in the string).
For example, the replacement pattern $1 indicates that the matched substring is to be replaced by the first captured group.
I stumbled into this question again. And I decided to take a look at how hard it would actually be to write a parser for a significant subset of regular expression syntax with Boost Spirit.
So, as usual, I started out with pen and paper, and after a while had some draft rules in mind. Time to draw the analogous AST up:
namespace ast
{
struct multiplicity
{
unsigned minoccurs;
boost::optional<unsigned> maxoccurs;
bool greedy;
multiplicity(unsigned minoccurs = 1, boost::optional<unsigned> maxoccurs = 1)
: minoccurs(minoccurs), maxoccurs(maxoccurs), greedy(true)
{ }
bool unbounded() const { return !maxoccurs; }
bool repeating() const { return !maxoccurs || *maxoccurs > 1; }
};
struct charset
{
bool negated;
using range = boost::tuple<char, char>; // from, till
using element = boost::variant<char, range>;
std::set<element> elements;
// TODO: single set for loose elements, simplify() method
};
struct start_of_match {};
struct end_of_match {};
struct any_char {};
struct group;
typedef boost::variant< // unquantified expression
start_of_match,
end_of_match,
any_char,
charset,
std::string, // literal
boost::recursive_wrapper<group> // sub expression
> simple;
struct atom // quantified simple expression
{
simple expr;
multiplicity mult;
};
using sequence = std::vector<atom>;
using alternative = std::vector<sequence>;
using regex = boost::variant<atom, sequence, alternative>;
struct group {
alternative root;
group() = default;
group(alternative root) : root(std::move(root)) { }
};
}
This is your typical AST (58 LoC) that works well with Spirit (due to integrating with boost via variant and optional, as well as having strategically chosen constructors).
The grammar ended up being only slightly longer:
template <typename It>
struct parser : qi::grammar<It, ast::alternative()>
{
parser() : parser::base_type(alternative)
{
using namespace qi;
using phx::construct;
using ast::multiplicity;
alternative = sequence % '|';
sequence = *atom;
simple =
(group)
| (charset)
| ('.' >> qi::attr(ast::any_char()))
| ('^' >> qi::attr(ast::start_of_match()))
| ('$' >> qi::attr(ast::end_of_match()))
// optimize literal tree nodes by grouping unquantified literal chars
| (as_string [ +(literal >> !char_("{?+*")) ])
| (as_string [ literal ]) // lone char/escape + explicit_quantifier
;
atom = (simple >> quantifier); // quantifier may be implicit
explicit_quantifier =
// bounded ranges:
lit('?') [ _val = construct<multiplicity>( 0, 1) ]
| ('{' >> uint_ >> '}' ) [ _val = construct<multiplicity>(_1, _1) ]
// repeating ranges can be marked non-greedy:
| (
lit('+') [ _val = construct<multiplicity>( 1, boost::none) ]
| lit('*') [ _val = construct<multiplicity>( 0, boost::none) ]
| ('{' >> uint_ >> ",}") [ _val = construct<multiplicity>(_1, boost::none) ]
| ('{' >> uint_ >> "," >> uint_ >> '}') [ _val = construct<multiplicity>(_1, _2) ]
| ("{," >> uint_ >> '}' ) [ _val = construct<multiplicity>( 0, _1) ]
) >> -lit('?') [ phx::bind(&multiplicity::greedy, _val) = false ]
;
quantifier = explicit_quantifier | attr(ast::multiplicity());
charset = '['
>> (lit('^') >> attr(true) | attr(false)) // negated
>> *(range | charset_el)
> ']'
;
range = charset_el >> '-' >> charset_el;
group = '(' >> alternative >> ')';
literal = unescape | ~char_("\\+*?.^$|{()") ;
unescape = ('\\' > char_);
// helper to optionally unescape waiting for raw ']'
charset_el = !lit(']') >> (unescape|char_);
}
private:
qi::rule<It, ast::alternative()> alternative;
qi::rule<It, ast::sequence()> sequence;
qi::rule<It, ast::atom()> atom;
qi::rule<It, ast::simple()> simple;
qi::rule<It, ast::multiplicity()> explicit_quantifier, quantifier;
qi::rule<It, ast::charset()> charset;
qi::rule<It, ast::charset::range()> range;
qi::rule<It, ast::group()> group;
qi::rule<It, char()> literal, unescape, charset_el;
};
Now, the real fun is to do something with the AST. Since you want to visualize the tree, I thought of generating DOT graph from the AST. So I did:
int main()
{
std::cout << "digraph common {\n";
for (std::string pattern: {
"abc?",
"ab+c",
"(ab)+c",
"[^-a\\-f-z\"\\]aaaa-]?",
"abc|d",
"a?",
".*?(a|b){,9}?",
"(XYZ)|(123)",
})
{
std::cout << "// ================= " << pattern << " ========\n";
ast::regex tree;
if (doParse(pattern, tree))
{
check_roundtrip(tree, pattern);
regex_todigraph printer(std::cout, pattern);
boost::apply_visitor(printer, tree);
}
}
std::cout << "}\n";
}
This program results in the following graphs:

The self-edges depict repeats and the colour indicates whether the match is greedy (red) or non-greedy (blue). As you can see I've optimized the AST a bit for clarity, but (un)commenting the relevant lines will make the difference:

I think it wouldn't be too hard to tune. Hopefully it will serve as inspiration to someone.
Full code at this gist: https://gist.github.com/sehe/8678988
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With