How to get the AST of a regular expression string?

Tags:

How can I get the abstract syntax tree (AST) of a regular expression (in C++)?

For example,

 (XYZ)|(123)

should yield a tree of:

        |
      /   \
    .       .
   / \     / \
  .   Z   .   3    
 / \     / \   
X  Y     1 2

Is there a boost::spirit grammar to parse regular expression patterns? The boost::regex library should have it, but I didn't find it. Are there any other open-source tools available that would give me the abstract representation of a regex?

552

asked Oct 05 '11 05:10

Frank

1 Answers

I stumbled into this question again. And I decided to take a look at how hard it would actually be to write a parser for a significant subset of regular expression syntax with Boost Spirit.

So, as usual, I started out with pen and paper, and after a while had some draft rules in mind. Time to draw the analogous AST up:

namespace ast
{
    struct multiplicity 
    {
        unsigned minoccurs;
        boost::optional<unsigned> maxoccurs;
        bool greedy;

        multiplicity(unsigned minoccurs = 1, boost::optional<unsigned> maxoccurs = 1) 
            : minoccurs(minoccurs), maxoccurs(maxoccurs), greedy(true)
        { }

        bool unbounded() const { return !maxoccurs; }
        bool repeating() const { return !maxoccurs || *maxoccurs > 1; }
    };

    struct charset
    {
        bool negated;

        using range   = boost::tuple<char, char>; // from, till
        using element = boost::variant<char, range>;

        std::set<element> elements; 
        // TODO: single set for loose elements, simplify() method
    };

    struct start_of_match {};
    struct end_of_match {};
    struct any_char {};
    struct group;

    typedef boost::variant<   // unquantified expression
        start_of_match,
        end_of_match,
        any_char,
        charset,
        std::string,          // literal
        boost::recursive_wrapper<group> // sub expression
    > simple;

    struct atom               // quantified simple expression
    {
        simple       expr;
        multiplicity mult;
    };

    using sequence    = std::vector<atom>;
    using alternative = std::vector<sequence>;
    using regex       = boost::variant<atom, sequence, alternative>;

    struct group {
        alternative root;

        group() = default;
        group(alternative root) : root(std::move(root)) { }
    };
}

This is your typical AST (58 LoC) that works well with Spirit (due to integrating with boost via variant and optional, as well as having strategically chosen constructors).

The grammar ended up being only slightly longer:

template <typename It>
    struct parser : qi::grammar<It, ast::alternative()>
{
    parser() : parser::base_type(alternative)
    {
        using namespace qi;
        using phx::construct;
        using ast::multiplicity;

        alternative = sequence % '|';
        sequence    = *atom;

        simple      = 
                      (group)
                    | (charset)
                    | ('.' >> qi::attr(ast::any_char()))
                    | ('^' >> qi::attr(ast::start_of_match()))
                    | ('$' >> qi::attr(ast::end_of_match()))
                    // optimize literal tree nodes by grouping unquantified literal chars
                    | (as_string [ +(literal >> !char_("{?+*")) ])
                    | (as_string [ literal ]) // lone char/escape + explicit_quantifier
                    ;

        atom        = (simple >> quantifier); // quantifier may be implicit

        explicit_quantifier  =
                    // bounded ranges:
                      lit('?')                                   [ _val = construct<multiplicity>( 0, 1)   ]
                    | ('{'  >> uint_ >> '}' )                    [ _val = construct<multiplicity>(_1, _1)  ]
                    // repeating ranges can be marked non-greedy:
                    | (                                        
                          lit('+')                               [ _val = construct<multiplicity>( 1, boost::none) ]
                        | lit('*')                               [ _val = construct<multiplicity>( 0, boost::none) ]
                        | ('{'  >> uint_ >> ",}")                [ _val = construct<multiplicity>(_1, boost::none) ]
                        | ('{'  >> uint_ >> "," >> uint_ >> '}') [ _val = construct<multiplicity>(_1, _2)  ]
                        | ("{," >> uint_ >> '}' )                [ _val = construct<multiplicity>( 0, _1)  ]
                      ) >> -lit('?')       [ phx::bind(&multiplicity::greedy, _val) = false ]
                    ;

        quantifier = explicit_quantifier | attr(ast::multiplicity());

        charset     = '[' 
                   >> (lit('^') >> attr(true) | attr(false)) // negated
                   >> *(range | charset_el)
                    > ']'
                    ;

        range       = charset_el >> '-' >> charset_el;

        group       = '(' >> alternative >> ')';

        literal     = unescape | ~char_("\\+*?.^$|{()") ;

        unescape    = ('\\' > char_);

        // helper to optionally unescape waiting for raw ']'
        charset_el  = !lit(']') >> (unescape|char_);
    }

  private:
    qi::rule<It, ast::alternative()>    alternative;
    qi::rule<It, ast::sequence()>       sequence;
    qi::rule<It, ast::atom()>           atom;
    qi::rule<It, ast::simple()>         simple;
    qi::rule<It, ast::multiplicity()>   explicit_quantifier, quantifier;
    qi::rule<It, ast::charset()>        charset;
    qi::rule<It, ast::charset::range()> range;
    qi::rule<It, ast::group()>          group;
    qi::rule<It, char()>                literal, unescape, charset_el;
};

Now, the real fun is to do something with the AST. Since you want to visualize the tree, I thought of generating DOT graph from the AST. So I did:

int main()
{
    std::cout << "digraph common {\n";

    for (std::string pattern: { 
            "abc?",
            "ab+c",
            "(ab)+c",
            "[^-a\\-f-z\"\\]aaaa-]?",
            "abc|d",
            "a?",
            ".*?(a|b){,9}?",
            "(XYZ)|(123)",
        })
    {
        std::cout << "// ================= " << pattern << " ========\n";
        ast::regex tree;
        if (doParse(pattern, tree))
        {
            check_roundtrip(tree, pattern);

            regex_todigraph printer(std::cout, pattern);
            boost::apply_visitor(printer, tree);
        }
    }

    std::cout << "}\n";
}

This program results in the following graphs:

enter image description here

The self-edges depict repeats and the colour indicates whether the match is greedy (red) or non-greedy (blue). As you can see I've optimized the AST a bit for clarity, but (un)commenting the relevant lines will make the difference:

enter image description here

I think it wouldn't be too hard to tune. Hopefully it will serve as inspiration to someone.

Full code at this gist: https://gist.github.com/sehe/8678988

149

answered Oct 22 '22 04:10

sehe

Related questions
                            
                                C++ code dependency / call-graph "viewer"? [closed]
                            
                                Bitshifting to multiply an integer by 10
                            
                                overloaded 'operator+' must be a unary or binary operator error
                            
                                error C2061: syntax error : identifier [closed]
                            
                                Why does this C++ static singleton never stop?
                            
                                Using O_RDWR vs O_RDONLY | O_WRONLY
                            
                                Can I specify an integer constant by its bytes?
                            
                                appending to a file with ofstream [duplicate]
                            
                                Should I include every header?
                            
                                When to use `asio_handler_invoke`?
                            
                                Deleting copy constructors and copy assignment operators. Which of them are essential?
                            
                                The body of constexpr function not a return-statement
                            
                                Qt and no moc_*.cpp file
                            
                                Force compiler to not optimize side-effect-less statements
                            
                                Why callback functions needs to be static when declared in class
                            
                                Difference between static const char* and const char*
                            
                                Shared Memory or mmap - Linux C/C++ IPC
                            
                                Incompatibility between C and C++ code
                            
                                Understanding how to correctly treat c++ class constants
                            
                                Always create classes in C++? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get the AST of a regular expression string?

Tags:

c++

regex

boost

Frank

People also ask

1 Answers

sehe

Recent Activity

Donate For Us