Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse quoted strings with boost::spirit

I would like to parse a sentence where some strings may be unquoted, 'quoted' or "quoted". The code below almost works - but it fails to match closing quotes. I'm guessing this is because of the qq reference. A modification is commented in the code, the modification reults in "quoted' or 'quoted" also parsing and helps show the original problem is with the closing quote. The code also describes the exact grammar.

To be completely clear: unquoted strings parse. A quoted string like 'hello' will parse the open quote ', all the characters hello, but then fail to parse the final quote '.

I made another attempt, similar the begin/end tag matching in the boost tutorials, but without success.

template <typename Iterator>
struct test_parser : qi::grammar<Iterator, dectest::Test(), ascii::space_type>
{
    test_parser()
        :
    test_parser::base_type(test, "test")
    {
        using qi::fail;
        using qi::on_error;
        using qi::lit;
        using qi::lexeme;
        using ascii::char_;
        using qi::repeat;
        using namespace qi::labels;
        using boost::phoenix::construct;
        using boost::phoenix::at_c;
        using boost::phoenix::push_back;
        using boost::phoenix::val;
        using boost::phoenix::ref;
        using qi::space;

        char qq;          

        arrow = lit("->");

        open_quote = (char_('\'') | char_('"')) [ref(qq) = _1];  // Remember what the opening quote was
        close_quote = lit(val(qq));  // Close must match the open
        // close_quote = (char_('\'') | char_('"')); // Enable this line to get code 'almost' working

        quoted_string = 
            open_quote
            >> +ascii::alnum        
            >> close_quote; 

        unquoted_string %= +ascii::alnum;
        any_string %= (quoted_string | unquoted_string);

        test = 
            unquoted_string             [at_c<0>(_val) = _1] 
            > unquoted_string           [at_c<1>(_val) = _1]   
            > repeat(1,3)[any_string]   [at_c<2>(_val) = _1]
            > arrow
            > any_string                [at_c<3>(_val) = _1] 
            ;

        // .. <snip>set rule names
        on_error<fail>(/* <snip> */);
        // debug rules
    }

    qi::rule<Iterator> arrow;
    qi::rule<Iterator> open_quote;
    qi::rule<Iterator> close_quote;

    qi::rule<Iterator, std::string()> quoted_string;
    qi::rule<Iterator, std::string()> unquoted_string;
    qi::rule<Iterator, std::string()> any_string;     // A quoted or unquoted string

    qi::rule<Iterator, dectest::Test(), ascii::space_type> test;

};


// main()
// This example should fail at the very end 
// (ie not parse "str3' because of the mismatched quote
// However, it fails to parse the closing quote of str1
typedef boost::tuple<string, string, vector<string>, string> DataT;
DataT data;
std::string str("addx001 add 'str1'   \"str2\"       ->  \"str3'");
std::string::const_iterator iter = str.begin();
const std::string::const_iterator end = str.end();
bool r = phrase_parse(iter, end, grammar, boost::spirit::ascii::space, data);

For bonus credit: A solution that avoid a local data member (such as char qq in above example) would be preferred, but from a practical point of view I'll use anything that works!

like image 820
Zero Avatar asked Apr 24 '12 00:04

Zero


1 Answers

The reference to qq becomes dangling after leaving the constructor, so that is indeed a problem.

qi::locals is the canonical way to keep local state inside parser expressions. Your other option would be to extend the lifetime of qq (by making it a member of the grammar class, e.g.). Lastly, you might be interested in inherited attributes as well. This mechanism gives you a way to call a rule/grammar with 'parameters' (passing local state around).

NOTE There are caveats with the use of the kleene operator +: it is greedy, and parsing fails if the string is not terminated with the expected quote.

See another answer I wrote for more complete examples of treating arbitrary contents in (optionally/partially) quoted strings, that allow escaping of quotes inside quoted strings and more things like that:

  • How to make my split work only on one real line and be capable to skip quoted parts of string?

I've reduced the grammar to the relevant bit, and included a few test cases:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/fusion/adapted.hpp>

namespace qi = boost::spirit::qi;

template <typename Iterator>
struct test_parser : qi::grammar<Iterator, std::string(), qi::space_type, qi::locals<char> >
{
    test_parser() : test_parser::base_type(any_string, "test")
    {
        using namespace qi;

        quoted_string = 
               omit    [ char_("'\"") [_a =_1] ]             
            >> no_skip [ *(char_ - char_(_a))  ]
            >> lit(_a)
        ; 

        any_string = quoted_string | +qi::alnum;
    }

    qi::rule<Iterator, std::string(), qi::space_type, qi::locals<char> > quoted_string, any_string;
};

int main()
{
    test_parser<std::string::const_iterator> grammar;
    const char* strs[] = { "\"str1\"", 
                           "'str2'",
                           "'str3' trailing ok",
                           "'st\"r4' embedded also ok",
                           "str5",
                           "str6'",
                           NULL };

    for (const char** it = strs; *it; ++it)
    {
        const std::string str(*it);
        std::string::const_iterator iter = str.begin();
        std::string::const_iterator end  = str.end();

        std::string data;
        bool r = phrase_parse(iter, end, grammar, qi::space, data);

        if (r)
            std::cout << "Parsed:    " << str << " --> " << data << "\n";
        if (iter!=end)
            std::cout << "Remaining: " << std::string(iter,end) << "\n";
    }
}

Output:

Parsed:    "str1" --> str1
Parsed:    'str2' --> str2
Parsed:    'str3' trailing ok --> str3
Remaining: trailing ok
Parsed:    'st"r4' embedded also ok --> st"r4
Remaining: embedded also ok
Parsed:    str5 --> str5
Parsed:    str6' --> str6
Remaining: '
like image 165
sehe Avatar answered Oct 06 '22 02:10

sehe