I am toying with Boost.Spirit. As part of a larger work I am trying to construct a grammar for parsing C/C++ style string literals. I encountered a problem:
How do I create a sub-grammar that appends a std::string()
result to the calling grammar's std::string()
attribute (instead of just a char
?
Here is my code, which is working so far. (Actually I already got much more than that, including stuff like '\n'
etc., but I cut it down to the essentials.)
#define BOOST_SPIRIT_UNICODE
#include <string>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
using namespace boost;
using namespace boost::spirit;
using namespace boost::spirit::qi;
template < typename Iterator >
struct EscapedUnicode : grammar< Iterator, char() > // <-- should be std::string
{
EscapedUnicode() : EscapedUnicode::base_type( escaped_unicode )
{
escaped_unicode %= "\\" > ( ( "u" >> uint_parser< char, 16, 4, 4 >() )
| ( "U" >> uint_parser< char, 16, 8, 8 >() ) );
}
rule< Iterator, char() > escaped_unicode; // <-- should be std::string
};
template < typename Iterator >
struct QuotedString : grammar< Iterator, std::string() >
{
QuotedString() : QuotedString::base_type( quoted_string )
{
quoted_string %= '"' >> *( escaped_unicode | ( char_ - ( '"' | eol ) ) ) >> '"';
}
EscapedUnicode< Iterator > escaped_unicode;
rule< Iterator, std::string() > quoted_string;
};
int main()
{
std::string input = "\"foo\u0041\"";
typedef std::string::const_iterator iterator_type;
QuotedString< iterator_type > qs;
std::string result;
bool r = parse( input.cbegin(), input.cend(), qs, result );
std::cout << result << std::endl;
}
This prints fooA
-- the QuotedString
grammar calls the EscapedUnicode
grammar, which results in a char
being added to the std::string
attribute of QuotedString
(the A
, 0x41
).
But of course I would need to generate a sequence of chars (bytes) for anything beyond 0x7f. EscapedUnicode
would neet to produce a std::string
, which would have to be appended to the string generated by QuotedString
.
And that is where I've met a roadblock. I do not understand the things Boost.Spirit does in concert with Boost.Phoenix, and any attempts I have made resulted in lengthy and pretty much undecipherable template-related compiler errors.
So, how can I do this? The answer need not actually do the proper Unicode conversion; it's the std::string
issue I need a solution for.
A few points applied:
using namespace
in relation to highly generic code. ADL will ruin your day unless you control it%=
is auto-rule assignment, meaning that automatic attribute propagation will be forced even in the presence of semantic actions. You don't want that because the attribute exposed by uint_parser
will not be (correctly) automatically propagated if you want to encode into multi-byte string representation.The input string
std::string input = "\"foo\u0041\"";
needed to be
std::string input = "\"foo\\u0041\"";
otherwise the compiler did the escape handling before the parser even runs :)
Here come the specific tricks for the meat of the task:
You will want to change the rule's declared attribute to something that Spirit will automatically "flatten" in simple sequences. E.g.
quoted_string = '"' >> *(escaped_unicode | (qi::char_ - ('"' | qi::eol))) >> '"';
Will not append because the first branch of the alternate results in a sequence of char, and the second in a single char. The following spelling of the equivalent:
quoted_string = '"' >> *(escaped_unicode | +(qi::char_ - ('"' | qi::eol | "\\u" | "\\U"))) >> '"';
subtly triggers the appending heuristic in Spirit, so we can achieve what we want without involving Semantic Actions.
The rest is straight-forward:
implement the actual encoding with a Phoenix function object:
struct encode_f {
template <typename...> struct result { using type = void; };
template <typename V, typename CP> void operator()(V& a, CP codepoint) const {
// TODO implement desired encoding (e.g. UTF8)
bio::stream<bio::back_insert_device<V> > os(a);
os << "[" << std::hex << std::showbase << std::setw(std::numeric_limits<CP>::digits/4) << std::setfill('0') << codepoint << "]";
}
};
boost::phoenix::function<encode_f> encode;
This you can then use like:
escaped_unicode = '\\' > ( ("u" >> uint_parser<uint16_t, 16, 4, 4>() [ encode(_val, _1) ])
| ("U" >> uint_parser<uint32_t, 16, 8, 8>() [ encode(_val, _1) ]) );
Because you mentioned you don't care about the specific encoding, I elected to encode the raw codepoint in 16bit or 32bit hex representation like
[0x0041]
. I pragmatically used Boost Iostreams which is capable of directly writing into the attribute's container type
Use BOOST_SPIRIT_DEBUG*
macros
Live On Coliru
//#define BOOST_SPIRIT_UNICODE
//#define BOOST_SPIRIT_DEBUG
#include <string>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
// for demo re-encoding
#include <boost/iostreams/device/back_inserter.hpp>
#include <boost/iostreams/stream.hpp>
#include <iomanip>
namespace qi = boost::spirit::qi;
namespace bio = boost::iostreams;
namespace phx = boost::phoenix;
template <typename Iterator, typename Attr = std::vector<char> > // or std::string for that matter
struct EscapedUnicode : qi::grammar<Iterator, Attr()>
{
EscapedUnicode() : EscapedUnicode::base_type(escaped_unicode)
{
using namespace qi;
escaped_unicode = '\\' > ( ("u" >> uint_parser<uint16_t, 16, 4, 4>() [ encode(_val, _1) ])
| ("U" >> uint_parser<uint32_t, 16, 8, 8>() [ encode(_val, _1) ]) );
BOOST_SPIRIT_DEBUG_NODES((escaped_unicode))
}
struct encode_f {
template <typename...> struct result { using type = void; };
template <typename V, typename CP> void operator()(V& a, CP codepoint) const {
// TODO implement desired encoding (e.g. UTF8)
bio::stream<bio::back_insert_device<V> > os(a);
os << "[0x" << std::hex << std::setw(std::numeric_limits<CP>::digits/4) << std::setfill('0') << codepoint << "]";
}
};
boost::phoenix::function<encode_f> encode;
qi::rule<Iterator, Attr()> escaped_unicode;
};
template <typename Iterator>
struct QuotedString : qi::grammar<Iterator, std::string()>
{
QuotedString() : QuotedString::base_type(start)
{
start = quoted_string;
quoted_string = '"' >> *(escaped_unicode | +(qi::char_ - ('"' | qi::eol | "\\u" | "\\U"))) >> '"';
BOOST_SPIRIT_DEBUG_NODES((start)(quoted_string))
}
EscapedUnicode<Iterator> escaped_unicode;
qi::rule<Iterator, std::string()> start;
qi::rule<Iterator, std::vector<char>()> quoted_string;
};
int main() {
std::string input = "\"foo\\u0041\\U00000041\"";
typedef std::string::const_iterator iterator_type;
QuotedString<iterator_type> qs;
std::string result;
bool r = parse( input.cbegin(), input.cend(), qs, result );
std::cout << std::boolalpha << r << ": '" << result << "'\n";
}
Prints:
true: 'foo[0x0041][0x00000041]'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With