Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Boost Spirit (X3) symbol tables resulting in UTF8 strings

I'm trying to parse LaTeX escape codes (e.g. \alpha) to the Unicode (Mathematical) characters (i.e. U+1D6FC).

Right now this means I am using this symbols parser (rule):

struct greek_lower_case_letters_ : x3::symbols<char32_t>
    add("alpha",   U'\u03B1');
} greek_lower_case_letter;

This works fine but means I'm getting a std::u32string as a result. I'd like an elegant way to keep the Unicode code points in the code (for maybe future automation) and maintenance reasons. Is there a way to get this kind of parser to parse into a UTF-8 std::string?

I thought of making the symbols struct parse to a std::string, but that would be highly inefficient (I know, premature optimization bla bla).

I was hoping there was some elegant way instead of going through a bunch of hoops to get this working (symbols appending strings to the result).

I do fear though that using the code point values and wanting UTF8 will incur a runtime cost of the conversion (or is there a constexpr UTF32->UTF8 conversion possibe?).

like image 311
rubenvb Avatar asked Dec 18 '15 20:12


1 Answers

The JSON parser example at cierelabs shows an approach that uses semantic actions to append code points in utf8 encoding:

  auto push_utf8 = [](auto& ctx)
     typedef std::back_insert_iterator<std::string> insert_iter;
     insert_iter out_iter(_val(ctx));
     boost::utf8_output_iterator<insert_iter> utf8_iter(out_iter);
     *utf8_iter++ = _attr(ctx);

  // ...

  auto const escape =
         ('u' > hex4)           [push_utf8]
     |   char_("\"\\/bfnrt")    [push_esc]

This is used in their

typedef x3::rule<unicode_string_class, std::string> unicode_string_type;

Which, as you can see, build the utf8 sequence into a std::string attribute.

See for full code: https://github.com/cierelabs/json_spirit/blob/x3_devel/ciere/json/parser/x3_grammar_def.hpp

like image 121
sehe Avatar answered Nov 01 '22 04:11
