I'm trying to parse LaTeX escape codes (e.g. \alpha
) to the Unicode (Mathematical) characters (i.e. U+1D6FC
).
Right now this means I am using this symbols
parser (rule):
struct greek_lower_case_letters_ : x3::symbols<char32_t>
{
greek_lower_case_letters_::greek_lower_case_letters_()
{
add("alpha", U'\u03B1');
}
} greek_lower_case_letter;
This works fine but means I'm getting a std::u32string
as a result.
I'd like an elegant way to keep the Unicode code points in the code (for maybe future automation) and maintenance reasons. Is there a way to get this kind of parser to parse into a UTF-8 std::string
?
I thought of making the symbols
struct parse to a std::string
, but that would be highly inefficient (I know, premature optimization bla bla).
I was hoping there was some elegant way instead of going through a bunch of hoops to get this working (symbols
appending strings to the result).
I do fear though that using the code point values and wanting UTF8 will incur a runtime cost of the conversion (or is there a constexpr
UTF32->UTF8 conversion possibe?).
The JSON parser example at cierelabs shows an approach that uses semantic actions to append code points in utf8 encoding:
auto push_utf8 = [](auto& ctx)
{
typedef std::back_insert_iterator<std::string> insert_iter;
insert_iter out_iter(_val(ctx));
boost::utf8_output_iterator<insert_iter> utf8_iter(out_iter);
*utf8_iter++ = _attr(ctx);
};
// ...
auto const escape =
('u' > hex4) [push_utf8]
| char_("\"\\/bfnrt") [push_esc]
;
This is used in their
typedef x3::rule<unicode_string_class, std::string> unicode_string_type;
Which, as you can see, build the utf8 sequence into a std::string
attribute.
See for full code: https://github.com/cierelabs/json_spirit/blob/x3_devel/ciere/json/parser/x3_grammar_def.hpp
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With