Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ Parsing Library with UTF-8 support

Tags:

c++

parsing

utf-8

Let's say I want to make a parser for a programming language (EBNF already known), and want it done with as little of a fuss as possible. Also, I want to support identifiers of any UTF-8 letters. And I want it in C++.

flex/bison have a non-existent UTF-8 support, as I read it. ANTLR seems not to have a working C++ output.

I've considered boost::spirit, they state on their site it's actually not meant for a full parser.

What else is left? Rolling it entirely per hand?

like image 637
Lanbo Avatar asked Dec 09 '11 15:12

Lanbo


2 Answers

If you don't find something which has the support you want, don't forget that flex is mostly independant on the encoding. It lexes an octet stream and I've used it to lex pure binary data. Something encoded in UTF-8 is an octet stream and can be handled by flex is you accept to do manually some of the work. I.E. instead of having

idletter [a-zA-Z]

if you want to accept as letter everything in the range Latin1 supplement excepted the NBSP (in other words, in the range U00A1-U00FF) you have to do something like (I may have messed up the encoding, but you get the idea)

idletter [a-zA-Z]|\xC2[\xA1-\xFF]|\xC3[\x80-\xBF]

You could even write a preprocessor which does most of the work for you (i.e. replaces \u00A1 by \xC2\xA1 and replace [\u00A1-\u00FF] by \xC2[\xA1-\xFF]|\xC3[\x80-\xBF], how much work is the preprocessor depend on how generic you want your input to be, there will be a time when you'd probably better integrate the work in flex and contribute it upstream)

like image 89
AProgrammer Avatar answered Oct 08 '22 19:10

AProgrammer


Parser works with tokens, it's not its duty to know the encoding. It will usually just compare the ids of the tokens, and in case you code your special rules you may compare the underlining UTF-8 strings the way you do it anywhere else.

So you need a UTF-8 lexer? Well, it highly depends on how you define your problem. If you define your identifiers to consist of ASCII alphanumerics and anything else non-ASCII, then flex will suit your needs just fine. If you want to actually feed Unicode ranges to the lexer, you'll need something more complicated. You can look at Quex. I'd never used it myself, but it claims to support Unicode. (Although I would kill somebody for "free tell/seek based on character indices")

EDIT: Here is a similar question, it claims that flex won't work because of bug that ignores that some implementations may have a signed char... It may be outdated though.

like image 36
Yakov Galka Avatar answered Oct 08 '22 19:10

Yakov Galka