I'm trying to make a Bison parser to handle UTF-8 characters. I don't want the parser to actually interpret the Unicode character values, but I want it to parse the UTF-8 string as a sequence of bytes. Right now, Bison generates the following code which is problematic: <pre class="prettyprint"><code> if (yychar <= YYEOF) { yychar = yytoken = YYEOF; YYDPRINTF ((stderr, "Now at end of input.\n")); } </code></pre> The problem is that many bytes of the UTF-8 string will have a negative value, and Bison interprets negative values as an EOF, and stops. Is there a way around this?

<code>bison</code> yes, <code>flex</code> no. The one time I needed a bison parser to work with UTF-8 encoded files I ended up writing my own <code>yylex</code> function. edit: To help, I used a lot of the Unicode operations available in glib (there's a <code>gunicode</code> type and some file/string manipulation functions that I found useful).

Can Bison parse UTF-8 characters?

Tags:

c++

utf-8

bison

I'm trying to make a Bison parser to handle UTF-8 characters. I don't want the parser to actually interpret the Unicode character values, but I want it to parse the UTF-8 string as a sequence of bytes.

Right now, Bison generates the following code which is problematic:

  if (yychar <= YYEOF)
    {
      yychar = yytoken = YYEOF;
      YYDPRINTF ((stderr, "Now at end of input.\n"));
    }

The problem is that many bytes of the UTF-8 string will have a negative value, and Bison interprets negative values as an EOF, and stops.

Is there a way around this?

228

asked Jun 01 '09 14:06

Martin Cote

2 Answers

bison yes, flex no. The one time I needed a bison parser to work with UTF-8 encoded files I ended up writing my own yylex function.

edit: To help, I used a lot of the Unicode operations available in glib (there's a gunicode type and some file/string manipulation functions that I found useful).

answered Sep 18 '22 07:09

eduffy

flex being the issue here, you might want to take a look at zlex.

answered Sep 20 '22 07:09

chaos

Related questions
                            
                                How to flatten heterogeneous lists (aka tuples of tuples of ...)
                            
                                Is it Legal to reinterpret_cast to a void*
                            
                                How to eliminate the MessageBeep from the RICHEDIT control?
                            
                                Static assertion failed with "Windows headers require the default packing option..."
                            
                                GCC: Specified bound depends on the length of the source argument
                            
                                Type_traits *_v variable template utility order fails to compile
                            
                                std::vector<T>::assign using a subrange valid?
                            
                                How to implement the generalized form of std::same_as (i.e. for more than two type parameters) that is agnostic to parameter order?
                            
                                How to properly compare an integer and a floating-point value?
                            
                                Why is it so convoluted to get the date and/or time in C++?
                            
                                Why is unique_ptr::release not defined with [[nodiscard]]?
                            
                                Why is it illegal to bind an r-value to a const l-value reference in special member functions?
                            
                                How to use clang-10 or gcc-10 when building via Github Actions?
                            
                                Borland x86 inlined assembler; get a label's address?
                            
                                Thread safety of Matlab engine API
                            
                                How to read arbitrary number of values using std::copy?
                            
                                How do 'malloc' and 'new' work? How are they different (implementation wise)? [duplicate]
                            
                                Forcing symbol export with MSVC
                            
                                How can I check is a socket is still open?
                            
                                Socket Timeout in C++ Linux

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With