Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Bison parse UTF-8 characters?

Tags:

c++

utf-8

bison

I'm trying to make a Bison parser to handle UTF-8 characters. I don't want the parser to actually interpret the Unicode character values, but I want it to parse the UTF-8 string as a sequence of bytes.

Right now, Bison generates the following code which is problematic:

  if (yychar <= YYEOF)
    {
      yychar = yytoken = YYEOF;
      YYDPRINTF ((stderr, "Now at end of input.\n"));
    }

The problem is that many bytes of the UTF-8 string will have a negative value, and Bison interprets negative values as an EOF, and stops.

Is there a way around this?

like image 228
Martin Cote Avatar asked Jun 01 '09 14:06

Martin Cote


People also ask

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

What kind of parser is Bison?

Bison is a general-purpose parser generator that converts a grammar description (Bison Grammar Files) for an LALR(1) context-free grammar into a C program to parse that grammar. The Bison parser is a bottom-up parser.

Is UTF-8 a character?

UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.

How many possible UTF-8 characters are there?

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.


2 Answers

bison yes, flex no. The one time I needed a bison parser to work with UTF-8 encoded files I ended up writing my own yylex function.

edit: To help, I used a lot of the Unicode operations available in glib (there's a gunicode type and some file/string manipulation functions that I found useful).

like image 64
eduffy Avatar answered Sep 18 '22 07:09

eduffy


flex being the issue here, you might want to take a look at zlex.

like image 37
chaos Avatar answered Sep 20 '22 07:09

chaos