I recent added source file parsing to an existing tool that generated output files from complex command line arguments. The command line arguments got to be so complex that we started allowing them to be supplied as a file that was parsed as if it was a very large command line, but the syntax was still awkward. So I added the ability to parse a source file using a more reasonable syntax. I used flex 2.5.4 for windows to generate the tokenizer for this custom source file format, and it worked. But I hated the code. global variables, wierd naming convention, and the c++ code it generated was awful. The existing code generation backend was glued to the output of flex - I don't use yacc or bison. I'm about to dive back into that code, and I'd like to use a better/more modern tool. Does anyone know of something that. <ul> <li>Runs in Windows command prompt (Visual studio integration is ok, but I use make files to build)</li> <li>Generates a proper encapsulated C++ tokenizer. (No global variables)</li> <li>Uses regular expressions for describing the tokenizing rules (compatible with lex syntax a plus)</li> <li>Does not force me to use the c-runtime (or fake it) for file reading. (parse from memory)</li> <li>Warns me when my rules force the tokenizer to backtrack (or fixes it automatically)</li> <li>Gives me full control over variable and method names (so I can conform to my existing naming convention)</li> <li>Allows me to link multiple parsers into a single .exe without name collisions</li> <li>Can generate a UNICODE (16bit UCS-2) parser if I want it to</li> <li>Is NOT an integrated tokenizer + parser-generator (I want a lex replacement, not a lex+yacc replacement)</li> </ul> I could probably live with a tool that just generated the tokenizing tables if that was the only thing available.

Ragel: http://www.complang.org/ragel/ It fits most of your requirements. <ul> <li>It runs on Windows</li> <li>It doesn't declare the variables, so you can put them inside a class or inside a function as you like.</li> <li>It has nice tools for analyzing regular expressions to see when they would backtrack. (I don't know about this very much, since I never use syntax in Ragel that would create a backtracking parser.)</li> <li>Variable names can't be changed.</li> <li>Table names are prefixed with the machine name, and they're declared "const static", so you can put more than one in the same file and have more than one with the same name in a single program (as long as they're in different files).</li> <li>You can declare the variables as any integer type, including UChar (or whatever UTF-16 type you prefer). It doesn't automatically handle surrogate pairs, though. It doesn't have special character classes for Unicode either (I think).</li> <li>It only does regular expressions... has no bison/yacc features.</li> </ul> The code it generates interferes very little with a program. The code is also incredibly fast, and the Ragel syntax is more flexible and readable than anything I've ever seen. It's a rock solid piece of software. It can generate a table-driven parser or a goto-driven parser.

Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?

Tags:

I recent added source file parsing to an existing tool that generated output files from complex command line arguments.

The command line arguments got to be so complex that we started allowing them to be supplied as a file that was parsed as if it was a very large command line, but the syntax was still awkward. So I added the ability to parse a source file using a more reasonable syntax.

I used flex 2.5.4 for windows to generate the tokenizer for this custom source file format, and it worked. But I hated the code. global variables, wierd naming convention, and the c++ code it generated was awful. The existing code generation backend was glued to the output of flex - I don't use yacc or bison.

I'm about to dive back into that code, and I'd like to use a better/more modern tool. Does anyone know of something that.

Runs in Windows command prompt (Visual studio integration is ok, but I use make files to build)
Generates a proper encapsulated C++ tokenizer. (No global variables)
Uses regular expressions for describing the tokenizing rules (compatible with lex syntax a plus)
Does not force me to use the c-runtime (or fake it) for file reading. (parse from memory)
Warns me when my rules force the tokenizer to backtrack (or fixes it automatically)
Gives me full control over variable and method names (so I can conform to my existing naming convention)
Allows me to link multiple parsers into a single .exe without name collisions
Can generate a UNICODE (16bit UCS-2) parser if I want it to
Is NOT an integrated tokenizer + parser-generator (I want a lex replacement, not a lex+yacc replacement)

I could probably live with a tool that just generated the tokenizing tables if that was the only thing available.

974

asked Jan 30 '10 23:01

John Knoeller

2 Answers

Ragel: http://www.complang.org/ragel/ It fits most of your requirements.

It runs on Windows
It doesn't declare the variables, so you can put them inside a class or inside a function as you like.
It has nice tools for analyzing regular expressions to see when they would backtrack. (I don't know about this very much, since I never use syntax in Ragel that would create a backtracking parser.)
Variable names can't be changed.
Table names are prefixed with the machine name, and they're declared "const static", so you can put more than one in the same file and have more than one with the same name in a single program (as long as they're in different files).
You can declare the variables as any integer type, including UChar (or whatever UTF-16 type you prefer). It doesn't automatically handle surrogate pairs, though. It doesn't have special character classes for Unicode either (I think).
It only does regular expressions... has no bison/yacc features.

The code it generates interferes very little with a program. The code is also incredibly fast, and the Ragel syntax is more flexible and readable than anything I've ever seen. It's a rock solid piece of software. It can generate a table-driven parser or a goto-driven parser.

118

answered Nov 08 '22 02:11

Dietrich Epp

Boost.Spirit.Qi (parser-tokenizer) or Boost.Spirit.Lex (tokenizer only). I absolutely love Qi, and Lex is not bad either, but I just tend to take Qi for my parsing needs...

The only real drawback with Qi tends to be an increase in compile time, and it is also runs slightly slower than hand-written parsing code. It is generally much faster than parsing with regex, though.

http://www.boost.org/doc/libs/1_41_0/libs/spirit/doc/html/index.html

answered Nov 08 '22 02:11

Tronic

Related questions
                            
                                What is an empty element?
                            
                                Storing PDF files as binary objects in SQL Server, yes or no?
                            
                                Set form submit header
                            
                                the best "Simple" CMS system suitable for .Net MVC [closed]
                            
                                How to get CPU usage statistics on Android?
                            
                                What is Medium Trust in Asp.net?
                            
                                this == null // How can it be possible?
                            
                                Question about [Pure] methods
                            
                                Can I find a filename from a filehandle in Perl?
                            
                                In Java, is a Comparator used in Collections.sort() thread safe?
                            
                                How to remove accidental branch in TortoiseHg?
                            
                                UUID collision risk using different algorithms

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With