When I have to parse text (e.g. config files or other rather simple/descriptive languages), there are several solutions that come to my mind: <ul> <li>using library functions, e.g. <code>strtok()</code>, <code>sscanf()</code> </li> <li>a finite state machine which processes one char at a time, tokenizing and parsing</li> <li>using the <code>explode()</code> function I once wrote out of pure boredom</li> <li>using <code>lex</code>/<code>yacc</code> (read: <code>flex</code>/<code>bison</code>) to generate an appropriate parser</li> </ul> I don't like the "library functions" approach. It feels clumsy and awkward. <code>explode()</code>, while it doesn't take much new code, feels even more blown up. And <code>flex</code>/<code>bison</code> often seems like sheer overkill. I usually implement a FSM, but at the same time I already feel sorry for the poor guy that may have to maintain my code at a later point. Hence my question: What is the best way to parse relatively simple text files? Does it matter at all? Is there a commonly agreed-upon approach?

I'm going to break the rules a bit and answer your questions out of order. <ul> <li>Is there a commonly agreed-upon approach?</li> </ul> Absolutely not. IMHO the solution you choose should depend on (to name a few) your text, your timeframe, your experience, even your personality. If the text is simple enough to make <code>flex</code> and <code>bison</code> overkill, maybe C is itself overkill. Is it more important to be fast, or robust? Does it need to be maintained, or can it start quick and dirty? Are you a passionate C user, or can you be enticed away with the right language features? &c., &c. <ul> <li>Does it matter at all? </li> </ul> Again, this is something only you can answer. If you're working closely with a team of people, with particular skills and abilities, and the parser is important and needs to be maintained, it sure does matter! If you're writing something "out of pure boredom," I would suggest that it doesn't matter at all, no. :-) <ul> <li>What is the best way to parse relatively simple text files?</li> </ul> Well, I don't know that you're going to like my answer. Maybe first read some of the other fine answers here. No, really, go ahead. I'll wait. Ah, you're back and relaxed. Let's ease into things, shall we? <blockquote> Never write it in 'C' if you can do it in 'awk'; Never do it in 'awk' if 'sed' can handle it; Never use 'sed' when 'tr' can do the job; Never invoke 'tr' when 'cat' is sufficient; Avoid using 'cat' whenever possible. -- Taylor's Laws of Programming </blockquote> If you're writing it in C, but C feels like the wrong tool...it really might be the wrong tool. <code>awk</code> or <code>perl</code> will likely do what you're trying to do without all the aggravation. You may even be able to do it with <code>cut</code> or something similar. On the other hand, if you're writing it in C, you probably have a good reason to write it in C. Maybe your parser is a tiny part of a much larger system, which, for the sake of argument, is embedded, in a refrigerator, on the moon. Or maybe you loooove C. You may even hate <code>awk</code> and <code>perl</code>, heaven forfend. If you don't hate <code>awk</code> and <code>perl</code>, you may want to embed them into your C program. This is doable, in principle--I've never done it myself. For <code>awk</code>, try <code>libmawk</code>. For <code>perl</code>, there are proably a few ways (TMTOWTDI). You can run <code>perl</code> separately using <code>popen</code> to start it, or you can actually embed a Perl interpreter into your C program--see <code>man perlembed</code>. Anyhow, as I've said, "the best way to parse" entirely depends on you and your team, the problem space, and your approach to the issue. What I can offer is my opinion. I'm going to assume that in your C-only solutions (library functions and FSM (considering your <code>explode</code> to essentially be a library function)) you've already done your best at isolating the relevant code, designing the code and files well, and so forth. Even so, I'm going to recommend <code>lex</code> and <code>yacc</code>. Library functions feel "clumsy and awkward." A state machine seems unmaintainable. But you say that <code>lex</code> and <code>yacc</code> feel like overkill. I think you should approach your complaints differently. What you're really doing is specifying a FSM. However, you're also hiring someone to write and maintain it for you, thereby solving most of the maintainability problem. Overkill? Did I mention they'll work for free? I suspect, but do not know, that the reason <code>lex</code> and <code>yacc</code> originally felt like overkill was that your config / simple files just felt too, well, simple. If I'm right (a big if), you may be able to do most of your work in the lexer. (It's even conceivable that you can do all of your work in the lexer, but I know nothing about your input.) If your input is not only simple but widespread, you may be able to find a lexer/parser combination freely available for what you need. In short: if you can do this not in C, try something else. If you want C, use <code>lex</code> and <code>yacc</code>--they have a little overhead, but they're a very good solution.

Parsing: library functions, FSM, explode() or lex/yacc?

Tags:

c

parsing

explode

scanf

lex

When I have to parse text (e.g. config files or other rather simple/descriptive languages), there are several solutions that come to my mind:

using library functions, e.g. strtok(), sscanf()
a finite state machine which processes one char at a time, tokenizing and parsing
using the explode() function I once wrote out of pure boredom
using lex/yacc (read: flex/bison) to generate an appropriate parser

I don't like the "library functions" approach. It feels clumsy and awkward. explode(), while it doesn't take much new code, feels even more blown up. And flex/bison often seems like sheer overkill.

I usually implement a FSM, but at the same time I already feel sorry for the poor guy that may have to maintain my code at a later point.

Hence my question:

What is the best way to parse relatively simple text files?
Does it matter at all?
Is there a commonly agreed-upon approach?

816

asked Apr 17 '11 22:04

Philip

1 Answers

I'm going to break the rules a bit and answer your questions out of order.

Is there a commonly agreed-upon approach?

Absolutely not. IMHO the solution you choose should depend on (to name a few) your text, your timeframe, your experience, even your personality. If the text is simple enough to make flex and bison overkill, maybe C is itself overkill. Is it more important to be fast, or robust? Does it need to be maintained, or can it start quick and dirty? Are you a passionate C user, or can you be enticed away with the right language features? &c., &c.

Does it matter at all?

Again, this is something only you can answer. If you're working closely with a team of people, with particular skills and abilities, and the parser is important and needs to be maintained, it sure does matter! If you're writing something "out of pure boredom," I would suggest that it doesn't matter at all, no. :-)

What is the best way to parse relatively simple text files?

Well, I don't know that you're going to like my answer. Maybe first read some of the other fine answers here.

No, really, go ahead. I'll wait.

Ah, you're back and relaxed. Let's ease into things, shall we?

Never write it in 'C' if you can do it in 'awk';
Never do it in 'awk' if 'sed' can handle it;
Never use 'sed' when 'tr' can do the job;
Never invoke 'tr' when 'cat' is sufficient;
Avoid using 'cat' whenever possible.
-- Taylor's Laws of Programming

If you're writing it in C, but C feels like the wrong tool...it really might be the wrong tool. awk or perl will likely do what you're trying to do without all the aggravation. You may even be able to do it with cut or something similar.

On the other hand, if you're writing it in C, you probably have a good reason to write it in C. Maybe your parser is a tiny part of a much larger system, which, for the sake of argument, is embedded, in a refrigerator, on the moon. Or maybe you loooove C. You may even hate awk and perl, heaven forfend.

If you don't hate awk and perl, you may want to embed them into your C program. This is doable, in principle--I've never done it myself. For awk, try libmawk. For perl, there are proably a few ways (TMTOWTDI). You can run perl separately using popen to start it, or you can actually embed a Perl interpreter into your C program--see man perlembed.

Anyhow, as I've said, "the best way to parse" entirely depends on you and your team, the problem space, and your approach to the issue. What I can offer is my opinion.

I'm going to assume that in your C-only solutions (library functions and FSM (considering your explode to essentially be a library function)) you've already done your best at isolating the relevant code, designing the code and files well, and so forth.

Even so, I'm going to recommend lex and yacc.

Library functions feel "clumsy and awkward." A state machine seems unmaintainable. But you say that lex and yacc feel like overkill.

I think you should approach your complaints differently. What you're really doing is specifying a FSM. However, you're also hiring someone to write and maintain it for you, thereby solving most of the maintainability problem. Overkill? Did I mention they'll work for free?

I suspect, but do not know, that the reason lex and yacc originally felt like overkill was that your config / simple files just felt too, well, simple. If I'm right (a big if), you may be able to do most of your work in the lexer. (It's even conceivable that you can do all of your work in the lexer, but I know nothing about your input.) If your input is not only simple but widespread, you may be able to find a lexer/parser combination freely available for what you need.

In short: if you can do this not in C, try something else. If you want C, use lex and yacc--they have a little overhead, but they're a very good solution.

191

answered Sep 18 '22 00:09

JXG

Related questions
                            
                                How does ruby's rb_raise stop the execution of the c function calling it?
                            
                                View macro expanded version of source while debugging
                            
                                What is crtbegin.o and crtbegin_dynamic.o?
                            
                                emacs: is there a semantic-jump-to-declaration (using semantic.el)?
                            
                                How to hide helper functions from public API in c
                            
                                GNU build system question
                            
                                What is the relation between PATH_MAX and NAME_MAX, and how do I obtain?
                            
                                libxml2 HTML parsing
                            
                                C/C++ Question about trace-programming techniques
                            
                                lvalue-to-rvalue conversion of an array in ISO C
                            
                                How to tell gcc to instrument the code with calls to my own function each _line_ of code?
                            
                                C/C++ translation to assembly, low level memory behavior: how is it done? [closed]
                            
                                Efficient C pool allocator?
                            
                                C: Reading a text file (with variable-length lines) line-by-line using fread()/fgets() instead of fgetc() (block I/O vs. character I/O)
                            
                                Should standards be specified in source code or in CPPFLAGS?
                            
                                What is the purpose of the s==NULL case for mbrtowc?
                            
                                What are the most common configurations where pointer writes are not atomic?
                            
                                Windows hook with Python
                            
                                How Linux C code find conf file(usually in /etc)
                            
                                Why division operation typically uses more resources in computer?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With