Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are regular expressions used to build parsers?

Tags:

regex

parsing

This is just a question out of curiosity since I have been needing to get more and more into parsing and using regex lately.. it seems, for questions I come across in my searches regarding parsing of some sort, someone always ends up saying, when asked something relating to regex, "regex isn't good for that, use such and such parser instead"... as I have come to better understand regex, I think most stuff is possible, just its rather complex and time consuming since you have to account for many different possiblities, and of course, it has to be combined with conditional statements and loops to build any sort of parser.. so I'm wondering if regex is what is used to build most parsers or is there some other method being used.. I am just wondering since I may have the need to build some fairly complex custom parsers coming up where there isn't necessarily an existing one to use.

thanks for any info as I can't seem to find a direct answer to this.

like image 525
Rick Avatar asked Aug 15 '10 11:08

Rick


2 Answers

Typically, you'll use two (at least) types of tools in building your parser.

The first part is lexical analysis -- separating characters into tokens and filtering out comments and whitespace. That part is typically done with regular expressions. Well, it's even more typically done using a scanner generator that converts a collection of pairs of regular expressions and code into a program that executes the corresponding code when it recognizes the regular expressions. This turns out to be more efficient than testing each regular expression each time, and it also works better for various other reasons. FLEX is a common tool for this in C.

The second part of your parser is the grammar. The most typical tool for this is another program-generator that accepts a context-free grammar (CFG) annotated with rules for interpreting the component "parts of speech", as it were. A CFG is able to express things like balanced parenthesis, which a regular expression cannot (unless it's been extended with CF features, making it not strictly "regular" in the mathematical sense). But a CFG with rules is very nice because you can attach a semantic meaning to the phrase structure of your language. BISON is a common tool for this part in C.

But I actually told you a little lie. You see, every real programming language has parts that cannot be expressed within a context-free framework. For example, you need to connect the definition of a variable with the use of it so that you know what instructions to generate, and also if an operation on it is legal. That's typically considered outside the scope of parsing, but there are such things as "attribute grammars" which are like CFGs extended with features that can make even these context-dependencies much easier to code up and work with.

Now, there's no rule that says you HAVE to use such tools. Many simple grammars are easy enough to process with hand-written tools. For example, LISP's S-expressions can be simply scanned as:

If it starts with a digit, read a number. If it starts with a letter, read a symbol. If it's a space, skip it. If it's an open-paren, then skip it, recurse this routine for a value, and expect a close paren.

Well, there are a few more complications for strings and what-have-you, but that's the basic idea. Parsing FORTH is even simpler, because it doesn't build a recursive data structure.

Anyway, that should get you going on whatever your project is.

like image 141
Ian Avatar answered Oct 18 '22 09:10

Ian


Regexes can be used to parse a certain class of language (finite state language), but their power is limited compared to other formalisms, and as you mention, they quickly become unweildy and hard to maintain.

For example, it is not possible to have a regex that can ensure for each open parenthesis that there is a matching close parenthesis - something that most languages have in their expression syntax.

Regexes are usually used to do tokenization, and the tokens then combined to create the desired syntax.

like image 31
mdma Avatar answered Oct 18 '22 10:10

mdma