Is it a Lexer's Job to Parse Numbers and Strings?

Tags:

Is it a lexer's job to parse numbers and strings?

This may or may not sound dumb, given that fact that I'm asking whether a lexer should parse input. However, I'm not sure whether that's in fact the lexer's job or the parser's job, because in order to lex properly, the lexer needs to parse the string/number in the first place, so it would seem like code would be duplicated if the parser does this.

Is it indeed the lexer's job? Or should the lexer simply break up a string like 123.456 into the strings 123, ., 456 and let the parser figure out the rest? Doing this wouldn't be so straightforward with strings...

440

asked Jun 12 '11 04:06

user541686

1 Answers

The simple answer is "Yes".

In the abstract, you don't need lexers at all. You could simply write a grammer that used individual characters as tokens (and in fact that's exactly what SGLR parsers do, but that's a story for another day).

You need lexers because parsers built using characters as primitive elements aren't as efficient as parsers that break the input stream into "tokens", where tokens are the primitive elements of the language you are parsing (whitespace, keywords, identifiers, numbers, operators, strings, comments, ...). [If you don't care about efficiency you can skip the rest of this answer and go read about SGLR parsers].

Good lexers typically take sets of regular expressions representing the language elements, and compile them into an efficient finite state machine that can segment the input stream into such language elements quickly. (If you don't want to user a lexer generator, for simple languages you can code the FSA yourself). Such compiled FSAs execute only a few tens of machine instructions per input character (get character from input buffer, switch on character to new state, decide if token is complete, if not do it again), and can thus be extremely fast.

The output of such lexers is typically a code representing the langauge element (or nothing for whitespace if the parser would ignore it anyway) and some position information (starts in file foo, line 17 column 3) to enable error reporting.

One can stop there and have useful lexers. It is often useful to do a conversion step, that converts the character string into the equivalent native machine value for that token, either as the characters are collected, or when the token is complete, because one still has knowledge of the specific characters involved in the token. This is used to convert numbers (of varying radixes) in the target language to their native binary equivalent, to convert literal strings containing escape sequences into the actual characters making up the string, and even taking identifier names and looking them up in a hash table so that identical identifiers are easily determined. The parser typically isn't interested in these converted values, but steps beyond parsing (semantic analysis, checking for optimizations, code generation) needs the converted values anyway, so you might as well convert them as you discover them. (You could delay this conversion until their binary value was needed, but in practice you almost always need the value so delaying the conversion doesn't buy very much).

108

answered Sep 19 '22 21:09

Ira Baxter

Related questions
                            
                                Parsing nested JSON data using GSON
                            
                                Error using Pod Install command on Podfile in Terminal
                            
                                Regular expression to return all characters between two special characters
                            
                                Convert Month String to Integer in Java
                            
                                libsqlite3.dylib and libz.dylib missing in Xcode 7. How do I use Parse?
                            
                                Float to String format specifier
                            
                                reading two integers in one line using C#
                            
                                What is the best way to get the list of column names using CsvHelper?
                            
                                easiest way to parse JSON in Qt 4.7
                            
                                Parsing Performance (If, TryParse, Try-Catch)
                            
                                Get and Parse CSV file in android
                            
                                Why double.TryParse("0.0000", out doubleValue) returns false ?
                            
                                Reading JSON file with Python 3
                            
                                Perl compatible regular expression (PCRE) in Python
                            
                                gson: Treat null as empty String
                            
                                How can I parse the IO String in Haskell?
                            
                                Extract filename and path from URL in bash script
                            
                                Complex number arithmetic in Tcl?
                            
                                Tutorials for writing a parser with Javascript [closed]
                            
                                Why does "new Date().toString()" work given Javascript operator precedence?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it a Lexer's Job to Parse Numbers and Strings?

Tags:

parsing

tokenize

lexer

user541686

People also ask

1 Answers

Ira Baxter

Recent Activity

Donate For Us