Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Dragon Book (Lexical analysis) How to handle literals

This project is for educational use and I am very well aware that excellent compilers already exist.

I am currently fighting my way through the famous Dragon Book and just started to implement my own Lexer. It works suprisingly well except for literals. I do not understand how to handle literals using symbol (lookup) tables and the book doesn't seem to cover that very well:

In the following code 60 is a numeric literal:

int myIdentifier = 60;

The Dragon Book says:

Technically speaking, for the lexeme 60 we should make up a token like (number,4), where 4 points to the symbol table for the internal representation of integer 60 [...]

Understood - I created the following Token:

<enum TokenType, int lookupIndex> 
//TokenType could be 'number' and lookupIndex could be any int

And stored the literal in a dictionary like this:

Dictionary<int literal, int index> 
//literal could be '60' and index could be anything

Since the literal itself is the key in the Dictionary, that allows me to quickly check if future literals have already been added to the symbol table (or not).

The Parser then recieves the Tokens from the Lexer and should be able to identify the literals in the symbol table.

Questions:

  1. Why should my Token contain a lookup-index instead of containing the literal itself?
    Wouldn' t that be quicker...
  2. How should the Parser be able to quickly find the literal values inside the symbol-table when the lookup-index is the value of the dictionary?
    (I cannot make the lookup-index the dictionary-key because the Lexer would then have to check against the value of the dictionary wich is not very performant as well)
    Could a multi-indexed-dictionary be a solution? I guess not...
  3. Must I create a symbol-table for every type of literal then?
    F.e.:Dictionary<int literal, int index>
    and Dictionary<double literal, int index>
    and Dictionary<char literal, int index> etc.
  4. Maybe I am completly on the wrong track with literals. Feel free to post any better solutions.
like image 709
Noel Widmer Avatar asked Oct 31 '22 04:10

Noel Widmer


1 Answers

  1. Why should my Token contain a lookup-index instead of containing the literal itself? Wouldn't that be quicker?

    Sure, it would probably be quicker. But then every literal would be a different value. Now, most programmers have the expectation that if they use, for example, "this longish string" twice in the same program, the compiler will be clever enough to only emit a single copy of that string in the final executable. And it would also be, shall we say, surprising if when you decompiled the code, you found 273 different storage locations for the constant 1, because every time the compiler saw a += 1, it created a new constant.

    The easiest way to ensure that constant literals are only emitted once is to keep them in an associative container indexed by the value of the literal.

    As @sepp2k points out in a commment, most hardware allows the use of small integer constants as direct operands, and sometimes even not-so-small constants. So the statement about the constant 1 above is a bit of an exagerration. You might well be able to handle integers differently, but it might not be worth the trouble.

  2. How should the Parser be able to quickly find the literal values inside the symbol-table when the lookup-index is the value of the dictionary?

    That depends a lot on the precise datastructure you use for literal tables (which I don't like to call symbol tables, but admittedly the concepts are related.) In many languages, you will find that your standard library containers are not a perfect match for the problem, so you will either need to adapt them to the purpose or write replacements.

    Still, it's not terribly complicated. One possibility is to use the combination of a map<literalType, int> and a vector<literalType>. Here the map associates literal values with indices into the vector. When you find a new literal value, you enter it into the map associated with the current size of the vector, and then push the value onto the vector (which will make its index correspond to the index you just inserted into the map.)

    That's not entirely ideal for large constants like strings because between the key in the map and the value in the vector, the constant is stored twice. When you're starting, I'd recommend just suppressing your annoyance about this duplication; later, if it proves to be a problem, you can find a solution.

    If you were using C++, you could use an (unordered) set instead of a map, and use a reference (pointer) to the newly-added element instead of an index. But I don't think that feature is available in many languages, and also pointers are sometimes awkward in comparison to indices. In some languages you could put all the values into the vector and then keep a set whose keys were indices into the vector. This requires that a lookup of the set can be done with something other than the key type; for some reason, this feature is available in very few datastructure libraries.

    And, yes, a doubly-indexed datastructure could be used, if you have one of those handy. (In effect, the map+vector solution is a doubly-indexed datastructure.)

  3. Must I create a symbol-table for every type of literal then?

    Maybe. How many kinds of literals do you have?

    You'll probably end up using type-tagged enumerations ("discriminated unions"), both for variables and for constants. (Again, not all languages have discriminated unions in their standard library, which is truly sad; if your implementation language lacks this basic feature, you'll need to implement it.) It should certainly be possible for a discriminated union instance to be used as a key in an associative data structure, so there is nothing stopping you, in principle, from keeping all your literals in a single data structure. If you have appropriate types, that's definitely what I'd recommend, at least when starting.

    Note that when you are ultimately emitting the literals as object code, you're really more interested in their bit representation and alignment than their semantics. If two constants of completely different types happen to have the same bit representation, then you could use the same storage location for both of them. If you have multiple widths of integer datatypes, then you'd probably want to keep all of them in a single literal table, precisely to take advantage of this optimization. No need to store a 1 of every width :). Occasionally you will find other cases where two literals of different types have the same representation, but it's probably not common enough to go out of your way to deal with it. (However, on IEEE hardware, floating point and integer zeros have the same representation, and that is usually the same representation as a NULL pointer, so you might want to special case zeros.)

    All in all, it's a judgement call. How complicated is it to use a discriminated union as a key? How much storage could you save by having associative containers with specific key types, and does it matter? Will you want to iterate over all literal constants of the same type (answer: probably) and how easy is that to do with your datastructures?

    If you use a well-designed internal API, you will be able to change your mind about the internal representation of your literal tables. So I'd use this experiment as an opportunity to try good API design.

  4. Anything else?

    Good luck with your project. Learn and enjoy!

like image 72
rici Avatar answered Nov 09 '22 04:11

rici