My first Stack Overflow question.
I have always been curious about this.
Say you are parsing the following line of code:
List<Nullable<int>> list = new List<Nullable<int>>();
When parsing, a naive tokenizer will assume that the two right angle brackets are a single "shift right" token. I haven't run across this problem with any other language construction in a C-style language.
How do modern parsers handle this? Is there a workaround when using such "greedy parsing?"
I have thought about using a stack structure in the parser that handles those tokens specially when parsing generic types. I'm not sure how well that would work when writing a code editor though.
Thanks a ton! :)
It is exactly like a normal function, however, a generic method has type parameters that are cited by actual type. This allows the generic method to be used in a more general way. The compiler takes care of the type of safety which enables programmers to code easily since they do not have to perform long, individual type castings.
Generics means parameterized types. The idea is to allow type (Integer, String, … etc, and user-defined types) to be a parameter to methods, classes, and interfaces. Using Generics, it is possible to create classes that work with different data types. An entity such as class, interface, or method that operates on a parameterized type is called ...
We can also pass multiple Type parameters in Generic classes. We can also write generic functions that can be called with different types of arguments based on the type of arguments passed to the generic method. The compiler handles each method.
The way the production rules are implemented (derivation) divides parsing into two types : top-down parsing and bottom-up parsing.
When parsing a language, there are typically two main components: the scanner and the parser. The scanner produces a stream of tokens, and the parser interprets that stream based on a grammar, which is a formal definition of the production rules in the language - you can find the grammar for C# 4.0 here.
Disclaimer: I am not implying the following is necessarily how the C# language is parsed, I am merely using a C# snippet to illustrate the general concepts.
Scanning
So the first step is to produce the tokens for the parser. The tokens will generally be made up some kind of symbolic type (indicating the type of token it is), the lexeme (the actual text of the token) and perhaps other information such as the line number (useful for error handling).
So if we use List<Nullable<int>> list;
from your question as an example, the scanner would produce the following tokens:
available_identifier, List
<
available_identifier, Nullable
<
integral_type, int
>
>
available_identifier, list
;
Note that the token types are inferred from the C# grammar linked to above.
Parsing
Most parsers are what's known as shift-reduce parsers. This means that tokens are shifted onto a stack incrementally, and are reduced (removed) when they match a rule. To help match, a parser will have a certain number of lookahead tokens it may observe (I believe one is most common). In general, a successful parse will conclude when all tokens have been reduced.
The type of parser that is implemented by compiler construction programs such as YACC and GPPG is known as an LALR(1)
parser. These work by building up a parse table based on every legal combination of parser state and lookahead symbol, and given the current state and next symbol, can then tell us how to compute the next state.
Now that we have our tokens, we fire them into the parser, the outcome of which will usually be an abstract syntax tree, which can be used for later tasks such as code generation, semantic type checking etc. To parse those tokens, we need rules to group them into meaningful syntactic units - this is what prevents confusion when encountering >>
.
From the C# grammar:
declaration_statement:
| local_variable_declaration ";"
| local_constant_declaration ";"
local_variable_declaration:
| local_variable_type local_variable_declarators
local_variable_type:
| type
| "var"
local_variable_declarators:
| local_variable_declarator
| local_variable_declarators "," local_variable_declarator
local_variable_declarator:
| identifier
| identifier "=" local_variable_initializer
type:
| value_type
| reference_type
| type_parameter
| type_unsafe
value_type:
| struct_type
| enum_type
struct_type:
| type_name
| simple_type
| nullable_type
simple_type:
| numeric_type
| bool
numeric_type:
| integral_type
| floating_point_type
| decimal
integral_type:
| "sbyte"
| "byte"
| "short"
| "ushort"
| "int"
| "uint"
| "long"
| "ulong"
| "char"
reference_type:
| class_type
| interface_type
| array_type
| delegate_type
class_type:
| type_name
| "object"
| "dynamic"
| "string"
type_name:
| namespace_or_type_name
namespace_or_type_name:
| identifier type_argument_list?
| namespace_or_type_name "." identifier type_argument_list?
| qualified_alias_member
identifier:
| available_identifier
| "@" identifier_or_keyword
type_argument_list:
| "<" type_arguments ">"
type_arguments:
| type_argument
| type_arguments "," type_argument
type_argument:
| type
Looks complicated, but stay with me. Each rule is of the form
rule_name:
| production_1
| production_2
| production_2
Each production can be another rule (a non-terminal) or a terminal. Take the integral_type
rule for example: all of its productions are terminals. Rules can also refer to themselves, which is how things like the type arguments in Tuple<int, int, double>
are dealt with.
For the purposes of this example, I'll assume that List<Nullable<int>> list;
is a local variable declaration. You can find a more simple example on the Shift-Reduce Parsing Wikipedia page, and another on the LR Parsing page.
To begin with, our Parse Stack is empty, our single lookahead token is the very first one and our first action will be to shift that token. That is, our parser-state will look like this:
Step 0
Parse Stack: empty
Look Ahead: available_identifier
Unscanned: List<Nullable<int>> list;
Parser Action: Shift
In our next step, we can reduce the current token based on the production identifier <- available_identifier
.
Step 1
Parse Stack: available_identifier
Look Ahead: "<"
Unscanned: <Nullable<int>> list;
Parser Action: Reduce by identifier <- available_identifier
Skipping ahead a few steps, at step 10 we will have the following parser-state:
Step 10
Parse Stack: identifier "<" identifier "<" type_arguments ">"
Look Ahead: ">"
Unscanned: > list;
Parser Action: Reduce by type_argument_list <- "<" type_arguments ">"
At this point we will be able to reduce the last three tokens as their sequence makes up a type_argument_list
(you can check this in the rules above). Fast forward a little more to step 13 and we have the following:
Step 13
Parse Stack: identifier "<" type_arguments ">"
Look Ahead: ">"
Unscanned: list;
Parser Action: Reduce by type_argument_list <- "<" type_arguments ">"
Just like in step 10, we are reducing by type_argument_list <- "<" type_arguments ">"
. In doing so, we have in fact avoided any ambiguity with >>
. These steps continue until we reduce by declaration_statement <- local_variable_declaration ";"
- the first rule above.
Summary
By creating an unambiguous grammar, parsers are able to easily disambiguate seemingly tricky situations like List<Nullable<int>>
. What I've covered here is essentially a bottom-up, LALR(1) parser. I haven't gone into the actual creation of the abstract syntax tree, but you probably have enough on your plate with the above.
Keep in mind that the rules did not include a start-state - this was mainly for the sake of brevity. If it's helpful, I can throw the rest of the parse steps in there.
Edit: f(g<a, b>(c))
What this boils down to in the grammar is two invocation_expression
rules, which are of the form invocation_expression -> primary_expression ( argument_list? )
The first one matches g<a, b>(c)
. It does so by first establishing that g<a,b>
is an identifier
followed by an type_argument_list
. Our lookahead is now "("
, and because the parser will know from previous context that this code is in a method body, it can reduce identifier type_argument_list
by
primary_expression <- primary_no_array_creation_expression
<- simple_name <- identifier type_argument_list?
After shifting "("
and c
, we can reduce c
by
argument_list <- argument <- argument_value <- expression
<- <a really long list of rules> <- simple_name
<- identifier <- available_identifier
And shifting that final parentheses character gives us
primary_expression ( argument_list? )
Which can then be reduced by the invocation_expression
rule, thus matching g<a, b>(c)
.
By this point we would have already matched f
as an identifier
and applied the reduction
primary_expression <- primary_no_array_creation_expression
<- simple_name <- identifier type_argument_list?
So the parse stack will therefore contain the following
primary_expression "(" invocation_expression
^ ^ ^
f ( g<a, b>(c)
The lookahead symbol will by the very last ")"
, so the parser will reduce invocation_expression
by
argument_list <- argument <- argument_value <- expression
<- <the same really long list of rules> <- primary_expression
<- primary_no_array_creation_expression <- invocation_expression
Shifting that last ")"
will then give us
primary_expression "(" argument_list ")"
^ ^ ^ ^
f ( g<a, b>(c) )
Like before, this can be reduced by the invocation_expression
rule, thus matching f(g<a, b>(c))
.
You can postpone decision until you are done with parsing and/or with semantic analysis (AFAIK C/C++ compilers have to take the latter approach).
I wrote an article how can you do it with parser (NLT in this case) which allow to parse your input with different interpretations in parallel -- Ambiguity? Let NLT parse it for you. In short you don't decide if this is a shift or closing angle bracket for generic argument, you are OK with both versions, however you wait until one of them will be invalid, then you kill it, and you get the correct one.
I am not pasting here the full text, because it is way too long for SO answer.
I asked the same question about Java.
Basically, the problem is that:
depending on the context, a different tokenization is required for correct parsing -- compare:
A<B<C >> d = new A<B<C >> ();
E >> f;
I'd like to emphasize that the problem isn't caused by the language itself, but by certain approaches to parsing it. Depending on how/what you're using to parse it, you may not experience this problem at all.
However, if you are running into this problem, here are some general solutions:
give the tokenizer enough information about the context to correctly tokenize. For instance, allow tokenization to be context-sensitive instead of regular, perhaps by integrating tokenizing with hierarchical parsing -- after all, a separate tokenization pass is merely an implementation detail for the sake of efficiency, not a necessary part of parsing. This approach is quite easy to implement with recursive-descent parsers.
ambiguously tokenize, and resolve the ambiguities when more contextual information is available. (This may sound inefficient in theory, but there are some very fast implementations in practice.)
do naive tokenization, but reinterpret tokens if needed. This is the solution apparently used by some parsers of the Java language, as explained more fully in my similar question
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With