Differentiating between ">>" and ">" when parsing generic types

Tags:

My first Stack Overflow question.

I have always been curious about this.

Say you are parsing the following line of code:

List<Nullable<int>> list = new List<Nullable<int>>();

When parsing, a naive tokenizer will assume that the two right angle brackets are a single "shift right" token. I haven't run across this problem with any other language construction in a C-style language.

How do modern parsers handle this? Is there a workaround when using such "greedy parsing?"

I have thought about using a stack structure in the parser that handles those tokens specially when parsing generic types. I'm not sure how well that would work when writing a code editor though.

Thanks a ton! :)

377

asked Jan 16 '14 02:01

Josh Wyant

Video Answer

3 Answers

When parsing a language, there are typically two main components: the scanner and the parser. The scanner produces a stream of tokens, and the parser interprets that stream based on a grammar, which is a formal definition of the production rules in the language - you can find the grammar for C# 4.0 here.

Disclaimer: I am not implying the following is necessarily how the C# language is parsed, I am merely using a C# snippet to illustrate the general concepts.

Scanning

So the first step is to produce the tokens for the parser. The tokens will generally be made up some kind of symbolic type (indicating the type of token it is), the lexeme (the actual text of the token) and perhaps other information such as the line number (useful for error handling).

So if we use List<Nullable<int>> list; from your question as an example, the scanner would produce the following tokens:

available_identifier, List
<
available_identifier, Nullable
<
integral_type, int
>
>
available_identifier, list
;

Note that the token types are inferred from the C# grammar linked to above.

Parsing

Most parsers are what's known as shift-reduce parsers. This means that tokens are shifted onto a stack incrementally, and are reduced (removed) when they match a rule. To help match, a parser will have a certain number of lookahead tokens it may observe (I believe one is most common). In general, a successful parse will conclude when all tokens have been reduced.

The type of parser that is implemented by compiler construction programs such as YACC and GPPG is known as an LALR(1) parser. These work by building up a parse table based on every legal combination of parser state and lookahead symbol, and given the current state and next symbol, can then tell us how to compute the next state.

Now that we have our tokens, we fire them into the parser, the outcome of which will usually be an abstract syntax tree, which can be used for later tasks such as code generation, semantic type checking etc. To parse those tokens, we need rules to group them into meaningful syntactic units - this is what prevents confusion when encountering >>.

From the C# grammar:

declaration_statement:
    | local_variable_declaration ";" 
    | local_constant_declaration ";" 

local_variable_declaration:
    | local_variable_type local_variable_declarators 

local_variable_type:
    | type 
    | "var"

local_variable_declarators:
    | local_variable_declarator 
    | local_variable_declarators "," local_variable_declarator 

local_variable_declarator:
    | identifier 
    | identifier "=" local_variable_initializer 

type:
    | value_type 
    | reference_type 
    | type_parameter 
    | type_unsafe 

value_type:
    | struct_type 
    | enum_type 

struct_type:
    | type_name 
    | simple_type 
    | nullable_type 

simple_type:
    | numeric_type 
    | bool 

numeric_type:
    | integral_type 
    | floating_point_type 
    | decimal 

integral_type:
    | "sbyte" 
    | "byte" 
    | "short" 
    | "ushort" 
    | "int"
    | "uint" 
    | "long"
    | "ulong" 
    | "char"

reference_type:
    | class_type 
    | interface_type 
    | array_type 
    | delegate_type 

class_type:
    | type_name 
    | "object"
    | "dynamic" 
    | "string"

type_name:
    | namespace_or_type_name 

namespace_or_type_name:
    | identifier type_argument_list? 
    | namespace_or_type_name "." identifier type_argument_list? 
    | qualified_alias_member 

identifier:
    | available_identifier 
    | "@" identifier_or_keyword 

type_argument_list:
    | "<" type_arguments ">" 

type_arguments:
    | type_argument 
    | type_arguments "," type_argument 

type_argument:
    | type

Looks complicated, but stay with me. Each rule is of the form

rule_name:
    | production_1
    | production_2
    | production_2

Each production can be another rule (a non-terminal) or a terminal. Take the integral_type rule for example: all of its productions are terminals. Rules can also refer to themselves, which is how things like the type arguments in Tuple<int, int, double> are dealt with.

For the purposes of this example, I'll assume that List<Nullable<int>> list; is a local variable declaration. You can find a more simple example on the Shift-Reduce Parsing Wikipedia page, and another on the LR Parsing page.

To begin with, our Parse Stack is empty, our single lookahead token is the very first one and our first action will be to shift that token. That is, our parser-state will look like this:

Step 0
Parse Stack:    empty
Look Ahead:     available_identifier
Unscanned:      List<Nullable<int>> list;
Parser Action:  Shift

In our next step, we can reduce the current token based on the production identifier <- available_identifier.

Step 1
Parse Stack:    available_identifier
Look Ahead:     "<"
Unscanned:      <Nullable<int>> list;
Parser Action:  Reduce by identifier <- available_identifier

Skipping ahead a few steps, at step 10 we will have the following parser-state:

Step 10
Parse Stack:    identifier "<" identifier "<" type_arguments ">"
Look Ahead:     ">"
Unscanned:      > list;
Parser Action:  Reduce by type_argument_list <- "<" type_arguments ">"

At this point we will be able to reduce the last three tokens as their sequence makes up a type_argument_list (you can check this in the rules above). Fast forward a little more to step 13 and we have the following:

Step 13
Parse Stack:    identifier "<" type_arguments ">"
Look Ahead:     ">"
Unscanned:      list;
Parser Action:  Reduce by type_argument_list <- "<" type_arguments ">"

Just like in step 10, we are reducing by type_argument_list <- "<" type_arguments ">". In doing so, we have in fact avoided any ambiguity with >>. These steps continue until we reduce by declaration_statement <- local_variable_declaration ";" - the first rule above.

Summary

By creating an unambiguous grammar, parsers are able to easily disambiguate seemingly tricky situations like List<Nullable<int>>. What I've covered here is essentially a bottom-up, LALR(1) parser. I haven't gone into the actual creation of the abstract syntax tree, but you probably have enough on your plate with the above.

Keep in mind that the rules did not include a start-state - this was mainly for the sake of brevity. If it's helpful, I can throw the rest of the parse steps in there.

Edit: f(g<a, b>(c))

What this boils down to in the grammar is two invocation_expression rules, which are of the form invocation_expression -> primary_expression ( argument_list? )

The first one matches g<a, b>(c). It does so by first establishing that g<a,b> is an identifier followed by an type_argument_list. Our lookahead is now "(", and because the parser will know from previous context that this code is in a method body, it can reduce identifier type_argument_list by

primary_expression <- primary_no_array_creation_expression
    <- simple_name <- identifier type_argument_list?

After shifting "(" and c, we can reduce c by

argument_list <- argument <- argument_value <- expression
    <- <a really long list of rules> <- simple_name
    <- identifier <- available_identifier

And shifting that final parentheses character gives us

primary_expression ( argument_list? )

Which can then be reduced by the invocation_expression rule, thus matching g<a, b>(c).

By this point we would have already matched f as an identifier and applied the reduction

primary_expression <- primary_no_array_creation_expression
    <- simple_name <- identifier type_argument_list?

So the parse stack will therefore contain the following

primary_expression "(" invocation_expression
        ^           ^            ^
        f           (        g<a, b>(c)

The lookahead symbol will by the very last ")", so the parser will reduce invocation_expression by

argument_list <- argument <- argument_value <- expression
    <- <the same really long list of rules> <- primary_expression
    <- primary_no_array_creation_expression <- invocation_expression

Shifting that last ")" will then give us

    primary_expression "(" argument_list ")"
            ^           ^        ^        ^
            f           (    g<a, b>(c)   )

Like before, this can be reduced by the invocation_expression rule, thus matching f(g<a, b>(c)).

164

answered Oct 05 '22 21:10

nick_w

You can postpone decision until you are done with parsing and/or with semantic analysis (AFAIK C/C++ compilers have to take the latter approach).

I wrote an article how can you do it with parser (NLT in this case) which allow to parse your input with different interpretations in parallel -- Ambiguity? Let NLT parse it for you. In short you don't decide if this is a shift or closing angle bracket for generic argument, you are OK with both versions, however you wait until one of them will be invalid, then you kill it, and you get the correct one.

I am not pasting here the full text, because it is way too long for SO answer.

answered Oct 05 '22 21:10

greenoldman

I asked the same question about Java.

Basically, the problem is that:

the language is unambiguous (at least in this example) -- meaning there's only one correct way to parse it
some implementations will ambiguously tokenize the example -- meaning that there's more than one way to split the input into valid tokens
depending on the context, a different tokenization is required for correct parsing -- compare:
```
A<B<C >> d = new A<B<C >> ();

E >> f;
```

I'd like to emphasize that the problem isn't caused by the language itself, but by certain approaches to parsing it. Depending on how/what you're using to parse it, you may not experience this problem at all.

However, if you are running into this problem, here are some general solutions:

give the tokenizer enough information about the context to correctly tokenize. For instance, allow tokenization to be context-sensitive instead of regular, perhaps by integrating tokenizing with hierarchical parsing -- after all, a separate tokenization pass is merely an implementation detail for the sake of efficiency, not a necessary part of parsing. This approach is quite easy to implement with recursive-descent parsers.
ambiguously tokenize, and resolve the ambiguities when more contextual information is available. (This may sound inefficient in theory, but there are some very fast implementations in practice.)
do naive tokenization, but reinterpret tokens if needed. This is the solution apparently used by some parsers of the Java language, as explained more fully in my similar question

answered Oct 05 '22 20:10

Matt Fenwick

Related questions
                            
                                Cost of inlining methods in C#
                            
                                Manual model binding with .Net Mvc
                            
                                How do I use a lexicon with SpeechSynthesizer?
                            
                                Get rows of a Telerik RadGrid
                            
                                Optional Generic Types [closed]
                            
                                What should be placed into the AssemblyTrademarkAttribute?
                            
                                WebClient Does not automatically redirect
                            
                                I hit an OutOfMemoryException with List<string> - is this the limit or am I missing something?
                            
                                Passing IEnumerable parameter to WebAPI
                            
                                Why generic IList<> does not inherit non-generic IList
                            
                                Convert SQL geography to C#
                            
                                How to automatically delete Test Results
                            
                                What is the best practice for multiple "Include"-s in Entity Framework?
                            
                                Fast way to check if IEnumerable<T> contains no duplicates (= is distinct)
                            
                                How do I kick off an entity stored procedure in EF6 async and not wait for a return?
                            
                                Why doesn't the null coalescing operator (??) work in this situation?
                            
                                Turn off auto-formatting for a #region in Visual Studio 201x
                            
                                Large Object Heap Compaction, when is it good?
                            
                                How do you do dependency injection with AutoFac and OWIN?
                            
                                what's the purpose of ReturnJob in IJobFactory Interface for Quartz.Net

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Differentiating between ">>" and ">" when parsing generic types

Tags:

c#

parsing

generics