Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Differentiating between ">>" and ">" when parsing generic types

My first Stack Overflow question.

I have always been curious about this.

Say you are parsing the following line of code:

List<Nullable<int>> list = new List<Nullable<int>>();

When parsing, a naive tokenizer will assume that the two right angle brackets are a single "shift right" token. I haven't run across this problem with any other language construction in a C-style language.

How do modern parsers handle this? Is there a workaround when using such "greedy parsing?"

I have thought about using a stack structure in the parser that handles those tokens specially when parsing generic types. I'm not sure how well that would work when writing a code editor though.

Thanks a ton! :)

like image 377
Josh Wyant Avatar asked Jan 16 '14 02:01

Josh Wyant


People also ask

What is the difference between function and generic method?

It is exactly like a normal function, however, a generic method has type parameters that are cited by actual type. This allows the generic method to be used in a more general way. The compiler takes care of the type of safety which enables programmers to code easily since they do not have to perform long, individual type castings.

What is generics in Java?

Generics means parameterized types. The idea is to allow type (Integer, String, … etc, and user-defined types) to be a parameter to methods, classes, and interfaces. Using Generics, it is possible to create classes that work with different data types. An entity such as class, interface, or method that operates on a parameterized type is called ...

Can We pass multiple types of parameters in a generic class?

We can also pass multiple Type parameters in Generic classes. We can also write generic functions that can be called with different types of arguments based on the type of arguments passed to the generic method. The compiler handles each method.

What are the two types of parsing?

The way the production rules are implemented (derivation) divides parsing into two types : top-down parsing and bottom-up parsing.


Video Answer


3 Answers

When parsing a language, there are typically two main components: the scanner and the parser. The scanner produces a stream of tokens, and the parser interprets that stream based on a grammar, which is a formal definition of the production rules in the language - you can find the grammar for C# 4.0 here.

Disclaimer: I am not implying the following is necessarily how the C# language is parsed, I am merely using a C# snippet to illustrate the general concepts.

Scanning

So the first step is to produce the tokens for the parser. The tokens will generally be made up some kind of symbolic type (indicating the type of token it is), the lexeme (the actual text of the token) and perhaps other information such as the line number (useful for error handling).

So if we use List<Nullable<int>> list; from your question as an example, the scanner would produce the following tokens:

available_identifier, List
<
available_identifier, Nullable
<
integral_type, int
>
>
available_identifier, list
;

Note that the token types are inferred from the C# grammar linked to above.

Parsing

Most parsers are what's known as shift-reduce parsers. This means that tokens are shifted onto a stack incrementally, and are reduced (removed) when they match a rule. To help match, a parser will have a certain number of lookahead tokens it may observe (I believe one is most common). In general, a successful parse will conclude when all tokens have been reduced.

The type of parser that is implemented by compiler construction programs such as YACC and GPPG is known as an LALR(1) parser. These work by building up a parse table based on every legal combination of parser state and lookahead symbol, and given the current state and next symbol, can then tell us how to compute the next state.

Now that we have our tokens, we fire them into the parser, the outcome of which will usually be an abstract syntax tree, which can be used for later tasks such as code generation, semantic type checking etc. To parse those tokens, we need rules to group them into meaningful syntactic units - this is what prevents confusion when encountering >>.

From the C# grammar:

declaration_statement:
    | local_variable_declaration ";" 
    | local_constant_declaration ";" 

local_variable_declaration:
    | local_variable_type local_variable_declarators 

local_variable_type:
    | type 
    | "var"

local_variable_declarators:
    | local_variable_declarator 
    | local_variable_declarators "," local_variable_declarator 

local_variable_declarator:
    | identifier 
    | identifier "=" local_variable_initializer 

type:
    | value_type 
    | reference_type 
    | type_parameter 
    | type_unsafe 

value_type:
    | struct_type 
    | enum_type 

struct_type:
    | type_name 
    | simple_type 
    | nullable_type 

simple_type:
    | numeric_type 
    | bool 

numeric_type:
    | integral_type 
    | floating_point_type 
    | decimal 

integral_type:
    | "sbyte" 
    | "byte" 
    | "short" 
    | "ushort" 
    | "int"
    | "uint" 
    | "long"
    | "ulong" 
    | "char"

reference_type:
    | class_type 
    | interface_type 
    | array_type 
    | delegate_type 

class_type:
    | type_name 
    | "object"
    | "dynamic" 
    | "string"

type_name:
    | namespace_or_type_name 

namespace_or_type_name:
    | identifier type_argument_list? 
    | namespace_or_type_name "." identifier type_argument_list? 
    | qualified_alias_member 

identifier:
    | available_identifier 
    | "@" identifier_or_keyword 

type_argument_list:
    | "<" type_arguments ">" 

type_arguments:
    | type_argument 
    | type_arguments "," type_argument 

type_argument:
    | type 

Looks complicated, but stay with me. Each rule is of the form

rule_name:
    | production_1
    | production_2
    | production_2

Each production can be another rule (a non-terminal) or a terminal. Take the integral_type rule for example: all of its productions are terminals. Rules can also refer to themselves, which is how things like the type arguments in Tuple<int, int, double> are dealt with.

For the purposes of this example, I'll assume that List<Nullable<int>> list; is a local variable declaration. You can find a more simple example on the Shift-Reduce Parsing Wikipedia page, and another on the LR Parsing page.

To begin with, our Parse Stack is empty, our single lookahead token is the very first one and our first action will be to shift that token. That is, our parser-state will look like this:

Step 0
Parse Stack:    empty
Look Ahead:     available_identifier
Unscanned:      List<Nullable<int>> list;
Parser Action:  Shift

In our next step, we can reduce the current token based on the production identifier <- available_identifier.

Step 1
Parse Stack:    available_identifier
Look Ahead:     "<"
Unscanned:      <Nullable<int>> list;
Parser Action:  Reduce by identifier <- available_identifier

Skipping ahead a few steps, at step 10 we will have the following parser-state:

Step 10
Parse Stack:    identifier "<" identifier "<" type_arguments ">"
Look Ahead:     ">"
Unscanned:      > list;
Parser Action:  Reduce by type_argument_list <- "<" type_arguments ">"

At this point we will be able to reduce the last three tokens as their sequence makes up a type_argument_list (you can check this in the rules above). Fast forward a little more to step 13 and we have the following:

Step 13
Parse Stack:    identifier "<" type_arguments ">"
Look Ahead:     ">"
Unscanned:      list;
Parser Action:  Reduce by type_argument_list <- "<" type_arguments ">"

Just like in step 10, we are reducing by type_argument_list <- "<" type_arguments ">". In doing so, we have in fact avoided any ambiguity with >>. These steps continue until we reduce by declaration_statement <- local_variable_declaration ";" - the first rule above.

Summary

By creating an unambiguous grammar, parsers are able to easily disambiguate seemingly tricky situations like List<Nullable<int>>. What I've covered here is essentially a bottom-up, LALR(1) parser. I haven't gone into the actual creation of the abstract syntax tree, but you probably have enough on your plate with the above.

Keep in mind that the rules did not include a start-state - this was mainly for the sake of brevity. If it's helpful, I can throw the rest of the parse steps in there.

Edit: f(g<a, b>(c))

What this boils down to in the grammar is two invocation_expression rules, which are of the form invocation_expression -> primary_expression ( argument_list? )

The first one matches g<a, b>(c). It does so by first establishing that g<a,b> is an identifier followed by an type_argument_list. Our lookahead is now "(", and because the parser will know from previous context that this code is in a method body, it can reduce identifier type_argument_list by

primary_expression <- primary_no_array_creation_expression
    <- simple_name <- identifier type_argument_list?

After shifting "(" and c, we can reduce c by

argument_list <- argument <- argument_value <- expression
    <- <a really long list of rules> <- simple_name
    <- identifier <- available_identifier

And shifting that final parentheses character gives us

primary_expression ( argument_list? )

Which can then be reduced by the invocation_expression rule, thus matching g<a, b>(c).

By this point we would have already matched f as an identifier and applied the reduction

primary_expression <- primary_no_array_creation_expression
    <- simple_name <- identifier type_argument_list?

So the parse stack will therefore contain the following

primary_expression "(" invocation_expression
        ^           ^            ^
        f           (        g<a, b>(c)

The lookahead symbol will by the very last ")", so the parser will reduce invocation_expression by

argument_list <- argument <- argument_value <- expression
    <- <the same really long list of rules> <- primary_expression
    <- primary_no_array_creation_expression <- invocation_expression

Shifting that last ")" will then give us

    primary_expression "(" argument_list ")"
            ^           ^        ^        ^
            f           (    g<a, b>(c)   )

Like before, this can be reduced by the invocation_expression rule, thus matching f(g<a, b>(c)).

like image 164
nick_w Avatar answered Oct 05 '22 21:10

nick_w


You can postpone decision until you are done with parsing and/or with semantic analysis (AFAIK C/C++ compilers have to take the latter approach).

I wrote an article how can you do it with parser (NLT in this case) which allow to parse your input with different interpretations in parallel -- Ambiguity? Let NLT parse it for you. In short you don't decide if this is a shift or closing angle bracket for generic argument, you are OK with both versions, however you wait until one of them will be invalid, then you kill it, and you get the correct one.

I am not pasting here the full text, because it is way too long for SO answer.

like image 33
greenoldman Avatar answered Oct 05 '22 21:10

greenoldman


I asked the same question about Java.

Basically, the problem is that:

  • the language is unambiguous (at least in this example) -- meaning there's only one correct way to parse it
  • some implementations will ambiguously tokenize the example -- meaning that there's more than one way to split the input into valid tokens
  • depending on the context, a different tokenization is required for correct parsing -- compare:

    A<B<C >> d = new A<B<C >> ();
    
    E >> f;
    

I'd like to emphasize that the problem isn't caused by the language itself, but by certain approaches to parsing it. Depending on how/what you're using to parse it, you may not experience this problem at all.

However, if you are running into this problem, here are some general solutions:

  1. give the tokenizer enough information about the context to correctly tokenize. For instance, allow tokenization to be context-sensitive instead of regular, perhaps by integrating tokenizing with hierarchical parsing -- after all, a separate tokenization pass is merely an implementation detail for the sake of efficiency, not a necessary part of parsing. This approach is quite easy to implement with recursive-descent parsers.

  2. ambiguously tokenize, and resolve the ambiguities when more contextual information is available. (This may sound inefficient in theory, but there are some very fast implementations in practice.)

  3. do naive tokenization, but reinterpret tokens if needed. This is the solution apparently used by some parsers of the Java language, as explained more fully in my similar question

like image 25
Matt Fenwick Avatar answered Oct 05 '22 20:10

Matt Fenwick