Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistency parsing numeric literals according to C++ Standard's grammar [duplicate]

Reading through the C++17 standard, it seems to me that there is an inconsistency between pp-number as handled by the preprocessor and numeric literals, e.g. user-defined-integer-literal, as they are defined to be handled by the "upper" language.

For example, the following is correctly parsed as a pp-number according to the preprocessor grammar:

123_e+1

But placed in the context of a C++11-compliant code fragment,

int  operator"" _e(unsigned long long)
    { return 0; }

int test()
    {
    return 123_e+1;
    }

the current Clang or GCC compilers (I haven't tested others) will return an error similar to this:

unable to find numeric literal operator 'operator""_e+1'

where operator"" _e(...) is not found and trying to define operator"" _e+1(...) would be invalid.

It seems that this comes about because the compiler lexes the token as a pp-number first, but then fails to roll-back and apply the grammar rules for a user-defined-integer-literal when parsing the final expression.

In comparison, the following code compiles fine:

int  operator"" _d(unsigned long long)
    { return 0; }

int test()
    {
    return 0x123_d+1;  // doesn't lex as a 'pp-number' because 'sign' can only follow [eEpP]
    }

Is this a correct reading of the standard? And if so, is it reasonable that the compiler should handle this, arguably rare, corner case?

like image 645
Andy G Avatar asked Dec 11 '18 11:12

Andy G


1 Answers

You have fallen victim to the maximal munch rule which has the lexical analyzer take as many characters as possible to form a valid token.

This is covered in section [lex.pptoken]p3 which says (emphasis mine):

Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail, except that a header-name ([lex.header]) is only formed within a #include directive.

and includes several examples:

[ Example:

#define R "x"
const char* s = R"y";           // ill-formed raw string, not "x" "y"

— end example ]

4 [ Example: The program fragment 0xe+foo is parsed as a preprocessing number token (one that is not a valid floating or integer literal token), even though a parse as three preprocessing tokens 0xe, +, and foo might produce a valid expression (for example, if foo were a macro defined as 1). Similarly, the program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating literal token), whether or not E is a macro name. — end example ]

5[ Example: The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y have integral types, violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct expression. — end example  ]

This rule effects in several other well known cases such as a+++++b and tokens >= which required a fix to allow.

For reference the pp-token grammar is as follows:

pp-number:  
  digit  
  . digit  
  pp-number digit  
  pp-number identifier-nondigit 
  pp-number ' digit  
  pp-number ' nondigit    
  pp-number e sign  
  pp-number E sign  
  pp-number p sign  
  pp-number P sign  
  pp-number .  

Note the e sign production, which is what is snagging this case. If on the other hand you use d like your second example you would not hit this (see it live on godbolt).

Also adding spacing would also fix your issue since you would no longer be subject to maximal munch (see it live on godbolt):

123_e + 1
like image 50
Shafik Yaghmour Avatar answered Oct 22 '22 06:10

Shafik Yaghmour