Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating identifiers containing universal character names via token concatenation

Tags:

I wrote this code that creates identifiers containing universal character names via token concatenation.

//#include <stdio.h>
int printf(const char*, ...);

#define CAT(a, b) a ## b

int main(void) {
    //int \u306d\u3053 = 10;
    int CAT(\u306d, \u3053) = 10;

    printf("%d\n", \u306d\u3053);
    //printf("%d\n", CAT(\u306d, \u3053));

    return 0;
}

This code worked well with gcc 4.8.2 with -fextended-identifiers option and gcc 5.3.1, but didn't work with clang 3.3 with error message:

prog.c:10:17: error: use of undeclared identifier 'ねこ'
        printf("%d\n", \u306d\u3053);
                       ^
1 error generated.

and local clang (Apple LLVM version 7.0.2 (clang-700.1.81)) with error message:

$ clang -std=c11 -Wall -Wextra -o uctest1 uctest1.c
warning: format specifies type 'int' but the argument has type
      '<dependent type>' [-Wformat]
uctest1.c:10:17: error: use of undeclared identifier 'ねこ'
        printf("%d\n", \u306d\u3053);
                       ^
1 warning and 1 error generated.

When I used -E option to have the compilers output code with macro expanded, gcc 5.3.1 emitted this:

# 1 "main.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "main.c"

int printf(const char*, ...);



int main(void) {

 int \U0000306d\U00003053 = 10;

 printf("%d\n", \U0000306d\U00003053);


 return 0;
}

local clang emitted this:

# 1 "uctest1.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 326 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "uctest1.c" 2

int printf(const char*, ...);



int main(void) {

 int \u306d\u3053 = 10;

 printf("%d\n", ねこ);


 return 0;
}

As you see, the identifiers declared and used in printf() matches in gcc's output, but they don't match in clang's output.

I know that creating universal character names via token concatenation invokes undefined behavior.

Quote from N1570 5.1.1.2 Translation phases:

If a character sequence that matches the syntax of a universal character name is produced by token concatenation (6.10.3.3), the behavior is undefined.

I thought that this character sequence \u306d\u3053 may "match the syntax of a universal character name" because it contains universal character names as its substring. I also thought that "match" may mean that the entire token produced via concatenation stands for one universal character name, and that therefore this undefined behavior isn't invoked in this code.

Reading PRE30-C. Do not create a universal character name through concatenation, I found a comment saying this kind of concatenation is allowed:

What is forbidden, to create a new UCN via concatenation. Like doing

assign(\u0001,0401,a,b,4)

just concatenating stuff that happens to contain UCNs anywhere is okay.

And a log that shows that a code example like this case (but with 4 characters) is replaced with another code example.

Does my code example invoke some undefined behaviors (not limited to ones invoked by producing universal character names via token concatenation)? Or is this a bug in clang?

like image 679
MikeCAT Avatar asked Apr 22 '16 07:04

MikeCAT


1 Answers

Your code is not triggering the undefined behavior you mention, as universal character name (6.4.3) not being produced by token concatenation.

And, according to 6.10.3.3, as both the left side and the right side of operator ## is an identifier, and the produced token is also a valid preprocessing token (an identifier too), the ## operator itself not trigger an undefined behavior.

After reading description about identifier (6.4.2, D.1, D.2), universal character names (6.4.3), I'm pretty sure that it is more like a bug in clang preprocessor, which treats identifier produced by token concatenation and normal identifier differently.

like image 62
user1887915 Avatar answered Sep 28 '22 21:09

user1887915