Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why `strchr` seems to work with multibyte characters, despite man page disclaimer?

From:

man strchr

char *strchr(const char *s, int c);

The strchr() function returns a pointer to the first occurrence of the character c in the string s.

Here "character" means "byte"; these functions do not work with wide or multibyte characters.

Still, if I try to search a multi-byte character like é (0xC3A9 in UTF-8):

const char str[] = "This string contains é which is a multi-byte character";
char * pos = strchr(str, (int)'é');
printf("%s\n", pos);
printf("0x%X 0x%X\n", pos[-1], pos[0]); 

I get the following output:

� which is a multi-byte character

0xFFFFFFC3 0xFFFFFFA9

Despite the warning:

warning: multi-character character constant [-Wmultichar]

So here are my questions:

  • What does it means strchr doesn't work with multi-byte characters ? (it seems to work, provided int type is big enough to contains your multi-byte that can be at most 4 bytes)
  • How to get rid of the warning, i.e. how to safely recover the mult-byte value and store it in an int ?
  • Why the prefixes 0xFFFFFF ?
like image 711
n0p Avatar asked Dec 26 '22 05:12

n0p


2 Answers

strchr() only seems to work for your multi-byte character.

The actual string in memory is

... c, o, n, t, a, i, n, s, ' ', 0xC3, 0xA9, ' ', w ...

When you call strchr(), you are really only searching for the 0xA9, which are the lower 8 bits. That's why pos[-1] has the first byte of your multi-byte character: it was ignored during the search.

A char is signed on your system, which is why your characters are sign extended (the 0xFFFFFF) when you print them out.

As for the warning, it seems that the compiler is trying to tell you that you are doing something odd, which you are. Don't ignore it.

like image 172
Richard Pennington Avatar answered Feb 13 '23 21:02

Richard Pennington


That's the problem. It seems to work. Firstly, it's entirely up to the compiler what it puts in the string if you put multibyte characters in it, if indeed it compiles it at all. Clearly you are lucky (for some appropriate interpretation of lucky) in that it has filled your string with

.... c3, a9, ' ', 'w', etc

and that you are looking for c3a9, as it can find that fairly easily. The man page on strchr says:

The strchr() function returns a pointer to the first occurrence of c (converted to a char) in string s

So you pass c3a9 to this, which is converted to a char with value 'a9'. It finds the a9 character, and you get returned a pointer to it.

The ffffff prefix is because you are outputting a signed character as a 32 bit hex number, so it sign extends it for you. This is as expected.

The problem is that 'undefined behaviour' is just that. It might work almost correctly. And it might not, depending on circumstances.

And again it is almost. You are not getting a pointer to the multibyte character, you are getting a pointer to the middle of it, (and I'm surprised you're interpreting that as working). If the multibyte character had evaluated to 0xff20 you'd get pointed to somewhere much earlier in the string.

like image 39
Tom Tanner Avatar answered Feb 13 '23 21:02

Tom Tanner