From:
man strchr
char *strchr(const char *s, int c);
The strchr() function returns a pointer to the first occurrence of the character c in the string s.
Here "character" means "byte"; these functions do not work with wide or multibyte characters.
Still, if I try to search a multi-byte character like é
(0xC3A9
in UTF-8):
const char str[] = "This string contains é which is a multi-byte character";
char * pos = strchr(str, (int)'é');
printf("%s\n", pos);
printf("0x%X 0x%X\n", pos[-1], pos[0]);
I get the following output:
� which is a multi-byte character
0xFFFFFFC3 0xFFFFFFA9
Despite the warning:
warning: multi-character character constant [-Wmultichar]
So here are my questions:
strchr
doesn't work with multi-byte characters ? (it seems to work, provided int
type is big enough to contains your multi-byte that can be at most 4 bytes)0xFFFFFF
?strchr()
only seems to work for your multi-byte character.
The actual string in memory is
... c, o, n, t, a, i, n, s, ' ', 0xC3, 0xA9, ' ', w ...
When you call strchr()
, you are really only searching for the 0xA9
, which are the lower 8 bits. That's why pos[-1]
has the first byte of your multi-byte character: it was ignored during the search.
A char
is signed on your system, which is why your characters are sign extended (the 0xFFFFFF
) when you print them out.
As for the warning, it seems that the compiler is trying to tell you that you are doing something odd, which you are. Don't ignore it.
That's the problem. It seems to work. Firstly, it's entirely up to the compiler what it puts in the string if you put multibyte characters in it, if indeed it compiles it at all. Clearly you are lucky (for some appropriate interpretation of lucky) in that it has filled your string with
.... c3, a9, ' ', 'w', etc
and that you are looking for c3a9
, as it can find that fairly easily. The man page on strchr says:
The strchr() function returns a pointer to the first occurrence of c (converted to a char) in string s
So you pass c3a9 to this, which is converted to a char
with value 'a9'. It finds the a9
character, and you get returned a pointer to it.
The ffffff
prefix is because you are outputting a signed character as a 32 bit hex number, so it sign extends it for you. This is as expected.
The problem is that 'undefined behaviour' is just that. It might work almost correctly. And it might not, depending on circumstances.
And again it is almost. You are not getting a pointer to the multibyte character, you are getting a pointer to the middle of it, (and I'm surprised you're interpreting that as working). If the multibyte character had evaluated to 0xff20 you'd get pointed to somewhere much earlier in the string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With