Following my previous question: Why `strchr` seems to work with multibyte characters, despite man page disclaimer?, I figured out that strchr
was a bad choice.
Instead I am thinking about using strstr
to look for a single character (multi-byte not char
):
const char str[] = "This string contains é which is a multi-byte character";
char * pos = strstr(str, "é"); // 'é' = 0xC3A9: 2 bytes
printf("%s\n", pos);
Ouput:
é which is a multi-byte character
Which is what I expect: the position of the 1st byte of my multi-byte character.
A priori, this is not the canonical use of strstr
but it seems to work well.
Is this workaround safe ? Can you think about any side-effect or special case that would cause a bug ?
[EDIT]: I should precise that I do not want to use wchar_t
type and that strings I handle are UTF-8 encoded (I am aware this choice can be discussed but this an irrelevant debate)
Edit
Based on updated question from OP that "can such false positive exist in an UTF-8 context"
So the answer is UTF-8 is designed in such a way that it is immune to partial mismatch of character as shown above and cause any false positive. So it is completely safe to use strstr
with UTF-8 coded multibyte characters.
Original Answer
No strstr
is not suitable for strings containing multi-byte characters.
If you are searching for a string that doesn't contain multi-byte character inside a string that contains multi-byte character, it may give false positive. (While using shift-jis encoding in japanese locale, strstr("掘something", "@some") may give false positive)
+---------+----+----+----+
| c1 | c2 | c3 | c4 | <--- string
+---------+----+----+----+
+----+----+----+
| c5 | c2 | c3 | <--- string to search
+----+----+----+
If trailing part of c1 (accidentally) matches with c5, you may get incorrect result. I would suggest using unicode with unicode substring check function or multibyte substring check functions. (_mbsstr for example)
Modern systems use UTF-8 (or ASCII) as their multibyte encoding, where the use of this function is safe.
To be strictly conforming and make your code work even on old/exotic platforms, you need to take additional problems into account.
First, the good news: In every multibyte encoding, a 0-byte indicates the end of a string, regardless of state. This means, your strstr
won’t cause a crash or something, but the result may be wrong.
As an example, consider UTF-7, a 7-bit clean way to encode Unicode. UTF-7 is a multibyte encoding having a shift state, which means how a byte is interpreted may depend on the context where it appears. E.g. (cf. Wikipedia) “£1AKM” is encoded as +AKM-AKM
in UTF-7, where the +
sign changes the state and the interpretation of letters like A
. Doing strstr(str, "AKM")
would match the first AKM portion (after the +
), although this is part of the encoding of £
and actually should match the AKM
portion after the -
(setting the shift state back to the initial state).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With