I try to parse text and find some characters in it. I use the code below. It works with normal characters like abcdef
but it does not work with öçşğüı
. GCC gives compilation warnings. What should I do to work with öçşğüı
?
Code :
#include <stdio.h>
#include <ctype.h>
#include <string.h>
int main()
{
char * text = "öçşğü";
int i=0;
text = strdup(text);
while (text[i])
{
if(text[i] == 'ö')
{
printf("ö \n");
}
i++;
}
return 0;
}
Warning :
warning: multi-character character constant [-Wmultichar]
warning: comparison is always false due to limited range of data type [-Wtype-limits]
There are 10 addresses when I print address of char in while loop
printf("%d : %p \n", i, text[i]);
output :
0 : 0xffffffc3
1 : 0xffffffb6
2 : 0xffffffc3
3 : 0xffffffa7
4 : 0xffffffc5
5 : 0xffffff9f
6 : 0xffffffc4
7 : 0xffffff9f
8 : 0xffffffc3
9 : 0xffffffbc
and strlen
is 10.
But if I use abcde
:
0 : 0x61
1 : 0x62
2 : 0x63
3 : 0x64
4 : 0x65
and strlen
is 5.
If I use wchar_t
for text output is
0 : 0xa7c3b6c3
1 : 0x9fc49fc5
2 : 0xbcc3
and strlen
is 10, wcslen
is 3.
To go through each of the characters in the string, you can use mblen
. You also need to set the correct locale (the encoding represented by the multi byte string), so that mblen
can correctly parse the multi byte string.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
char * text = "öçşğü";
int i=0, char_len;
setlocale(LC_CTYPE, "en_US.utf8");
while ((char_len = mblen(&text[i], MB_CUR_MAX)) > 0)
{
/* &text[i] contains multibyte character of length char_len */
if(memcmp(&text[i], "ö", char_len) == 0)
{
printf("ö \n");
}
i += char_len;
}
return 0;
}
There are 2 types of string representation, using multi-byte (8-bit bytes) or wide byte (size depends on platform). Multi-byte representation has the advantage it can be represented using char *
(usual c string as in your code), but has disadvantage that multiple bytes represent a character. Wide string is represented using wchar_t *
. wchar_t
has the advantage that one wchar_t is one character (However as @anatolyg pointed out, this assumption can still go wrong in platforms where wchar_t is not able to represent all possible characters).
Have you looked at your source code using a hex editor? The string "öçşğü"
actually is represented by multi byte string c3 b6 c3 a7 c5 9f c4 9f c3 bc
in memory (UTF-8 encoding), of course with zero termination. You see 5 characters just because the string is rendered correctly by your UTF-8 aware viewer/browser. It is simple to realize that strlen(text)
returns 10 for this, whereas the above code loops only 5 times.
If you use wide-byte string, it can be done as explained by @WillBriggs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With