Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare multibyte characters in C

Tags:

c

compare

I try to parse text and find some characters in it. I use the code below. It works with normal characters like abcdef but it does not work with öçşğüı. GCC gives compilation warnings. What should I do to work with öçşğüı?

Code :

#include <stdio.h>
#include <ctype.h>
#include <string.h>

int main()
{
    char * text = "öçşğü";
    int i=0;

    text = strdup(text);

    while (text[i])
    {       
        if(text[i] == 'ö')
        {
            printf("ö \n");
        }

        i++;
    }

    return 0;
}

Warning :

warning: multi-character character constant [-Wmultichar]
warning: comparison is always false due to limited range of data type [-Wtype-limits]

There are 10 addresses when I print address of char in while loop

printf("%d : %p \n", i, text[i]);

output :

0 : 0xffffffc3 
1 : 0xffffffb6 
2 : 0xffffffc3 
3 : 0xffffffa7 
4 : 0xffffffc5 
5 : 0xffffff9f 
6 : 0xffffffc4 
7 : 0xffffff9f 
8 : 0xffffffc3 
9 : 0xffffffbc 

and strlen is 10.

But if I use abcde:

0 : 0x61 
1 : 0x62 
2 : 0x63 
3 : 0x64 
4 : 0x65 

and strlen is 5.


If I use wchar_t for text output is

0 : 0xa7c3b6c3 
1 : 0x9fc49fc5 
2 : 0xbcc3 

and strlen is 10, wcslen is 3.

like image 483
utarid Avatar asked Nov 16 '15 14:11

utarid


1 Answers

To go through each of the characters in the string, you can use mblen. You also need to set the correct locale (the encoding represented by the multi byte string), so that mblen can correctly parse the multi byte string.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>

int main()
{
    char * text = "öçşğü";
    int i=0, char_len;

    setlocale(LC_CTYPE, "en_US.utf8");

    while ((char_len = mblen(&text[i], MB_CUR_MAX)) > 0)
    {
        /* &text[i] contains multibyte character of length char_len */
        if(memcmp(&text[i], "ö", char_len) == 0)
        {
            printf("ö \n");
        }

        i += char_len;
    }

    return 0;
}

There are 2 types of string representation, using multi-byte (8-bit bytes) or wide byte (size depends on platform). Multi-byte representation has the advantage it can be represented using char * (usual c string as in your code), but has disadvantage that multiple bytes represent a character. Wide string is represented using wchar_t *. wchar_t has the advantage that one wchar_t is one character (However as @anatolyg pointed out, this assumption can still go wrong in platforms where wchar_t is not able to represent all possible characters).

Have you looked at your source code using a hex editor? The string "öçşğü" actually is represented by multi byte string c3 b6 c3 a7 c5 9f c4 9f c3 bc in memory (UTF-8 encoding), of course with zero termination. You see 5 characters just because the string is rendered correctly by your UTF-8 aware viewer/browser. It is simple to realize that strlen(text) returns 10 for this, whereas the above code loops only 5 times.

If you use wide-byte string, it can be done as explained by @WillBriggs.

like image 60
user1969104 Avatar answered Oct 19 '22 06:10

user1969104