Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling multibyte (non-ASCII) characters in C

I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C? I've been googling a little bit and found some wchar_t type, but there were not any simple examples how to use it with files.

like image 563
user561838 Avatar asked Jan 03 '11 22:01

user561838


2 Answers

I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.

Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.

Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.

The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.

Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.

I, personally, have never used ICU, but I probably will from now on :-)

like image 71
Joey Adams Avatar answered Oct 08 '22 15:10

Joey Adams


If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():

setlocale(LC_CTYPE, "");

This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).

You can then use the wide-character versions of all the standard functions. For example, if you have code like this:

long words = 0;
int in_word = 0;
int c;

while ((c = getchar()) != EOF)
{
    if (isspace(c))
    {
        if (in_word)
        {
            in_word = 0;
            words++;
        }
    }
    else
    {
        in_word = 1;
    }
}

...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():

long words = 0;
int in_word = 0;
wint_t c;

while ((c = getwchar()) != WEOF)
{
    if (iswspace(c))
    {
        if (in_word)
        {
            in_word = 0;
            words++;
        }
    }
    else
    {
        in_word = 1;
    }
}
like image 43
caf Avatar answered Oct 08 '22 14:10

caf