Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c reading non ASCII characters

I am parsing a file that involves characters such as æ ø å. If we assume I have stored a line of the text file as follows

#define MAXLINESIZE 1024
char* buffer = malloc(MAXLINESIZE)
...
fgets(buffer,MAXLINESIZE,handle)
...

if I wanted to count the number of characters on a line. If I try to do the following:

char* p = buffer
int count = 0;
while (*p != '\n') {
    if (isgraph(*p)) {
        count++;
    }
    p++;
}

this ignores the any occurrence of æ ø å

ie: counting "aåeæioøu" would return 5 not 8

do I need to read the file in an alternative way? should I not be using a char* but an int*?

like image 570
beoliver Avatar asked Sep 11 '15 12:09

beoliver


3 Answers

Let's say you use UTF-8.

You need to understand how UTF-8 works.

Here's a little piece of work which should do what you want :

int nbChars(char *str) {
    int len = 0;
    int i = 0;
    int charSize = 0; // Size of the current char in byte

    if (!str)
        return -1;
    while (str[i])
    {
        if (charSize == 0)
        {
            ++len;
            if (!(str[i] >> 7 & 1)) // ascii char
                charSize = 1;
            else if (!(str[i] >> 5 & 1))
                charSize = 2;
            else if (!(str[i] >> 4 & 1))
                charSize = 3;
            else if (!(str[i] >> 3 & 1))
                charSize = 4;
            else
                return -1; // not supposed to happen
        }
        else if (str[i] >> 6 & 3 != 2)
            return -1;
        --charSize;
        ++i;
    }
    return len;
}

It returns the number of chars, and -1 if it's not a valid UTF-8 string.

(By non-valid UTF-8 string, I mean the format is not valid. I don't check if the character actually exists)

EDIT: As stated in the comment section, this code doesn't handle decomposed unicode

like image 145
4rzael Avatar answered Nov 02 '22 12:11

4rzael


You need to understand which encoding is used for your characters. I guess it is very probably UTF-8 (and you should use UTF8 everywhere....), read Joel's blog on Unicode. If your encoding is not UTF-8 you should convert it to UTF-8 e.g. using libiconv.

Then you need a C library for UTF-8. There are many of them (but none is standardized in the C11 language yet). I recommend libunistring or glib (from GTK), but see also this.

Your code will change, since an UTF-8 character can take one to four [8 bits] bytes (but Wikipedia UTF-8 page mentions 6 bytes at most; See Unicode standards for details). You won't test if a byte (i.e. a plain C char) is a letter, but if a byte and the few bytes after it (given by a pointer, i.e. a char* or better by uint8_t*) encode a letter (including cyrillic letters, etc..).

Not every sequence of bytes is a valid UTF-8 representation, and you might want to validate a line (or a null-terminated C string) before analyzing it.

like image 26
Basile Starynkevitch Avatar answered Nov 02 '22 12:11

Basile Starynkevitch


The C standard IO library can only read bytes. Your file probably contains multibyte characters, encoded with UTF8 or some other encoding. You'll need a library for interpreting such files.

It is possible that your file contains Latin1 text, in which case characters are bytes. In this case, you cannot use isgraph unless you have the proper locale set.

Bottom line: find the encoding used in your file. Then read it accordingly. In any case, plain C does not know about encodings.

like image 2
lhf Avatar answered Nov 02 '22 14:11

lhf