Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Process UTF-8 characters in C from a text file

Tags:

c

file

input

utf-8

I need to read UTF-8 characters from a text file and process them. for instance to calculate the frequency of occurrence of a certain character. Ordinary characters are fine. The problem occurs with characters like ü or ğ. following is my code to check if a certain character occurs comparing the ascii code of the incoming character:

FILE * fin;
FILE * fout;
wchar_t c;
fin=fopen ("input.txt","r");
fout=fopen("out.txt","w");
int frequency = 0;
while((c=fgetwc(fin))!=WEOF)
{
   if(c == SOME_NUMBER){ frequency++; }
}

SOME_NUMBER is what I can't figure out for those characters. Infact those characters print out 5 different numbers when trying to print it as a decimal. whereas for example for character 'a' I would do as: if(c == 97){ frequency++; } since the ascii code of 'a' is 97. Is there anyway that I could identify those special characters in C?

P.S. working with ordinary char ( not wchar_t ) creates the same problem, but this time printing the decimal equivalent of the incoming character would print 5 different NEGATIVE numbers for those special characters. Problem stands.

like image 773
Ams Avatar asked Nov 14 '14 12:11

Ams


2 Answers

You can create your own utf-8 decoding reading function.

see the format description in https://en.wikipedia.org/wiki/UTF-8

this code is not very nice and robust. But it is the sketch of what I ment...

#include <stdio.h>
#include <stdlib.h>

#define INVALID (-2)

int fgetutf8c(FILE* f)
{
    int result = 0;
    int input[6] = {};

    input[0] = fgetc(f);
    printf("(i[0] = %d) ", input[0]);
    if (input[0] == EOF)
    {
        // The EOF was hit by the first character.
        result = EOF;
    }
    else if (input[0] < 0x80)
    {
        // the first character is the only 7 bit sequence...
        result = input[0];
    }
    else if ((input[0] & 0xC0) == 0x80)
    {
        // This is not the beginning of the multibyte sequence.
        return INVALID;
    }
    else if ((input[0] & 0xfe) == 0xfe)
    {
        // This is not a valid UTF-8 stream.
        return INVALID;
    }
    else
    {
        int sequence_length;
        for(sequence_length = 1; input[0] & (0x80 >> sequence_length); ++sequence_length);
        result = input[0] & ((1 << sequence_length) - 1);
        printf("squence length = %d ", sequence_length);
        int index;
        for(index = 1; index < sequence_length; ++index)
        {
            input[index] = fgetc(f);
            printf("(i[%d] = %d) ", index, input[index]);
            if (input[index] == EOF)
            {
                return EOF;
            }
            result = (result << 6) | (input[index] & 0x30);
        }
    }
    return result;
}

main(int argc, char **argv)
{
   printf("open(%s) ", argv[1]);
   FILE *f = fopen(argv[1], "r");
   int c = 0;
   while (c != EOF)
   {
       c = fgetutf8c(f);
       printf("* %d\n", c);
   }
   fclose(f);
}
like image 123
V-X Avatar answered Sep 22 '22 14:09

V-X


A modern C platform should provide everything you need for such a task.

First thing that you have to be sure is that your program runs under a locale that can handle utf8. Your environement should already be set to that, the only thing you have to do in your code is

setlocale(LC_ALL, "");

to switch from the "C" locale to your native environment.

Then you can read strings as usual with fgets, e.g. To do comparisons for accented characters and stuff you'd have to convert such a string to a wide character string (mbsrtowcs) as you already mention. The encoding of such wide characters is implementation defined, but you don't need to know that encoding to do checks.

Usually something like L'ä' will work perfectly as long as the platform on which you compile and where you execute are not completely screwed up. If you need codes that you can't even enter on the keyboard you can use the L'\uXXXX' notation from C11 as didierc mentions in his answer. ('L'\uXXXX' is for the "basic" characters, if you have something really weird you'd use L'\UXXXXXXXX', a capital U with 8 hex-digits)

As said, the encoding for wide characters is implementation defined, but good chances are that it is either utf-16 or utf-32, which you can check with sizeof(wchar_t) and the predefined macro __STDC_ISO_10646__. Even if your platform only supports utf-16 (which may have 2-word "characters") the use case you describe shouldn't cause any trouble since all your characters can be coded with the L'\uXXXX' form.

like image 22
Jens Gustedt Avatar answered Sep 24 '22 14:09

Jens Gustedt