Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling special characters in C (UTF-8 encoding)

I'm writing a small application in C that reads a simple text file and then outputs the lines one by one. The problem is that the text file contains special characters like Æ, Ø and Å among others. When I run the program in terminal the output for those characters are represented with a "?".

Is there an easy fix?

like image 387
o01 Avatar asked Sep 03 '09 13:09

o01


People also ask

Does UTF-8 include special characters?

Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as / (slash) in filenames, \ (backslash) in escape sequences, and % in printf.

Can UTF-8 handle Chinese characters?

UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.

Does UTF-8 include accents?

UTF-8 is a standard for representing Unicode numbers in computer files. Symbols with a Unicode number from 0 to 127 are represented exactly the same as in ASCII, using one 8-bit byte. This includes all Latin alphabet letters without accents.

Does C use UTF-8?

Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters.


2 Answers

First things first:

  1. Read in the buffer
  2. Use libiconv or similar to obtain wchar_t type from UTF-8 and use the wide character handling functions such as wprintf()
  3. Use the wide character functions in C! Most file/output handling functions have a wide-character variant

Ensure that your terminal can handle UTF-8 output. Having the correct locale setup and manipulating the locale data can automate alot of the file opening and conversion for you ... depending on what you are doing.

Remember that the width of a code-point or character in UTF-8 is variable. This means you can't just seek to a byte and begin reading like with ASCII ... because you might land in the middle of a code point. Good libraries can do this in some cases.

Here is some code (not mine) that demonstrates some usage of UTF-8 file reading and wide character handling in C.

#include <stdio.h>
#include <wchar.h>
int main()
{
    FILE *f = fopen("data.txt", "r, ccs=UTF-8");
    if (!f)
        return 1;

    for (wint_t c; (c = fgetwc(f)) != WEOF;)
        printf("%04X\n", c);

    fclose(f);
    return 0;
}

Links

  1. libiconv
  2. Locale data in C/GNU libc
  3. Some handy info
  4. Another good Unicode/UTF-8 in C resource
like image 164
Aiden Bell Avatar answered Sep 23 '22 09:09

Aiden Bell


Make sure you're not accidentally dropping any bytes; some UTF-8 characters are more than one byte in length (that's sort of the point), and you need to keep them all.

It can be useful to print the contents of the buffer as hex, so you can inspect which bytes are actually read:

static void print_buffer(const char *buffer, size_t length)
{
  size_t i;

  for(i = 0; i < length; i++)
    printf("%02x ", (unsigned int) buffer[i]);
  putchar('\n');
}

You can do this after loading a very short file, containing just a few characters.

Also make sure the terminal is set to the proper encoding, so it interprets your characters as UTF-8.

like image 45
unwind Avatar answered Sep 26 '22 09:09

unwind