Handling special characters in C (UTF-8 encoding)

Tags:

I'm writing a small application in C that reads a simple text file and then outputs the lines one by one. The problem is that the text file contains special characters like Æ, Ø and Å among others. When I run the program in terminal the output for those characters are represented with a "?".

Is there an easy fix?

387

asked Sep 03 '09 13:09

o01

2 Answers

First things first:

Read in the buffer
Use libiconv or similar to obtain wchar_t type from UTF-8 and use the wide character handling functions such as wprintf()
Use the wide character functions in C! Most file/output handling functions have a wide-character variant

Ensure that your terminal can handle UTF-8 output. Having the correct locale setup and manipulating the locale data can automate alot of the file opening and conversion for you ... depending on what you are doing.

Remember that the width of a code-point or character in UTF-8 is variable. This means you can't just seek to a byte and begin reading like with ASCII ... because you might land in the middle of a code point. Good libraries can do this in some cases.

Here is some code (not mine) that demonstrates some usage of UTF-8 file reading and wide character handling in C.

#include <stdio.h>
#include <wchar.h>
int main()
{
    FILE *f = fopen("data.txt", "r, ccs=UTF-8");
    if (!f)
        return 1;

    for (wint_t c; (c = fgetwc(f)) != WEOF;)
        printf("%04X\n", c);

    fclose(f);
    return 0;
}

Links

libiconv
Locale data in C/GNU libc
Some handy info
Another good Unicode/UTF-8 in C resource

164

answered Sep 23 '22 09:09

Aiden Bell

Make sure you're not accidentally dropping any bytes; some UTF-8 characters are more than one byte in length (that's sort of the point), and you need to keep them all.

It can be useful to print the contents of the buffer as hex, so you can inspect which bytes are actually read:

static void print_buffer(const char *buffer, size_t length)
{
  size_t i;

  for(i = 0; i < length; i++)
    printf("%02x ", (unsigned int) buffer[i]);
  putchar('\n');
}

You can do this after loading a very short file, containing just a few characters.

Also make sure the terminal is set to the proper encoding, so it interprets your characters as UTF-8.

answered Sep 26 '22 09:09

unwind

Related questions
                            
                                Why use select() instead of sleep()?
                            
                                How to initialize a union? [duplicate]
                            
                                C equivalent of C++ STL [duplicate]
                            
                                How to trap unaligned memory access?
                            
                                Strange values while initializing array using designated initializers
                            
                                Write to a file using fputs in C
                            
                                Programmatically press a button on another application (C, Windows)
                            
                                Example of realpath function in C
                            
                                How to hint to GCC that a line should be unreachable at compile time?
                            
                                How do I get the name of the calling function?
                            
                                Is "asmlinkage" required for a c function to be called from assembly?
                            
                                Creating a basic C/C++ TCP socket writer
                            
                                Why does malloc() or new never return NULL? [duplicate]
                            
                                Extending python with C: Pass a list to PyArg_ParseTuple
                            
                                What is the proper way to store narrower data types into a wider data type in the C language?
                            
                                Is it possible to write an Operating System completely in C?
                            
                                Is Android POSIX-compatible?
                            
                                Intersection of two lines defined in (rho/theta ) parameterization
                            
                                Programmatically getting UID and GID from username in Unix?
                            
                                C pre-processor defining for generated function names

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Handling special characters in C (UTF-8 encoding)

Tags:

c

terminal

macos

encoding

utf-8

o01

People also ask

2 Answers

Aiden Bell

unwind

Recent Activity

Donate For Us