Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ANSI C UTF-8 problem

Tags:

c

string

utf-8

First I develope an independent platform library by using ANSI C (not C++ and any non standard libs like MS CRT or glibc, ...).

After a few searchs, I found that one of the best way to internationalization in ANSI C, is using UTF-8 encoding.

In utf-8:

  • strlen(s): always counts the number of bytes.
  • mbstowcs(NULL,s,0): The number of characters can be counted.

But I have some problems when I want to random access of elements(characters) of a utf-8 string.

In ASCII encoding:

char get_char(char* assci_str, int n)
{
  // It is very FAST.
  return assci_str[n];
}

In UTF-16/32 encoding:

wchar_t get_char(wchar_t* wstr, int n)
{
  // It is very FAST.
  return wstr[n];
}

And here my problem in UTF-8 encoding:

// What is the return type?
// Because sizeof(utf-8 char) is 8 or 16 or 24 or 32.
/*?*/ get_char(char* utf8str, int n)
{
  // I can found Nth character of string by using for.
  // But it is too slow.
  // What is the best way?
}

Thanks.

like image 607
Amir Saniyan Avatar asked Jun 29 '11 00:06

Amir Saniyan


People also ask

Should I use UTF-8 or ANSI?

UTF-8 is superior in every way to ANSI. There is no reason to choose ANSI over UTF-8 in creating new applications as all computers can decode it.

How do I change ANSI in UTF-8?

Try Settings -> Preferences -> New document -> Encoding -> choose UTF-8 without BOM, and check Apply to opened ANSI files . That way all the opened ANSI files will be treated as UTF-8 without BOM. For explanation what's going on, read the comments below this answer.

Does C use UTF-8?

Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters.


2 Answers

Perhaps you're thinking about this a bit wrongly. UTF-8 is an encoding which is useful for serializing data, e.g. writing it to a file or the network. It is a very non-trivial encoding, though, and a raw string of Unicode codepoints can end up in any number of encoded bytes.

What you should probably do, if you want to handle text (given your description), is to store raw, fixed-width strings internally. If you're going for Unicode (which you should), then you need 21 bits per codepoint, so the nearest integral type is uint32_t. In short, store all your strings internally as arrays of integers. Then you can random-access each codepoint.

Only encode to UTF-8 when you are writing to a file or console, and decode from UTF-8 when reading.

By the way, a Unicode codepoint is still a long way from a character. The concept of a character is just far to high-level to have a simple general mechanic. (E.g. "a" + "accent grave" -- two codepoints, how many characters?)

like image 128
Kerrek SB Avatar answered Oct 25 '22 10:10

Kerrek SB


You simply can't. If you do need a lot of such queries, you can build an index for the UTF-8 string, or convert it to UTF-32 up front. UTF-32 is a better in-memory representation while UTF-8 is good on disk.

By the way, the code you listed for UTF-16 is not correct either. You may want to take care of the surrogate characters.

like image 32
Todd Li Avatar answered Oct 25 '22 11:10

Todd Li