ANSI C UTF-8 problem

Tags:

First I develope an independent platform library by using ANSI C (not C++ and any non standard libs like MS CRT or glibc, ...).

After a few searchs, I found that one of the best way to internationalization in ANSI C, is using UTF-8 encoding.

In utf-8:

strlen(s): always counts the number of bytes.
mbstowcs(NULL,s,0): The number of characters can be counted.

But I have some problems when I want to random access of elements(characters) of a utf-8 string.

In ASCII encoding:

char get_char(char* assci_str, int n)
{
  // It is very FAST.
  return assci_str[n];
}

In UTF-16/32 encoding:

wchar_t get_char(wchar_t* wstr, int n)
{
  // It is very FAST.
  return wstr[n];
}

And here my problem in UTF-8 encoding:

// What is the return type?
// Because sizeof(utf-8 char) is 8 or 16 or 24 or 32.
/*?*/ get_char(char* utf8str, int n)
{
  // I can found Nth character of string by using for.
  // But it is too slow.
  // What is the best way?
}

Thanks.

607

asked Jun 29 '11 00:06

Amir Saniyan

2 Answers

Perhaps you're thinking about this a bit wrongly. UTF-8 is an encoding which is useful for serializing data, e.g. writing it to a file or the network. It is a very non-trivial encoding, though, and a raw string of Unicode codepoints can end up in any number of encoded bytes.

What you should probably do, if you want to handle text (given your description), is to store raw, fixed-width strings internally. If you're going for Unicode (which you should), then you need 21 bits per codepoint, so the nearest integral type is uint32_t. In short, store all your strings internally as arrays of integers. Then you can random-access each codepoint.

Only encode to UTF-8 when you are writing to a file or console, and decode from UTF-8 when reading.

By the way, a Unicode codepoint is still a long way from a character. The concept of a character is just far to high-level to have a simple general mechanic. (E.g. "a" + "accent grave" -- two codepoints, how many characters?)

128

answered Oct 25 '22 10:10

Kerrek SB

You simply can't. If you do need a lot of such queries, you can build an index for the UTF-8 string, or convert it to UTF-32 up front. UTF-32 is a better in-memory representation while UTF-8 is good on disk.

By the way, the code you listed for UTF-16 is not correct either. You may want to take care of the surrogate characters.

answered Oct 25 '22 11:10

Todd Li

Related questions
                            
                                What's the correct way to add 1 byte to a pointer in C/C++?
                            
                                C prototype functions
                            
                                *** glibc detected *** free(): invalid next size (normal): 0x0a03c978 *** [duplicate]
                            
                                How to find place of buffer overflow and memory corruptions?
                            
                                What happens when you initialize a junk value to something in C?
                            
                                Confusion in data types in a 2D array
                            
                                Why am I getting linker errors for ws2_32.dll in my C program?
                            
                                Which thread library should I use for multithreaded C programs on Linux?
                            
                                Compiler optimization about elimination of pointer operation on inline function in C?
                            
                                int divided by unsigned int causing rollover
                            
                                splint debugging parse error
                            
                                How to loop through only active file descriptors from fd_set result from select()?
                            
                                Reading directory content in linux
                            
                                C Preprocessor getting rid of the __align__ and __attribute__
                            
                                File size lookup in C
                            
                                Recursively add sequence of numbers
                            
                                ASN.1 Encoding-Decoding
                            
                                extern variable - why?
                            
                                Infinite Loops: int vs. float
                            
                                C89, Mixing Variable Declarations and Code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

ANSI C UTF-8 problem

Tags:

c

string

utf-8

Amir Saniyan

People also ask

2 Answers

Kerrek SB

Todd Li

Recent Activity

Donate For Us