Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to store UTF-8 strings in memory in C/C++?

Tags:

c++

unicode

Looking at the unicode standard, they recommend to use plain chars for storing UTF-8 encoded strings. Does this work as expected with C++ and the basic std::string, or do cases exist in which the UTF-8 encoding can create problems?

For example, when computing the length, it may not be identical to the number of bytes - how is this supposed to be handled? Reading the standard, I'm probably fine using a char array for storage, but I'll still need to write functions like strlen etc. on my own, which work on encoded text, cause as far as I understand the problem, the standard routines are either ASCII only, or expect wide literals (16bit or more), which are not recommended by the unicode standard. So far, the best source I found about the encoding stuff is a post on Joel's on Software, but it does not explain what we poor C++ developer should use :)

like image 295
Anteru Avatar asked Jan 12 '09 11:01

Anteru


3 Answers

There's a library called "UTF8-CPP", which lets you store your UTF-8 strings in standard std::string objects, and provides additional functions to enumerate and manipulate utf-8 characters.

I haven't tested it yet, so I don't know what it's worth, but I am considering using it myself.

like image 172
Carl Seleborg Avatar answered Nov 13 '22 00:11

Carl Seleborg


strlen counts the number of non-null chars before the first \0. In UTF-8, that count is a sane number (number of bytes used), but the count is not the number of characters (one UTF-8 character is typically 1-4 chars). basic_string doesn't store a \0, but it too keeps a byte count.

strcpy or the basic_string copy ctor copy all bytes without looking too closely.

Finding a substring works OK, because of the way UTF_8 is encoded. The allowed values for the first byte of a character is distinct from the second to 4th byte (the former never start with 10xxxxxx, the latter always)

Taking a substring is tricky - how do you specify the position? If the begin and end were found by searching for ASCII text markers (e.g. [ and ]) then there's no problem. You'd just get the bytes in the middle, which are a valid UTF8 string too. You can't harcode positions, or even relative offsets though. Even a relative offset of +1 character can be hard; how many bytes is that? You will end up writing a function like SkipOneChar.

like image 29
MSalters Avatar answered Nov 13 '22 00:11

MSalters


An example with ICU library (C, C++, Java):

#include <iostream>
#include <unicode/unistr.h> // using ICU library

int main(int argc, char *argv[]) {
    // constructing a Unicode string
    UnicodeString ustr1("Привет"); // using platform's default codepage
    // calculating the length in characters, should be 6
    int ulen1=ustr1.length();
    // extracting encoded characters from a string
    int const bufsize=25;
    char encoded[bufsize];
    ustr1.extract(0,ulen1,encoded,bufsize,"UTF-8"); // forced UTF-8 encoding
    // printing the result
    std::cout << "Length of " << encoded << " is " << ulen1 << "\n";
    return 0;
}

building like

$ g++ -licuuc -o icu-example{,.cc}

running

$ ./icu-example
Length of Привет is 6

Works for me on Linux with GCC 4.3.2 and libicu 3.8.1. Please note that it prints in UTF-8 no matter what the system locale is. You won't see it correctly if yours is not UTF-8.

like image 3
sastanin Avatar answered Nov 12 '22 23:11

sastanin