Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

std::string and UTF-8 encoded unicode

If I understand well, it is possible to use both string and wstring to store UTF-8 text.

  • With char, ASCII characters take a single byte, some chinese characters take 3 or 4, etc. Which means that str[3] doesn't necessarily point to the 4th character.

  • With wchar_t same thing, but the minimal amount of bytes used per characters is always 2 (instead of 1 for char), and a 3 or 4 byte wide character will take 2 wchar_t.

Right ?

So, what if I want to use string::find_first_of() or string::compare(), etc with such a weirdly encoded string ? Will it work ? Does the string class handle the fact that characters have a variable size ? Or should I only use them as dummy feature-less byte arrays, in which case I'd rather go for a wchar_t[] buffer.

If std::string doesn't handle that, second question: are there libraries providing string classes that could handle that UTF-8 encoding so that str[3] actually points to the 3rd character (which would be a byte array from length 1 to 4) ?

like image 757
Virus721 Avatar asked Sep 07 '13 09:09

Virus721


People also ask

Does std::string support UTF-8?

UTF-8 actually works quite well in std::string . Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII. Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point: str.

Is std::string Unicode?

And as std::string works with char , so std::string is already unicode-ready.

Does C++ string support Unicode?

C++ provides a wide-character type, wchar_t , which can store Unicode strings. The exact implementation of wchar_t is implementation defined, but it is often UTF-32. The class wstring , defined in <string> , is a sequence of wchar_t s, just like the string class is a sequence of char s.

Are Unicode and UTF-8 the same?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).


2 Answers

You are talking about Unicode. Unicode uses 32 bits to represent a character. However since that is wasting memory there are more compact encodings. UTF-8 is one such encoding. It assumes that you are using byte units and it maps Unicode characters to 1, 2, 3 or 4 bytes. UTF-16 is another that is using words as units and maps Unicode characters to 1 or 2 words (2 or 4 bytes). You can use both encoding with both string and wchar_t. UTF-8 tends to be more compact for english text/numbers.

Some things will work regardless of encoding and type used (compare). However all functions that need to understand one character will be broken. I.e the 5th character is not always the 5th entry in the underlying array. It might look like it's working with certain examples but It will eventually break. string::compare will work but do not expect to get alphabetical ordering. That is language dependent. string::find_first_of will work for some but not all. Long string will likely work just because they are long while shorter ones might get confused by character alignment and generate very hard to find bugs.

Best thing is to find a library that handles it for you and ignore the type underneath (unless you have strong reasons to pick one or the other).

like image 76
Sorin Avatar answered Oct 01 '22 09:10

Sorin


You can't handle Unicode with std::string or any other tools from Standard Library. Use external library such as: http://utfcpp.sourceforge.net/

like image 24
jimvonmoon Avatar answered Oct 01 '22 09:10

jimvonmoon