std::string and UTF-8 encoded unicode

Tags:

If I understand well, it is possible to use both string and wstring to store UTF-8 text.

With char, ASCII characters take a single byte, some chinese characters take 3 or 4, etc. Which means that str[3] doesn't necessarily point to the 4th character.
With wchar_t same thing, but the minimal amount of bytes used per characters is always 2 (instead of 1 for char), and a 3 or 4 byte wide character will take 2 wchar_t.

Right ?

So, what if I want to use string::find_first_of() or string::compare(), etc with such a weirdly encoded string ? Will it work ? Does the string class handle the fact that characters have a variable size ? Or should I only use them as dummy feature-less byte arrays, in which case I'd rather go for a wchar_t[] buffer.

If std::string doesn't handle that, second question: are there libraries providing string classes that could handle that UTF-8 encoding so that str[3] actually points to the 3rd character (which would be a byte array from length 1 to 4) ?

757

asked Sep 07 '13 09:09

Virus721

2 Answers

You are talking about Unicode. Unicode uses 32 bits to represent a character. However since that is wasting memory there are more compact encodings. UTF-8 is one such encoding. It assumes that you are using byte units and it maps Unicode characters to 1, 2, 3 or 4 bytes. UTF-16 is another that is using words as units and maps Unicode characters to 1 or 2 words (2 or 4 bytes). You can use both encoding with both string and wchar_t. UTF-8 tends to be more compact for english text/numbers.

Some things will work regardless of encoding and type used (compare). However all functions that need to understand one character will be broken. I.e the 5th character is not always the 5th entry in the underlying array. It might look like it's working with certain examples but It will eventually break. string::compare will work but do not expect to get alphabetical ordering. That is language dependent. string::find_first_of will work for some but not all. Long string will likely work just because they are long while shorter ones might get confused by character alignment and generate very hard to find bugs.

Best thing is to find a library that handles it for you and ignore the type underneath (unless you have strong reasons to pick one or the other).

answered Oct 01 '22 09:10

Sorin

You can't handle Unicode with std::string or any other tools from Standard Library. Use external library such as: http://utfcpp.sourceforge.net/

answered Oct 01 '22 09:10

jimvonmoon

Related questions
                            
                                GCC/LD cannot find link library
                            
                                using OpenSSL in Visual Studio 2012
                            
                                How to set Qt tooltip width
                            
                                Is there a way to find out, whether a thread is blocked?
                            
                                How can a C# program use a C++ dll of any version?
                            
                                How to handle floating-point underflow?
                            
                                OpenCV Error: Sizes of input arguments do not match (The operation is neither 'array op array')
                            
                                Makefile Linking with shared library fails
                            
                                Houston, we have an undefined reference
                            
                                Param syntax for substituting boost filtering_stream for std::ofstream
                            
                                How to install Clang from the binary distribution?
                            
                                A recursive template type for a container / typename forwarding
                            
                                How to align text (alone) of a QToolButton
                            
                                seekp and seekg don't work with fstream
                            
                                two way communication between unmanaged code and unity3d code
                            
                                How to define strong ID types in C++11? [duplicate]
                            
                                Detect if boost test case failed
                            
                                how do i run quickfix examples?
                            
                                What can a second year computer undergrad do which might be considered worthwhile in the future? [closed]
                            
                                opengl flickering while rendering multiple objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

std::string and UTF-8 encoded unicode

Tags:

c++

string

unicode

utf-8

Virus721

People also ask

2 Answers

Sorin

jimvonmoon

Recent Activity

Donate For Us