Standard way in C11 and C++11 to convert UTF-8?

Tags:

C11 and C++11 both introduce the uchar.h/cuchar header defining char16_t and char32_t as explicitly 16 and 32 bit wide characters, added literal syntax u"" and U"" for writing strings with these character types, along with macros __STDC_UTF_16__ and __STDC_UTF_32__ that tell you whether or not they correspond to UTF-16 and UTF-32 code units. This helps remove the ambiguity about wchar_t, which on some platforms was 16 bit and generally used to hold UTF-16 code units, and on some platforms was 32 bit and generally used to hold UTF-32 code units; assuming those macros are now set, you can now write portable, unambiguous code referring to UTF-16 and UTF-32. __STDC_ISO_10646__ can also be used as a proxy to determine whether wchar_t is capable of holding UTF-32 values; if it can't, you can't necessarily assume that it holds UTF-16, but it's probably a close enough approximation to be portable.

They also add the functions mbrtoc16, mbrtoc32, c16rtomb, and c32rtomb for converting between multibyte characters and these types. Between these and the existing mbstowcs family of functions, it's possible to translate between UTF-16, UTF-32, the platform multibyte character set, and the platform wide character set portably (though not necessarily losslessly unless the platform defined multibyte and wide character sets are UTFs; in particular, it seems like these functions will be fairly useless on Windows where the locale defined multibyte encoding is not allowed to use more than two bytes per character).

Furthermore, they added the u8"" syntax for writing literal UTF-8 encoded strings. As UTF-8 is an encoding that is compatible with most functions that deal in char * and std::string, this is one of the most useful new additions.

However, they seem to have failed to add any way to portably convert between UTF-8, UTF-16, and UTF-32. The mbtoc16 and related functions convert between the implementation defined multibyte encoding and UTF-16 or 32; but you can't depend on this being UTF-8. On Unix-like platforms it's dependent on the locale, and many of them use UTF-8 in their locale by default, and even if it's not the default you can at least set the locale to a UTF-8 locale for the purposes of knowing that "multibyte" means UTF-8. On Windows, however, you explicitly can't use UTF-8 or any other encoding that requires more than two bytes for the locale.

Am I just missing something, or is the UTF-8 string type not accompanied by any way to convert it to the other types of strings: platform defined multibyte, platform defined wide char, UTF-16, or UTF-32? Is there no way to even tell if your system multibyte encoding is UTF-8? Is there any reason why this support wasn't included (specifically, I'm looking for actually written justification or discussion by the C or C++ standards committees, not just speculation)? Is there any work being done to improve this situation; is it likely to improve in the future?

Or, is the current best solution, if you want to support UTF-8 in a portable fashion, to write your own implementation, pull in a library dependency, or use platform-specific functions like iconv and MultiByteToWideChar?

759

asked Oct 29 '13 03:10

Brian Campbell

1 Answers

Sounds like you're looking for the std::codecvt type. See the example on that page for usage.

124

answered Oct 11 '22 15:10

MikeP

Related questions
                            
                                How do I strip a tuple<> back into a variadic template list of types?
                            
                                What do <: and :> mean when declaring a lambda? [duplicate]
                            
                                Turning off the "'register' storage class specifier is deprecated" warning
                            
                                Is there any real use case for function's reference qualifiers?
                            
                                C++ how to generate all the permutations of function overloads?
                            
                                std::iota is very limited
                            
                                Why is locking a std::mutex twice 'Undefined Behaviour'?
                            
                                How to Write the Range-based For-Loop With Argv?
                            
                                Implementing boost::barrier in C++11
                            
                                In C++11, how can I get a temporary lvalue without a name?
                            
                                Why can't std::bind and boost::bind be used interchangeably in this Boost.Asio tutorials
                            
                                std::array instantiation error
                            
                                Explicit void pointer as function parameter
                            
                                using std::cout in multiple threads
                            
                                Iterator to last element of std::vector using end()--
                            
                                Why have move semantics?
                            
                                std::forward_list and std::forward_list::push_back
                            
                                How does =delete on destructor prevent stack allocation?
                            
                                unordered_multimap - iterating the result of find() yields elements with different value
                            
                                Why should one never use auto&& for local variables?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Standard way in C11 and C++11 to convert UTF-8?

Tags:

character-encoding

c++11

unicode

utf-8

c11

Brian Campbell

People also ask

1 Answers

MikeP

Recent Activity

Donate For Us