I've been looking for a way to convert between the Unicode string types and came across this method. Not only do I not completely understand the method (there are no comments) but also the article implies that in future there will be better methods.
If this is the best method, could you please point out what makes it work, and if not I would like to hear suggestions for better methods.
To convert UTF-8 to UTF-16 just call Utf32To16(Utf8To32(str)) and to convert UTF-16 to UTF-8 call Utf32To8(Utf16To32(str)) .
This function is used to convert the numerical value to the wide string i.e. it parses a numerical value of datatypes (int, long long, float, double ) to a wide string. It returns a wide string of data type wstring representing the numerical value passed in the function.
This is an instantiation of the basic_string class template that uses char16_t as the character type, with its default char_traits ad allocator types. In example, the std:string uses one byte (8 bits) while the std::u16string uses two bytes (16bits) per each character of the text string.
mbstowcs()
and wcstombs()
don't necessarily convert to UTF-16 or UTF-32, they convert to wchar_t
and whatever the locale wchar_t
encoding is. All Windows locales uses a two byte wchar_t
and UTF-16 as the encoding, but the other major platforms use a 4-byte wchar_t
with UTF-32 (or even a non-Unicode encoding for some locales). A platform that only supports single-byte encodings could even have a one byte wchar_t
and have the encoding differ by locale. So wchar_t
seems to me to be a bad choice for portability and Unicode. *
Some better options have been introduced in C++11; new specializations of std::codecvt, new codecvt classes, and a new template to make using them for conversions very convienent.
First the new template class for using codecvt is std::wstring_convert. Once you've created an instance of a std::wstring_convert class you can easily convert between strings:
std::wstring_convert<...> convert; // ... filled in with a codecvt to do UTF-8 <-> UTF-16 std::string utf8_string = u8"This string has UTF-8 content"; std::u16string utf16_string = convert.from_bytes(utf8_string); std::string another_utf8_string = convert.to_bytes(utf16_string);
In order to do different conversion you just need different template parameters, one of which is a codecvt facet. Here are some new facets that are easy to use with wstring_convert:
std::codecvt_utf8_utf16<char16_t> // converts between UTF-8 <-> UTF-16 std::codecvt_utf8<char32_t> // converts between UTF-8 <-> UTF-32 std::codecvt_utf8<char16_t> // converts between UTF-8 <-> UCS-2 (warning, not UTF-16! Don't bother using this one)
Examples of using these:
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert; std::string a = convert.to_bytes(u"This string has UTF-16 content"); std::u16string b = convert.from_bytes(u8"blah blah blah");
The new std::codecvt specializations are a bit harder to use because they have a protected destructor. To get around that you can define a subclass that has a destructor, or you can use the std::use_facet template function to get an existing codecvt instance. Also, an issue with these specializations is you can't use them in Visual Studio 2010 because template specialization doesn't work with typedef'd types and that compiler defines char16_t and char32_t as typedefs. Here's an example of defining your own subclass of codecvt:
template <class internT, class externT, class stateT> struct codecvt : std::codecvt<internT,externT,stateT> { ~codecvt(){} }; std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>,char16_t> convert16; std::wstring_convert<codecvt<char32_t,char,std::mbstate_t>,char32_t> convert32;
The char16_t specialization converts between UTF-16 and UTF-8. The char32_t specialization, UTF-32 and UTF-8.
Note that these new conversions provided by C++11 don't include any way to convert directly between UTF-32 and UTF-16. Instead you just have to combine two instances of std::wstring_convert.
***** I thought I'd add a note on wchar_t and its purpose, to emphasize why it should not generally be used for Unicode or portable internationalized code. The following is a short version of my answer https://stackoverflow.com/a/11107667/365496
wchar_t is defined such that any locale's char encoding can be converted to wchar_t where every wchar_t represents exactly one codepoint:
Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). -- [basic.fundamental] 3.9.1/5
This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.
Since that seems to be the primary use in practice for wchar_t you might wonder what it's good for if not that.
The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of same simple algorithms used with ascii strings to work with other languages.
Unfortunately the requirements on wchar_t assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption, so you can't safely use wchar_t for simple text algorithms either.
This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.
Not much, for portable code anyway. If __STDC_ISO_10646__
is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.
The reason Windows doesn't define __STDC_ISO_10646__
I think is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__
.
For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').
In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With