Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode to UTF-8 in C++

I searched a lot, but couldn't find anything:

unsigned int unicodeChar = 0x5e9;
unsigned int utf8Char;
uni2utf8(unicodeChar, utf8Char);
assert(utf8Char == 0xd7a9);

Is there a library (preferably boost) that implements something similar to uni2utf8?

like image 912
Ezra Avatar asked Jul 22 '12 19:07

Ezra


People also ask

How do I convert Unicode to UTF 8?

World's simplest unicode tool This online utility encodes Unicode data to UTF-8 encoding. Anything that you paste or enter in the input area automatically gets converted to UTF-8 and is printed in the output area. It supports all Unicode symbols and it works with emoji characters.

What is a UTF-8 character?

UTF-8 is a "multi-byte" encoding scheme, meaning that it requires a variable number of bytes to represent a single Unicode value. Given a so-called "UTF-8 sequence", you can convert it to a Unicode value that refers to a character. UTF-8 has the property that all existing 7-bit ASCII strings are still valid.

Does UTF-8 still work in C?

Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII. Characters usually require fewer than four bytes. String sort order is preserved. In other words, sorting UTF-8 strings per-byte yields the same order as sorting them per-character by logical Unicode value.

How many Unicode characters are there in a string?

It can represent all 1,114,112 Unicode characters. Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII. Characters usually require fewer than four bytes. String sort order is preserved.


2 Answers

Boost.Locale has also functions for encoding conversions:

#include <boost/locale.hpp>

int main() {
  unsigned int point = 0x5e9;
  std::string utf8 = boost::locale::conv::utf_to_utf<char>(&point, &point + 1);
  assert(utf8.length() == 2);
  assert(utf8[0] == '\xD7');
  assert(utf8[1] == '\xA9');
}
like image 80
Philipp Avatar answered Oct 19 '22 05:10

Philipp


You might want to give a try to UTF8-CPP library. Encoding a Unicode character with it would look like this:

std::wstring unicodeChar(L"\u05e9");
std::string utf8Char;
encode_utf8(unicodeChar, utf8Char);

std::string is used here just as a container for UTF-8 bytes.

like image 4
Desmond Hume Avatar answered Oct 19 '22 06:10

Desmond Hume