Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simplest way to convert unicode codepoint into UTF-8

Tags:

c

unicode

utf-8

What's the simplest way to convert a Unicode codepoint into a UTF-8 byte sequence in C? The only way that springs to mind is using iconv to map from the UTF-32LE codepage to UTF-8, but that seems like overkill.

like image 960
Lily Ballard Avatar asked Oct 27 '08 19:10

Lily Ballard


People also ask

Is Unicode UTF-8?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

How many bytes is a UTF-8 character?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

How many characters are in UTF-8?

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.

What are UTF-8 surrogates?

They are sometimes called surrogates but they are not characters. They don't mean anything by themselves. UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.


1 Answers

Unicode conversion is not a simple task. Using iconv doesn't seem like overkill at all to me. Perhaps there is a library version of iconv you can use to avoid make a system() call, if that's what you want to avoid.

like image 160
JesperE Avatar answered Nov 11 '22 23:11

JesperE