Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Manually converting unicode codepoints into UTF-8 and UTF-16

I have a university programming exam coming up, and one section is on unicode.

I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help.

The question will be something like:

The string 'mЖ丽' has these unicode codepoints U+006D, U+0416 and U+4E3D, with answers written in hexadecimal, manually encode the string into UTF-8 and UTF-16.

Any help at all will be greatly appreciated as I am trying to get my head round this.

like image 829
RSM Avatar asked Jun 04 '11 23:06

RSM


People also ask

How do I encode Unicode to UTF-8?

Base Convert Unicode symbols to UTF-8 in this base. Set the byte delimiter character here. Add a Prefix Use prefix "0b" for binary, prefix "o" for octal, and prefix "0x" for hex values. Add Padding Add zero padding to small values to make them all the same length.

Is UTF-16 and Unicode the same?

UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.

Is UTF-8 compatible with UTF-16?

no. they are not compatible. What do you mean by "hand in"? They encode the same set of characters, but a byte sequence in UTF-8 won't represent the same set of characters if it's interpreted as UTF-16.

Is UTF-8 and Unicode the same?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).


2 Answers

Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)

The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:

Encoding    The following byte sequences are used to represent a    character.  The sequence to be used depends on the UCS code    number of the character:     0x00000000 - 0x0000007F:        0xxxxxxx     0x00000080 - 0x000007FF:        110xxxxx 10xxxxxx     0x00000800 - 0x0000FFFF:        1110xxxx 10xxxxxx 10xxxxxx     0x00010000 - 0x001FFFFF:        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx     [... removed obsolete five and six byte forms ...]     The xxx bit positions are filled with the bits of the    character code number in binary representation.  Only the    shortest possible multibyte sequence which can represent the    code number of the character can be used.     The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well    as 0xfffe and 0xffff (UCS noncharacters) should not appear in    conforming UTF-8 streams. 

It might be easier to remember a 'compressed' version of the chart:

Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.

0x80      5 bits, one byte 0x800     4 bits, two bytes 0x10000   3 bits, three bytes 

You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:

2**(5+1*6) == 2048       == 0x800 2**(4+2*6) == 65536      == 0x10000 2**(3+3*6) == 2097152    == 0x200000 

I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)

Update

Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:

U+4E3E 

This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:

   1110xxxx 10xxxxxx 10xxxxxx 

0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):

   1110x100 10111000 10111110 

There is an x spot left over at the start, fill it in with 0:

   11100100 10111000 10111110 

Convert from bits to hex:

   0xE4 0xB8 0xBE 
like image 108
sarnold Avatar answered Sep 21 '22 15:09

sarnold


The descriptions on Wikipedia for UTF-8 and UTF-16 are good:

Procedures for your example string:

UTF-8

UTF-8 uses up to 4 bytes to represent Unicode codepoints. For the 1-byte case, use the following pattern:

1-byte UTF-8 = 0xxxxxxxbin = 7 bits = 0-7Fhex

The initial byte of 2-, 3- and 4-byte UTF-8 start with 2, 3 or 4 one bits, followed by a zero bit. Follow on bytes always start with the two-bit pattern 10, leaving 6 bits for data:

2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 5+6(11) bits = 80-7FFhex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 4+6+6(16) bits = 800-FFFFhex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 3+6+6+6(21) bits = 10000-10FFFFhex

Unicode codepoints are undefined beyond 10FFFFhex.

Your codepoints are U+006D, U+0416 and U+4E3D requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign the bits:

U+006D = 1101101bin = 01101101bin = 6Dhex
U+0416 = 10000 010110bin = 11010000 10010110bin = D0 96hex
U+4E3D = 0100 111000 111101bin = 11100100 10111000 10111101bin = E4 B8 BDhex

Final byte sequence:

6D D0 96 E4 B8 BD

or if nul-terminated strings are desired:

6D D0 96 E4 B8 BD 00

UTF-16

UTF-16 uses 2 or 4 bytes to represent Unicode codepoints. Algorithm:

U+0000 to U+D7FF uses 2-byte 0000hex to D7FFhex
U+D800 to U+DFFF are invalid codepoints reserved for 4-byte UTF-16
U+E000 to U+FFFF uses 2-byte E000hex to FFFFhex

U+10000 to U+10FFFF uses 4-byte UTF-16 encoded as follows:

  1. Subtract 10000hex from the codepoint.
  2. Express result as 20-bit binary.
  3. Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxxbin to encode the upper- and lower- 10 bits into two 16-bit words.

Using your codepoints:

U+006D = 006Dhex
U+0416 = 0416hex
U+4E3D = 4E3Dhex

Now, we have one more issue. Some machines store the two bytes of a 16-bit word least significant byte first (so-called little-endian machines) and some store most significant byte first (big-endian machines). UTF-16 uses the codepoint U+FEFF (called the byte order mark or BOM) to help a machine determine if a byte stream contains big- or little-endian UTF-16:

big-endian = FE FF 00 6D 04 16 4E 3D
little-endian = FF FE 6D 00 16 04 3D 4E

With nul-termination, U+0000 = 0000hex:

big-endian = FE FF 00 6D 04 16 4E 3D 00 00
little-endian = FF FE 6D 00 16 04 3D 4E 00 00

Since your instructor didn't give a codepoint that required 4-byte UTF-16, here's one example:

U+1F031 = 1F031hex - 10000hex = F031hex = 0000111100 0000110001bin =
1101100000111100 1101110000110001bin = D83C DC31hex

like image 34
Mark Tolonen Avatar answered Sep 18 '22 15:09

Mark Tolonen