I have a university programming exam coming up, and one section is on unicode. I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help. The question will be something like: <blockquote> The string 'mЖ丽' has these unicode codepoints <code>U+006D</code>, <code>U+0416</code> and <code>U+4E3D</code>, with answers written in hexadecimal, manually encode the string into UTF-8 and UTF-16. </blockquote> Any help at all will be greatly appreciated as I am trying to get my head round this.

Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?) The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the <code>utf-8(7)</code> manpage on many Linux systems: <pre class="prettyprint"><code>Encoding The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character: 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx [... removed obsolete five and six byte forms ...] The xxx bit positions are filled with the bits of the character code number in binary representation. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well as 0xfffe and 0xffff (UCS noncharacters) should not appear in conforming UTF-8 streams. </code></pre> It might be easier to remember a 'compressed' version of the chart: Initial bytes starts of mangled codepoints start with a <code>1</code>, and add padding <code>1+0</code>. Subsequent bytes start <code>10</code>. <pre class="prettyprint"><code>0x80 5 bits, one byte 0x800 4 bits, two bytes 0x10000 3 bits, three bytes </code></pre> You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation: <pre class="prettyprint"><code>2**(5+1*6) == 2048 == 0x800 2**(4+2*6) == 65536 == 0x10000 2**(3+3*6) == 2097152 == 0x200000 </code></pre> I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :) Update Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex: <pre class="prettyprint"><code>U+4E3E </code></pre> This fits in the <code>0x00000800 - 0x0000FFFF</code> range (<code>0x4E3E < 0xFFFF</code>), so the representation will be of the form: <pre class="prettyprint"><code> 1110xxxx 10xxxxxx 10xxxxxx </code></pre> <code>0x4E3E</code> is <code>100111000111110b</code>. Drop the bits into the <code>x</code> above (start from the right, we'll fill in missing bits at the start with <code>0</code>): <pre class="prettyprint"><code> 1110x100 10111000 10111110 </code></pre> There is an <code>x</code> spot left over at the start, fill it in with <code>0</code>: <pre class="prettyprint"><code> 11100100 10111000 10111110 </code></pre> Convert from bits to hex: <pre class="prettyprint"><code> 0xE4 0xB8 0xBE </code></pre>

The descriptions on Wikipedia for UTF-8 and UTF-16 are good: Procedures for your example string: <h3>UTF-8</h3> UTF-8 uses up to 4 bytes to represent Unicode codepoints. For the 1-byte case, use the following pattern: <blockquote> 1-byte UTF-8 = 0xxxxxxxbin = 7 bits = 0-7Fhex </blockquote> The initial byte of 2-, 3- and 4-byte UTF-8 start with 2, 3 or 4 one bits, followed by a zero bit. Follow on bytes always start with the two-bit pattern <code>10</code>, leaving 6 bits for data: <blockquote> 2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 5+6(11) bits = 80-7FFhex 3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 4+6+6(16) bits = 800-FFFFhex 4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 3+6+6+6(21) bits = 10000-10FFFFhex&dagger; &dagger;Unicode codepoints are undefined beyond 10FFFFhex. </blockquote> Your codepoints are U+006D, U+0416 and U+4E3D requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign the bits: <blockquote> U+006D = 1101101bin = 01101101bin = 6Dhex U+0416 = 10000 010110bin = 11010000 10010110bin = D0 96hex U+4E3D = 0100 111000 111101bin = 11100100 10111000 10111101bin = E4 B8 BDhex </blockquote> Final byte sequence: <blockquote> 6D D0 96 E4 B8 BD </blockquote> or if nul-terminated strings are desired: <blockquote> 6D D0 96 E4 B8 BD 00 </blockquote> <h3>UTF-16</h3> UTF-16 uses 2 or 4 bytes to represent Unicode codepoints. Algorithm: <blockquote> U+0000 to U+D7FF uses 2-byte 0000hex to D7FFhex U+D800 to U+DFFF are invalid codepoints reserved for 4-byte UTF-16 U+E000 to U+FFFF uses 2-byte E000hex to FFFFhex U+10000 to U+10FFFF uses 4-byte UTF-16 encoded as follows: <ol> <li>Subtract 10000hex from the codepoint.</li> <li>Express result as 20-bit binary.</li> <li>Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxxbin to encode the upper- and lower- 10 bits into two 16-bit words.</li> </ol> </blockquote> Using your codepoints: <blockquote> U+006D = 006Dhex U+0416 = 0416hex U+4E3D = 4E3Dhex </blockquote> Now, we have one more issue. Some machines store the two bytes of a 16-bit word least significant byte first (so-called little-endian machines) and some store most significant byte first (big-endian machines). UTF-16 uses the codepoint U+FEFF (called the byte order mark or BOM) to help a machine determine if a byte stream contains big- or little-endian UTF-16: <blockquote> big-endian = FE FF 00 6D 04 16 4E 3D little-endian = FF FE 6D 00 16 04 3D 4E </blockquote> With nul-termination, U+0000 = 0000hex: <blockquote> big-endian = FE FF 00 6D 04 16 4E 3D 00 00 little-endian = FF FE 6D 00 16 04 3D 4E 00 00 </blockquote> Since your instructor didn't give a codepoint that required 4-byte UTF-16, here's one example: <blockquote> U+1F031 = 1F031hex - 10000hex = F031hex = 0000111100 0000110001bin = 1101100000111100 1101110000110001bin = D83C DC31hex </blockquote>

Manually converting unicode codepoints into UTF-8 and UTF-16

2 Answers

Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)

The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:

Encoding    The following byte sequences are used to represent a    character.  The sequence to be used depends on the UCS code    number of the character:     0x00000000 - 0x0000007F:        0xxxxxxx     0x00000080 - 0x000007FF:        110xxxxx 10xxxxxx     0x00000800 - 0x0000FFFF:        1110xxxx 10xxxxxx 10xxxxxx     0x00010000 - 0x001FFFFF:        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx     [... removed obsolete five and six byte forms ...]     The xxx bit positions are filled with the bits of the    character code number in binary representation.  Only the    shortest possible multibyte sequence which can represent the    code number of the character can be used.     The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well    as 0xfffe and 0xffff (UCS noncharacters) should not appear in    conforming UTF-8 streams.

It might be easier to remember a 'compressed' version of the chart:

Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.

0x80      5 bits, one byte 0x800     4 bits, two bytes 0x10000   3 bits, three bytes

You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:

2**(5+1*6) == 2048       == 0x800 2**(4+2*6) == 65536      == 0x10000 2**(3+3*6) == 2097152    == 0x200000

I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)

Update

Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:

U+4E3E

This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:

   1110xxxx 10xxxxxx 10xxxxxx

0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):

   1110x100 10111000 10111110

There is an x spot left over at the start, fill it in with 0:

   11100100 10111000 10111110

Convert from bits to hex:

   0xE4 0xB8 0xBE

108

answered Sep 21 '22 15:09

sarnold

The descriptions on Wikipedia for UTF-8 and UTF-16 are good:

Procedures for your example string:

UTF-8

UTF-8 uses up to 4 bytes to represent Unicode codepoints. For the 1-byte case, use the following pattern:

1-byte UTF-8 = 0xxxxxxx_bin = 7 bits = 0-7F_hex

The initial byte of 2-, 3- and 4-byte UTF-8 start with 2, 3 or 4 one bits, followed by a zero bit. Follow on bytes always start with the two-bit pattern 10, leaving 6 bits for data:

2-byte UTF-8 = 110xxxxx 10xxxxxx_bin = 5+6(11) bits = 80-7FF_hex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxx_bin = 4+6+6(16) bits = 800-FFFF_hex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx_bin = 3+6+6+6(21) bits = 10000-10FFFF_hex^†

^†Unicode codepoints are undefined beyond 10FFFF_hex.

Your codepoints are U+006D, U+0416 and U+4E3D requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign the bits:

U+006D = 1101101_bin = 01101101_bin = 6D_hex
U+0416 = 10000 010110_bin = 11010000 10010110_bin = D0 96_hex
U+4E3D = 0100 111000 111101_bin = 11100100 10111000 10111101_bin = E4 B8 BD_hex

Final byte sequence:

6D D0 96 E4 B8 BD

or if nul-terminated strings are desired:

6D D0 96 E4 B8 BD 00

UTF-16

UTF-16 uses 2 or 4 bytes to represent Unicode codepoints. Algorithm:

U+0000 to U+D7FF uses 2-byte 0000_hex to D7FF_hex
U+D800 to U+DFFF are invalid codepoints reserved for 4-byte UTF-16
U+E000 to U+FFFF uses 2-byte E000_hex to FFFF_hex

U+10000 to U+10FFFF uses 4-byte UTF-16 encoded as follows:

Subtract 10000_hex from the codepoint.

Express result as 20-bit binary.

Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxx_bin to encode the upper- and lower- 10 bits into two 16-bit words.

Using your codepoints:

U+006D = 006D_hex
U+0416 = 0416_hex
U+4E3D = 4E3D_hex

Now, we have one more issue. Some machines store the two bytes of a 16-bit word least significant byte first (so-called little-endian machines) and some store most significant byte first (big-endian machines). UTF-16 uses the codepoint U+FEFF (called the byte order mark or BOM) to help a machine determine if a byte stream contains big- or little-endian UTF-16:

big-endian = FE FF 00 6D 04 16 4E 3D
little-endian = FF FE 6D 00 16 04 3D 4E

With nul-termination, U+0000 = 0000_hex:

big-endian = FE FF 00 6D 04 16 4E 3D 00 00
little-endian = FF FE 6D 00 16 04 3D 4E 00 00

Since your instructor didn't give a codepoint that required 4-byte UTF-16, here's one example:

U+1F031 = 1F031_hex - 10000_hex = F031_hex = 0000111100 0000110001_bin =
1101100000111100 1101110000110001_bin = D83C DC31_hex

answered Sep 18 '22 15:09

Mark Tolonen

Related questions
                            
                                How can I convert surrogate pairs to normal string in Python?
                            
                                How to recognize if a string contains unicode chars?
                            
                                PHP decoding and encoding json with unicode characters
                            
                                Android WebView UTF-8 not showing
                            
                                How to remove emoji code using javascript?
                            
                                What are the file/group/record/unit separator control characters and their usage?
                            
                                How to find out number/name of unicode character in Python?
                            
                                How to read text files with ANSI encoding and non-English letters?
                            
                                How to detect string byte encoding?
                            
                                unicode().decode('utf-8', 'ignore') raising UnicodeEncodeError
                            
                                chr() equivalent returning a bytes object, in py3k
                            
                                How to make Django slugify work properly with Unicode strings?
                            
                                What is unicode character 2028 (LS / Line Separator) used for?
                            
                                Read Unicode UTF-8 file into wstring
                            
                                What encoding are filenames in NTFS stored as?
                            
                                Simple to enter Unicode character that would sort after Z in most cases?
                            
                                Ruby Output Unicode Character
                            
                                Convert between string, u16string & u32string
                            
                                Unicode Regex; Invalid XML characters
                            
                                urllib2 read to Unicode

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Manually converting unicode codepoints into UTF-8 and UTF-16

Tags:

unicode

utf-8

utf-16

RSM

People also ask