Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 hex to unicode code point (only math)

Let's take this table with characters and HEX encodings in Unicode and UTF-8.
Does anyone know how it is possible to convert UTF-8 hex to Unicode code point using only math operations?
E.g. let's take the first row. Given 227, 129 130 how to get 12354?
Is there any simple way to do it by using only math operations?

Unicode code point UTF-8 Char
30 42 (12354) e3 (227) 81 (129) 82 (130)
30 44 (12356) e3 (227) 81 (129) 84 (132)
30 46 (12358) e3 (227) 81 (129) 86 (134)

* Source: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=12288&unicodeinhtml=hex

like image 316
mihails.kuzmins Avatar asked Sep 05 '25 16:09

mihails.kuzmins


1 Answers

This video is the perfect source (watch from 6:15), but here is its summary and code sample in golang. With letters I mark bits taken from UTF-8 bytes, hopefully it makes sense. When you understand the logic it's easy to apply bitwise operators):

Bytes Char UTF-8 bytes Unicode code point Explanation
1-byte (ASCII) E 1. 0xxx xxxx
0100 0101 or 0x45
1. 0xxx xxxx
0100 0101 or U+0045
no conversion needed, the same value in UTF-8 and unicode code point
2-byte Ê 1. 110x xxxx
2. 10yy yyyy
1100 0011 1000 1010 or 0xC38A
0xxx xxyy yyyy
0000 1100 1010 or U+00CA
1. First 5 bits of the 1st byte
2. First 6 bits of the 2nd byte
3-byte 1. 1110 xxxx
2. 10yy yyyy
3. 10zz zzzz
1110 0011 1000 0001 1000 0010 or 0xE38182
xxxx yyyy yyzz zzzz
0011 0000 0100 0010 or U+3042
1. First 4 bits of the 1st byte
2. First 6 bits of the 2nd byte
3. First 6 bits of the 3rd byte
4-byte 𐄟 1. 1111 0xxx
2. 10yy yyyy
3. 10zz zzzz
4. 10ww wwww
1111 0000 1001 0000 1000 0100 1001 1111 or 0xF090_849F
000x xxyy yyyy zzzz zzww wwww
0000 0001 0000 0001 0001 1111 or U+1011F
1. First 3 bits of the 1st byte
2. First 6 bits of the 2nd byte
3. First 6 bits of the 3rd byte
4. First 6 bits of the 4th byte

2-byte UTF-8

func get(byte1 byte, byte2 byte) {
    int1 := uint16(byte1 & 0b_0001_1111) << 6
    int2 := uint16(byte2 & 0b_0011_111)
    return rune(int1 + int2)
}

3-byte UTF-8

func get(byte1 byte, byte2 byte, byte3 byte) {
    int1 := uint16(byte1 & 0b_0000_1111) << 12
    int2 := uint16(byte2 & 0b_0011_111) << 6
    int3 := uint16(byte3 & 0b_0011_111)
    return rune(int1 + int2 + int3)
}

4-byte UTF-8

func get(byte1 byte, byte2 byte, byte3 byt3, byte4 byte) {
    int1 := uint(byte1 & 0b_0000_1111) << 18
    int2 := uint(byte2 & 0b_0011_111) << 12
    int3 := uint(byte3 & 0b_0011_111) << 6
    int4 := uint(byte4 & 0b_0011_111)
    return rune(int1 + int2 + int3 + int4)
}
like image 177
mihails.kuzmins Avatar answered Sep 07 '25 19:09

mihails.kuzmins



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!