UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.
UTF-8 is the universal code page for internationalization and is able to encode the entire Unicode character set. It is used pervasively on the web, and is the default for *nix-based platforms.
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes.
0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.
Some characters such as the Unicode Character 'LATIN SMALL LETTER C WITH CARON' can be encoded as 0xC4 0x8D
, but can also be represented with the two code points for 'LATIN SMALL LETTER C' and 'COMBINING CARON', which is 0x63 0xcc 0x8c
.
More info here: http://www.fileformat.info/info/unicode/char/10d/index.htm
I wonder if there is a library which can convert a 'LATIN SMALL LETTER C' + 'COMBINING CARON' into 'LATIN SMALL LETTER C WITH CARON'. Or is there a table containing these conversions?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With