what unicode characters fit in 1, 2, 4 bytes? Can someone point me to complete character chart?
By using less space to represent more common characters (i.e. ASCII characters), UTF-8 reduces file size while allowing for a much larger number of less-common characters. These less-common characters are encoded into two or more bytes, but this is okay if they're stored sparingly.
UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. UTF-8 has the following properties: The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.
Characters are encoded according to their position in the range. You can actually find the algorithm on the Wikipedia page for UTF8 - you can implement it very quickly Wikipedia UTF8 Encoding
The wikipedia article on UTF-8 has a good enough description of the encoding:
The charts can be downloaded directly from unicode.org. It's a set of about 150 PDF files, because a single chart would be huge (maybe 30 MiB).
Also be aware that Unicode (compared to something like ASCII) is much more complex to process - there's things like right-to-left text, byte order marks, code points that can be combined ("composed") to create a single character and different ways of representing the exact same string (and a process to convert strings into a canonical form suitable for comparison), a lot more white-space characters, etc. I'd recommend downloading the entire Unicode specification and reading most of it if you're planning to do more than "not much".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With