Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 Encoding size

Tags:

unicode

utf-8

what unicode characters fit in 1, 2, 4 bytes? Can someone point me to complete character chart?

like image 721
user3234 Avatar asked Feb 03 '11 09:02

user3234


People also ask

Does UTF-8 encoding increase size?

By using less space to represent more common characters (i.e. ASCII characters), UTF-8 reduces file size while allowing for a much larger number of less-common characters. These less-common characters are encoded into two or more bytes, but this is okay if they're stored sparingly.

How many bytes is a UTF-8?

UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. UTF-8 has the following properties: The classical US-ASCII characters (0 to 0x7f) encode as themselves, so files and strings that are encoded with ASCII values have the same encoding under both ASCII and UTF-8.

How many characters can UTF-8 represent?

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.


2 Answers

Characters are encoded according to their position in the range. You can actually find the algorithm on the Wikipedia page for UTF8 - you can implement it very quickly Wikipedia UTF8 Encoding

  • U+0000 to U+007F are (correctly) encoded with one byte
  • U+0080 to U+07FF are encoded with 2 bytes
  • U+0800 to U+FFFF are encoded with 3 bytes
  • U+010000 to U+10FFFF are encoded with 4 bytes
like image 83
Jimmy Avatar answered Sep 28 '22 11:09

Jimmy


The wikipedia article on UTF-8 has a good enough description of the encoding:

  • 1 byte = code points 0x000000 to 0x00007F (inclusive)
  • 2 bytes = code points 0x000080 to 0x0007FF
  • 3 bytes = code points 0x000800 to 0x00FFFF
  • 4 bytes = code points 0x010000 to 0x10FFFF

The charts can be downloaded directly from unicode.org. It's a set of about 150 PDF files, because a single chart would be huge (maybe 30 MiB).

Also be aware that Unicode (compared to something like ASCII) is much more complex to process - there's things like right-to-left text, byte order marks, code points that can be combined ("composed") to create a single character and different ways of representing the exact same string (and a process to convert strings into a canonical form suitable for comparison), a lot more white-space characters, etc. I'd recommend downloading the entire Unicode specification and reading most of it if you're planning to do more than "not much".

like image 38
Brendan Avatar answered Sep 28 '22 11:09

Brendan