Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the maximum number of bytes for a UTF-8 encoded character?

What is the maximum number of bytes for a single UTF-8 encoded character?

I'll be encrypting the bytes of a String encoded in UTF-8 and therefore need to be able to work out the maximum number of bytes for a UTF-8 encoded String.

Could someone confirm the maximum number of bytes for a single UTF-8 encoded character please

like image 207
Edd Avatar asked Mar 02 '12 12:03

Edd


People also ask

How many bytes is an UTF-8 encoded character?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

What is the range of UTF-8?

UTF-8 Basics. UTF-8 (Unicode Transformation–8-bit) is an encoding defined by the International Organization for Standardization (ISO) in ISO 10646. It can represent up to 2,097,152 code points (2^21), more than enough to cover the current 1,112,064 Unicode code points.

Is a UTF-8 character?

UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.


1 Answers

The maximum number of bytes per character is 4 according to RFC3629 which limited the character table to U+10FFFF:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets.

(The original specification allowed for up to six byte character codes for code points past U+10FFFF.)

Characters with a code less than 128 will require 1 byte only, and the next 1920 character codes require 2 bytes only. Unless you are working with an esoteric language, multiplying the character count by 4 will be a significant overestimation.

like image 148
Tamás Avatar answered Oct 19 '22 14:10

Tamás