<p>what unicode characters fit in 1, 2, 4 bytes? Can someone point me to complete character chart? </p>

<p>Characters are encoded according to their position in the range. You can actually find the algorithm on the Wikipedia page for UTF8 - you can implement it very quickly Wikipedia UTF8 Encoding</p> <ul> <li>U+0000 to U+007F are (correctly) encoded with one byte </li> <li>U+0080 to U+07FF are encoded with 2 bytes</li> <li>U+0800 to U+FFFF are encoded with 3 bytes</li> <li>U+010000 to U+10FFFF are encoded with 4 bytes</li> </ul>

UTF-8 Encoding size

2 Answers

Characters are encoded according to their position in the range. You can actually find the algorithm on the Wikipedia page for UTF8 - you can implement it very quickly Wikipedia UTF8 Encoding

U+0000 to U+007F are (correctly) encoded with one byte
U+0080 to U+07FF are encoded with 2 bytes
U+0800 to U+FFFF are encoded with 3 bytes
U+010000 to U+10FFFF are encoded with 4 bytes

answered Sep 28 '22 11:09

Jimmy

The wikipedia article on UTF-8 has a good enough description of the encoding:

1 byte = code points 0x000000 to 0x00007F (inclusive)
2 bytes = code points 0x000080 to 0x0007FF
3 bytes = code points 0x000800 to 0x00FFFF
4 bytes = code points 0x010000 to 0x10FFFF

The charts can be downloaded directly from unicode.org. It's a set of about 150 PDF files, because a single chart would be huge (maybe 30 MiB).

Also be aware that Unicode (compared to something like ASCII) is much more complex to process - there's things like right-to-left text, byte order marks, code points that can be combined ("composed") to create a single character and different ways of representing the exact same string (and a process to convert strings into a canonical form suitable for comparison), a lot more white-space characters, etc. I'd recommend downloading the entire Unicode specification and reading most of it if you're planning to do more than "not much".

answered Sep 28 '22 11:09

Brendan

Related questions
                            
                                Regex matching letter characters [duplicate]
                            
                                UTF-16 string terminator
                            
                                Is UTF-8 an encoding or a character set?
                            
                                Python Unicode Encode Error ordinal not in range<128> with Euro Sign
                            
                                LoadStringFromFile and StringChangeEx from Unicode Inno Setup (Ansi file)
                            
                                does (w)ifstream support different encodings
                            
                                How to unescape unicode string in C#
                            
                                Entering Unicode data in Visual Studio, C#
                            
                                Differences between IsDigit and IsNumber in unicode in Go
                            
                                Matching Unicode letter characters in PCRE/PHP
                            
                                UTF-16 on cmd.exe
                            
                                How to get vim to show a byte-by-byte representation of file data
                            
                                If RAM isn't a concern, is reading line by line faster or reading everything into RAM and access it? - Python
                            
                                UTF-8 Compatibility in C++
                            
                                Print unicode character from variable (swift)
                            
                                japanese email subject encoding
                            
                                Java String Unicode Value
                            
                                How to read UTF8 encoded file using RandomAccessFile?
                            
                                What do I need to know to globalize an asp.net application?
                            
                                Replace newlines in a Unicode string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UTF-8 Encoding size

Tags:

unicode

utf-8

user3234

People also ask

2 Answers

Jimmy

Brendan

Recent Activity

Donate For Us