Reading the Wikipedia article on UTF-8, I've been wondering about the term overlong. This term is used various times but the article doesn't provide a definition or reference for its meaning. I would like to know if someone can explain the term and its purpose.

It's an encoding of a code point which takes more code units than it needs to. For example, U+0020 is represented in UTF-8 by the single byte <code>0x20</code>. If you decode the two bytes <code>0xc0 0xa0</code> in the normal fashion, you'll still end up back at U+0020, but that's an invalid representation. The Unicode Corrigendum #1 has more information, particularly around table 3.1B.

What is exactly an overlong form/encoding?

2 Answers

It's an encoding of a code point which takes more code units than it needs to.

For example, U+0020 is represented in UTF-8 by the single byte 0x20. If you decode the two bytes 0xc0 0xa0 in the normal fashion, you'll still end up back at U+0020, but that's an invalid representation.

The Unicode Corrigendum #1 has more information, particularly around table 3.1B.

108

answered Oct 10 '22 21:10

Jon Skeet

UTF-8 theoretically allows for different representations of characters that also have a shorter one. For example, you could encode an ASCII character in two bytes by setting the MSBs to zero. The UTF-8 specification explicitly forbids this.

answered Oct 10 '22 20:10

Joey

Related questions
                            
                                How to replace non-printable unicode characters (Javascript)
                            
                                React Native Bullet Character? or Unicode?
                            
                                Issue with smtplib sending mail with unicode characters in Python 3.1
                            
                                Should I Use Base64 or Unicode for Storing Hashes & Salts?
                            
                                Standard way in C11 and C++11 to convert UTF-8?
                            
                                Degrading Unicode characters for web browsers with missing fonts
                            
                                Can HTTP URIs have non-ASCII characters?
                            
                                What new Unicode functions are there in C++0x?
                            
                                Why does emoji have two different utf-8 codes? How to convert emoji from utf-8 , use NSString in ios?
                            
                                Is there a Unicode font for Windows available that is as complete as Arial Unicode MS, but free, even for commercial use? [closed]
                            
                                Why doesn't unicodedata recognise certain characters?
                            
                                Why don't more/all of Unicode's right and left arrows match (in particular the "Black" arrows)?
                            
                                Java equivalent for .charCodeAt()
                            
                                How to connect to MySQL using UTF8 within a perl script?
                            
                                What is Perl's "standard string comparison order"?
                            
                                How do I compare unicode strings containing non-english characters to sort alpabetically?
                            
                                How can I print Hindi sentences(unicode) on image in Python?
                            
                                how to remove characters from a font file?
                            
                                Unicode subscripts and superscripts in identifiers, why does Python consider XU == Xᵘ == Xᵤ?
                            
                                Linux/Python: encoding a unicode string for print

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is exactly an overlong form/encoding?

Tags:

character-encoding

unicode

utf-8

sequences

codepoint

nEAnnam

People also ask

2 Answers

Jon Skeet

Joey

Recent Activity

Donate For Us